Perspective Talks Speakers

Tuesday (June 06)

Wenwu Wang

University of Surrey


Cross modal generation of audio and texts has emerged as an important research area in audio signal processing and natural language processing. Audio-to-text generation, also known as automated audio captioning, aims to provide a meaningful language description of the audio content for an audio clip. This can be used for assisting the hearing-impaired to understand environmental sounds, facilitating retrieval of multimedia content, and analyzing sounds for security surveillance. Text-to-audio generation aims to produce an audio clip based on a text prompt which is a language description of the audio content to be generated. This can be used as sound synthesis tools for film making, game design, virtual reality/metaverse, digital media, and digital assistants for text understanding by the visually impaired. To achieve cross modal audio-text generation, it is essential to comprehend the audio events and scenes within an audio clip, as well as interpret the textual information presented in natural language. Additionally, learning the mapping and alignment of these two streams of information is crucial. Exciting developments have recently emerged in the field of automated audio-text cross modal generation. In this talk, we will give an introduction of this field, including problem description, potential applications, datasets, open challenges, recent technical progresses, and possible future research directions.


Wenwu Wang is a Professor in Signal Processing and Machine Learning, and a Co-Director of the Machine Audition Lab within the Centre for Vision Speech and Signal Processing, University of Surrey, UK. He is also an AI Fellow at the Surrey Institute for People Centred Artificial Intelligence. His current research interests include signal processing, machine learning and perception, artificial intelligence, machine audition (listening), and statistical anomaly detection. He has (co)-authored over 300 papers in these areas. He has been involved as Principal or Co-Investigator in more than 30 research projects, funded by UK and EU research councils, and industry (e.g. BBC, NPL, Samsung, Tencent, Huawei, Saab, Atlas, and Kaon).

He is a (co-)author or (co-)recipient of over 15 awards including the 2022 IEEE Signal Processing Society Young Author Best Paper Award, ICAUS 2021 Best Paper Award, DCASE 2020 Judge’s Award, DCASE 2019 and 2020 Reproducible System Award, LVA/ICA 2018 Best Student Paper Award, FSDM 2016 Best Oral Presentation, and Dstl Challenge 2012 Best Solution Award.

He is a Senior Area Editor for IEEE Transactions on Signal Processing, an Associate Editor for IEEE/ACM Transactions on Audio Speech and Language Processing, an Associate Editor for (Nature) Scientific Report, and a Specialty Editor in Chief of Frontier in Signal Processing. He is a Board Member of IEEE Signal Processing Society (SPS) Technical Directions Board, the elected Chair of IEEE SPS Machine Learning for Signal Processing Technical Committee, the Vice Chair of the EURASIP Technical Area Committee on Acoustic Speech and Music Signal Processing, an elected Member of the IEEE SPS Signal Processing Theory and Methods Technical Committee, and an elected Member of the International Steering Committee of Latent Variable Analysis and Signal Separation. He was a Satellite Workshop Co-Chair for INTERSPEECH 2022, a Publication Co-Chair for IEEE ICASSP 2019, Local Arrangement Co-Chair of IEEE MLSP 2013, and Publicity Co-Chair of IEEE SSP 2009.

Xuedong Huang

Microsoft AI

We have been on a quest to advance AI beyond existing techniques, by taking a more holistic, human-centric approach to learning and understanding. We view the relationship among three attributes of human cognition: monolingual text (X), audio or visual sensory signals, (Y) and multilingual (Z) to be essential to derive from the intersection: what we call XYZ-code, a joint representation to create more powerful AI that can speak, hear, see, and understand humans better. We believe XYZ-code will enable us to fulfill our long-term vision: cross-domain transfer learning, spanning modalities and languages. The goal is to have pretrained models that can jointly learn representations to support a broad range of downstream AI tasks, much in the way humans do today. The AI foundation models provided us with strong signals toward our more ambitious aspiration to produce a leap in AI capabilities, achieving multisensory and multilingual learning that is closer in line with how humans learn and understand. The journey of achieving such integrative AI also must be grounded with external knowledge sources in the downstream AI tasks.



Xuedong Huang is a Microsoft Technical Fellow and served as Microsoft’s Azure AI Chief Technology Officer.

In 1993, Huang left Carnegie Mellon University to found Microsoft’s speech technology group. Huang has been leading Microsoft’s spoken language efforts for over 30 years. In addition to bringing speech to the mass market, Huang led Microsoft in achieving several historical human parity milestones in speech recognition, machine translation, and computer vision. He is best known for leading Microsoft Azure Cognitive Services from its inception, including Computer Speech, Computer Vision, Natural Language, and OpenAI, to making Azure AI an industrial AI platform serving billions customers worldwide.

Huang is an IEEE fellow (2000) and ACM fellow (2017). Huang was elected as a member of the National Academy of Engineering (2023) and the American Academy of Arts and Sciences (2023).

Huang received his BS, MS, and PhD from Hunan University (1982), Tsinghua University (1984) and the University of Edinburgh (1989) respectively.

Wednesday (June 07)

Karen Livescu

Toyota Technological Institute at Chicago


Sign languages are used by millions of deaf and hard of hearing individuals around the world.  Research on sign language video processing is needed to make all of the technologies that are now available for spoken and written languages available also for sign languages.  Sign languages are under-resourced and unwritten, so research in this area shares many of the same challenges faced by research on other low-resource languages.  However, there are also sign language-specific challenges, such as the difficulty of analyzing fast, coarticulated body pose changes.  Sign languages also include certain linguistic properties that are specific to this modality.
There has been some encouraging progress, including on tasks like isolated sign recognition and sign-to-written language translation.  However, the current state of the art is far from handling arbitrary linguistic domains and visual environments.  This talk will provide a perspective on research in this area, including work in my group and others aimed at a broad range of real-world domains and conditions.  Along the way, I will present recently collected datasets as well as technical strategies that have been developed for handling the challenges of natural sign language video.


Karen Livescu is a Professor at Toyota Technological Institute at Chicago (TTI-Chicago). She completed her PhD in electrical engineering and computer science at Massachusetts Institute of Technology (MIT) in 2005 and her Bachelor’s degree in physics at Princeton University in 1996.  Her main research interests are in speech and language processing, as well as related problems in machine learning.  Her recent work includes self-supervised representation learning, spoken language understanding, acoustic word embeddings, visually grounded speech and language models, and automatic sign language processing.
Dr. Livescu is a 2021 IEEE SPS Distinguished Lecturer and an Associate Editor for IEEE TPAMI, and has previously served as Associate Editor for IEEE/ACM TASLP and IEEE OJSP.  She has also previously served as a member of the IEEE Speech and Language Processing Technical Committee and as a Technical Co-Chair of ASRU.  Outside of the IEEE, she is an ISCA Fellow and has recently served as Technical Program Co-Chair for Interspeech (2022); Program Co-Chair for ICLR (2019); Associate Editor for TACL (2021-present); and Area Chair for a number of speech, machine learning, and NLP conferences.

Spyros Raptis

Innoetics / Samsung Electronics Hellas


Powered by the recent advances in AI-based representation and generation, text-to-speech technology has reached unprecedented levels in quality and flexibility.
Self-supervised learning techniques have provided ways to formulate efficient latent spaces claiming more control over different qualities of the generated speech, zero-shot training allowed matching the characteristics of unseen speakers, and efficient prior networks contributed to disentangling content, speaker, emotion and other dimensions of speech.
These developments have boosted existing application areas but also allowed tackling new ones that previously seemed much more distant. We’ll discuss some of the recent advances in specific areas in the field, including our team’s work on multi-speaker, multi-/cross-lingual, expressive and controllable TTS, on synthesized singing, as well as on automatic synthetic speech evaluation. We’ll also look into cloning existing speakers as well as generating novel ones.
Finally, we’ll touch on the valid concerns that such unprecedented technical capabilities raise. Voice is a key element of one’s identity and although such technologies hold great promise for useful applications, at the same time they have a potential for abuse, thus raising ethical and intellectual property questions, both in the context of the creative industries and in our everyday lives.


Spyros Raptis is the Head of Text-to-Speech R&D at Samsung Electronics Hellas. Before joining Samsung he was the Director of the Institute for Language and Speech Processing (ILSP), Vice president of the Board of the “Athena” Research Center, and co-founder of the INNOETICS startup company which was acquired by Samsung Electronics.
He holds a Diploma and a PhD on computational intelligence and robotics from the National Technical University of Athens, Greece.
He has coordinated various national and European R&D projects in the broader field of Human-Computer Interaction, and he has led the ILSP TTS team who developed award-winning speech synthesis technologies at international competitions.
He is the co-author of more than 60 publications in scientific books, journals and international conferences in the areas of speech processing, computational intelligence, and music technology. He has taught speech and language processing at pre- and post-graduate levels.
His research interests include speech analysis, modeling and generation, voice assistants and speech-enabled applications and services, tools for accessibility etc.

Thursday (June 08)

Chandra Murthy

Indian Institute of Science

This talk presents a set of tools based on a Bayesian framework to address the general problem of sparse signal recovery, and discusses the challenges associated with them. Bayesian methods offer superior performance compared to convex optimization-based methods and are parameter tuning-free. They also have the flexibility necessary to deal with a diverse range of measurement modalities and structured sparsity in signals than hitherto possible. We discuss recent developments towards providing rigorous theoretical guarantees for these methods. Further, we show that, by re-interpreting the Bayesian cost function as a technique to perform covariance matching, one can develop new and ultra-fast Bayesian algorithms for sparse signal recovery. As an example application, we discuss the utility of these algorithms in the context of (a) 5G communications with several case studies including wideband time-varying channel estimation, low resolution ADCs, etc, and (b) controllability and observability of linear dynamical systems under sparsity constraints.
Chandra R. Murthy received the B. Tech degree in Electrical Engineering from the Indian Institute of Technology, Madras, Chennai, India in 1998 and the M.S. degree in Electrical and Computer Engineering from Purdue University, West Lafayette, IN, USA in 2000. In 2006, he obtained the Ph.D. degree in Electrical and Computer Engineering from the UC San Diego, La Jolla, CA, USA.
From Aug. 2000 to Aug. 2002, he worked on WCDMA baseband transceiver design and 802.11b baseband receivers at Qualcomm, Inc, San Jose, USA; and from Aug. 2006 to Aug. 2007, he worked on advanced receiver algorithms for the 802.16e mobile WiMAX system at Beceem Communications (now Broadcom), Bangalore, India. Currently, he is working as a Professor in the department of Electrical Communication Engineering at the Indian Institute of Science, Bangalore, India.
His research interests are in the areas of sparse signal recovery, energy harvesting based communication, performance analysis and optimization of 5G and beyond communications. One of his papers was the recipient of the Best Paper Awards at NCC 2014 and 2023, and papers co-authored with his students received Student Best Paper Awards at the IEEE ICASSP 2018, IEEE ISIT 2021, and IEEE SPAWC 2022.
He is an IEEE Fellow (Class of 2023), a senior area editor for the IEEE Transactions on Signal Processing and an associate editor for the IEEE Transactions on Information Theory. He served as an associate editor for the IEEE Signal Processing Letters during 2012-16, the IEEE Transactions on Signal Processing during 2018-20, and the IEEE Transactions on Communications during 2017-22.

Anthony Vetro

Mitsubishi Electric Research Labs

Machine learning has demonstrated impressive achievements for a wide range of applications, but many systems are unable to provide a high level of reliability and trustworthiness in the results. This is especially important for industrial and safety-critical systems, where a higher level of assurance in the results are essential. This talk will offer a perspective on the emerging area of physics-grounded machine learning for the design, optimization, and control of real-world engineering systems. Such a framework can enforce physical principles and constraints, while still leveraging the power of data-driven machine learning techniques. Several industrial applications will be covered to demonstrate the benefits of this framework including radar-based imaging, which leverages the physics of wave propagation; airflow sensing and optimization, which is governed by the Navier-Stokes equation; as well as a range of additional applications that can be modeled as a dynamical system or through geometric constraints.
Anthony Vetro is President & CEO of Mitsubishi Electric Research Labs, in Cambridge, Massachusetts. His primary area of research has been on multimedia signal processing. He published more than 200 papers and was a member of the MPEG and ITU-T video coding standardization committees for many years, serving in numerous leadership roles. In his 25+ years with the company, he has contributed to the strategic R&D directions of the company, led teams in a variety of emerging technology areas, and has contributed to the transfer and development of several technologies to Mitsubishi products, including digital television receivers and displays, surveillance and camera monitoring systems, automotive equipment, as well as satellite imaging systems.
He has also been active in various IEEE conferences, technical committees, and editorial boards. Currently, he serves on the Board of Governors and Conferences Board of the IEEE Signal Processing Society. Past roles include Senior Associate Editor of the IEEE Open Journal on Signal Processing, Senior Editorial Board of IEEE Journal on Selected Topics in Signal Processing and IEEE Journal on Emerging and Selected Topics in Circuits and Systems, Editorial Board of IEEE Signal Processing Magazine and IEEE Multimedia, Associate Editor of IEEE Transactions on Circuits and Systems for Video Technology and IEEE Transactions on Image Processing, Chair of TC on Multimedia Signal Processing of the IEEE Signal Processing Society, and Steering Committee of IEEE Transactions on Multimedia. He was a General Co-Chair of ICIP 2017 and ICME 2015, and a Technical Program Co-Chair of ICME 2016.
Dr. Vetro received the B.S., M.S. and Ph.D. degrees in Electrical Engineering from NYU. He has received several awards for his work on transcoding and is a Fellow of the IEEE.

Diamond Plus Patron

Diamond Patrons

Platinum Patrons

Gold Patrons

Silver Patrons

Bronze Patrons