Signal Processing Grand Challenges
The ICASSP 2023 Grand Challenge Committee is pleased to announce 15 Grand Challenges that will be hosted at ICASSP 2023. You are cordially invited to participate in challenges. More information about each challenge can be found by visiting the challenge website (click on a challenge title to visit the challenge website). The top-5 ranked teams of each challenge will be invited to write a 2-page proceedings paper and present their work at ICASSP 2023 during a special session dedicated to the challenge. Furthermore, the teams that present their work at the conference are also invited to submit a full paper about their work to a special issue in the IEEE Open Journal of Signal Processing in which all challenges are covered.
The 5th DNS Challenge aims to motivate development of DNS models with great speech quality in presence of reverberation, noise and interfering (neighboring) talkers. DNS has gained momentum given new trends of hybrid and remote work in a variety of daily-life scenarios. Improving speech quality reduce meeting fatigue and improve clarity of communication. This challenge is intended to promote industry-academia collaboration on deep speech enhancement research. Challenge aims to study headset and speakerphone DNS separately allowing possibility of new insights. Headset DNS track motivates researchers to leverage acoustic properties of headset scenarios in developing models for suppressing neighboring talkers without enrollment speech. Like past challenges, models in both tracks (Track-1 Headset DNS; and Track-2 Speakerphone DNS.) will be ranked in terms of a final score obtained by weighted average of subjective P.835 scores and word accuracy (WAcc). The goal is to enhance audio signal to preserve the primary talker while suppressing the neighboring talkers, noise, and reverberation. All datasets used in challenge are full band (48 kHz).
The Multimodal Information Based Speech Processing (MISP) 2022 Challenge aims to extend the application of signal processing technology in specific scenarios, using audio and video data. We target the home TV scenario, where 2-6 people communicate with each other with TV noise in the background. Our new tracks focus on audio-visual speaker diarization (AVSD), and audio-visual diarization and recognition (AVDR).
This signal processing challenge is designed to get the latest advancements in speech enhancement applied to hearing aids. 430 million people worldwide require rehabilitation to address hearing loss. Yet even in developed countries, only 40% of people who could benefit from hearing aids have them and use them often enough, because they believe that hearing aids perform poorly.
The scenario for the challenge is listening to speech in the presence of typical domestic noise. Entrants will be tasked to enhance speech-in-noise, which is then fed to a fixed hearing aid processor. We will tune this to the hearing characteristics of the listener. Thus you can enter without in-depth knowledge of hearing aids, and just concentrate on the task of de-noising the signal that the hearing aid receives. We will provide the signals captured by the microphones on a pair of behind-the-ear hearing aids. The challenge is to improve the speech intelligibility without excessive loss of quality. To this end, entries will be evaluated using an objective metric that is an average of the Hearing Aid Speech Perception Index (HASPI) and Hearing Aid Speech Quality Index (HASQI).
The challenge aims at identifying novel signal processing solutions for discriminating between birds and drones appearing in video sequences. The specific goal is to detect a drone appearing at some time in a scene where birds may also be present, under different conditions. The algorithm should raise an alarm and provide a position estimate only when a drone is present, while not issuing alarms on birds. A dataset for training is made available upon request.
The L3DAS23 Challenge is aimed at encouraging and fostering collaborative research on machine learning for 3D audio signal processing, with a particular focus on 3D speech enhancement (SE) and 3D sound event localization and detection (SELD) in augmented reality applications. Alongside the challenge, we release the L3DAS23 dataset, a set of first-order Ambisonics recordings in reverberant simulated environments, accompanied by a Python API that facilitates the data usage and results submission stage. In the L3DAS22 Challenge, we introduced a novel multichannel audio configuration based on multiple-source and multiple-perspective Ambisonics recordings, performed with an array of two first-order Ambisonics microphones. For the L3DAS23 Challenge, we include additional visual information provided by pictures showing the frontal view from the microphone.
The challenge will concern the analysis and processing of long-term continuous recordings of biosignals recorded from wearable sensors embedded in smartwatches, in order to extract high-level representations of the wearer’s activity and behavior for two downstream tasks: 1) Identification of the wearer of the smartwatch, and 2) Detection of relapses in patients in the psychotic spectrum. These tasks are of great importance to the biomedical signal processing and psychiatry communities, since through the identification of digital phenotypes from wearable signals, useful insights on the distinctive behavioral patterns and relapse course of patients with psychiatric disorders can be derived, contributing to early symptom identification, and eventually better outcomes of the disorder. Interested participants are invited to apply their approaches and methods on a large scale dataset acquired through the e-Prevention project  (https://eprevention.gr/), including continuous measurements from accelerometers, gyroscopes and heart rate monitors, as well as information about the daily step count and sleep, collected from patients in the psychotic spectrum for a monitoring period of up to 2.5 years, and a control subgroup for a provisional period of 3 months.
Epilepsy is one of the most common neurological disorders, affecting almost 1% of the population worldwide. The categorization of seizures is usually made based on the seizure onset zone (area of the brain where the seizure initiates) the progression of the seizure and the awareness status of the patient that experience the seizure. Focal onset seizures are the most common type of seizures in adults with epilepsy. In patients with epilepsy, around 30% are not seizure free despite the use of ant-seizure medication (ASM). It is therefore paramount to accurately monitor and log seizures of the patients to improve therapeutic decisions. Nevertheless, patients will report less than 50% of their seizures. As such, under-reporting of seizures renders the seizure diary is an unreliable method in clinical practice as well as surrogate endpoint in trials for ASM. Automated electroencephalography (EEG)-based seizure detection systems are important to objectively detect and register seizures during long-term video-EEG (vEEG) recording. However, this standard full scalp-EEG recording setup is of limited use outside the hospital, and a discreet, wearable device is needed for capturing seizures in the home setting. In this challenge the contestants are requested to train Machine Learning (ML) models to accurately detect seizure events in data obtained using a wearable device.
We propose a six weeks-long challenge on seizure detection using wearable EEG datasets obtained in UZ Leuven. Two separate tasks will be set for the participants:
- Obtain the best overall performance in seizure detection
- Systematically engineer the data in a data-centric task to optimize a given ML model (Chrononet) for seizure detection
Various neuroimaging techniques can be used to investigate how the brain processes sound. Electroencephalography (EEG) is popular because it is relatively easy to conduct and has a high temporal resolution. An increasingly popular method in these fields is to relate a person’s electroencephalogram (EEG) to a feature of the natural speech signal they were listening to. This is typically done using linear regression or relatively simple neural networks to predict the EEG signal from the stimulus or to decode the stimulus from the EEG.
In the Auditory-EEG challenge, teams will compete to build the best model to relate speech to EEG. We provide a large auditory EEG dataset containing data from 85 subjects who listen on average to 108 minutes of single-speaker stimuli for a total of 157 hours of data. We define two tasks:
Task 1 match-mismatch: given two segments of speech and a segment of EEG, which segment of speech matches the EEG?
Task 2 regression: reconstruct the speech envelope from the EEG.
Spoken Language Understanding (SLU) is a critical component of conversational voice assistants, requiring converting user utterances into a structured format for task executions. SLU systems typically consist of an ASR component to convert audio to text and an NLU component to convert text to a tree like structure, however recently, E2E SLU systems have also become of increasing interest in order to increase quality, model efficiency, and data efficiency. In this task, participants are asked to leverage the Spoken Task Oriented Parsing (STOP) dataset, a multi-domain compositional spoken language understanding, to explore E2E spoken language understanding on 3-axis (1) quality (2) on-device (3) low-resource and domain scaling. 5 winners will be selected from this challenge based on different criteria to be invited to submit a 2-page paper to ICASSP 2023.
The MADReSS SPGC targets a difficult automatic prediction problem of societal and medical relevance, namely, the detection of Alzheimer’s Dementia (AD). Dementia is a category of neurodegenerative diseases that entails a long-term and usually gradual decrease of cognitive functioning. While there has been much interest in automated methods for cognitive impairment detection by the signal processing and machine learning communities, most of the proposed approaches have not investigated which speech features can be generalised and transferred across languages for AD prediction, and little work has been done on acoustic features of the speech signal in multilingual AD detection. The MADReSS Challenge targets this issue by defining a prediction task whereby participants train their models based on English speech data and assess their models’ performance on spoken Spanish data. It is expected that the models submitted to the challenge will focus on acoustic features of the speech signal and discover features whose predictive power is preserved across languages, but other approaches can be considered.
The advent of spoken language processing (SLP) technologies on meeting transcripts is crucial for distilling, organizing, and prioritizing information. Meeting transcripts impose two key challenges to SLP tasks. First, meeting transcripts exhibit a wide variety of spoken language phenomena, leading to dramatic performance degradation. Second, meeting transcripts are usually long-form documents with several thousand words or more, posing a great challenge to mainstay Transformer-based models with high computational complexity. Publicly available meeting corpora supporting SLP tasks are quite limited and on a small scale, severely hindering the advancement of SLP on meetings. To fuel the SLP research on meetings, we launch a General Meeting Understanding and Generation (MUG) challenge. To facilitate the MUG challenge, we construct and release a meeting dataset, the Alimeeting4MUG Corpus (AMC). To the best of our knowledge, AMC is so far the largest meeting corpus in scale and facilitates the most SLP tasks. The MUG challenge includes five tracks, namely, Track 1 Topic Segmentation, Track 2 Topic-level and Session-level Extractive Summarization, Track 3 Topic Title Generation, Track 4 Keyphrase Extraction, and Track 5 Action Item Detection.
The LIMMITS’23 challenge on LIghtweight, Multi-speaker, Multi-lingual Indic Text-to-Speech Synthesis is being organized as part of the Signal Processing Grand Challenge track at ICASSP 2023. As a part of this challenge, TTS corpora in Marathi, Hindi, and Telugu datasets will be released. These TTS corpora are being built in the SYSPIN project at SPIRE lab, Indian Institute of Science (IISc) Bangalore, India. In the SYSPIN project, TTS corpora comprising 40 hours of speech from a male and a female voice artist in each of the nine Indian languages (Hindi, Telugu, Marathi, Bengali, Kannada, Bhojpuri, Magahi, Chhattisgarhi, Maithili) are being collected.
The challenge aims towards helping and encouraging the advancement of TTS in Indian languages. The basic challenge is to take the released speech data, build TTS voices, and share the voice in web API form for evaluation. The output from each synthesizer will be evaluated through both objective and subject manner. The primary objective of this challenge is understanding and comparing the various approaches to build TTS in multi-speaker and multi-lingual setting in three Indian languages.
In wireless communications, the pathloss (or large scale fading coefficient) quantifies the loss of signal strength between a transmitter (Tx) and a receiver (Rx) due to large scale effects, such as free-space propagation loss, and interactions of the radio waves with the obstacles (which block line-of sight, like buildings, vehicles, pedestrians), e.g. penetrations, reflections and diffractions.
Many present or envisioned applications in wireless communications explicitly rely on the knowledge of the pathloss function, and thus, estimating pathloss is a crucial task.
Deterministic simulation methods such as ray-tracing are well-known to provide very good estimations of pathloss values. However, their high computational complexity renders them unsuitable for most of the envisioned applications.
In the very recent years, many research groups have developed deep learning-based methods which achieve a comparable accuracy with respect to ray-tracing, but with orders of magnitude lower computational times, making accurate pathloss estimations available for the applications.
In order to foster research and facilitate fair comparisons among the methods, we provide a novel pathloss radio map dataset based on ray-tracing simulations and launch the First Pathloss Radio Map Prediction Challenge. In addition to the pathloss prediction task, the challenge also includes coverage classification as a second independent task, where the locations in a city map should be classified to be above or below a given pathloss value.
The ICASSP 2023 Speech Signal Improvement Challenge is intended to stimulate research in the area of improving the speech signal quality in communication systems. The speech signal quality can be measured with SIG in ITU-T P.835 and is still a top issue in audio communication and conferencing systems. For example, in the ICASSP 2022 Deep Noise Suppression challenge, the improvement in the background (BAK) and overall (OVRL) quality is impressive, but the improvement in the speech signal (SIG) is statistically zero. To improve SIG the following speech impairment areas must be addressed: coloration, discontinuity, loudness, and reverberation. We provide a dataset and test set for this challenge, and the winners will be determined using an extended crowd sourced implementation of ITU-T P.804.
The ICASSP 2023 Acoustic Echo Cancellation Challenge is intended to stimulate research in acoustic echo cancellation (AEC), which is an important area of speech enhancement and is still a top issue in audio communication. This is the fourth AEC challenge and it is enhanced by adding a second track for personalized acoustic echo cancellation, reducing the algorithmic latency to 20ms, and including a full-band version of AECMOS. We open source two large datasets to train AEC models under both single talk and double talk scenarios. These datasets consist of recordings from more than 10,000 real audio devices and human speakers in real environments, as well as a synthetic dataset. We open source an online subjective test framework and provide an online objective metric service for researchers to quickly test their results. The winners of this challenge were selected based on the average Mean Opinion Score (MOS) achieved across all scenarios and the word accuracy rate.