Home  | Publications | TAT26

Bringing Multimodal Foundation Models to Hearing Aids

MCML Authors

Abstract

Recent work has shown that adaptive speech denoising can reduce the computational overhead of deep learning models that run on hearing aid devices by leveraging information regarding the acoustic scene to condition the main denoising model. This is typically done through an auxiliary encoder that processes fingerprints of the noise background and is trained along with the main denoising network. This work explores the hypothesis that the use of multimodal foundation models (FMs) as pre-trained feature extractors instead of randomly initialised encoders can further improve performance. This would allow us to unleash the promise of FMs to the domain of hearing aids by offloading their execution to external device (e. g., smartphones) which are queried periodically to analyse the background acoustic scene. We present a series of experiments that aim to put this hypothesis to the test. Our results show that FMs do not bring any additional benefits compared to randomly initialised encoders. Our accompanying analysis showcases how the introduction of fingerprints does not affect the denoising as expected, showcasing the need for more research in this promising direction.

inproceedings TAT26


ICASSP 2026

IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona, Spain, May 04-08, 2026.

Authors

A. TriantafyllopoulosI. Tsangko • B. Schuller

Links

DOI

Research Area

 B3 | Multimodal Perception

BibTeXKey: TAT26

Back to Top