Research

WP2 Natural Voice Processing will investigate how neural systems in living organisms (humans, animals) and in artificial systems (machine) decode and represent socio-affective information and identity information expressed in voices. 

To achieve these cross-disciplinary objectives, we will use different experimental methods and data analysis approaches. We will perform psychoacoustic laboratory studies with human participants to understand the cognitive dynamics and decisional mechanisms of voice information decoding, especially in different decoding contexts (conversational dynamics) or language contexts. 

To characterize neurophysiological mechanisms in humans and animals, we will collect brain data in experimental paradigms suitable for EEG, MEG, and fMRI recordings.  We will investigate how natural neural networks (human and animal brains) decode and represent socio-affective and identity information conveyed in voices and compare this to how artificial neural networks (machine learning, deep neural networks, acoustic modeling) decode and represent such information. 

We will investigate the behavioral, cognitive, and neurophysiological dynamics in both healthy humans and in human patients with various impairments (hearing impairment, cochlear implant users that considerably affect accurate cognitive and neural voice decoding mechanisms. 

Knowledge gained in WP2 will be crucial to the design of more suitable, more socially relevant synthetic voices.


WP3 Synthetic Voice Design will focus on advancing our understanding of synthetic voice perception and optimizing voice synthesis technology. 

To achieve this, we will employ a variety of methodologies: We will use computational voice synthesis and voice manipulation tools to create voice stimuli that exhibit specific intended acoustic features and thus perceptual person characteristics, such as humanlikeness, trustworthiness, attractiveness or specific person identity. These will be combined with behavioural, fMRI, and EEG methods to understand how listeners form perceptual impressions based on synthesised voices. 

This work will focus on first impression formation, the perception of naturalness as well as persuasiveness and will uncover both the behavioural and neural underpinnings of synthetic voice perception. We will further examine how the synthesis of customised voices can be improved, using vocal tract MRI to acquire anatomical measures alongside computational approaches to voice synthesis. 


WP4 Voice Applications will transfer the basic knowledge gathered as part of WPs 2 and 3 into innovative user-oriented applications in two main domains: Health and Forensics. 

Health Medical devices that restore sensations of hearing in hearing impaired and profoundly deaf individuals have anchored their technology in improving speech recognition scores. However, recent studies showed that realistic speech communication requires more than understanding sentences, but also the interlocutor’s vocal features, dynamic vocal adaptation, speaker identity and familiarity etc., that are fundamental social aspects of speech perception. 

VoCS research will join this emerging research trend to focus on dimensions beyond speech recognition, aiming at improving the life quality of hearing-impaired users, at providing the healthcare sector with new approaches for detecting multiple respiratory diseases and characterizing voice distortion in Parkinson’s disease.

Forensics Individual recognition by voice plays a pivotal role in legal courts worldwide, exemplified by forensic speaker comparison. One major emerging threat to forensic analysis procedures is the generation of voices by deepfakes with which novel utterances of any speaker can be generated. This WP aims to advance forensic speaker comparison methods  through Data Challenge projects utilizing machine learning and leveraging deep speaker embeddings like x-vectors for automatic speaker recognition, and models such as LCNN and AASIST for detecting spoofed (deepfake) voices; and by exploring new dimensions for individual recognition, particularly temporal dimensions expected to contain robust speaker-specific information resistant to the noise often present in forensically relevant material.