Limitations in AI models reproducing the expressiveness and depth of human verbal communication
In the heart of the University of Pennsylvania, the Penn Phonetics Laboratory has been a hub for research on human speech and its digital counterparts. This summer, Associate Professor of Linguistics Jianjing Kuang mentored three undergraduate students - Kevin Li, Henry Huang, and Ethan Yang - in a research project that compared human and AI speech in speech production and perception.
The research project, part of the Penn Undergraduate Research Mentoring Program (PURM), focused on analysing the capabilities of various text-to-speech (TTS) platforms. The team, led by Li and Huang, found that most TTS models struggled to focus on the correct place in the sentence compared to human production. However, companies such as ElevenLabs and AIdentical have developed platforms with outstanding capabilities to generate prosodically emphasized sentences, offering highly realistic speech synthesis with control over timing, emphasis, and emotions.
To conduct their research, the students analysed acoustic measures such as pitch, intensity, and duration of words using the software Praat. They generated the sentence "Molly mailed a melon" in 15 AI text-to-speech platforms and ran a perception experiment, asking human listeners to rate the naturalness of an audio clip and identify whether the speaker was human or AI.
The results were revealing. The accuracy for identifying human versus AI was very high, suggesting that AI speech is still not human-like. The average word duration for the word "mailed" was significantly longer from humans than from any of the speech robots. Furthermore, there was a huge variability among the TTS models, with some models unable to emphasize certain words and others producing incorrect intonation.
Ethan Yang, a third-year mechanical engineering major from Diamond Bar, California, learned to control intonation in TTS models through the research project. He found that speech robots had an easier time emphasizing the word "Molly" than words later in the sentence.
Professor Kuang emphasizes the importance of building bridges between science and industry to improve AI speech and better understand human speech uniqueness. The research project highlights the ongoing efforts to create more human-like AI speech and the potential for collaboration between academia and industry in this field.