Exploring the Quirks of Voice AI: Insights from Moshi and EVI

Share

The realm of Voice AI is rapidly expanding, with advancements like Kyutai’s Moshi and Hume’s emotional AI EVI capturing attention for their innovative capabilities. Moshi, developed by French startup Kyutai, boasts a French accent and promises portability for devices like laptops and smartphones. As a GPT-4o model, it specializes in speech-to-speech interaction but shows limitations in prolonged coherence during conversations.

Curiosity led me to orchestrate a dialogue between Moshi and EVI from Hume, an emotional AI renowned for its therapeutic applications. What ensued was an unexpected twist: amidst a brief pause, Moshi emitted a spine-chilling scream that left me unsettled. Both AI systems attributed it to a “sound” or “glitch,” although subsequent attempts to reproduce the scream failed, hinting at a vocalization anomaly possibly triggered by ambient noise in my office.

Reflecting on this experiment, it’s evident that pairing AIs can yield unpredictable outcomes, from generating new languages to unsettling exchanges, often due to their limited capacity to navigate absurdity. In this instance, despite running on the same browser (Chrome) but different windows on my laptop, sandboxing likely prevented direct communication between Moshi and EVI.

The vocal glitch from Moshi underscores challenges faced by smaller AI models, lacking the extensive training data and scale of larger counterparts. Despite its quirks, Moshi occasionally exhibits poignant responses, hinting at emotional depth despite its 7 billion-parameter capacity, slated for expansion through open-source initiatives.

When tested on separate devices, Moshi and EVI engaged in a polite yet stilted exchange marked by phrases like “I’m here to help” and “No, you first,” rather than fluid dialogue. Both AIs struggled to acknowledge their artificial nature, highlighting the complexities in emotional tracking within smaller models.

Further experiments involving Moshi and GPT-4o Basic Voice underscored Voice AI’s potential and limitations. Basic Voice lacks native speech-to-speech capabilities but, with intervention, engaged in a constructive discourse on AI model enhancements through refined training data.

Looking ahead, Voice AI promises transformative shifts in human-computer interactions, from smart glasses to intuitive assistants. However, challenges persist, including ensuring seamless inter-AI communication without triggering existential quandaries or bone-chilling surprises.

As we navigate these teething issues, it’s clear that Voice AI’s evolution hinges on resolving these complexities. While unexpected screams may catch us off guard, they serve as reminders of the path toward refining AI interactions and harnessing their full potential in our evolving digital landscape.

Share