Sesame AI is dedicated to creating dialogue experiences with "voice presence," making AI voice interactions more natural, emotional, and human-like.
At Sesame, our goal is to achieve "voice presence" - the magical quality that makes spoken interactions feel real, understood, and valued. We're creating conversational partners that don't just process requests; they engage in genuine dialogue, building confidence and trust over time.
Through this, we aim to realize the untapped potential of voice as the ultimate interface for guidance and understanding.
Reading and responding to emotional context
Natural timing, pauses, interruptions, and emphasis
Adjusting tone and style to match context
Maintaining coherent, reliable, and appropriate presence
To create truly interactive AI companions, voice generation must go beyond producing high-quality audio—it must understand and adapt to context in real-time. Traditional Text-to-Speech (TTS) models generate spoken output directly from text but lack the contextual awareness needed for natural dialogue.
Even recent models that produce highly human-like speech still struggle with the one-to-many problem: there are countless valid ways to say a sentence, but only some fit a specific context. Without additional context—including tone, rhythm, and conversation history—models lack information to choose the best option.
To address this, we introduce the Conversational Speech Model (CSM), which frames the problem as an end-to-end multimodal learning task using transformers. It leverages conversation history to produce more natural and coherent speech.
Traditional benchmarks like Word Error Rate (WER) and Speaker Similarity (SIM) have saturated—modern models, including CSM, now achieve near-human performance on these metrics.
To better evaluate pronunciation and context understanding, we introduce a new set of speech transcription-based benchmarks.
We conducted two Comparative Mean Opinion Score (CMOS) studies using the Expresso dataset to evaluate the naturalness and prosodic appropriateness of CSM-Medium generated speech.
The graph shows win rates between real human recordings and CSM-generated speech samples in both studies. Without conversational context, human evaluators showed no clear preference between generated and real speech, indicating saturation in naturalness. However, when context was included, evaluators consistently preferred original recordings. These findings suggest that there remains a significant gap in prosody between generated and human speech in conversational contexts.
Sesame AI is an innovative platform focused on breakthrough voice interaction technology. Through unique voice presence technology, Sesame AI makes AI voice interactions more natural and human-like.
The core of Sesame AI is the Conversational Speech Model (CSM), an innovative end-to-end multimodal learning system. Through this technology, Sesame AI can generate more natural and emotional voice interactions.
Sesame AI has unique advantages in voice interaction: advanced emotional intelligence, natural dialogue dynamics, precise context awareness, and stable personality expression. This makes Sesame AI a leader in voice interaction technology.
Through deep learning and emotional computing, Sesame AI provides users with more natural and humanized dialogue experiences. Each interaction is carefully optimized to ensure users get the best voice interaction experience.
Sesame AI can be widely applied in customer service, education and training, medical consultation, and other fields. Whether for daily conversations or professional communication, Sesame AI can provide excellent voice interaction experiences.