Continuing from last weekâs post on the rise of the Voice Stack, thereâs an area that todayâs voice-based systems often struggle with: Voice Activity Detection (VAD) and the turn-taking paradigm of communication. When communicating with a text-based chatbot, the turns are clear: You write something, then the bot does, then you do, and so on. The success of text-based chatbots with clear turn-taking has influenced the design of voice-based bots, most of which also use the turn-taking paradigm. A key part of building such a system is a VAD component to detect when the user is talking. This allows our software to take the parts of the audio stream in which the user is saying something and pass that to the model for the userâs turn. It also supports interruption in a limited way, whereby if a user insistently interrupts the AI system while it is talking, eventually the VAD system will realize the user is talking, shut off the AIâs output, and let the user take a turn. This works reasonably well in quiet environments. However, VAD systems today struggle with noisy environments, particularly when the background noise is from other human speech. For example, if you are in a noisy cafe speaking with a voice chatbot, VAD â which is usually trained to detect human speech â tends to be inaccurate at figuring out when you, or someone else, is talking. (In comparison, it works much better if you are in a noisy vehicle, since the background noise is more clearly not human speech.) It might think you are interrupting when it was merely someone in the background speaking, or fail to recognize that youâve stopped talking. This is why todayâs speech applications often struggle in noisy environments. Intriguingly, last year, Kyutai Labs published Moshi, a model that had many technical innovations. An important one was enabling persistent bi-direction audio streams from the user to Moshi and from Moshi to the user. If you and I were speaking in person or on the phone, we would constantly be streaming audio to each other (through the air or the phone system), and weâd use social cues to know when to listen and how to politely interrupt if one of us felt the need. Thus, the streams would not need to explicitly model turn-taking. Moshi works like this. Itâs listening all the time, and itâs up to the model to decide when to stay silent and when to talk. This means an explicit VAD step is no longer necessary. Just as the architecture of text-only transformers has gone through many evolutions, voice models are going through a lot of architecture explorations. Given the importance of foundation models with voice-in and voice-out capabilities, many large companies right now are investing in developing better voice models. Iâm confident weâll see many more good voice models released this year. [Reached length limit; full text: https://lnkd.in/g9wGsPb2 ]
User Experience for Voice Interfaces
Explore top LinkedIn content from expert professionals.
-
-
Over the last year, Iâve seen many people fall into the same trap: They launch an AI-powered agent (chatbot, assistant, support tool, etc.)⦠But only track surface-level KPIs â like response time or number of users. Thatâs not enough. To create AI systems that actually deliver value, we need ðµð¼ð¹ð¶ððð¶ð°, ðµððºð®ð»-ð°ð²ð»ðð¿ð¶ð° ðºð²ðð¿ð¶ð°ð that reflect: ⢠User trust ⢠Task success ⢠Business impact ⢠Experience quality   This infographic highlights 15 ð¦ð´ð´ð¦ð¯ðµðªð¢ð dimensions to consider: â³ ð¥ð²ðð½ð¼ð»ðð² ðð°ð°ðð¿ð®ð°ð â Are your AI answers actually useful and correct? â³ ð§ð®ðð¸ ðð¼ðºð½ð¹ð²ðð¶ð¼ð» ð¥ð®ðð² â Can the agent complete full workflows, not just answer trivia? â³ ðð®ðð²ð»ð°ð â Response speed still matters, especially in production. â³ ð¨ðð²ð¿ ðð»ð´ð®ð´ð²ðºð²ð»ð â How often are users returning or interacting meaningfully? â³ ð¦ðð°ð°ð²ðð ð¥ð®ðð² â Did the user achieve their goal? This is your north star. â³ ðð¿ð¿ð¼ð¿ ð¥ð®ðð² â Irrelevant or wrong responses? Thatâs friction. â³ ð¦ð²ððð¶ð¼ð» ððð¿ð®ðð¶ð¼ð» â Longer isnât always better â it depends on the goal. â³ ð¨ðð²ð¿ ð¥ð²ðð²ð»ðð¶ð¼ð» â Are users coming back ð¢ð§ðµð¦ð³ the first experience? â³ ðð¼ðð ð½ð²ð¿ ðð»ðð²ð¿ð®ð°ðð¶ð¼ð» â Especially critical at scale. Budget-wise agents win. â³ ðð¼ð»ðð²ð¿ðð®ðð¶ð¼ð» ðð²ð½ððµ â Can the agent handle follow-ups and multi-turn dialogue? â³ ð¨ðð²ð¿ ð¦ð®ðð¶ðð³ð®ð°ðð¶ð¼ð» ð¦ð°ð¼ð¿ð² â Feedback from actual users is gold. â³ ðð¼ð»ðð²ð ððð®ð¹ ð¨ð»ð±ð²ð¿ððð®ð»ð±ð¶ð»ð´ â Can your AI ð³ð¦ð®ð¦ð®ð£ð¦ð³ ð¢ð¯ð¥ ð³ð¦ð§ð¦ð³ to earlier inputs? â³ ð¦ð°ð®ð¹ð®ð¯ð¶ð¹ð¶ðð â Can it handle volume ð¸ðªðµð©ð°ð¶ðµ degrading performance? â³ ðð»ð¼ðð¹ð²ð±ð´ð² ð¥ð²ðð¿ð¶ð²ðð®ð¹ ðð³ð³ð¶ð°ð¶ð²ð»ð°ð â This is key for RAG-based agents. â³ ðð±ð®ð½ðð®ð¯ð¶ð¹ð¶ðð ð¦ð°ð¼ð¿ð² â Is your AI learning and improving over time? If you're building or managing AI agents â bookmark this. Whether it's a support bot, GenAI assistant, or a multi-agent system â these are the metrics that will shape real-world success. ðð¶ð± ð ðºð¶ðð ð®ð»ð ð°ð¿ð¶ðð¶ð°ð®ð¹ ð¼ð»ð²ð ðð¼ð ððð² ð¶ð» ðð¼ðð¿ ð½ð¿ð¼ð·ð²ð°ðð? Letâs make this list even stronger â drop your thoughts ð
-
Iâve open-sourced a key component of one of my latest projects: Voice Lab, a comprehensive testing framework that removes the guesswork from building and optimizing voice agents across language models, prompts, and personas. Speech is increasingly becoming a prominent modality companies employ to enable user interaction with their products, yet the AI community is still figuring out systematic evaluation for such applications. Key features: (1) Metrics and analysis â define custom metrics like brevity or helpfulness in JSON format and evaluate them using LLM-as-a-Judge. No more manual reviews. (2) Model migration and cost optimization â confidently switch between models (e.g., from GPT-4 to smaller models) while evaluating performance and cost trade-offs. (3) Prompt and performance testing â systematically test multiple prompt variations and simulate diverse user interactions to fine-tune agent responses. (4) Testing different agent personas, from an angry United Airlines representative to a hotel receptionist who tries to jailbreak your agent to book all available rooms. While designed for voice agents, Voice Lab is versatile and can evaluate any LLM-based agent. âï¸ I invite the community to contribute and would highly appreciate your support by starring the repo to make it more discoverable for others. GitHub repo (commercially permissive) https://lnkd.in/gAaZ-tkA
-
AI changes how we measure UX. Weâve been thinking and iterating on how we track user experiences with AI. In our open Glare framework, we use a mix of attitudinal, behavioral, and performance metrics. AI tools open the door to customizing metrics based on how people use each experience. Iâd love to hear who else is exploring this. To measure UX in AI tools, it helps to follow the user journey and match the right metrics to each step. Here's a simple way to break it down: 1. Before using the tool Start by understanding what users expect and how confident they feel. This gives you a sense of their goals and trust levels. 2. While prompting Track how easily users explain what they want. Look at how much effort it takes and whether the first result is useful. 3. While refining the output Measure how smoothly users improve or adjust the results. Count retries, check how well they understand the output, and watch for moments when the tool really surprises or delights them. 4. After seeing the results Check if the result is actually helpful. Time-to-value and satisfaction ratings show whether the tool delivered on its promise. 5. After the session ends See what users do next. Do they leave, return, or keep using it? This helps you understand the lasting value of the experience. We need sharper ways to measure how people use AI. Clicks canât tell the whole story. But getting this data is not easy. What matters is whether the experience builds trust, sparks creativity, and delivers something users feel good about. These are the signals that show us if the tool is working, not just technically, but emotionally and practically. How are you thinking about this? #productdesign #uxmetrics #productdiscovery #uxresearch
-
 Is This the Future of Human-AI Interaction? Sesame's "Voice Presence" is Astonishing. Have you ever truly felt like you were having a conversation with an AI? Sesame, founded by Oculus co-founder Brendan Iribe, is pushing the boundaries of AI voice technology with its Conversational Speech Model (CSM). The results are striking. As The Verge's Sean Hollister noted, it's "the first voice assistant I've ever wanted to talk to more than once." Why? Because Sesame focuses on "voice presence," creating spoken interactions that feel genuinely real and understood. What's the potential impact for businesses? Enhanced Customer Service: Imagine AI assistants that can handle complex inquiries with empathy and natural conversation flow. Improved Accessibility: More natural voice interfaces can make technology accessible to more users. Revolutionized Content Creation: Voice models like Maya and Miles could open up new audio and video content possibilities. Training and Education: Interactive AI tutors could provide personalized and engaging learning experiences. The most impressive part? In blind listening tests, humans often couldn't distinguish Sesame's AI from real human recordings. #AI #ArtificialIntelligence #VoiceTechnology #Innovation #FutureofWork #CustomerExperience #MachineLearning #SesameAI
-
A lot of us still rely on simple trend lines or linear regression when analyzing how user behavior changes over time. But in recent years, the tools available to us have evolved significantly. For behavioral and UX data - especially when it's noisy, nonlinear, or limited - there are now better methods to uncover meaningful patterns. Machine learning models like LSTMs can be incredibly useful when youâre trying to understand patterns that unfold across time. Theyâre good at picking up both short-term shifts and long-term dependencies, like how early frustration might affect engagement later in a session. If you want to go further, newer models that combine graph structures with time series - like graph-based recurrent networks - help make sense of how different behaviors influence each other. Transformers, originally built for language processing, are also being used to model behavior over time. Theyâre especially effective when user interactions donât follow a neat, regular rhythm. Whatâs interesting about transformers is their ability to highlight which time windows matter most, which makes them easier to interpret in UX research. Not every trend is smooth or gradual. Sometimes weâre more interested in when something changes - like a sudden drop in satisfaction after a feature rollout. This is where change point detection comes in. Methods like Bayesian Online Change Point Detection or PELT can find those key turning points, even in noisy data or with few observations. When trends donât follow a straight line, generalized additive models (GAMs) can help. Instead of fitting one global line, they let you capture smooth curves and more realistic patterns. For example, users might improve quickly at first but plateau later - GAMs are built to capture that shape. If youâre tracking behavior across time and across users or teams, mixed-effects models come into play. These models account for repeated measures or nested structures in your data, like individual users within groups or cohorts. The Bayesian versions are especially helpful when your dataset is small or uneven, which happens often in UX research. Some researchers go a step further by treating behavior over time as continuous functions. This lets you compare entire curves rather than just time points. Others use matrix factorization methods that simplify high-dimensional behavioral data - like attention logs or biometric signals - into just a few evolving patterns. Understanding not just what changed, but why, is becoming more feasible too. Techniques like Gaussian graphical models and dynamic Bayesian networks are now used to map how one behavior might influence another over time, offering deeper insights than simple correlations. And for those working with small samples, new Bayesian approaches are built exactly for that. Some use filtering to maintain accuracy with limited data, and ensemble models are proving useful for increasing robustness when datasets are sparse or messy.
-
OpenAI just released their new GPT-4o-mini-tts model, and the âsteerabilityâ feature is a big step forward. This means developers can now guide voice agents not only on what to say, but exactly how to say itâallowing precise control over tone, emotion, and style. Traditionally, voice agents worked using a two-step pipeline: converting speech to text (STT), processing it through AI, and then converting text back to speech (TTS). But newer innovations, like OpenAIâs model and the voice-to-voice approach, streamline this process. They enable voice agents to directly understand and respond to speech without translating into text first, reducing delays and significantly improving realism and emotional depth. At Element451, weâre excited by this innovation because it empowers higher education institutions to create customized voice agents. Schools can now design agents or even mascots that truly reflect their personality and brandâdramatically enhancing the student experience. We dive deeper into this topic in our latest #GenerationAI podcast episode, exploring exactly how voice agents are transforming higher education. https://lnkd.in/eXH3ts3x