User Experience for Voice Interfaces

Explore top LinkedIn content from expert professionals.

  • View profile for Andrew Ng
    Andrew Ng Andrew Ng is an Influencer

    Founder of DeepLearning.AI; Managing General Partner of AI Fund; Exec Chairman of LandingAI

    2,287,294 followers

    Continuing from last week’s post on the rise of the Voice Stack, there’s an area that today’s voice-based systems often struggle with: Voice Activity Detection (VAD) and the turn-taking paradigm of communication. When communicating with a text-based chatbot, the turns are clear: You write something, then the bot does, then you do, and so on. The success of text-based chatbots with clear turn-taking has influenced the design of voice-based bots, most of which also use the turn-taking paradigm. A key part of building such a system is a VAD component to detect when the user is talking. This allows our software to take the parts of the audio stream in which the user is saying something and pass that to the model for the user’s turn. It also supports interruption in a limited way, whereby if a user insistently interrupts the AI system while it is talking, eventually the VAD system will realize the user is talking, shut off the AI’s output, and let the user take a turn. This works reasonably well in quiet environments. However, VAD systems today struggle with noisy environments, particularly when the background noise is from other human speech. For example, if you are in a noisy cafe speaking with a voice chatbot, VAD — which is usually trained to detect human speech — tends to be inaccurate at figuring out when you, or someone else, is talking. (In comparison, it works much better if you are in a noisy vehicle, since the background noise is more clearly not human speech.) It might think you are interrupting when it was merely someone in the background speaking, or fail to recognize that you’ve stopped talking. This is why today’s speech applications often struggle in noisy environments. Intriguingly, last year, Kyutai Labs published Moshi, a model that had many technical innovations. An important one was enabling persistent bi-direction audio streams from the user to Moshi and from Moshi to the user. If you and I were speaking in person or on the phone, we would constantly be streaming audio to each other (through the air or the phone system), and we’d use social cues to know when to listen and how to politely interrupt if one of us felt the need. Thus, the streams would not need to explicitly model turn-taking. Moshi works like this. It’s listening all the time, and it’s up to the model to decide when to stay silent and when to talk. This means an explicit VAD step is no longer necessary. Just as the architecture of text-only transformers has gone through many evolutions, voice models are going through a lot of architecture explorations. Given the importance of foundation models with voice-in and voice-out capabilities, many large companies right now are investing in developing better voice models. I’m confident we’ll see many more good voice models released this year. [Reached length limit; full text: https://lnkd.in/g9wGsPb2 ]

  • View profile for Brij kishore Pandey
    Brij kishore Pandey Brij kishore Pandey is an Influencer

    AI Architect | Strategist | Generative AI | Agentic AI

    687,358 followers

    Over the last year, I’ve seen many people fall into the same trap: They launch an AI-powered agent (chatbot, assistant, support tool, etc.)… But only track surface-level KPIs — like response time or number of users. That’s not enough. To create AI systems that actually deliver value, we need 𝗵𝗼𝗹𝗶𝘀𝘁𝗶𝗰, 𝗵𝘂𝗺𝗮𝗻-𝗰𝗲𝗻𝘁𝗿𝗶𝗰 𝗺𝗲𝘁𝗿𝗶𝗰𝘀 that reflect: • User trust • Task success • Business impact • Experience quality    This infographic highlights 15 𝘦𝘴𝘴𝘦𝘯𝘵𝘪𝘢𝘭 dimensions to consider: ↳ 𝗥𝗲𝘀𝗽𝗼𝗻𝘀𝗲 𝗔𝗰𝗰𝘂𝗿𝗮𝗰𝘆 — Are your AI answers actually useful and correct? ↳ 𝗧𝗮𝘀𝗸 𝗖𝗼𝗺𝗽𝗹𝗲𝘁𝗶𝗼𝗻 𝗥𝗮𝘁𝗲 — Can the agent complete full workflows, not just answer trivia? ↳ 𝗟𝗮𝘁𝗲𝗻𝗰𝘆 — Response speed still matters, especially in production. ↳ 𝗨𝘀𝗲𝗿 𝗘𝗻𝗴𝗮𝗴𝗲𝗺𝗲𝗻𝘁 — How often are users returning or interacting meaningfully? ↳ 𝗦𝘂𝗰𝗰𝗲𝘀𝘀 𝗥𝗮𝘁𝗲 — Did the user achieve their goal? This is your north star. ↳ 𝗘𝗿𝗿𝗼𝗿 𝗥𝗮𝘁𝗲 — Irrelevant or wrong responses? That’s friction. ↳ 𝗦𝗲𝘀𝘀𝗶𝗼𝗻 𝗗𝘂𝗿𝗮𝘁𝗶𝗼𝗻 — Longer isn’t always better — it depends on the goal. ↳ 𝗨𝘀𝗲𝗿 𝗥𝗲𝘁𝗲𝗻𝘁𝗶𝗼𝗻 — Are users coming back 𝘢𝘧𝘵𝘦𝘳 the first experience? ↳ 𝗖𝗼𝘀𝘁 𝗽𝗲𝗿 𝗜𝗻𝘁𝗲𝗿𝗮𝗰𝘁𝗶𝗼𝗻 — Especially critical at scale. Budget-wise agents win. ↳ 𝗖𝗼𝗻𝘃𝗲𝗿𝘀𝗮𝘁𝗶𝗼𝗻 𝗗𝗲𝗽𝘁𝗵 — Can the agent handle follow-ups and multi-turn dialogue? ↳ 𝗨𝘀𝗲𝗿 𝗦𝗮𝘁𝗶𝘀𝗳𝗮𝗰𝘁𝗶𝗼𝗻 𝗦𝗰𝗼𝗿𝗲 — Feedback from actual users is gold. ↳ 𝗖𝗼𝗻𝘁𝗲𝘅𝘁𝘂𝗮𝗹 𝗨𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱𝗶𝗻𝗴 — Can your AI 𝘳𝘦𝘮𝘦𝘮𝘣𝘦𝘳 𝘢𝘯𝘥 𝘳𝘦𝘧𝘦𝘳 to earlier inputs? ↳ 𝗦𝗰𝗮𝗹𝗮𝗯𝗶𝗹𝗶𝘁𝘆 — Can it handle volume 𝘸𝘪𝘵𝘩𝘰𝘶𝘵 degrading performance? ↳ 𝗞𝗻𝗼𝘄𝗹𝗲𝗱𝗴𝗲 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝗰𝘆 — This is key for RAG-based agents. ↳ 𝗔𝗱𝗮𝗽𝘁𝗮𝗯𝗶𝗹𝗶𝘁𝘆 𝗦𝗰𝗼𝗿𝗲 — Is your AI learning and improving over time? If you're building or managing AI agents — bookmark this. Whether it's a support bot, GenAI assistant, or a multi-agent system — these are the metrics that will shape real-world success. 𝗗𝗶𝗱 𝗜 𝗺𝗶𝘀𝘀 𝗮𝗻𝘆 𝗰𝗿𝗶𝘁𝗶𝗰𝗮𝗹 𝗼𝗻𝗲𝘀 𝘆𝗼𝘂 𝘂𝘀𝗲 𝗶𝗻 𝘆𝗼𝘂𝗿 𝗽𝗿𝗼𝗷𝗲𝗰𝘁𝘀? Let’s make this list even stronger — drop your thoughts 👇

  • View profile for Sahar Mor

    I help researchers and builders make sense of AI | ex-Stripe | aitidbits.ai | Angel Investor

    40,696 followers

    I’ve open-sourced a key component of one of my latest projects: Voice Lab, a comprehensive testing framework that removes the guesswork from building and optimizing voice agents across language models, prompts, and personas. Speech is increasingly becoming a prominent modality companies employ to enable user interaction with their products, yet the AI community is still figuring out systematic evaluation for such applications. Key features: (1) Metrics and analysis – define custom metrics like brevity or helpfulness in JSON format and evaluate them using LLM-as-a-Judge. No more manual reviews. (2) Model migration and cost optimization – confidently switch between models (e.g., from GPT-4 to smaller models) while evaluating performance and cost trade-offs. (3) Prompt and performance testing – systematically test multiple prompt variations and simulate diverse user interactions to fine-tune agent responses. (4) Testing different agent personas, from an angry United Airlines representative to a hotel receptionist who tries to jailbreak your agent to book all available rooms. While designed for voice agents, Voice Lab is versatile and can evaluate any LLM-based agent. ⭐️ I invite the community to contribute and would highly appreciate your support by starring the repo to make it more discoverable for others. GitHub repo (commercially permissive) https://lnkd.in/gAaZ-tkA

  • View profile for Bryan Zmijewski

    Started and run ZURB. 2,500+ teams made design work.

    12,205 followers

    AI changes how we measure UX. We’ve been thinking and iterating on how we track user experiences with AI. In our open Glare framework, we use a mix of attitudinal, behavioral, and performance metrics. AI tools open the door to customizing metrics based on how people use each experience. I’d love to hear who else is exploring this. To measure UX in AI tools, it helps to follow the user journey and match the right metrics to each step. Here's a simple way to break it down: 1. Before using the tool Start by understanding what users expect and how confident they feel. This gives you a sense of their goals and trust levels. 2. While prompting  Track how easily users explain what they want. Look at how much effort it takes and whether the first result is useful. 3. While refining the output Measure how smoothly users improve or adjust the results. Count retries, check how well they understand the output, and watch for moments when the tool really surprises or delights them. 4. After seeing the results Check if the result is actually helpful. Time-to-value and satisfaction ratings show whether the tool delivered on its promise. 5. After the session ends See what users do next. Do they leave, return, or keep using it? This helps you understand the lasting value of the experience. We need sharper ways to measure how people use AI. Clicks can’t tell the whole story. But getting this data is not easy. What matters is whether the experience builds trust, sparks creativity, and delivers something users feel good about. These are the signals that show us if the tool is working, not just technically, but emotionally and practically. How are you thinking about this? #productdesign #uxmetrics #productdiscovery #uxresearch

  • View profile for Rajni Jaipaul

    AI Enthusiast | Real-World AI Use cases | Project Manager

    7,218 followers

     Is This the Future of Human-AI Interaction? Sesame's "Voice Presence" is Astonishing. Have you ever truly felt like you were having a conversation with an AI? Sesame, founded by Oculus co-founder Brendan Iribe, is pushing the boundaries of AI voice technology with its Conversational Speech Model (CSM). The results are striking. As The Verge's Sean Hollister noted, it's "the first voice assistant I've ever wanted to talk to more than once." Why? Because Sesame focuses on "voice presence," creating spoken interactions that feel genuinely real and understood. What's the potential impact for businesses? Enhanced Customer Service: Imagine AI assistants that can handle complex inquiries with empathy and natural conversation flow. Improved Accessibility: More natural voice interfaces can make technology accessible to more users. Revolutionized Content Creation: Voice models like Maya and Miles could open up new audio and video content possibilities. Training and Education: Interactive AI tutors could provide personalized and engaging learning experiences. The most impressive part? In blind listening tests, humans often couldn't distinguish Sesame's AI from real human recordings. #AI #ArtificialIntelligence #VoiceTechnology #Innovation #FutureofWork #CustomerExperience #MachineLearning #SesameAI

  • View profile for Bahareh Jozranjbar, PhD

    UX Researcher @ Perceptual User Experience Lab | Human-AI Interaction Researcher @ University of Arkansas at Little Rock

    7,958 followers

    A lot of us still rely on simple trend lines or linear regression when analyzing how user behavior changes over time. But in recent years, the tools available to us have evolved significantly. For behavioral and UX data - especially when it's noisy, nonlinear, or limited - there are now better methods to uncover meaningful patterns. Machine learning models like LSTMs can be incredibly useful when you’re trying to understand patterns that unfold across time. They’re good at picking up both short-term shifts and long-term dependencies, like how early frustration might affect engagement later in a session. If you want to go further, newer models that combine graph structures with time series - like graph-based recurrent networks - help make sense of how different behaviors influence each other. Transformers, originally built for language processing, are also being used to model behavior over time. They’re especially effective when user interactions don’t follow a neat, regular rhythm. What’s interesting about transformers is their ability to highlight which time windows matter most, which makes them easier to interpret in UX research. Not every trend is smooth or gradual. Sometimes we’re more interested in when something changes - like a sudden drop in satisfaction after a feature rollout. This is where change point detection comes in. Methods like Bayesian Online Change Point Detection or PELT can find those key turning points, even in noisy data or with few observations. When trends don’t follow a straight line, generalized additive models (GAMs) can help. Instead of fitting one global line, they let you capture smooth curves and more realistic patterns. For example, users might improve quickly at first but plateau later - GAMs are built to capture that shape. If you’re tracking behavior across time and across users or teams, mixed-effects models come into play. These models account for repeated measures or nested structures in your data, like individual users within groups or cohorts. The Bayesian versions are especially helpful when your dataset is small or uneven, which happens often in UX research. Some researchers go a step further by treating behavior over time as continuous functions. This lets you compare entire curves rather than just time points. Others use matrix factorization methods that simplify high-dimensional behavioral data - like attention logs or biometric signals - into just a few evolving patterns. Understanding not just what changed, but why, is becoming more feasible too. Techniques like Gaussian graphical models and dynamic Bayesian networks are now used to map how one behavior might influence another over time, offering deeper insights than simple correlations. And for those working with small samples, new Bayesian approaches are built exactly for that. Some use filtering to maintain accuracy with limited data, and ensemble models are proving useful for increasing robustness when datasets are sparse or messy.

  • View profile for Ardis Kadiu

    Innovator in AI & EdTech | Founder & CEO at Element451 | Educator & Speaker | Developer of AI Courses & Workshops | Host of #GenerationAI Podcast

    6,035 followers

    OpenAI just released their new GPT-4o-mini-tts model, and the “steerability” feature is a big step forward. This means developers can now guide voice agents not only on what to say, but exactly how to say it—allowing precise control over tone, emotion, and style. Traditionally, voice agents worked using a two-step pipeline: converting speech to text (STT), processing it through AI, and then converting text back to speech (TTS). But newer innovations, like OpenAI’s model and the voice-to-voice approach, streamline this process. They enable voice agents to directly understand and respond to speech without translating into text first, reducing delays and significantly improving realism and emotional depth. At Element451, we’re excited by this innovation because it empowers higher education institutions to create customized voice agents. Schools can now design agents or even mascots that truly reflect their personality and brand—dramatically enhancing the student experience. We dive deeper into this topic in our latest #GenerationAI podcast episode, exploring exactly how voice agents are transforming higher education. https://lnkd.in/eXH3ts3x

Explore categories