Top LinkedIn Content on User Experience for Voice Interfaces

Founder of DeepLearning.AI; Managing General Partner of AI Fund; Exec Chairman of LandingAI

2,287,294 followers 8mo

Continuing from last weekâ€™s post on the rise of the Voice Stack, thereâ€™s an area that todayâ€™s voice-based systems often struggle with: Voice Activity Detection (VAD) and the turn-taking paradigm of communication. When communicating with a text-based chatbot, the turns are clear: You write something, then the bot does, then you do, and so on. The success of text-based chatbots with clear turn-taking has influenced the design of voice-based bots, most of which also use the turn-taking paradigm. A key part of building such a system is a VAD component to detect when the user is talking. This allows our software to take the parts of the audio stream in which the user is saying something and pass that to the model for the userâ€™s turn. It also supports interruption in a limited way, whereby if a user insistently interrupts the AI system while it is talking, eventually the VAD system will realize the user is talking, shut off the AIâ€™s output, and let the user take a turn. This works reasonably well in quiet environments. However, VAD systems today struggle with noisy environments, particularly when the background noise is from other human speech. For example, if you are in a noisy cafe speaking with a voice chatbot, VAD â€” which is usually trained to detect human speech â€” tends to be inaccurate at figuring out when you, or someone else, is talking. (In comparison, it works much better if you are in a noisy vehicle, since the background noise is more clearly not human speech.) It might think you are interrupting when it was merely someone in the background speaking, or fail to recognize that youâ€™ve stopped talking. This is why todayâ€™s speech applications often struggle in noisy environments. Intriguingly, last year, Kyutai Labs published Moshi, a model that had many technical innovations. An important one was enabling persistent bi-direction audio streams from the user to Moshi and from Moshi to the user. If you and I were speaking in person or on the phone, we would constantly be streaming audio to each other (through the air or the phone system), and weâ€™d use social cues to know when to listen and how to politely interrupt if one of us felt the need. Thus, the streams would not need to explicitly model turn-taking. Moshi works like this. Itâ€™s listening all the time, and itâ€™s up to the model to decide when to stay silent and when to talk. This means an explicit VAD step is no longer necessary. Just as the architecture of text-only transformers has gone through many evolutions, voice models are going through a lot of architecture explorations. Given the importance of foundation models with voice-in and voice-out capabilities, many large companies right now are investing in developing better voice models. Iâ€™m confident weâ€™ll see many more good voice models released this year. [Reached length limit; full text: https://lnkd.in/g9wGsPb2 ]

GPT-4.5 Goes Big, Claude 3.7 Reasons, Alexa+ Goes Agentic, and more... deeplearning.ai

106 Comments

Brij kishore Pandey

AI Architect | Strategist | Generative AI | Agentic AI

687,358 followers 7mo

Over the last year, Iâ€™ve seen many people fall into the same trap: They launch an AI-powered agent (chatbot, assistant, support tool, etc.)â€¦ But only track surface-level KPIs â€” like response time or number of users. Thatâ€™s not enough. To create AI systems that actually deliver value, we need ð—µð—¼ð—¹ð—¶ð˜€ð˜ð—¶ð—°, ð—µð˜‚ð—ºð—®ð—»-ð—°ð—²ð—»ð˜ð—¿ð—¶ð—° ð—ºð—²ð˜ð—¿ð—¶ð—°ð˜€ that reflect: â€¢ User trust â€¢ Task success â€¢ Business impact â€¢ Experience quality Â Â This infographic highlights 15 ð˜¦ð˜´ð˜´ð˜¦ð˜¯ð˜µð˜ªð˜¢ð˜ dimensions to consider: â†³ ð—¥ð—²ð˜€ð—½ð—¼ð—»ð˜€ð—² ð—”ð—°ð—°ð˜‚ð—¿ð—®ð—°ð˜† â€” Are your AI answers actually useful and correct? â†³ ð—§ð—®ð˜€ð—¸ ð—–ð—¼ð—ºð—½ð—¹ð—²ð˜ð—¶ð—¼ð—» ð—¥ð—®ð˜ð—² â€” Can the agent complete full workflows, not just answer trivia? â†³ ð—Ÿð—®ð˜ð—²ð—»ð—°ð˜† â€” Response speed still matters, especially in production. â†³ ð—¨ð˜€ð—²ð—¿ ð—˜ð—»ð—´ð—®ð—´ð—²ð—ºð—²ð—»ð˜ â€” How often are users returning or interacting meaningfully? â†³ ð—¦ð˜‚ð—°ð—°ð—²ð˜€ð˜€ ð—¥ð—®ð˜ð—² â€” Did the user achieve their goal? This is your north star. â†³ ð—˜ð—¿ð—¿ð—¼ð—¿ ð—¥ð—®ð˜ð—² â€” Irrelevant or wrong responses? Thatâ€™s friction. â†³ ð—¦ð—²ð˜€ð˜€ð—¶ð—¼ð—» ð——ð˜‚ð—¿ð—®ð˜ð—¶ð—¼ð—» â€” Longer isnâ€™t always better â€” it depends on the goal. â†³ ð—¨ð˜€ð—²ð—¿ ð—¥ð—²ð˜ð—²ð—»ð˜ð—¶ð—¼ð—» â€” Are users coming back ð˜¢ð˜§ð˜µð˜¦ð˜³ the first experience? â†³ ð—–ð—¼ð˜€ð˜ ð—½ð—²ð—¿ ð—œð—»ð˜ð—²ð—¿ð—®ð—°ð˜ð—¶ð—¼ð—» â€” Especially critical at scale. Budget-wise agents win. â†³ ð—–ð—¼ð—»ð˜ƒð—²ð—¿ð˜€ð—®ð˜ð—¶ð—¼ð—» ð——ð—²ð—½ð˜ð—µ â€” Can the agent handle follow-ups and multi-turn dialogue? â†³ ð—¨ð˜€ð—²ð—¿ ð—¦ð—®ð˜ð—¶ð˜€ð—³ð—®ð—°ð˜ð—¶ð—¼ð—» ð—¦ð—°ð—¼ð—¿ð—² â€” Feedback from actual users is gold. â†³ ð—–ð—¼ð—»ð˜ð—²ð˜…ð˜ð˜‚ð—®ð—¹ ð—¨ð—»ð—±ð—²ð—¿ð˜€ð˜ð—®ð—»ð—±ð—¶ð—»ð—´ â€” Can your AI ð˜³ð˜¦ð˜®ð˜¦ð˜®ð˜£ð˜¦ð˜³ ð˜¢ð˜¯ð˜¥ ð˜³ð˜¦ð˜§ð˜¦ð˜³ to earlier inputs? â†³ ð—¦ð—°ð—®ð—¹ð—®ð—¯ð—¶ð—¹ð—¶ð˜ð˜† â€” Can it handle volume ð˜¸ð˜ªð˜µð˜©ð˜°ð˜¶ð˜µ degrading performance? â†³ ð—žð—»ð—¼ð˜„ð—¹ð—²ð—±ð—´ð—² ð—¥ð—²ð˜ð—¿ð—¶ð—²ð˜ƒð—®ð—¹ ð—˜ð—³ð—³ð—¶ð—°ð—¶ð—²ð—»ð—°ð˜† â€” This is key for RAG-based agents. â†³ ð—”ð—±ð—®ð—½ð˜ð—®ð—¯ð—¶ð—¹ð—¶ð˜ð˜† ð—¦ð—°ð—¼ð—¿ð—² â€” Is your AI learning and improving over time? If you're building or managing AI agents â€” bookmark this. Whether it's a support bot, GenAI assistant, or a multi-agent system â€” these are the metrics that will shape real-world success. ð——ð—¶ð—± ð—œ ð—ºð—¶ð˜€ð˜€ ð—®ð—»ð˜† ð—°ð—¿ð—¶ð˜ð—¶ð—°ð—®ð—¹ ð—¼ð—»ð—²ð˜€ ð˜†ð—¼ð˜‚ ð˜‚ð˜€ð—² ð—¶ð—» ð˜†ð—¼ð˜‚ð—¿ ð—½ð—¿ð—¼ð—·ð—²ð—°ð˜ð˜€? Letâ€™s make this list even stronger â€” drop your thoughts ðŸ‘‡

54 Comments

Sahar Mor

I help researchers and builders make sense of AI | ex-Stripe | aitidbits.ai | Angel Investor

40,696 followers 11mo

Iâ€™ve open-sourced a key component of one of my latest projects: Voice Lab, a comprehensive testing framework that removes the guesswork from building and optimizing voice agents across language models, prompts, and personas. Speech is increasingly becoming a prominent modality companies employ to enable user interaction with their products, yet the AI community is still figuring out systematic evaluation for such applications. Key features: (1) Metrics and analysis â€“ define custom metrics like brevity or helpfulness in JSON format and evaluate them using LLM-as-a-Judge. No more manual reviews. (2) Model migration and cost optimization â€“ confidently switch between models (e.g., from GPT-4 to smaller models) while evaluating performance and cost trade-offs. (3) Prompt and performance testing â€“ systematically test multiple prompt variations and simulate diverse user interactions to fine-tune agent responses. (4) Testing different agent personas, from an angry United Airlines representative to a hotel receptionist who tries to jailbreak your agent to book all available rooms. While designed for voice agents, Voice Lab is versatile and can evaluate any LLM-based agent. âï¸ I invite the community to contribute and would highly appreciateÂ your support by starring the repo to make it more discoverable for others. GitHub repo (commercially permissive) https://lnkd.in/gAaZ-tkA

5 Comments

Bryan Zmijewski

Started and run ZURB. 2,500+ teams made design work.

12,205 followers 4mo

AI changes how we measure UX. Weâ€™ve been thinking and iterating on how we track user experiences with AI. In our open Glare framework, we use a mix of attitudinal, behavioral, and performance metrics. AI tools open the door to customizing metrics based on how people use each experience. Iâ€™d love to hear who else is exploring this. To measure UX in AI tools, it helps to follow the user journey and match the right metrics to each step. Here's a simple way to break it down: 1. Before using the tool Start by understanding what users expect and how confident they feel. This gives you a sense of their goals and trust levels. 2. While promptingÂ Track how easily users explain what they want. Look at how much effort it takes and whether the first result is useful. 3. While refining the output Measure how smoothly users improve or adjust the results. Count retries, check how well they understand the output, and watch for moments when the tool really surprises or delights them. 4. After seeing the results Check if the result is actually helpful. Time-to-value and satisfaction ratings show whether the tool delivered on its promise. 5. After the session ends See what users do next. Do they leave, return, or keep using it? This helps you understand the lasting value of the experience. We need sharper ways to measure how people use AI. Clicks canâ€™t tell the whole story.Â But getting this data is not easy. What matters is whether the experience builds trust, sparks creativity, and delivers something users feel good about. These are the signals that show us if the tool is working, not just technically, but emotionally and practically. How are you thinking about this? #productdesign #uxmetrics #productdiscovery #uxresearch

10 Comments

Rajni Jaipaul

AI Enthusiast | Real-World AI Use cases | Project Manager

7,218 followers 8mo

Â Is This the Future of Human-AI Interaction? Sesame's "Voice Presence" is Astonishing. Have you ever truly felt like you were having a conversation with an AI? Sesame, founded by Oculus co-founder Brendan Iribe, is pushing the boundaries of AI voice technology with its Conversational Speech Model (CSM). The results are striking. As The Verge's Sean Hollister noted, it's "the first voice assistant I've ever wanted to talk to more than once." Why? Because Sesame focuses on "voice presence," creating spoken interactions that feel genuinely real and understood. What's the potential impact for businesses? Enhanced Customer Service: Imagine AI assistants that can handle complex inquiries with empathy and natural conversation flow. Improved Accessibility: More natural voice interfaces can make technology accessible to more users. Revolutionized Content Creation: Voice models like Maya and Miles could open up new audio and video content possibilities. Training and Education: Interactive AI tutors could provide personalized and engaging learning experiences. The most impressive part? In blind listening tests, humans often couldn't distinguish Sesame's AI from real human recordings. #AI #ArtificialIntelligence #VoiceTechnology #Innovation #FutureofWork #CustomerExperience #MachineLearning #SesameAI

9 Comments

Bahareh Jozranjbar, PhD

UX Researcher @ Perceptual User Experience Lab | Human-AI Interaction Researcher @ University of Arkansas at Little Rock

7,958 followers 4mo

A lot of us still rely on simple trend lines or linear regression when analyzing how user behavior changes over time. But in recent years, the tools available to us have evolved significantly. For behavioral and UX data - especially when it's noisy, nonlinear, or limited - there are now better methods to uncover meaningful patterns. Machine learning models like LSTMs can be incredibly useful when youâ€™re trying to understand patterns that unfold across time. Theyâ€™re good at picking up both short-term shifts and long-term dependencies, like how early frustration might affect engagement later in a session. If you want to go further, newer models that combine graph structures with time series - like graph-based recurrent networks - help make sense of how different behaviors influence each other. Transformers, originally built for language processing, are also being used to model behavior over time. Theyâ€™re especially effective when user interactions donâ€™t follow a neat, regular rhythm. Whatâ€™s interesting about transformers is their ability to highlight which time windows matter most, which makes them easier to interpret in UX research. Not every trend is smooth or gradual. Sometimes weâ€™re more interested in when something changes - like a sudden drop in satisfaction after a feature rollout. This is where change point detection comes in. Methods like Bayesian Online Change Point Detection or PELT can find those key turning points, even in noisy data or with few observations. When trends donâ€™t follow a straight line, generalized additive models (GAMs) can help. Instead of fitting one global line, they let you capture smooth curves and more realistic patterns. For example, users might improve quickly at first but plateau later - GAMs are built to capture that shape. If youâ€™re tracking behavior across time and across users or teams, mixed-effects models come into play. These models account for repeated measures or nested structures in your data, like individual users within groups or cohorts. The Bayesian versions are especially helpful when your dataset is small or uneven, which happens often in UX research. Some researchers go a step further by treating behavior over time as continuous functions. This lets you compare entire curves rather than just time points. Others use matrix factorization methods that simplify high-dimensional behavioral data - like attention logs or biometric signals - into just a few evolving patterns. Understanding not just what changed, but why, is becoming more feasible too. Techniques like Gaussian graphical models and dynamic Bayesian networks are now used to map how one behavior might influence another over time, offering deeper insights than simple correlations. And for those working with small samples, new Bayesian approaches are built exactly for that. Some use filtering to maintain accuracy with limited data, and ensemble models are proving useful for increasing robustness when datasets are sparse or messy.

10 Comments

Ardis Kadiu

Innovator in AI & EdTech | Founder & CEO at Element451 | Educator & Speaker | Developer of AI Courses & Workshops | Host of #GenerationAI Podcast

6,035 followers 7mo

OpenAI just released their new GPT-4o-mini-tts model, and the â€œsteerabilityâ€ feature is a big step forward. This means developers can now guide voice agents not only on what to say, but exactly how to say itâ€”allowing precise control over tone, emotion, and style. Traditionally, voice agents worked using a two-step pipeline: converting speech to text (STT), processing it through AI, and then converting text back to speech (TTS). But newer innovations, like OpenAIâ€™s model and the voice-to-voice approach, streamline this process. They enable voice agents to directly understand and respond to speech without translating into text first, reducing delays and significantly improving realism and emotional depth. At Element451, weâ€™re excited by this innovation because it empowers higher education institutions to create customized voice agents. Schools can now design agents or even mascots that truly reflect their personality and brandâ€”dramatically enhancing the student experience. We dive deeper into this topic in our latest #GenerationAI podcast episode, exploring exactly how voice agents are transforming higher education. https://lnkd.in/eXH3ts3x

6 Comments

LinkedIn respects your privacy

User Experience for Voice Interfaces

Explore categories

User Experience for Voice Interfaces

More in User Experience for Voice Interfaces

More User Experience topics

Explore categories