Over the last year, Iâve seen many people fall into the same trap: They launch an AI-powered agent (chatbot, assistant, support tool, etc.)⦠But only track surface-level KPIs â like response time or number of users. Thatâs not enough. To create AI systems that actually deliver value, we need ðµð¼ð¹ð¶ððð¶ð°, ðµððºð®ð»-ð°ð²ð»ðð¿ð¶ð° ðºð²ðð¿ð¶ð°ð that reflect: ⢠User trust ⢠Task success ⢠Business impact ⢠Experience quality   This infographic highlights 15 ð¦ð´ð´ð¦ð¯ðµðªð¢ð dimensions to consider: â³ ð¥ð²ðð½ð¼ð»ðð² ðð°ð°ðð¿ð®ð°ð â Are your AI answers actually useful and correct? â³ ð§ð®ðð¸ ðð¼ðºð½ð¹ð²ðð¶ð¼ð» ð¥ð®ðð² â Can the agent complete full workflows, not just answer trivia? â³ ðð®ðð²ð»ð°ð â Response speed still matters, especially in production. â³ ð¨ðð²ð¿ ðð»ð´ð®ð´ð²ðºð²ð»ð â How often are users returning or interacting meaningfully? â³ ð¦ðð°ð°ð²ðð ð¥ð®ðð² â Did the user achieve their goal? This is your north star. â³ ðð¿ð¿ð¼ð¿ ð¥ð®ðð² â Irrelevant or wrong responses? Thatâs friction. â³ ð¦ð²ððð¶ð¼ð» ððð¿ð®ðð¶ð¼ð» â Longer isnât always better â it depends on the goal. â³ ð¨ðð²ð¿ ð¥ð²ðð²ð»ðð¶ð¼ð» â Are users coming back ð¢ð§ðµð¦ð³ the first experience? â³ ðð¼ðð ð½ð²ð¿ ðð»ðð²ð¿ð®ð°ðð¶ð¼ð» â Especially critical at scale. Budget-wise agents win. â³ ðð¼ð»ðð²ð¿ðð®ðð¶ð¼ð» ðð²ð½ððµ â Can the agent handle follow-ups and multi-turn dialogue? â³ ð¨ðð²ð¿ ð¦ð®ðð¶ðð³ð®ð°ðð¶ð¼ð» ð¦ð°ð¼ð¿ð² â Feedback from actual users is gold. â³ ðð¼ð»ðð²ð ððð®ð¹ ð¨ð»ð±ð²ð¿ððð®ð»ð±ð¶ð»ð´ â Can your AI ð³ð¦ð®ð¦ð®ð£ð¦ð³ ð¢ð¯ð¥ ð³ð¦ð§ð¦ð³ to earlier inputs? â³ ð¦ð°ð®ð¹ð®ð¯ð¶ð¹ð¶ðð â Can it handle volume ð¸ðªðµð©ð°ð¶ðµ degrading performance? â³ ðð»ð¼ðð¹ð²ð±ð´ð² ð¥ð²ðð¿ð¶ð²ðð®ð¹ ðð³ð³ð¶ð°ð¶ð²ð»ð°ð â This is key for RAG-based agents. â³ ðð±ð®ð½ðð®ð¯ð¶ð¹ð¶ðð ð¦ð°ð¼ð¿ð² â Is your AI learning and improving over time? If you're building or managing AI agents â bookmark this. Whether it's a support bot, GenAI assistant, or a multi-agent system â these are the metrics that will shape real-world success. ðð¶ð± ð ðºð¶ðð ð®ð»ð ð°ð¿ð¶ðð¶ð°ð®ð¹ ð¼ð»ð²ð ðð¼ð ððð² ð¶ð» ðð¼ðð¿ ð½ð¿ð¼ð·ð²ð°ðð? Letâs make this list even stronger â drop your thoughts ð
User Experience Design for Chatbots
Explore top LinkedIn content from expert professionals.
-
-
ð Whatâs the right KPI to measure an AI agentâs performance? Hereâs the trap: most companies still measure the wrong thing. They track activity (tasks completed, chats answered) instead of impact. Based on my experience, effective measurement is multi-dimensional. Think of it as six lenses: 1ï¸â£ Accuracy â Is the agent correct? Response accuracy (right answers) Intent recognition accuracy (did it understand the ask?) 2ï¸â£ Efficiency â Is it fast and smooth? Response time Task completion rate (fully autonomous vs guided vs human takeover) 3ï¸â£ Reliability â Is it stable over time? Uptime & availability Error rate 4ï¸â£ User Experience & Engagement â Do people trust and return? CSAT (outcome + interaction + confidence) Repeat usage rate Friction metrics (repeats, clarifying questions, misunderstandings) 5ï¸â£ Learning & Adaptability â Does it get better? Improvement over time Adaptation speed to new data/conditions Retraining frequency & impact 6ï¸â£ Business Outcomes â Does it move the needle? Conversion & revenue impact Cost per interaction & ROI Strategic goal contribution (retention, compliance, expansion) Gartner predicts that by 2027, 60% of business leaders will rely on AI agents to make critical decisions. If thatâs true, then measuring them right is existential. So, hereâs the debate: Should AI agents be held to the same KPIs as humans (outcomes, growth, value) â or do they need an entirely new framework? ð If you had to pick ONE metric tomorrow, what would you measure first? #AI #Agents #KPIs #FutureOfWork #BusinessValue #Productivity #DecisionMaking
-
AI products like Cursor, Bolt and Replit are shattering growth records not because they're "AI agents". Or because they've got impossibly small teams (although that's cool to see ð). It's because they've mastered the user experience around AI, somehow balancing pro-like capabilities with B2C-like UI. This is product-led growth on steroids. Yaakov Carno tried the most viral AI products he could get his hands on. Here are the surprising patterns he found: (Don't miss the full breakdown in today's bonus Growth Unhinged: https://lnkd.in/ehk3rUTa) 1. Their AI doesn't feel like a black box. Pro-tips from the best: - Show step-by-step visibility into AI processes - Let users ask, âWhy did AI do that?â - Use visual explanations to build trust. 2. Users donât need better AIâthey need better ways to talk to it. Pro-tips from the best: - Offer pre-built prompt templates to guide users. - Provide multiple interaction modes (guided, manual, hybrid). - Let AI suggest better inputs ("enhance prompt") before executing an action. 3. The AI works with you, not just for you. Pro-tips from the best: - Design AI tools to be interactive, not just output-driven. - Provide different modes for different types of collaboration. - Let users refine and iterate on AI results easily. 4. Let users see (& edit) the outcome before it's irreversible. Pro-tips from the best: - Allow users to test AI features before full commitment (many let you use it without even creating an account). - Provide preview or undo options before executing AI changes. - Offer exploratory onboarding experiences to build trust. 5. The AI weaves into your workflow, it doesn't interrupt it. Pro-tips from the best: - Provide simple accept/reject mechanisms for AI suggestions. - Design seamless transitions between AI interactions. - Prioritize the userâs context to avoid workflow disruptions. -- The TL;DR: Having "AI" isnât the differentiator anymoreâgreat UX is. Pardon the Sunday interruption & hope you enjoyed this post as much as I did ð #ai #genai #ux #plg
-
In the rapidly evolving world of conversational AI, Large Language Model (LLM) based chatbots have become indispensable across industries, powering everything from customer support to virtual assistants. However, evaluating their effectiveness is no simple task, as human language is inherently complex, ambiguous, and context-dependent. In a recent blog post, Microsoft's Data Science team outlined key performance metrics designed to assess chatbot performance comprehensively. Chatbot evaluation can be broadly categorized into two key areas: search performance and LLM-specific metrics. On the search front, one critical factor is retrieval stability, which ensures that slight variations in user input do not drastically change the chatbot's search results. Another vital aspect is search relevance, which can be measured through multiple approaches, such as comparing chatbot responses against a ground truth dataset or conducting A/B tests to evaluate how well the retrieved information aligns with user intent. Beyond search performance, chatbot evaluation must also account for LLM-specific metrics, which focus on how well the model generates responses. These include: - Task Completion: Measures the chatbot's ability to accurately interpret and fulfill user requests. A high-performing chatbot should successfully execute tasks, such as setting reminders or providing step-by-step instructions. - Intelligence: Assesses coherence, contextual awareness, and the depth of responses. A chatbot should go beyond surface-level answers and demonstrate reasoning and adaptability. - Relevance: Evaluate whether the chatbotâs responses are appropriate, clear, and aligned with user expectations in terms of tone, clarity, and courtesy. - Hallucination: Ensures that the chatbotâs responses are factually accurate and grounded in reliable data, minimizing misinformation and misleading statements. Effectively evaluating LLM-based chatbots requires a holistic, multi-dimensional approach that integrates search performance and LLM-generated response quality. By considering these diverse metrics, developers can refine chatbot behavior, enhance user interactions, and build AI-driven conversational systems that are not only intelligent but also reliable and trustworthy. #DataScience #MachineLearning #LLM #Evaluation #Metrics #SnacksWeeklyonDataScience â â â Check out the "Snacks Weekly on Data Science" podcast and subscribe, where I explain in more detail the concepts discussed in this and future posts:   -- Spotify: https://lnkd.in/gKgaMvbh   -- Apple Podcast: https://lnkd.in/gj6aPBBY   -- Youtube: https://lnkd.in/gcwPeBmR https://lnkd.in/gAC8eXmy
-
AI models like ChatGPT and Claude are powerful, but they arenât perfect. They can sometimes produce inaccurate, biased, or misleading answers due to issues related to data quality, training methods, prompt handling, context management, and system deployment. These problems arise from the complex interaction between model design, user input, and infrastructure. Here are the main factors that explain why incorrect outputs occur: 1. Model Training Limitations AI relies on the data it is trained on. Gaps, outdated information, or insufficient coverage of niche topics lead to shallow reasoning, overfitting to common patterns, and poor handling of rare scenarios. 2. Bias & Hallucination Issues Models can reflect social biases or create âhallucinations,â which are confident but false details. This leads to made-up facts, skewed statistics, or misleading narratives. 3. External Integration & Tooling Issues When AI connects to APIs, tools, or data pipelines, miscommunication, outdated integrations, or parsing errors can result in incorrect outputs or failed workflows. 4. Prompt Engineering Mistakes Ambiguous, vague, or overloaded prompts confuse the model. Without clear, refined instructions, outputs may drift off-task or omit key details. 5. Context Window Constraints AI has a limited memory span. Long inputs can cause it to forget earlier details, compress context poorly, or misinterpret references, resulting in incomplete responses. 6. Lack of Domain Adaptation General-purpose models struggle in specialized fields. Without fine-tuning, they provide generic insights, misuse terminology, or overlook expert-level knowledge. 7. Infrastructure & Deployment Challenges Performance relies on reliable infrastructure. Problems with GPU allocation, latency, scaling, or compliance can lower accuracy and system stability. Wrong outputs donât mean AI is "broken." They show the challenge of balancing data quality, engineering, context management, and infrastructure. Tackling these issues makes AI systems stronger, more dependable, and ready for businesses. #LLM
-
Product managers & designers working with AI face a unique challenge: designing a delightful product experience that cannot fully be predicted. Traditionally, product development followed a linear path. A PM defines the problem, a designer draws the solution, and the software teams code the product. The outcome was largely predictable, and the user experience was consistent. However, with AI, the rules have changed. Non-deterministic ML models introduce uncertainty & chaotic behavior. The same question asked four times produces different outputs. Asking the same question in different ways - even just an extra space in the question - elicits different results. How does one design a product experience in the fog of AI? The answer lies in embracing the unpredictable nature of AI and adapting your design approach. Here are a few strategies to consider: 1. Fast feedback loops : Great machine learning products elicit user feedback passively. Just click on the first result of a Google search and come back to the second one. Thatâs a great signal for Google to know that the first result is not optimal - without tying a word. 2. Evaluation : before products launch, itâs critical to run the machine learning systems through a battery of tests to understand in the most likely use cases, how the LLM will respond. 3. Over-measurement : Itâs unclear what will matter in product experiences today, so measuring as much as possible in the user experience, whether itâs session times, conversation topic analysis, sentiment scores, or other numbers. 4. Couple with deterministic systems : Some startups are using large language models to suggest ideas that are evaluated with deterministic or classic machine learning systems. This design pattern can quash some of the chaotic and non-deterministic nature of LLMs. 5. Smaller models : smaller models that are tuned or optimized for use cases will produce narrower output, controlling the experience. The goal is not to eliminate unpredictability altogether but to design a product that can adapt and learn alongside its users. Just as much as the technology has changed products, our design processes must evolve as well.
-
Why would your users distrust flawless systems? Recent data shows 40% of leaders identify explainability as a major GenAI adoption risk, yet only 17% are actually addressing it. This gap determines whether humans accept or override AI-driven insights. As founders building AI-powered solutions, we face a counterintuitive truth: technically superior models often deliver worse business outcomes because skeptical users simply ignore them. The most successful implementations reveal that interpretability isn't about exposing mathematical gradientsâit's about delivering stakeholder-specific narratives that build confidence. Three practical strategies separate winning AI products from those gathering dust: 1ï¸â£ Progressive disclosure layers Different stakeholders need different explanations. Your dashboard should let users drill from plain-language assessments to increasingly technical evidence. 2ï¸â£ Simulatability tests Can your users predict what your system will do next in familiar scenarios? When users can anticipate AI behavior with >80% accuracy, trust metrics improve dramatically. Run regular "prediction exercises" with early users to identify where your system's logic feels alien. 3ï¸â£ Auditable memory systems Every autonomous step should log its chain-of-thought in domain language. These records serve multiple purposes: incident investigation, training data, and regulatory compliance. They become invaluable when problems occur, providing immediate visibility into decision paths. For early-stage companies, these trust-building mechanisms are more than luxuries. They accelerate adoption. When selling to enterprises or regulated industries, they're table stakes. The fastest-growing AI companies don't just build better algorithms - they build better trust interfaces. While resources may be constrained, embedding these principles early costs far less than retrofitting them after hitting an adoption ceiling. Small teams can implement "minimum viable trust" versions of these strategies with focused effort. Building AI products is fundamentally about creating trust interfaces, not just algorithmic performance. #startups #founders #growth #ai
-
I spoke to a company last week that makes software for doctors. But sometimes patients - usually in crisis - create an account looking for their doctor. When this happens their current big-name AI solution just starts happily giving totally irrelevant (and dangerous) answers. Lorikeet's agent instead instantly disengaged and escalated the ticket to a human agent. This is a great illustration of how it's hard to build a truly good CX AI solution when you focus on containment or deflection. In fact I think the excessive focus on deflection is the Achilles' heel for a lot of the solutions in our space. Focusing on deflection weakens the product in five core ways: 1. Product architecture reflects different values - chatbots maximize engagement, agents know their limits 2. Self-awareness is a real technical challenge - most vendors avoid the hard engineering work 3. Bad metrics create bad feedback loops - you can't improve what you can't measure properly 4. Testing tools get built around the wrong goals - celebrating coverage instead of quality 5. Workflow design suffers - optimizing for engagement over effectiveness More on what we've learned about these trade offs in comments.
-
Sharing 10 personal learnings from studying Claude's system prompt. If you're building with LLMs - it's a goldmine of design principles and shaping its behavior to the intended user experience ð 1ï¸â£ Define the assistantâs purpose clearly Key insight: Claude begins with a values-driven role: helpful, honest, harmless. Hook: Whatâs your AI for - and whatâs off-limits? Most prompts never say. 2ï¸â£ Tone is a first-class citizen Key insight: Claude isnât just accurate. Itâs grounded, warm, and clearâby design. Hook: Your LLM doesnât just need goals. It needs vibes. 3ï¸â£ Embrace ambiguity with options, not guesses Key insight: Claude offers multiple interpretations when users are unclear. Hook: Defaulting to âIâm not sure - did you mean X or Y?â beats hallucinating. 4ï¸â£ Use conditional logic for guardrails Key insight: Claude refuses dangerous content and redirects to constructive alternatives. Hook: âNoâ isnât the end. Itâs the start of a better direction. 5ï¸â£ Respect user preferences - but know when to ignore them Key insight: Claude applies preferences only when relevant. Hook: Over-personalization is just as risky as ignoring context. 6ï¸â£ Provide examples, not just instructions Key insight: Claude learns from examples in the prompt (e.g. bad vs. good clarifying questions). Hook: LLMs imitate better than they obey. Feed them patterns, not laws. 7ï¸â£ Explicitly define how to handle limits Key insight: Claude has phrasing for âI canât help with thatâ - no awkward dead ends. Hook: Saying ânoâ is easy. Saying it gracefully takes practice. 8ï¸â£ Reflect before responding Key insight: Claude can âthink silentlyâ on complex questions before answering. Hook: Add a reasoning step. Itâs like turning on a prefrontal cortex. 9ï¸â£ Use modular roles to scale capabilities Key insight: Claude uses role-based wrappers (e.g. writing coach, code reviewer) with consistent norms. Hook: Want reusable behavior? Prompt like youâre building Lego blocks. ð Prompts shape behavior probabilistically Key insight: The system prompt nudges - not forces - behavior. Hook: Youâre not programming. Youâre parenting.
-
A year ago, for me, ChatGPT was just a work tool â a writing aid for social media posts. Today, itâs also crept into my personal life. That shift is showing up in the data too. According to Marc Zao-Sanders in Harvard Business Review, âtherapy and companionshipâ is now the #1 use case for GenAI. People arenât just using chatbots to get things done â theyâre using them to feel better, find clarity, and connect emotionally. But is it working, and at what long-term cost? A recent RCT from AHA at MIT Media Lab and OpenAI offers some insight into what that kind of use actually does to us. Nearly 1,000 participants were asked to chat daily with ChatGPT for 4 weeks. Each was assigned to 1 of 9 combinations of modality (text, neutral voice, or emotionally expressive voice) and conversation type (personal prompts, non-personal prompts, or open-ended). *The researchers found that more frequent useâregardless of format or topicâwas consistently associated with greater loneliness, stronger emotional dependence, and lower social interaction with real people.* Interestingly, text-based chats were more emotionally âstickyâ than voice, prompting more self-disclosure and stronger attachment. And while personal prompts (like reflecting on values or gratitude) led to a slight uptick in loneliness, they were also linked to lower emotional dependence and less problematic use. On the other hand, non-personal prompts â the kind we often think of as purely practical â were more likely to foster emotional reliance over time. That nuance matters. The study didnât suggest that emotionally expressive AI is inherently risky, or that personal conversations are always harmful. Instead, it showed how easily frequent, habitual use â even for neutral tasks â can shift from support to substitution. Over time, chatbots can become not just a tool, but a source of comfort, perspective, and emotional regulation. And that comes with tradeoffs. The takeaway? Overuse (even for neutral tasks) is the clearest risk factor for emotional dependence. But how we use GenAI matters too. Structured, self-reflective prompts may help users think without over-attaching, and voice-based interactions â often seen as more âhumanâ â can actually be less emotionally sticky than text. As more people turn to GenAI for emotional support, this research is a reminder: design and intention matter. We can build AI that supports reflection without replacing relationships, but only if we design for that edge where helpful turns into habitual. This post is part of my Friday Findings series â curated research at the intersection of minds and machines. â Cathy (Mengying) F., Auren Liu, Valdemar Danry, Eunhae L., Samantha Chan, Pat Pataranutaporn, Pattie Maes, Jason Phang, Michael Lampe & Sandhini Agarwal (2025). How AI and Human Behaviors Shape Psychosocial Effects of Chatbot Use: A Longitudinal Randomized Controlled Study. arXiv preprint Nuance Behavior