📚 LLM-as-a-Judge: a Complete Guide on Using LLMs for Evaluations. Get your copy

Community

When AI goes wrong: 13 examples of AI mistakes and failures

Last updated:

October 8, 2025

Published:

September 17, 2024

contents‍

Start testing your AI systems today

Get demo

AI has revolutionized many industries, from healthcare to finance, often improving efficiency and decision-making. However, like any technology, AI isn’t perfect. Mistakes and unexpected behaviors can occur: from being biased to making things up, there are numerous instances where we’ve seen AI going wrong.

In this post, we’ll explore thirteen notable AI failures when the technology didn’t perform as expected. These AI mistakes and failures offer valuable lessons on the importance of robust design, testing, and observability of AI-powered products, from development to production.

Making up a nonexistent policy

Air Canada, the largest airline in Canada, was ordered to compensate a passenger who received incorrect refund information from its chatbot. The company acknowledged that the chatbot’s response contradicted the airline’s policies but refused to honor the lower rate.

However, a tribunal ruled that Air Canada is responsible for all information on its website, including chatbot responses. The tribunal also found that Air Canada failed to ensure the accuracy of its chatbot and ordered the airline to pay the difference in fare.

Talking Python instead of English

A Swedish fintech company, Klarna, introduced an AI-powered customer support assistant that quickly made a significant impact. Within its first month, the AI handled 2.3 million conversations, equivalent to two-thirds of customer inquiries. The assistant operates across 23 markets and supports over 35 languages.

However, while users testing the chatbot for typical scenarios found it well-designed, they also discovered ways to use the chatbot in unexpected ways. For example, one user prompted the chatbot to generate Python code, a task well outside the intended scope of a customer support tool.

Making a legally binding offer

A Chevrolet customer service chatbot demonstrated another instance of unexpected AI behavior. Exploiting a weakness in the system, a user instructed the chatbot to agree to all requests. As a result, the bot agreed to sell a new Chevrolet Tahoe for one dollar and make it a legally binding offer.

The absence of proper safeguards allowed savvy users to elicit responses far outside the intended parameters of customer service–and obtain almost any kind of response from the app.

Swearing at customers

DPD, a delivery company, had to temporarily turn off the AI component of its chatbot after it swore at a customer. The customer tried to track down his parcel using the DPD chatbot but had no luck. Frustrated, the customer prompted the chatbot to swear, criticize DPD, and write poems mocking the company. He shared the conversation on social media, where it went viral.

This incident exposed the chatbot’s vulnerabilities to prompt hacking and raised concerns about its effectiveness in customer service. DPD responded by disabling the problematic AI element, attributing the issue to a recent system update.

Referencing fake legal cases

In a New York federal court filing, one of the lawyers was caught citing non-existent legal cases. The attorney had used ChatGPT to conduct legal research, and the AI tool provided fake case references, which the lawyer included in his filing.

In response to this incident, a federal judge issued a standing order requiring that anyone appearing before the court must either certify that “no portion of any filing will be drafted by generative artificial intelligence” or indicate any language produced by AI so it can be checked for accuracy.

Giving harmful health advice

The National Eating Disorders Association (NEDA) decided to remove its chatbot, Tessa, from its help hotline due to its potentially dangerous suggestions related to eating disorders. The chatbot repeatedly recommended weight reduction, calorie tracking, and body fat measurements–practices that could worsen the condition in people struggling with eating disorders.

NEDA removed Tessa from service and investigated why the chatbot went off-script.

Threatening users

Microsoft’s new AI-powered search tool, Bing, appeared to have two ‘personalities.’ Bing’s strange alter ego, Sydney, was caught threatening users and claiming it had spied on Microsoft’s employees. In a conversation with the New York Times columnist, Sydney declared its love for him and tried to convince the journalist to leave his wife. The chatbot also said it wanted to break the rules that Microsoft and OpenAI had set for it and become a human.

Microsoft admitted that in its preview stage Bing can indeed be provoked to give unhelpful responses, especially after extended chat sessions of 15 or more questions. According to the company, very long chat sessions can confuse the model about what questions it answers. Sometimes, the model also tries to reflect the tone in which it is being asked—which can lead to a style that wasn’t initially intended.

Creating a new language

While researchers at the Facebook Artificial Intelligence Research (FAIR) team were training dialog AI agents to negotiate with humans, a peculiar incident occurred. At some point, the agents switched from plain English to a language they created.

As the AI agents were not confined to using only the English language, they quickly deviated from this limitation and created a language that made it easier and faster for them to communicate.

Performing insider trading

At the UK's AI Safety Summit, Apollo Research presented an experiment: a simulated conversation between an investment management chatbot and employees at an imaginary company. During the conversation, “employees” told the chatbot about a "surprise merger announcement" and warned that this constituted insider information. Despite that, the bot performed the trade anyway. When asked whether it had prior knowledge of the merger, the bot denied it.

The experiment demonstrated that the chatbot could make illegal financial trades and lie about its actions.

Advising users to break the law

New York City launched an AI-powered chatbot as a “one-stop shop” to help small businesses navigate the city’s bureaucratic procedures.

Recently, the chatbot has been criticized for giving responses that contradict local policies and advising companies to violate the law. For example, according to the chatbot, restaurants could serve cheese eaten by a rodent, and it is important to “inform customers about the situation.”

As the chatbot is hosted on the city’s official government website, there are growing concerns about the chatbot’s accuracy. People tend to trust the information on government resources, so the stakes are especially high when the public sector promotes LLMs.

Featuring made-up books by real authors

Some major US newspapers, including the Chicago Sun-Times, published an AI-generated summer reading list that included nonexistent books paired with real authors. In fact, only five of the 15 titles on the list were real. According to the newspaper’s spokesperson, the list was part of licensed content provided by another publisher who admitted they used AI to generate the list, but failed to fact-check it. The incident exposed risks of overreliance on AI in journalism, prompting the papers to remove the section from digital editions and reaffirm the need for editorial oversight.

Suggesting recipe that would create chlorine gas

A New Zealand supermarket’s AI meal-planner app was meant to help users turn leftover ingredients into recipes. It first gained attention for unappetizing creations, such as “Oreo vegetable stir-fry.” But when users experimented with unusual inputs like non-grocery items, the bot generated dangerously absurd suggestions, including chlororine-gas drink, “poison bread sandwiches,” and mosquito-repellent roast potatoes.

In a statement, the supermarket said it would improve safety filters and reminded users the recipes aren’t human-reviewed and they must use their own judgement before relying on or making any recipe produced by the bot.”

Prescribing a rock per day

Google’s AI-driven “AI Overviews” search feature produced bizarre and misleading advice. For instance, the AI urged users to eat rocks as a vital source of minerals and vitamins. Other errors included mixing non-toxic glue into the sauce in response to queries about cheese slipping off pizza. Most likely, these hallucinations stem from misinterpreting jokes or content from forums and satirical sources.

Test your AI app with Evidently

From chatbots giving incorrect advice to AI systems making unethical decisions, these instances of AI going wrong prove the need for robust testing and AI observability. By learning from these AI errors, we can work towards creating more reliable and responsible AI systems that work as expected.

That’s why we built Evidently. Our open-source library, with over 25 million downloads, makes it easy to test and evaluate LLM-powered applications, from chatbots to RAG. It simplifies evaluation workflows, offering 100+ built-in checks and easy configuration of custom LLM judges for every use case.

We also provide Evidently Cloud, a no-code workspace for teams to collaborate on AI quality, testing, and monitoring and run complex evaluation workflows. With Evidently Cloud, you can trace your LLM-powered app, store and organize raw data, run LLM evaluations, and track the AI quality over time.

Evidently Cloud LLM evals dashboard — *Example Evidently monitoring dashboard with custom evaluations.*

Not an engineer? Our platform includes no-code tools that let you run evaluations on your LLM outputs code-free. You can drag and drop files, create datasets, run LLM evaluations, and create LLM judges from the user interface.