What GPT Still Gets Wrong

“Good enough” isn’t “safe enough”.

Generative AI tools like ChatGPT have become go-to assistants in business workflows. They summarize reports, draft responses, support internal research, and generate code. But behind the polished outputs lie persistent issues that can cause real harm if left unchecked.

In 2023, a New York lawyer made headlines after using ChatGPT to draft a legal filing. The model confidently cited nonexistent court cases, leading to professional embarrassment and sanctions. That same year, media companies faced backlash after publishing AI-generated articles with factual inaccuracies, undermining their credibility. These incidents are the symptoms of a broader issue: GPT can sound right, even when it’s wrong.

Despite its versatility, GPT doesn’t understand the way humans do. It predicts language based on patterns in its training data, not grounded knowledge. This makes it prone to hallucinations, bias, and blind spots, all of which raise serious concerns when the model is embedded in business-critical tools.

Despite significant improvements, GPT-based models still exhibit systemic weaknesses that can affect trust, safety, and utility. Understanding these limitations is key before integrating them into business processes.

Hallucinations

GPT sometimes generates information that sounds plausible but is entirely fabricated. This can include nonexistent legal and academic citations, incorrect statistics and historical data, fake URLs, and code snippets that just don’t work.

For example, in 2023, CNET quietly paused its AI-generated financial content after discovering factual errors in dozens of articles, many of which were only corrected after public scrutiny.

Lack of Contextual Memory

Unless fine-tuned or integrated with retrieval tools, GPT doesn’t “remember” prior exchanges in a meaningful way. This can lead to contradictory answers within the same session. As a result, this leads to confusion over role-based tasks or personas and repetitive or circular responses when asked clarifying questions.

Bias and Toxicity

GPT models reflect the data they were trained on, including historical, cultural, and linguistic biases. This way, ChatGPT often inherits stereotypical or discriminatory language, skewed perspectives in summaries or recommendations, and a legal or reputational risk when used in customer-facing applications

In 2023, Amazon halted internal testing of an AI hiring assistant after it was found to downgrade CVs that included mentions of all-women colleges, a legacy bias issue rooted in its training data.

Poor Judgment with Ambiguity or Edge Cases

GPT models tend to choose the most statistically likely continuation, even when a more nuanced or cautious response is appropriate. In high-stakes industries like healthcare, law, and finance, this can lead to oversimplified interpretations of regulations, confident but incorrect medical or financial advice, and missing disclaimers or escalation recommendations.

Why These Issues Matter in Business Contexts

When companies embed GPT-based tools into their workflows, whether for customer support, internal knowledge bases, or document generation, the consequences of failure extend beyond technical issues. They’re operational, legal, and reputational.

Customer Trust Can Erode Instantly

If a chatbot invents a product feature, misquotes return policies, or offers incorrect instructions, it undermines customer confidence. Even one bad interaction can lead to negative reviews or lost conversions.

Air Canada’s chatbot mistakenly promised a bereavement fare refund that didn’t exist. When the airline refused to honor it, a court ruled they were still liable for the AI’s statements. While a legal solution may fall within the realm of financial compensation, the lack of trust and increased scrutiny of every company statement extends the troubles caused by neglect in AI oversight.

Legal Accountability Isn’t Delegated to AI

Whether the issue stems from hallucinated advice or biased outcomes, legal systems tend to hold the deploying entity responsible, not the AI tool.

GDPR regulators have made it clear that automated decision-making must be explainable and auditable. Businesses using LLMs without proper safeguards risk regulatory penalties, especially in data-sensitive industries.

Internal Use Isn’t Risk-Free

Even if GPT is used only for internal tasks, like drafting reports or summarizing documents, it can still:

Introduce factual errors into decision-making
Spread outdated or biased views among employees
Leak sensitive information if prompts aren’t sanitized properly

Internal misuse can quietly accumulate damage, especially if employees overtrust the model or assume output is always vetted.

Scaling GPT Usage Multiplies the Risk

The more places GPT is used (sales emails, contracts, dashboards, etc.), the greater the need for governance. Without clear usage policies, businesses risk inconsistency, misinformation, and exposure. Additionally, if the businesses don’t limit and train the model just for its specific task, the results and databases might mix and match, producing unexpected results.

Add that to a lack of human-in-the-loop oversight, and you have a perfect storm brewing that can lead to consequences that are hard to predict.

How to Make GPT Safer for Business Use

While the limitations of GPT and similar LLMs are well-documented, there are practical, proven methods to reduce risks and make these tools safer and more reliable in real-world business environments.

Safeguards for Safer GPT Use for Businesses

Implement Human-in-the-Loop (HITL) Systems

Don’t fully automate high-stakes decisions. For content generation, customer communication, or data summaries, add a layer of human review before publishing or acting on the output.

Example: Law firms using GPT for legal drafts ensure all outputs are reviewed by qualified staff before submission, following the fallout from lawyers submitting GPT-generated, hallucinated case law in 2023.

Use Retrieval-Augmented Generation (RAG)

RAG architecture grounds GPT responses in factual, company-approved data. This reduces hallucinations by having the model “read” internal sources, like a knowledge base or documentation, before answering.

Many enterprises now combine GPT with RAG systems using tools like LangChain or Haystack to ensure answers come from real, trusted documents.

Apply Guardrails and Filters

Use moderation layers or custom prompts to filter sensitive or harmful output. You can also restrict certain topics, formats, or tones depending on the use case.

Tools like Azure OpenAI and Google Vertex AI offer built-in content filters and safety scoring, ideal for customer-facing or regulated applications.

Log and Audit Interactions

Maintain logs of GPT interactions to analyze usage patterns, detect recurring errors, and refine prompts or behavior over time.

Audit trails are essential for compliance and help explain outcomes if something goes wrong. They’re also increasingly required in AI regulations like the EU AI Act.

Train or Fine-Tune for Domain Context

Generic GPT models are more likely to hallucinate when they don’t understand the domain. Fine-tuning or prompt engineering with domain-specific language helps reduce inaccuracies.

Companies in healthcare and finance often train models on their internal datasets, ensuring the LLM reflects their vocabulary, policies, and context.

Takeaway

GPT and similar large language models have transformed how businesses approach automation, content generation, and customer service. But relying on them without safeguards can backfire, from producing false or biased information to exposing companies to legal or reputational risk.

The good news is that safer deployment is achievable. Techniques like Retrieval-Augmented Generation, human-in-the-loop design, and domain adaptation are being used right now by companies building responsible AI solutions.

Making GPT safer isn’t just about avoiding mistakes. It’s about building trust. For your customers, for your teams, and for any stakeholder who depends on your technology to deliver reliable, verifiable outcomes.