Hallucinations in LLM Explained

12 min readApr 7, 2024

This article will deep dive into the issue of hallucination in LLMs, Tendency of these LLMs generating plausible-sounding but factually inaccurate information. We’ll explore the underlying reasons why LLMs sometimes hallucinate, and discuss the approaches that researchers and practitioners are exploring to measure and mitigate this issue.

Outlines

Understanding Hallucinations
Types of Hallucination in LLM
Why do LLMs Hallucinate?
Impact of Hallucinations in LLM
Measuring the Hallucinations in LLM
Mitigating Hallucinations Risks
Conclusion

Understanding Hallucinations

Hallucinations aren’t a novelty; they’ve been extensively studied, primarily within the context of human cognition and behaviour.

A hallucination is a false perception of objects or events involving your senses: sight, sound, smell, touch and taste. Hallucinations seem real, but they’re not. — Cleveland Clinic

Having said that, hallucinations operate differently within LLMs compared to human cognition. In human cognition, hallucinations might arise due to mental illness or brain disorders, while in LLMs, they can come from the architecture of the model, the training data used, their statistical modelling, and the prompt query.

Types of Hallucination in LLM

Hallucinations in LLMs can be broadly categorized into two types:

i. Intrinsic Hallucinations

Intrinsic hallucinations refer to outputs generated by the LLM that are inconsistent or contradictory to the information provided in the input prompt or context.
For example, imagine a prompt “What is the capital city of France?” is given to a LLM and it responds by saying “The capital city of France is Berlin.” This is an intrinsic hallucination, as the output directly contradicts the known fact that the capital of France is Paris.

ii. Extrinsic Hallucinations

Extrinsic hallucinations occur when the LLM generates outputs that are factually inconsistent with external, real-world information, even if they may seem plausible in the given context.
For example if a LLM is asked to summarize a news article about a recent scientific breakthrough, but the summary includes details about the discovery being made decades ago, this would be an extrinsic hallucination. The output is inconsistent with the actual timeline of events, even though it may sound reasonable within the context of the prompt.

Why do LLMs Hallucinate ?

Photo by Brett Sayles: https://www.pexels.com/photo/bridge-1926866/

LLMs are trained to generate tokens, that’s it. They don’t loot at whether the generated token is factually correct. However, their training process leads to models producing really good results for given prompts. They acquire knowledge, creative capabilities, and imagination through this process.

They use their pre-existing knowledge along with their creativity and imagination, producing jaw-droppingly good results.

Some times, the imagination or creativity in LLMs leads to hallucinations, resulting in the creation of entirely incorrect information. It’s as if LLMs lack self-awareness to distinguish between what is imagined and what is factually true or false.

This inability to differentiate can pose significant challenges in ensuring the accuracy and reliability of the generated output especially if used in critical applications.

It’s like “having a double-edged sword 🗡️ — my strengths can also be my Weakness.” 💫

Let's Look at the Reasons Why LLMs Hallucinate:

Issue with the training data

LLMs are trained on vast amounts of data(basically all of internet data 😃), which can contain human errors, biases, and inconsistencies.
For example, if the training data includes user-generated content with factual errors, the model may learn to pass on those errors in its own outputs.
Duplicates in the training corpus can also skew the model’s behaviour, causing it to overuse certain phrases or generate repetitive content.

Data distribution Shift

The data distribution encountered during model inference may differ from the distribution in the training data.
For instance, if the model was trained on historical news articles, but is now being used to generate content about recent events, the shift in data distribution can lead to hallucinations.

Lack of Context

LLMs can struggle to fully understand and incorporate the context provided in the input, especially in open-ended or ambiguous situations.
Without a deep understanding of the real-world context, the model may generate outputs that are inconsistent or irrelevant.
For example, a LLM may hallucinate details about a fictional character’s personal life when asked to summarize a book, as it lacks the necessary contextual understanding.

Source Target Divergence

When the training data contains inconsistencies or misalignments between the source text and the target/reference text, the model may learn to generate outputs that are not grounded in the source.
This can happen unintentionally, such as when the target text contains information not present in the source, or intentionally, as in tasks that prioritize diversity over factual accuracy.
A real-world case could be a summarization task where the summary contains details not mentioned in the original article.

Stochastic Nature of Decoding Strategies

The techniques used to generate text, such as top-k sampling, top-p, and temperature adjustment, can introduce randomness and diversity into the outputs.
While these strategies can improve the overall quality of the generated text, they can also lead to increased hallucinations, as the model is not always constrained to the most probable and factually consistent outputs.

Check out my article on setting top-k, top-p, and temperature parameters in Language Models (LLMs).

Setting Top-K, Top-P and Temperature in LLMs

Mastering Top-K, Top-P, & Temperature: Control LLMs like ChatGPT! Learn how these settings shape outputs & optimize for…

rumn.medium.com

Difference in training-time and inference-time Decoding

During training, LLMs are often taught to predict the next token based on the ground-truth prefix sequence. However, at inference time, the model generates text based on its own previously generated outputs.
This discrepancy can cause the model to drift away from the intended meaning and generate hallucinated content, especially in longer sequences.
For example in a question-answering task, the model may provide answers that drift away from the original question, as the accumulation of its own generated text during the response leads to hallucinations.

Parametric Knowledge Bias

LLMs tend to prioritize the knowledge encoded in their model parameters (acquired during pre-training) over the contextual information provided during inference.
This “parametric knowledge bias” can lead the model to generate outputs that are more aligned with its pre-trained knowledge, even if they are factually inconsistent with the given input.
For example in a task involving current events, the model may generate outputs that reflect its background knowledge of politics or historical trends, even if they are not directly relevant to the given context.

Architectural and Training Objectives

Hallucinations can also arise from flaws in the model architecture or suboptimal training objectives that do not align with the desired output characteristics, such as factual accuracy and coherence.
For instance, a model designed for general-purpose language generation may hallucinate when a prompt is given to write content for a specific domain, such as educational materials for children, due to a misalignment between the model’s training and the desired output characteristics.

Effective Prompting

The way prompts are engineered can influence the occurrence of hallucinations. Clear and specific prompts that guide the model towards generating relevant and accurate responses can help mitigate hallucinations.
Poorly designed prompts that are ambiguous or lack sufficient context can lead the model to generate hallucinated outputs that do not match the user’s intent.
For example, when asking a LLM to generate a product description, a vague prompt like “Describe a new smartphone” may lead to hallucinated features or specifications, while a more specific prompt like “Describe the key features and technical specifications of the latest iPhone model” can better guide the model to provide accurate and relevant information.

Impact of Hallucinations in LLM

Inaccurate information generated by LLMs due to hallucinations can have devastating consequences for end users, businesses and even governments. Lets look at few —

Google Lost BILLIONS 💵

Google suffered significant financial losses, amounting to billions of dollars, attributed to incorrect information produced by Bard.

Yikes 😬

Source : https://time.com/6254226/alphabet-google-bard-100-billion-ai-error/

A passenger was misled by Air Canada’s chatbot regarding flight information

https://www.bbc.com/travel/article/20240222-air-canada-chatbot-misinformation-what-travellers-should-know

LLMs in Education

LLMs can lead students astray with incorrect answers, known for fabricating responses during hallucinations.

Computer scientist Wei Wang from UCLA tested GPT-3.5 and its successor, GPT-4, on university-level questions in physics, chemistry, computer science, and mathematics. The results showed a lot of errors, with GPT-4 answering about one-third of textbook questions correctly and scoring 80% in one exam.

What's the solution?

To address hallucinations in LLMs, we must:

Measure hallucination effectively
Apply right mitigation strategy
Evaluate the effectiveness of the chosen mitigation strategy
Maintain positivity 😃

Measuring the Hallucinations in LLM

This is still an active area of research, focusing on developing effective methods for measuring hallucination.

Evaluating and quantifying the hallucinations produced by LLMs is a crucial step in developing effective mitigation strategies. There are broadly two approaches to measuring hallucinations in LLMs:

1. Human Evaluation:

This involves having human evaluators manually reviewing the outputs generated by LLMs and identify instances of hallucinations or factual inconsistencies.
Human evaluation is considered the gold standard, as it taps into human judgment and reasoning capabilities to assess the coherence and truthfulness of the generated text.
However, this method can be time-consuming, expensive, and subject to individual biases or inconsistencies among evaluators.

2. Quantitative Measurement:

In this approach, researchers/engineers develop automated metrics and methods to computationally assess the factual accuracy and consistency of LLM outputs.
Quantitative measurement techniques are generally faster, more scalable, and less resource-intensive than human evaluation.
By defining clear evaluation frameworks and metrics, we can systematically measure the effectiveness of different hallucination mitigation strategies employed in LLM development.

While human evaluation provides invaluable insights, the need for scalable and consistent measurement has led to the emergence of various quantitative techniques for assessing hallucinations in LLMs.

Framework for Quantitative Measurement

The four-step framework for quantitative measurement :

ii. Define a Ground Truth:

For factual claims, the ground truth can be sourced from high-quality knowledge bases, wikipedia, or curated datasets. The choice of ground truth data varies by use-case.
In tasks like summarization or question-answering, human-written reference texts can serve as the ground truth.
For open-ended generation tasks, the ground truth may be more abstract, focusing on attributes like coherence, relevance, or consistency.

ii. Prepare the Test Sets:

The test sets should cover a diverse range of inputs, topics, and complexity levels to assess the model’s performance thoroughly.

iii. Evaluate the Metrics:

Fact-Checking against Knowledge Bases: Checking the factual claims in the generated text against structured knowledge sources and computing metrics like precision, recall, and F1-score.
Natural Language Inference (NLI): Using NLI models to determine if the generated text is entailed by, contradicts, or is neutral with respect to the input context or ground truth. For example, if the input context is “The Eiffel Tower is located in Paris,” and the generated text is “The Eiffel Tower is located in London,” an NLI model would classify this as a contradiction.
Semantic Similarity Metrics: Measuring the semantic overlap between the generated text and the ground truth using metrics like BLEU, ROUGE, or BERTScore. Lower similarity scores can indicate potential hallucinations.
Consistency Evaluation: Assessing the consistency of the model’s outputs across different prompts or contexts to identify contradictory or incoherent generations. For instance, if the model generates “The Eiffel Tower is located in Paris” for one prompt, but “The Eiffel Tower is located in London” for a similar prompt, this would indicate a lack of consistency and potential hallucinations.
Adversarial Evaluation: Crafting carefully designed prompts or inputs to trigger hallucinations and measuring the model’s resilience to such adversarial examples. This could involve introducing subtle inconsistencies, ambiguities, or factual errors into the input to see how the model responds and whether it maintains coherence and truthfulness in its outputs.
FaCtScore: A recently proposed automatic metric that utilizes information retrieval and natural language inference to measure the factual consistency of the generated output against dynamically retrieved evidence from a large corpus, such as Wikipedia.

iv. Iterative Refinement:

The quantitative evaluation process should be iterative, with the results informing the development of new hallucination mitigation strategies.

This four-step framework for quantitative measurement provides a structured approach to assess the prevalence and severity of hallucinations in LLMs. By defining a reliable ground truth, preparing diverse test sets, evaluating key metrics, and iteratively refining the process, researchers and engineers can gain valuable insights into the hallucination problem.

Mitigating Hallucinations Risks in LLM

Photo by Aleksandar Pasaric: https://www.pexels.com/photo/lighted-vending-machines-on-street-2338113/

Mitigating hallucinations in LLMs can be approached at two key levels:

Pre-training / During Training
At the Inference Time

Let’s look at both of them in detail.

i. Pre-training / During Training

A variety of factors that can contribute to hallucinations in the final model can be addressed before the model training starts or during model trainig. These include:

1. Fixing Data Issues:

Carefully curate and filter the training data to remove noise, biases, and inconsistencies.
Employ data deduplication detection techniques to eliminate redundant or repetitive examples.
Augment the training data with high-quality, factual information from reliable sources to improve the model’s grounding in reality.

2. Architecture-level Improvements:

Explore model architectures that are better suited for maintaining factual consistency and coherence, such as incorporating external knowledge sources or memory modules.
Experiment with larger model sizes, as increased model capacity can sometimes help in learning more robust representations and reducing hallucinations.

3. Training Process Improvement:

Utilize RLHF techniques to fine-tune the model’s behaviour and incentivize the generation of factually accurate and coherent outputs.
Implement reward modelling approaches that explicitly reward the model for producing truthful and consistent text.
Incorporate training objectives and loss functions that encourage the model to maintain logical consistency across generated outputs.

ii. At the Inference Time

Retraining such huge language models can be a very expensive and impractical option for the majority of developers and engineers. However, there are several strategies that can be employed at the inference or deployment stage to mitigate hallucinations:

1. Better Prompting:

Carefully craft prompts that provide clear, specific, and contextual information to guide the model towards generating relevant and accurate responses.
Experiment with different prompt engineering techniques, such as including relevant background information or breaking down complex tasks into smaller, more manageable steps.

2. Controlling Output Randomness:

Adjust the decoding parameters, such as top-k, top-p, and temperature, to balance the diversity and factual consistency of the generated outputs.
Explore alternative decoding strategies, like nucleus sampling or beam search, that may be less prone to hallucinations.

3. Enhancing Reasoning Capabilities:

Implement “chain-of-thought” prompting techniques to encourage the model to engage in step-by-step reasoning before producing the final output.
Leverage techniques like self-consistency or multi-step reasoning to help the model maintain coherence and factual accuracy over longer sequences.

4. Incorporating External Knowledge:

Use RAG approaches to dynamically retrieve relevant information from external knowledge sources and integrate it into the model’s outputs.
Fine-tune the LLM on domain-specific data or knowledge bases to improve its factual grounding in particular areas.

5. Selecting Appropriate LLMs:

For critical applications where hallucinations can have significant consequences, consider using more advanced and robust LLM versions, such as GPT-4, which may show fewer hallucination issues compared to smaller models like GPT-3.5.

6. Fine-tuning for Domain-Specific Tasks:

Adapt the pre-trained LLM to specific domains or use cases by fine-tuning it on relevant data and incorporating task-specific objectives.
This can help the model develop a stronger understanding of the context and factual information relevant to the target application, reducing the likelihood of hallucinations.

By employing a combination of these mitigation strategies at both the training and inference time, we can work towards creating LLMs that are more reliable, factually accurate, and trustworthy in their outputs.

🚨 I’ll be publishing a series of articles on assessing the performance of LLMs in terms of accuracy and hallucination, complete with code examples. Stay tuned for that 🤩

Conclusion

Hallucinations in LLMs are a concerning issue that must be addressed as these powerful AI systems become more widely adopted. There are various reasons behind this, ranging from flaws in the training data to architectural limitations that prevent true contextual understanding and reasoning.

As the field of LLMs continues to evolve, tackling the hallucination challenge will be crucial to unlocking the full potential of these LLMs while ensuring their outputs remain factually accurate and coherent.

If you enjoyed this article, your applause would be greatly appreciated!

Reference

Few recent works which I found very interesting on measuring and mitigating Hallucinations in LLMs :

Others :