How we cut the rate of GPT hallucinations from 20%+ to less than 2%
tl;dr: Instead of fine-tuning, we used a combination of prompt chaining and pre/post-processing to reduce the rate of hallucinations by an order of magnitude, however it did require 3–4x as many calls to OpenAI. There’s still a lot more room for improvement!
One of the biggest challenges with using large language models like GPT is their tendency to fabricate information. This could be fine for use cases like generating text for creative writing or brainstorming sessions, but it can be disastrous when the output is used for business applications like customer support. Hallucinations, or the generation of false information, can be particularly harmful in these contexts and can lead to serious consequences. Even one instance of false information being generated could damage a company’s reputation, lead to legal liabilities, and harm customers.
There are a few ways to address this challenge. One common method is to use fine tuning to improve the accuracy of the model on a domain-specific dataset. The problem with fine-tuning is that collecting a domain-specific dataset is hard when you have a multi-tenant SaaS product, where every customer has a slightly different use case and different user personas. So we had to find other ways to solve the problem.
Here’s what we’ve done so far
Prompt Chaining
The first thing we tried was to use prompt chaining techniques to break a complex prompt into parts, and have GPT “check its answers” at each step.
For example, instead of having a single call to GPT with the user input and injected content, we first asked GPT to evaluate whether it could even answer the question, and to justify its response. We currently have 3 steps — a Preprocessing step, an Evaluation step, and Response step.
Here’s an example of the prompt we used at the Evaluation step. It simply asks GPT to answer if it can answer a question given the content provided.
"""<|im_start|>system You found the following content by searching through documentation. Use only this content to construct your response. {content}<|im_end|>
<|im_start|>user First, determine if the content found is sufficient to resolve the issue. Second, respond with a JSON in the format:
{
"content_contains_answer": boolean, // true or false. Whether the information in the content is sufficient to resolve the issue.
"justification": string // Why you believe the content you found is or is not sufficient to resolve the issue.
}
The inquiry: {inquiry}<|im_end|><|im_start|>assistant {
"content_contains_answer":<|im_end|>"""
Note that we asked GPT to return its answer in JSON format and seeded the assistant’s answer with the expected structure. This ensured that we would be able to parse the response, and works almost 100% of the time. We also noticed that simply asking the model to provide justification improved its accuracy at predicting content_contains_answer
, even if we didn’t use it for anything. You just gotta call GPT out on its bullshit!
This approach reduced the rate of hallucinations from 20% to probably 5%.
These techniques are well documented here and here
Post-processing
The next thing that helped us get from 5% to 2% was post-processing GPT’s outputs. There were several steps to this:
- Check if the e^(logprob) of the
true
token is below 90%. If so, we re-run the evaluation prompt and forcecontent_contains_answer
to be false. We’ve found this to reduce false positives without too much impact on false negatives. - If
content_contains_answer
is false, we’ll use the justification returned and a second call to the GPT API to reword the justification to target it towards the user. This reduces the chances our our final output has weird phrasing like “The user should…”. Not exactly a hallucination but also not an optimal experience.
Pre-processing
This was the most recent step we added that got us to <2% hallucinations. The first thing we did is to get GPT to classify the intent of a user’s inquiry. Depending on the intent, we’ll use a different prompt for the evaluation and response steps.
We’re also experimenting with additional pre-processing on the user input to make it more likely to find relevant results at the search step. This can be done by extracting entities from the user’s query and running the vector search with a higher weight on sparse embeddings. This helps for questions that are technical and involve specific token combinations like keras.save_model
, as keyword search is more useful than semantic search for these cases. This is all made possible through Pinecone’s new hybrid search functionality.
Final Thoughts
One final tip that might be useful is to wrap your content in <Content></Content> tags. This helps GPT understand the difference between different sources, and even return placeholders (e.g. Content1) that you can later str.replace()
with a link. You can also do this with any other data that’s injected into the prompt.
Overall, we found a combination of prompt chaining, pre-processing, and post-processing can do a great job of mitigating the risks of hallucinations and improve the accuracy of GPT. The downside is that it requires a lot more API calls, but with the recent 90% reduction in price, this is now very feasible.
We’re also open source!Email us at founders@getsidekick.ai and let us know if you’ve found this to be useful, or if you have tips to share on better ways to prevent hallucinations.