Improving vector search by converting documents to question/answer pairs

Jason Fan
2 min readJul 13, 2023

--

We’ve heard from developers building with vector databases that using GPT to transform documents into different formats could improve the reliability of vector search when building RAG applications.

For example, converting documents into question and answer pairs and indexing documents based vectors generated from those pairs intuitively seems like it would yield better results for queries that are formatted as questions.

{
"questions_and_answers": [
{
"question": "Who is the email from?",
"answer": "The email is from phillip.allen@enron.com."
},
{
"question": "Who is the email to?",
"answer": "The email is to david.delainey@enron.com."
},
{
"question": "What is the issue the back office is having?",
"answer": "The back office is having a hard time dealing with the $11 million dollars that is to be recognized as transport expense by the west desk then recouped from the Office of the Chairman."
},
...
}

We we curious if that was true in practice as well as in theory, so we created a basic benchmark using LangChain and FAISS to determine if these performance improvements are real, and under what conditions.

You can test out the notebook here: https://github.com/psychic-api/doctran/blob/main/benchmark.ipynb

Summary of results

We are able to consistently achieve higher true positive rate by vectorizing the question/answer pairs generated from this email dataset compared to vectorizing the raw emails, as indicated by `qa_distance` being consistently lower than `email_distance` for questions where the answer was present in the data set.

However, for queries that were either not present in the data set (“What were the two most exciting events of 2023”), or completely nonsensical (“How many legs does the city of San Francisco have?”), converting documents to question/answer pairs before vectorizing tends to yield more false positives.

For use cases where users are expected to ask questions rather than provide instructions to a LLM, converting source documents into question and answer pairs is likely to result in more reliable results when performing vector retrieval. For example, this preprocessing technique is likely better suited for customer support or knowledge base search use cases, and less well suited for agent-based workflows.

Disclaimer: we used a very small number of test cases! Highly recommend you test on your own dataset/queries before making any conclusions.

--

--