Build a bot to answer questions over documents with GPT and Weaviate

Jason Fan
8 min readApr 4, 2023

--

We’ve spent the last few weeks scraping all sorts of developer docs for our customers and ingesting it into a vector store. This allowed us to create chat APIs for our customers, who can then use them to get super fast answers to their questions without having to search through the docs directly.

We just made our product 100% open source — including our data connectors. If you’re just looking to get started fast, considering checking out our repo! If you’re looking to build your own data connectors for use with GPT3/4 or other language models, read on to learn how we did it.

Here’s an example of the end result, using Slack as the interface.

This also lets you build some pretty useful documentation search features like Supabase’s new Clippy for docs.

Why we built our own data connectors

We started out trying tools like LangChain and LlamaHub that provide out of the box document loaders, but we found that while they cover a large variety of data sources, the markdown loader was fairly basic and couldn’t support a lot of the edge cases encountered when ingesting developer docs hosted online.

A few of the issues we ran into that required a custom data connector:

Langchain splitters don’t preserve structure

Langchain splitters are general purpose by design, so they don’t let you extract as much semantically meaningful metadata from a document as writing a custom data connector would.

For example, the langchain MarkdownTextSplitter simply splits a markdown file using headings as the separator. This doesn’t capture the different hierarchies of headings. E.g. a paragraph about “getting started with langchain agents“ should be vectorized with both the “Getting started” h2 heading as well as the “Agents” title of the document. This is necessary so searches for topics similar to both “agents” and “getting started” in the embedding space return this paragraph.

MDX files are a nightmare to parse

MDX files are a popular way to create nice looking documentation with embedded JSX. The problem is that it’s impossible to tell what the final content on a page will be without compiling and running the page in React, at which point we may as well just use an HTML parser to process the content.

How we build new connectors

Here are the steps we followed to build a markdown parser specifically to ingest content for LLM-based chatbots. There are just 6 steps.

  1. Decide on a chunking scheme
  2. Write a custom parser for markdown and HTML
  3. Annotate chunks
  4. Upload to a vector store
  5. Test with GPT
  6. (Optional) Tweak results with hybrid search

This might sound like a lot but it’s simpler than it seems!

Decide on a chunking scheme

The first thing we did was decide on how we wanted to load data into the context window of our prompt. GPT3 has a token limit of just 2048 including the prompt + completion, while GPT4 will have 8000. We allocated 500 tokens for the completion, ~500 tokens for the prompt, which left 1000 tokens for the context window. With those 1000 tokens, we wanted to include at least 3 results, so the average size of each chunk would need to be ~300 tokens.

The next thing to consider is where to split the content, and how much overlap to include between chunks. One approach is to simply split the content every 1000 characters (roughly 300 tokens) with 250 characters of overlap. This naive approach worked surprisingly well, but occasionally caused hallucinated results when content was cut off in the middle of a sentence.

Another common approach is to split chunks at the starts and end of sentences. This avoids the issue of getting back chunks that are cut off in the middle of a word, but still doesn’t preserve the semantic relationships between parts of the content that are related.

The approach we ultimately used was simple — we chunked content using headers as the delimiter, but also included the headers as metadata for the chunk.

Write a custom parser

[Code here]

We used the markdown library to convert markdown files to HTML, and BeautifulSoup to parse the content. This was the easiest way to get started since we were familiar with BeautifulSoup, but in practice writing a line-by-line parser for markdown probably would have worked better.

html = markdown2.markdown(markdown_content, extras=["tables"])
soup = BeautifulSoup(html, 'html.parser')

We included two types of output in the parser: raw_markdown that preserves the markdown formatting and therefore the structure of code blocks and lists, and content which is just a dump of the text content of the chunk. Using text content for vectorization leads to better results. Markdown is useful because GPT can be instructed to preserve markdown formatting from its context window for responses in destinations that support markdown, like Slack or documentation search.

def get_new_block(c: list[Tag]) -> DocumentChunk:
if (len(c) == 0):
return None
stripped_content = " ".join([tag.get_text(" ") for tag in c])
stripped_content = " ".join(stripped_content.split())
raw_markdown = "\\n\\n".join(md(tag) for tag in c)
## Remove any empty content blocks, e.g. those with only images
if (len(stripped_content) < 1):
return None
chunk = DocumentChunk(
id=str(uuid.uuid4()),
content=content,
raw_markdown=raw_markdown,
title=title,
h2=md(h2),
h3=md(h3),
h4=md(h4),
product=product,
source_type=Source.github_markdown,
url=source
)
return chunk

Markdown files often contain links to other files in the same directory given as relative paths. Since we wanted to get markdown from the vector store to use it in Slack, we had to convert relative paths to full urls, otherwise links in the markdown would be broken. If you don’t need to get markdown formatted content back from the vector store, you can skip this step.

# Update all relative links to absolute links
for a in soup.find_all('a', href=True):
source_path = Path(source).parent
href = a['href']
if href.startswith('/') or href.startswith('#'):
a['href'] = source_path / href
# For complex relative links, e.g. ../../../../, we need to go up the directory tree
elif not href.startswith('http'):
parts = Path(href).parts
for part in parts:
if part == '..':
source_path = source_path.parent
else:
if part.endswith('.md'):
part = part[:-3]
source_path = source_path / part
a['href'] = source_path

Any <p>, <ul> , <ol>, and <pre> tags are included in a single chunk. Headers indicated the start of a new chunk. We excluded tables because they took up too many tokens, and rarely contained content that was useful for documentation search use cases.

for tag in soup.children:
## Reset all lower level headers if we hit a higher level header)
if tag.name == 'h1':
chunks.append(get_new_block(content))
content = []
h2 = None
h3 = None
h4 = None
elif tag.name == 'h2':
chunks.append(get_new_block(content))
content = []
h2 = tag
h3 = None
h4 = None
elif tag.name == 'h3':
chunks.append(get_new_block(content))
content = []
h3 = tag
h4 = None
elif tag.name == 'h4':
chunks.append(get_new_block(content))
content = []
h4 = tag
elif tag.name in ['p', 'ul', 'ol', 'pre']:
content.append(tag)
continue
chunks.append(get_new_block(content))

Edge cases

Since markdown is has a flat structure, converting it to HTML and parsing line-by-line generalizes very well. When it comes to scraping documentation sites that are already rendered as HTML however, there are a lot of edge cases we ran into.

Here is a list of some of them. You’ll likely run into many more:

  • Header containers: Sometimes headers would be wrapped in one or more <div>s instead of being at the same level as the content
  • Relative paths: Some sites would use relative paths in a elements instead of absolute paths. e.g. /tutorial instead of https://docs.example.com/tutorial. This means we had to check if a link was a relative path, and if so reconstruct the full URL before scraping and parsing it.
  • Dynamic sites: Some sites are dynamically rendered client-side. Playwright can still handle these sites by running in headless=False mode and automating clicks/scrolls, but it’s still a pain to deal with.
  • DDoS protection: Some sites have rate limits or use products like Cloudflare to prevent DDoS attacks, which makes scraping them more challenging.

If the documentation is available via an API or as raw markdown files, use those. Web scraping should only be used as a fallback, since it results in lower quality data but has the benefit of working on almost every website.

Annotate chunks

To preserve the surrounding context of each chunk within the document it belongs to, we included each header level (title to h4) as a property of the chunk. When we upload the content to the vector store, each header label is included in the vector space alongside the content. This ensures that semantic searches will consider the position in the document of each chunk, as represented by the most recent headers.

For example, in this chunk, title, h2, and h3 are vectorized alongside the content, so searches for “How do I create a BentoService object in Spark” will return this chunk even though Spark is not mentioned anywhere in the content itself.

{
"source": "<https://docs.bentoml.org/en/latest/integrations/spark.html>",
"title": "Spark",
"product": "bentoml",
"h2": "Run Bentos in Spark",
"h3": "Create a BentoService object",
"h4": null,
"raw_markdown": "Create a BentoService object using the BentoML service you want to use for the batch inference\\njob. Here, we first try to use `bentoml.get` to get the bento from the local BentoML store. If it\\nis not found, we retrieve the bento from the BentoML public S3 and import it.\\n\\n\\n```\\nimport bentoml\\n\\nbento = bentoml.get(\\"iris\\\\_classifier:latest\\")\\n\\n```\\n",
"content": "Create a BentoService object using the BentoML service you want to use for the batch inference\\njob. Here, we first try to use bentoml.get to get the bento from the local BentoML store. If it\\nis not found, we retrieve the bento from the BentoML public S3 and import it. import bentoml \\n\\n bento = bentoml . get ( \\"iris_classifier:latest\\" )",
"id": "818a403c-6c13-427a-b2e3-cb2d074c7dfd"
},

Finally, for cases where the raw markdown files were not available, or if the files were in .mdx format, we used the Playwright library to scrape the HTML content of the hosted docs and parsed the content in the same way described above. When using the HTML parser, we recursively download every

Upload to a vector store

[Code here]

Once we have the chunks, it’s easy to upload them to any vector store. We used Weaviate since it’s open source. It also supports vectorization of properties which is handy, though it may be possible to achieve similar results with Pinecone’s metadata filters.

We uploaded each chunk to Weaviate, vectorizing only title, h2, h3, h4, and content with the text-embedding-ada-002 embeddings. Note you’ll need to create a new schema using the Weaviate python library first.

async def _upsert(self, chunks: List[DocumentChunk]) -> List[str]:
"""
Takes in a list of list of document chunks and inserts them into the database.
Return a list of document ids.
"""
doc_ids = []
  with self.client.batch as batch:
batch.batch_size=100
# Batch import all documents
for chunk in chunks:
print(f"importing document: {chunk.title}")
properties = {
"title": chunk.title,
"url": chunk.url,
"product": chunk.product,
"h2": chunk.h2,
"h3": chunk.h3,
"h4": chunk.h4,
"content": chunk.content,
"raw_markdown": chunk.raw_markdown,
"source_type": chunk.source_type
}
batch.add_data_object(properties, WEAVIATE_INDEX, generate_uuid5("test"))
doc_ids.append(chunk.id)
return doc_ids

Test with GPT

[Code here]

Finally, it’s time to test if the chunks work effectively. This can be done by making a query to the vector store, processing the results, inserting them into a prompt via template strings, and finally making a call to the OpenAI API. In this example we used the gpt-3.5-turbo model which is the fastest and cheapest in the 3 series.

(Optional) Tweak results with hybrid search

[Code here]

Both Weaviate and Pinecone have parameters that can be tweaked to get better search results depending on if your use cases favors keyword search (sparse vectors) or semantic search (dense vectors)

def extract_schema_properties(schema):
properties = schema["properties"]
    return {property["name"] for property in properties}result = (
self.client.query.get(
WEAVIATE_INDEX,
list(extract_schema_properties(SCHEMA)),
)
.with_hybrid(query=query.query, alpha=0.5, vector=query.embedding)
.with_where(filters_)
.with_limit(query.top_k) # type: ignore
.with_additional(["score", "vector"])
.do()
)

Conclusion

That was a comprehensive walkthrough of how we built a developer documentation source connector for our GPT-powered chatbot. If you’re working on your own connector, come join our Slack and share what you’re working on, or contribute a new connector to the Sidekick repo.

--

--