How to connect Llama 2 to your own data, privately

Jason Fan
7 min readJul 19, 2023

Llama 2 + RAG = 🤯

RAG stands for Retrieval Augmented Generation, a technique where the capabilities of a large language model (LLM) are augmented by retrieving information from other systems and inserting them into the LLM’s context window via a prompt. This gives LLMs information beyond what was provided in their training data, which is necessary for almost every enterprise application. Examples include data from current web pages, data from SaaS apps like Confluence or Salesforce, and data from documents like sales contracts and PDFs.

RAG works better than fine-tuning the model because:

  • It’s cheaper: fine-tuning models is expensive because the weights of the model parameters themselves must be adjusted. RAG is simply a series of vector/SQL queries and API calls, which cost tiny fractions of a cent.
  • It’s faster: fine-tuning takes hours each time, sometimes days depending on model size. With RAG, changes to a knowledge base are reflected in model responses instantly because it’s retrieving just-in-time and directly from the system of record.
  • It’s reliable: RAG allows you to track metadata on the documents retrieved, including information about where each document came from, making it much easier to check for hallucinations. Fine-tuning, on the other hand, is a black box.

Llama 2 by itself is like a new hire — it has great general knowledge and reasoning capabilities, but lacks any experience or context on your organization.

Llama 2 with RAG is like a seasoned employee — it understands how your business works and can provide context-specific assistance on everything from customer feedback analysis to financial planning and forecasting.

Configure the local gcloud client

We’ll need to first get our Project ID from the GCP Console and sign in to the gcloud client.

Then run the following commands in the terminal, replacing $PROJECT_ID with your project ID.

gcloud auth login
gcloud config set project $PROJECT_ID

Now we have two options to deploy each service — we can use RAGstack, or we can deploy each service manually.

Deploy using RAGstack

RAGstack is an open source tool that uses Terraform and Truss to automate deploying a LLM (Falcon or Llama 2) and a vector store. It also includes an API service and lightweight UI to make accepting user queries and retrieving context easy.

RAGstack also allows us to run each service locally, so we can test out the application before deploying!

Developing locally

To run RAGstack locally, run:

./run-dev.sh

This will set up your local dev environment and install all the necessary python and nodejs dependencies. Changes to files under server and ragstack-ui will reflect automatically.

Deploying

To deploy RAGstack to GCP, run:

./deploy-gcp.sh

If you don’t have Terraform installed, you’ll have to install it first by following these instructions. On Mac, it’s just two commands:

brew tap hashicorp/tap
brew install hashicorp/tap/terraform

However if you still prefer to set things up yourself, read on!

Deploy manually

Step 1: Deploy Llama 2 to GCP

  • To learn how to deploy a Llama 2 model to GCP, check out this tutorial.
  • To deploy on AWS or Azure, contact us at founders@psychic.dev

Step 2: Deploy a vector database

Vector databases are the most commonly used method of storing context for retrieval since the way they measure similarity lends itself well to querying in natural language.

Some of the most popular vector databases are:

In this tutorial we’ll use Qdrant since it has a convenient docker image we can pull and deploy.

In the terminal, run:

gcloud run deploy qdrant --image qdrant/qdrant:v1.3.0 --region $REGION --platform managed --set-env-vars QDRANT__SERVICE__HTTP_PORT=8080

Replace $REGION with the region you want to deploy to. We typically deploy to us-west1 but you should deploy to a datacenter close to you or your users.

Step 3: Upload documents to the vector database

We’ll need some way to collect documents from our users. The easiest way is to read in a file path from the command line. The RAGstack library has a simple UI that handles file uploads and parsing.

def read_document() -> str:
while True:
try:
# Get the file path from the user
file_path = input("Enter the file path: ")
      # Read the file
with open(file_path, 'r') as file:
data = file.read()
return Document(
content=data,
title="Doc Title",
id=str(uuid.uuid4()),
uri=file_path
)
except Exception as e:
print(f"An error occurred: {e}")
input("Press enter to continue...")

We’ll also need a function to convert these documents into embeddings and insert them into Qdrant.

Source

async def upsert(self, documents: List[Document]) -> bool:
langchain_docs = [
Document(
page_content=doc.content,
metadata={"title": doc.title, "id": doc.id, "source": doc.uri}
) for doc in documents
]
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
split_docs = text_splitter.split_documents(langchain_docs)
  points = []
seen_docs = {}
for doc in split_docs:
doc_id = None
if doc.metadata["id"] not in seen_docs:
doc_id = doc.metadata["id"]
seen_docs[doc.metadata["id"]] = 1
chunk_id = uuid.uuid5(uuid.NAMESPACE_DNS, f"{doc_id}_1")
else:
doc_id = doc.metadata["id"]
seen_docs[doc.metadata["id"]] += 1
chunk_id = uuid.uuid5(uuid.NAMESPACE_DNS, f"{doc_id}_{seen_docs[doc.metadata['id']]}")
vector = embeddings_model.encode([doc.page_content])[0]
vector = vector.tolist()
points.append(PointStruct(
id=str(chunk_id),
payload={
"metadata": {
"title": doc.metadata["title"],
"source": doc.metadata["source"],
"chunk_id": chunk_id,
"doc_id": doc_id,
},

"content": doc.page_content
},
vector=vector
))
self.client.upsert(
collection_name=self.collection_name,
points=points
)
return True

Step 4: Connect the dots

In order to run our RAG model end to end, we’ll need to set up some additional glue functionality in python.

Accept user queries and convert it into an embedding, then run a similarity search in Qdrant based on the embedding

Source

async def query(self, query: str) -> List[PsychicDocument]:
query_vector = embeddings_model.encode([query])[0]
query_vector = query_vector.tolist()
# query_vector = embeddings.embed_query(query)
results = self.client.search(
collection_name=self.collection_name,
query_vector=query_vector,
limit=5
)
return results

Construct the prompt using a template and the retrieved documents, then send the prompt to the hosted Llama 2 model

Source — this is from the code for the Falcon 7B model but since we use Truss to serve models, the code will be the same when connecting with Llama 2.

async def ask(self, documents: List[Document], question: str) -> str:
  context_str = ""  for doc in documents:
context_str += f"{doc.title}: {doc.content}\\n\\n"
prompt = (
"Context: \\n"
"---------------------\\n"
f"{context_str}"
"\\n---------------------\\n"
f"Given the above context and no other information, answer the question: {question}\\n"
)
data = {"prompt": prompt} res = requests.post(f"{base_url}:8080/v1/models/model:predict", json=data) res_json = res.json() return res_json['data']['generated_text']

The easiest way to put all this together is to set up an API server with FastAPI or Flask that handles all the communication and coordination between the hosted Llama 2 instance, the hosted Qdrant instance, and user inputs.

Here’s an example using FastAPI:

from fastapi import FastAPI, File, HTTPException, Depends, Body, UploadFile
from fastapi.security import HTTPBearer
from fastapi.middleware.cors import CORSMiddleware
from models.api import (
AskQuestionRequest,
AskQuestionResponse,
)
app = FastAPI()
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
bearer_scheme = HTTPBearer()@app.post(
"/ask-question",
response_model=AskQuestionResponse,
)
async def ask_question(
request: AskQuestionRequest = Body(...),
):
try:
question = request.question
# vector_store and llm should be classes that provide an interface into your hosted Qdrant and Llama2 instances
documents = await vector_store.query(question)
answer = await llm.ask(documents, question)
return AskQuestionResponse(answer=answer)
except Exception as e:
print(e)
raise HTTPException(status_code=500, detail=str(e))

You can see how these pieces fit together in the RAGstack library — the entry point for the application is a FastAPI server run from [server/main.py](<https://github.com/psychic-api/rag-stack/blob/main/server/server/main.py>).

Disable authentication

In order to make testing our new RAG model easier, we can Allow unauthenticated invocations for each of our GCP services (hosted Llama 2 model, the hosted Qdrant image, any API server you have set up).

Make sure you set up authentication after your testing is complete or you might run into some surprises on your next billing cycle. GPUs ain’t cheap!

Test your RAG model

Here’s comes the fun part! We now have a Llama 7B service, a Qdrant instance, and an API service to connect all the pieces together, all deployed on GCP.

Use this simple cURL to query your new RAG model.

curl --location '$API_SERVICE_URL/ask-question' \\
--header 'Content-Type: application/json' \\
--data '{
"question": "Who was president of the united states of america in 1890?"
}'

Replace $API_SERVICE_URL with the URL to the service that is responsible for connecting the vector store and Llama 2.

Once you get your answer back, you’ll have effectively deployed the equivalent of ChatGPT entirely on your own infrastructure, and connected it to your own documents. You can ask questions and get answers without making a single API call to external third parties!

Next steps: Sync your knowledge base apps to your vector database

If you’re looking to deploy your RAG application to production, uploading text files and PDFs doesn’t scale. For production use cases, check out Psychic — we make it easy to connect your private data to LLMs with a turnkey solution that handles auth, data syncs, and data cleaning so you can get back exactly the documents you need in a format that’s optimized for vectorization.

--

--