How do deploy Llama 2 to Google Cloud (GCP)

5 min readJul 19, 2023

Llama 2 tops the benchmarks for open source models

On almost every benchmark, Llama 2 outperforms the previous state of the art open source model, Falcon, with both the 7B and 40B parameter models. Based on other benchmarks, it’s comparable to GPT3.5 and falls just short of GPT4, which is incredible considering GPT4 is technically still in beta.

You can try out Llama 2 in a playground here.

Step 1: Get permission to access Llama 2

As of July 19, 2023, Meta has Llama 2 gated behind a signup flow. First, you will need to request access from Meta. Then, you can request access from HuggingFace so that we can download the model in our docker container through HF.

Step 2: Containerize Llama 2

In order to deploy Llama 2 to Google Cloud, we will need to wrap it in a Docker container with a REST endpoint. We compared a couple different options for this step, including LocalAI and Truss. We ended up going with Truss because of its flexibility and extensive GPU support.

You can also use RAGStack, an MIT licensed project, to automate the other steps in this tutorial. It deploys Llama 2 to GCP with Terraform, and also includes a vector database and API server so you can upload files Llama 2 can retrieve them.

To containerize Llama 2, start off by creating a Truss project:

truss init llama2-7b

This will create a directory titled llama2-7b with the relevant files that Truss needs. We’ll need to configure the project to download Llama2 from Huggingface and use it for predictions. Update llama2-7b/model/model.py to the following code:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from typing import Dict
from huggingface_hub import login
import os

MODEL_NAME = "meta-llama/Llama-2-7b-chat-hf"
DEFAULT_MAX_LENGTH = 128

class Model:
    def __init__(self, data_dir: str, config: Dict, **kwargs) -> None:
        self._data_dir = data_dir
        self._config = config
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        print("THE DEVICE INFERENCE IS RUNNING ON IS: ", self.device)
        self.tokenizer = None
        self.pipeline = None
        secrets = kwargs.get("secrets")
        self.huggingface_api_token = os.environ.get("TRUSS_SECRET_huggingface_api_token")
            

    def load(self):
				login(token=self.huggingface_api_token)
        self.tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_auth_token=self.huggingface_api_token)
        model_8bit = AutoModelForCausalLM.from_pretrained(
            MODEL_NAME,
            device_map="auto",
            load_in_8bit=True,
            trust_remote_code=True)

        self.pipeline = pipeline(
            "text-generation",
            model=model_8bit,
            tokenizer=self.tokenizer,
            torch_dtype=torch.bfloat16,
            trust_remote_code=True,
            device_map="auto",
        )

    def predict(self, request: Dict) -> Dict:
        with torch.no_grad():
            try:
                prompt = request.pop("prompt")
                data = self.pipeline(
                    prompt,
                    eos_token_id=self.tokenizer.eos_token_id,
                    max_length=DEFAULT_MAX_LENGTH,
                    **request
                )[0]
                return {"data": data}

            except Exception as exc:
                return {"status": "error", "data": None, "message": str(exc)}

Because LLama2 access on Huggingface is gated, you will need to supply your Huggingface API token when retrieving the model. Truss has a built in secret management system so that we can use our API key without exposing it publicly in our Docker container. To add in our API token, update llama2-7b/config.yaml to:

apply_library_patches: true
bundled_packages_dir: packages
data_dir: data
description: null
environment_variables: {}
examples_filename: examples.yaml
external_package_dirs: []
input_type: Any
live_reload: false
model_class_filename: model.py
model_class_name: Model
model_framework: custom
model_metadata: {}
model_module_dir: model
model_name: Falcon-7B
model_type: custom
python_version: py39
requirements:
- torch
- peft
- sentencepiece
- accelerate
- bitsandbytes
- einops
- scipy
- git+https://github.com/huggingface/transformers.git
resources:
  use_gpu: true
  cpu: "3"
  memory: 14Gi
secrets: {}
spec_version: '2.0'
system_packages: []

Now we can create the docker image by running the following code outside the folder we just created:

import truss
from pathlib import Path
import requests

tr = truss.load("./llama2-7b")
command = tr.docker_build_setup(build_dir=Path("./llama2-7b"))
print(command)

This script will give you the command to build the docker image:

docker build llama2-7b -t llama2-7b:latest

Once you build the image, you can deploy it to Docker Hub for use in GKE:

docker tag llama2-7b $DOCKER_USERNAME/llama2-7b
docker push $DOCKER_USERNAME/llama2-7b

Step 3: Deploy Llama 2 using Google Kubernetes Engine (GKE)

Now that we have a docker image with Llama, we can deploy it to GKE. We’ll need to open up the Google Cloud dashboard to Google Kubernetes Engine and create a new Standard Kubernetes Cluster named gpu-cluster. Set the zone to us-central1-c.

In the default-pool > Nodes tab, set

Machine Configuration from General Purpose to GPU
GPU type: Nvidia TF
Number of GPUs: 1
Enable GPU time sharing
Max shared clients per GPU: 8
Machine type: n1-standard-4
Boot disk size: 50 GB
Enable nodes on spot VMs

Once the GKE cluster has been created, we will need to install some Nvidia drivers. In your terminal, run

gcloud config set compute/zone us-central1-c
gcloud container clusters get-credentials gpu-cluster
kubectl apply -f <https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml>

After the drivers have been installed, it’s time to deploy to GKE. Create a yaml file kubernetes_deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama2-7b
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      component: llama2-7b-layer
  template:
    metadata:
      labels:
        component: llama2-7b-layer
    spec:
      containers:
      - name: llama2-7b-container
        image: psychicapi/llama2-7b:latest
        ports:
          - containerPort: 8080
				env:
          - name: TRUSS_SECRET_huggingface_api_token
            value: "$YOUR_HUGGINGFACE_API_TOKEN"
        resources:
          limits:
            nvidia.com/gpu: 1
---
apiVersion: v1
kind: Service
metadata:
  name: llama2-7b-service
  namespace: default
spec:
  type: LoadBalancer
  selector:
    component: llama2-7b-layer
  ports:
  - port: 8080
    protocol: TCP
    targetPort: 8080

Make sure to replace $YOUR_HUGGINGFACE_API_TOKEN with your api token. This part is necessary because llama access is restricted.

Then run

kubectl create -f kubernetes_deployment.yaml

Once the deployment is complete, you can get the external IP of llama2-7b-service by calling

kubectl get svc

Step 4: Test Your Deployed Model

Copy this cURL and run it from your terminal, replacing $EXTERNAL_IP with the IP you got in the previous step.

curl --location 'http://$EXTERNAL_IP/v1/models/model:predict' \\
--header 'Content-Type: application/json' \\
--data '{
    "prompt": "Who was president of the united states of america in 1890?"
}'

If you get a response back, it means your service is working as intended! 🎉

Keep in mind that we enabled unauthenticated invocations so anyone who knows the IP or URL to your service can make requests to your hosted Llama 2 model. You’ll probably want to set up authentication within GCP to allow only your own services to call the endpoints that interact with your model, otherwise your GPU bill can rack up very quickly 💸.

Next steps - connect your data

A self-hosted version of Llama 2 is useful. What takes this model from useful to mission critical for an organization is the ability to connect it to external and internal data sources, for example the web, or your company’s internal knowledge base.

We can achieve this with a technique called RAG - retrieval augmented generation.

One way to think about RAG:

Llama 2 is like a new hire - it has general knowledge and reasoning capabilities, but lacks the experience necessary to be effective in any organization-specific contexts, which is most of the work employees need to do day-to-day.
Llama 2 with RAG is like a seasoned employee - it understands how your business works and can provide context-specific assistance on everything from customer feedback analysis to financial planning and forecasting.

Check out this tutorial to learn how to set up a RAG stack using the open source RAGstack library.

If you need help with this tutorial, or with deploying open source models in general reach out on our Slack community for help!