Deploying LLMs like Google Gemma
In this tutorial we will run Google Gemma with Ollama so that you can send queries via a REST API.
Ollama empowers users to work with large language models (LLMs) through its library of open-source models and its user-friendly API. This allows users to choose the best LLM for their specific task, whether it's text generation, translation, or code analysis. Ollama also simplifies interaction with different LLMs, making them accessible to a wider audience and fostering a more flexible and efficient LLM experience.
In this tutorial we will run Google Gemma with Ollama so that you can send queries via a REST API.
Quick start guide
- Prerequisites
- Starting a VM with cudoctl
- Installing Ollama via SSH
- Using Docker to start a LLM API
Prerequisites
- Create a project and add an SSH key
- Download the CLI tool
Starting a VM with cudoctl
Start a VM with the base image you require, here we will start with an image that already has NVIDIA drivers.
You can use the web console to start a VM using the Ubuntu 22.04 + NVIDIA drivers + Docker image or alternatively use the command line tool cudoctl
To use the command line tool you will need to get an API key from the web console, see here: API key
Then run cudoctl init
and enter your API key.
First we search to find a VM type to start
cudoctl search --vcpus 4 --mem 8 --gpus 1
Find an image:
cudoctl search images
After deciding on a machine type of epyc-milan-rtx-a4000
(16GB GPU) in the se-smedjebacken-1
data center and image ubuntu-2204-nvidia-535-docker-v20240214
we can start a VM:
cudoctl vm create --id my-ollama --image ubuntu-2204-nvidia-535-docker-v20240214 --machine-type epyc-milan-rtx-a4000 --memory 8 --vcpus 4 --gpus 1 --boot-disk-size 80 -boot-disk-class network --data-center se-smedjebacken-1
Installing Ollama via SSH
Get the IP address of the VM
cudoctl -json vm get my-ollama | jq '.externalIP'
SSH into the VM
ssh root@<IP_ADDRESS>
Install ollama
curl -fsSL https://ollama.com/install.sh | sh
Download and run Google Gemma LLM, then you can enter your prompt.
ollama run gemma:7b
From the Ollama docs :
Model | Parameters | Size | Download |
---|---|---|---|
Llama 2 | 7B | 3.8GB | ollama run llama2 |
Mistral | 7B | 4.1GB | ollama run mistral |
Dolphin Phi | 2.7B | 1.6GB | ollama run dolphin-phi |
Phi-2 | 2.7B | 1.7GB | ollama run phi |
Neural Chat | 7B | 4.1GB | ollama run neural-chat |
Starling | 7B | 4.1GB | ollama run starling-lm |
Code Llama | 7B | 3.8GB | ollama run codellama |
Llama 2 Uncensored | 7B | 3.8GB | ollama run llama2-uncensored |
Llama 2 13B | 13B | 7.3GB | ollama run llama2:13b |
Llama 2 70B | 70B | 39GB | ollama run llama2:70b |
Orca Mini | 3B | 1.9GB | ollama run orca-mini |
Vicuna | 7B | 3.8GB | ollama run vicuna |
LLaVA | 7B | 4.5GB | ollama run llava |
Gemma | 2B | 1.4GB | ollama run gemma:2b |
Gemma | 7B | 4.8GB | ollama run gemma:7b |
Using Docker to start a LLM API
If you had created a vm in the previous step delete it by running:
cudoctl vm delete my-ollama
Create a text file with a command to start the Ollama docker container:
start-ollama.txt
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
Create a VM and include the command to add a start script -start-script-file start-ollama.txt
:
cudoctl vm create --id my-ollama --image ubuntu-2204-nvidia-535-docker-v20240214 \
--machine-type epyc-milan-rtx-a4000 --memory 8 --vcpus 4 --gpus 1 --boot-disk-size 80 \
-boot-disk-class network --data-center se-smedjebacken-1 -start-script-file start-ollama.txt
Once the VM is running you can curl the API to pull the model you require, here we use gemma:7b
curl http://<IP_ADDRESS>:11434/api/pull -d '{"name": "gemma:7b"}'
Now it is ready to respond to a prompt:
curl http://<IP_ADDRESS>:11434/api/generate -d '{
"model": "gemma:7b",
"prompt":"Why when you leave water overnight in a glass does it create bubbles in the water ?",
"stream":false
}' | jq '.response'
Want to learn more?
You can learn more about this by contacting us . Or you can just get started right away!