Deploying LLMs like Google Gemma with Cudo Compute

In this tutorial we will run Google Gemma with Ollama so that you can send queries via a REST API.

Ollama empowers users to work with large language models (LLMs) through its library of open-source models and its user-friendly API. This allows users to choose the best LLM for their specific task, whether it's text generation, translation, or code analysis. Ollama also simplifies interaction with different LLMs, making them accessible to a wider audience and fostering a more flexible and efficient LLM experience.

In this tutorial we will run Google Gemma with Ollama so that you can send queries via a REST API.

Quick start guide

  1. Prerequisites
  2. Starting a VM with cudoctl
  3. Installing Ollama via SSH
  4. Using Docker to start a LLM API

Prerequisites

Starting a VM with cudoctl

Start a VM with the base image you require, here we will start with an image that already has NVIDIA drivers.

You can use the web console to start a VM using the Ubuntu 22.04 + NVIDIA drivers + Docker image or alternatively use the command line tool cudoctl

To use the command line tool you will need to get an API key from the web console, see here: API key Then run cudoctl init and enter your API key.

First we search to find a VM type to start

cudoctl search --vcpus 4 --mem 8 --gpus 1

Find an image:

cudoctl search images

After deciding on a machine type of epyc-milan-rtx-a4000 (16GB GPU) in the se-smedjebacken-1 data center and image ubuntu-2204-nvidia-535-docker-v20240214 we can start a VM:

cudoctl vm create --id my-ollama --image ubuntu-2204-nvidia-535-docker-v20240214 --machine-type epyc-milan-rtx-a4000 --memory 8 --vcpus 4  --gpus 1 --boot-disk-size 80 -boot-disk-class network --data-center se-smedjebacken-1

Installing Ollama via SSH

Get the IP address of the VM

cudoctl -json vm get my-ollama | jq '.externalIP'

SSH into the VM

ssh root@<IP_ADDRESS>

Install ollama

curl -fsSL https://ollama.com/install.sh | sh

Download and run Google Gemma LLM, then you can enter your prompt.

ollama run gemma:7b

From the Ollama docs :

ModelParametersSizeDownload
Llama 27B3.8GBollama run llama2
Mistral7B4.1GBollama run mistral
Dolphin Phi2.7B1.6GBollama run dolphin-phi
Phi-22.7B1.7GBollama run phi
Neural Chat7B4.1GBollama run neural-chat
Starling7B4.1GBollama run starling-lm
Code Llama7B3.8GBollama run codellama
Llama 2 Uncensored7B3.8GBollama run llama2-uncensored
Llama 2 13B13B7.3GBollama run llama2:13b
Llama 2 70B70B39GBollama run llama2:70b
Orca Mini3B1.9GBollama run orca-mini
Vicuna7B3.8GBollama run vicuna
LLaVA7B4.5GBollama run llava
Gemma2B1.4GBollama run gemma:2b
Gemma7B4.8GBollama run gemma:7b

Using Docker to start a LLM API

If you had created a vm in the previous step delete it by running:

cudoctl vm delete my-ollama

Create a text file with a command to start the Ollama docker container: start-ollama.txt

docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Create a VM and include the command to add a start script -start-script-file start-ollama.txt:

cudoctl vm create --id my-ollama --image ubuntu-2204-nvidia-535-docker-v20240214 \
--machine-type epyc-milan-rtx-a4000 --memory 8 --vcpus 4  --gpus 1 --boot-disk-size 80 \
-boot-disk-class network --data-center se-smedjebacken-1 -start-script-file start-ollama.txt

Once the VM is running you can curl the API to pull the model you require, here we use gemma:7b

curl http://<IP_ADDRESS>:11434/api/pull -d '{"name": "gemma:7b"}'

Now it is ready to respond to a prompt:

curl http://<IP_ADDRESS>:11434/api/generate -d '{
"model": "gemma:7b",
"prompt":"Why when you leave water overnight in a glass does it create bubbles in the water ?",
"stream":false
}' | jq '.response'
Want to learn more?

You can learn more about this by contacting us . Or you can just get started right away!