Dedicated LLM inference

Quick deploy popular LLMs on a dedicated GPU.

Setup

The LLMs are served using a vLLM Open AI server, so they come with an Open AI compatible API. An example is shown below.

After launching a model you may have to wait up to 30 minutes for the API to become live!

To call the model from python you will need:

Your CUDO_TOKEN
The Model ID

To find your model ID, click on your virtual machine in the console and look for the Metadata panel:

Meta data

Copy the CUDO_TOKEN value and add it to the example below.

Quick Reference

Model	App Option	Model ID
DeepSeek-R1	14b/24GB GPU	RedHatAI/DeepSeek-R1-Distill-Qwen-14B-quantized.w4a16
DeepSeek-R1	32b/48GB GPU	RedHatAI/DeepSeek-R1-Distill-Qwen-32B-quantized.w4a16
DeepSeek-R1	70b/80GB GPU	RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16
Llama 3.3 70b	w4a16/48GB GPU	RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16
Llama 3.3 70b	w4a16/80GB GPU	RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16
Llama 3.3 70b	FP8/94GB GPU	RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic
Llama 3.1 405b	3xA100 80GB	RedHatAI/Meta-Llama-3.1-405B-Instruct-quantized.w4a16
Llama 3.1 405b	4xH100 NVL 94GB	RedHatAI/Meta-Llama-3.1-405B-Instruct-quantized.w4a16
Llama 3.1 405b	4xH100 SXM 80GB	RedHatAI/Meta-Llama-3.1-405B-Instruct-quantized.w4a16

Python example

Make sure you know your VM IP address, HuggingFace model ID and CUDO_TOKEN.

    
    from openai import OpenAI
client = OpenAI(
    base_url="http://VM-IP-ADDRESS:8000/v1",
    api_key="CUDO_TOKEN",
)

completion = client.chat.completions.create(
  model="MODEL_ID",
  messages=[
    {"role": "user", "content": "How big is the universe?"},
  ]
)

print(completion.choices[0].message)

DeepSeek-R1 on vLLM

These options use a quantized version of deepseek-ai/DeepSeek-R1

14b/24GB GPU

This option runs on a single 24GB GPU using a 14B model: RedHatAI/DeepSeek-R1-Distill-Qwen-14B-quantized.w4a16

32b/48GB GPU

This option runs on a single 48GB GPU using a 32B model: RedHatAI/DeepSeek-R1-Distill-Qwen-32B-quantized.w4a16

70b/80GB GPU

This option runs on a single 24GB GPU using a 14B model: RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16

Llama 3.3 70b on VLLM

These options use a quantized version of meta-llama/Llama-3.3-70B-Instruct

w4a16/48GB GPU

This option runs on a single 48GB GPU using a 70B model quantized at w4a16: RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16

w4a16/80GB GPU

This option runs the same model as the previous option but on an H100 SXM. It provides much lower latency and more concurrent requests: RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16

This option runs the same model as the previous option but on an H100 NVL PCIE, it uses FP8 quantization which preserves more of the model's original detail and accuracy, leading to better predictions or outputs: RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic