Dedicated LLM inference

Quick deploy popular LLMs on a dedicated GPU.

Setup

The LLMs are served using a vLLM Open AI server, so they come with an Open AI compatible API. An example is shown below.

After launching a model you may have to wait up to 30 minutes for the API to become live!

To call the model from python you will need:

  • Your CUDO_TOKEN
  • The Model ID

To find your model ID, click on your virtual machine in the console and look for the Metadata panel:

Meta data

Copy the CUDO_TOKEN value and add it to the example below.

Quick Reference

ModelApp OptionModel ID
DeepSeek-R114b/24GB GPURedHatAI/DeepSeek-R1-Distill-Qwen-14B-quantized.w4a16
DeepSeek-R132b/48GB GPURedHatAI/DeepSeek-R1-Distill-Qwen-32B-quantized.w4a16
DeepSeek-R170b/80GB GPURedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16
Llama 3.3 70bw4a16/48GB GPURedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16
Llama 3.3 70bw4a16/80GB GPURedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16
Llama 3.3 70bFP8/94GB GPURedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic
Llama 3.1 405b3xA100 80GBRedHatAI/Meta-Llama-3.1-405B-Instruct-quantized.w4a16
Llama 3.1 405b4xH100 NVL 94GBRedHatAI/Meta-Llama-3.1-405B-Instruct-quantized.w4a16
Llama 3.1 405b4xH100 SXM 80GBRedHatAI/Meta-Llama-3.1-405B-Instruct-quantized.w4a16

Python example

Make sure you know your VM IP address, HuggingFace model ID and CUDO_TOKEN.

    
    from openai import OpenAI
client = OpenAI(
    base_url="http://VM-IP-ADDRESS:8000/v1",
    api_key="CUDO_TOKEN",
)

completion = client.chat.completions.create(
  model="MODEL_ID",
  messages=[
    {"role": "user", "content": "How big is the universe?"},
  ]
)

print(completion.choices[0].message)

    
  

Read more about the vLLM Open AI API spec here

DeepSeek-R1 on vLLM

These options use a quantized version of deepseek-ai/DeepSeek-R1

14b/24GB GPU

This option runs on a single 24GB GPU using a 14B model: RedHatAI/DeepSeek-R1-Distill-Qwen-14B-quantized.w4a16

32b/48GB GPU

This option runs on a single 48GB GPU using a 32B model: RedHatAI/DeepSeek-R1-Distill-Qwen-32B-quantized.w4a16

70b/80GB GPU

This option runs on a single 24GB GPU using a 14B model: RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w4a16

Llama 3.3 70b on VLLM

These options use a quantized version of meta-llama/Llama-3.3-70B-Instruct

w4a16/48GB GPU

This option runs on a single 48GB GPU using a 70B model quantized at w4a16: RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16

w4a16/80GB GPU

This option runs the same model as the previous option but on an H100 SXM. It provides much lower latency and more concurrent requests: RedHatAI/Llama-3.3-70B-Instruct-quantized.w4a16

FP8/94GB GPU

This option runs the same model as the previous option but on an H100 NVL PCIE, it uses FP8 quantization which preserves more of the model's original detail and accuracy, leading to better predictions or outputs: RedHatAI/Llama-3.3-70B-Instruct-FP8-dynamic

Llama 3.1 405b on vLLM

These options use a quantized version of meta-llama/Llama-3.1-405B-Instruct.

3xA100 80GB

This option runs on three 80GB A100s, this is the lowest cost way to run 405b. Uses model id: RedHatAI/Meta-Llama-3.1-405B-Instruct-quantized.w4a16

4xH100 NVL 94GB

If you need to serve more concurrent requests this provides the best low-cost option using H100 NVL PCIe: Uses model id: RedHatAI/Meta-Llama-3.1-405B-Instruct-quantized.w4a16

4xH100 SXM 80GB

If you need the lowest latency and high concurrent requests choose this option. Uses model id: RedHatAI/Meta-Llama-3.1-405B-Instruct-quantized.w4a16