Deploy LLMs on Kubernetes: Complete Guide with Examples

Why Use Kubernetes for Your LLM Projects?

Kubernetes makes it easier to run large language models in production. It handles changing workloads, manages GPU resources, and restarts your apps if something fails. Because of that, it is a strong choice for scalable AI systems. In this guide, you’ll learn how to deploy LLM on Kubernetes using clear, working examples.


Part 1: Getting Started with Your First LLM

Step 1: Create Your Model Container

First, create a simple Dockerfile:

FROM pytorch/pytorch:latest
WORKDIR /app
RUN pip install transformers fastapi uvicorn
COPY app.py .
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

Now add app.py:

from fastapi import FastAPI
from transformers import pipeline
import torch

app = FastAPI()

model_name = "microsoft/phi-2"
pipe = pipeline("text-generation", model=model_name,
                device=0 if torch.cuda.is_available() else -1)

@app.post("/generate")
def generate_text(prompt: str):
    result = pipe(prompt, max_length=100)
    return {"response": result[0]['generated_text']}

@app.get("/health")
def health_check():
    return {"status": "healthy"}

Step 2: Create Kubernetes Manifests

Add llm-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-deployment
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-app
  template:
    metadata:
      labels:
        app: llm-app
    spec:
      containers:
      - name: llm-container
        image: your-registry/llm-app:v1
        ports:
        - containerPort: 8000
        resources:
          requests:
            memory: "8Gi"
            cpu: "2"
            nvidia.com/gpu: 1
          limits:
            memory: "12Gi"
            nvidia.com/gpu: 1
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 30
---
apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: llm-app
  ports:
  - port: 80
    targetPort: 8000
  type: LoadBalancer

Step 3: Deploy Everything

docker build -t your-registry/llm-app:v1 .
docker push your-registry/llm-app:v1
kubectl apply -f llm-deployment.yaml
kubectl get pods
kubectl get services

Test the model:

SERVICE_IP=$(kubectl get service llm-service -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

curl -X POST http://$SERVICE_IP/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Explain Kubernetes in simple terms"}'


Part 2: Better Deployment with vLLM

vLLM provides fast inference and handles many requests at once. Therefore, it’s ideal for production.

Production vLLM Setup

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: vllm-llm
  template:
    metadata:
      labels:
        app: vllm-llm
    spec:
      containers:
      - name: vllm-server
        image: vllm/vllm-openai:latest
        args:
        - "--model=microsoft/phi-2"
        - "--port=8000"
        - "--gpu-memory-utilization=0.85"
        - "--max-model-len=2048"
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: 16Gi
          requests:
            nvidia.com/gpu: 1
            memory: 14Gi
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm-llm
  ports:
  - port: 80
    targetPort: 8000
  type: LoadBalancer

Auto-Scaling Your LLM

This HPA setup scales based on CPU and RPS:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-autoscaler
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-production
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: requests_per_second
      target:
        type: AverageValue
        averageValue: 50


Part 3: Chat Application Example

You can connect a simple FastAPI chat app to the vLLM backend.

Chat App Config

apiVersion: v1
kind: ConfigMap
metadata:
  name: chat-config
data:
  MODEL_ENDPOINT: "http://vllm-service.default.svc.cluster.local/v1"
  MAX_TOKENS: "500"

Chat App Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: chat-application
spec:
  replicas: 2
  selector:
    matchLabels:
      app: chat-app
  template:
    metadata:
      labels:
        app: chat-app
    spec:
      containers:
      - name: chat-backend
        image: python:3.9
        command: ["python", "app.py"]
        ports:
        - containerPort: 8080
        envFrom:
        - configMapRef:
            name: chat-config
        resources:
          requests:
            memory: "1Gi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "1"
        volumeMounts:
        - name: app-volume
          mountPath: /app
      volumes:
      - name: app-volume
        configMap:
          name: chat-config

Chat App Code

from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
import requests
import os

app = FastAPI()

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"],
)

MODEL_ENDPOINT = os.getenv("MODEL_ENDPOINT", "http://localhost:8000")
MAX_TOKENS = int(os.getenv("MAX_TOKENS", 500))

@app.post("/chat")
async def chat_endpoint(message: str):
    try:
        response = requests.post(
            f"{MODEL_ENDPOINT}/completions",
            json={"prompt": message, "max_tokens": MAX_TOKENS, "temperature": 0.7},
            timeout=30
        )
        if response.status_code == 200:
            return {"response": response.json()["choices"][0]["text"]}
        raise HTTPException(status_code=500, detail="Model service error")
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/health")
def health_check():
    return {"status": "healthy", "model_endpoint": MODEL_ENDPOINT}


Part 4: Monitoring GPU Usage

You can track GPU performance by installing NVIDIA’s DCGM exporter:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: dcgm-exporter
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
spec:
  selector:
    matchLabels:
      name: dcgm-exporter
  template:
    metadata:
      labels:
        name: dcgm-exporter
    spec:
      serviceAccountName: dcgm-exporter
      containers:
      - name: dcgm-exporter
        image: nvidia/dcgm-exporter:latest
        ports:
        - containerPort: 9400
          name: metrics
        securityContext:
          runAsNonRoot: false
          runAsUser: 0
        volumeMounts:
        - name: pod-gpu-resources
          mountPath: /var/lib/kubelet/pod-resources
          readOnly: true
      volumes:
      - name: pod-gpu-resources
        hostPath:
          path: /var/lib/kubelet/pod-resources


Part 5: Best Practices

1. Set Resource Requests and Limits

resources:
  requests:
    memory: "12Gi"
    cpu: "4"
    nvidia.com/gpu: 1
  limits:
    memory: "14Gi"
    nvidia.com/gpu: 1

2. Add Health Checks

livenessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 120
  periodSeconds: 30

3. Cache Models on Disk

volumes:
- name: model-cache
  persistentVolumeClaim:
    claimName: model-pvc

volumeMounts:
- name: model-cache
  mountPath: /root/.cache/huggingface


Part 6: Troubleshooting

GPU Memory Issues

kubectl describe pod <pod-name> | grep -A 5 -B 5 "GPU"

Reduce batch size:

--max-batch-size=4

Slow Model Loading

Increase probe delay:

initialDelaySeconds: 300

Slow Responses

kubectl top pods

Enable GPU monitoring:

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/main/nvidia-device-plugin.yml


Part 7: Quick Start Script

#!/bin/bash
set -e

MODEL_NAME=${1:-"microsoft/phi-2"}
GPU_COUNT=${2:-1}
REPLICAS=${3:-2}

cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: quick-llm
spec:
  replicas: $REPLICAS
  selector:
    matchLabels:
      app: quick-llm
  template:
    metadata:
      labels:
        app: quick-llm
    spec:
      containers:
      - name: llm
        image: vllm/vllm-openai:latest
        args: ["--model-id", "$MODEL_NAME", "--port", "8000"]
        ports:
        - containerPort: 8000
        resources:
          limits:
            nvidia.com/gpu: $GPU_COUNT
            memory: 12Gi
---
apiVersion: v1
kind: Service
metadata:
  name: quick-llm-service
spec:
  selector:
    app: quick-llm
  ports:
  - port: 80
    targetPort: 8000
  type: LoadBalancer
EOF


Final Thoughts

Deploying an LLM on Kubernetes becomes much simpler when you break the process into small steps. First, containerize your model. Next, create Kubernetes manifests. After that, add probes, resource limits, and monitoring. As your traffic grows, auto-scaling will help your system handle more users.

By following these practices, you’ll have a stable and scalable deployment that you can improve over time.

Helpful Links

You Might Also Like

Uncategorized