
Why Use Kubernetes for Your LLM Projects?
Kubernetes makes it easier to run large language models in production. It handles changing workloads, manages GPU resources, and restarts your apps if something fails. Because of that, it is a strong choice for scalable AI systems. In this guide, you’ll learn how to deploy LLM on Kubernetes using clear, working examples.
Part 1: Getting Started with Your First LLM
Step 1: Create Your Model Container
First, create a simple Dockerfile:
FROM pytorch/pytorch:latest
WORKDIR /app
RUN pip install transformers fastapi uvicorn
COPY app.py .
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
Now add app.py:
from fastapi import FastAPI
from transformers import pipeline
import torch
app = FastAPI()
model_name = "microsoft/phi-2"
pipe = pipeline("text-generation", model=model_name,
device=0 if torch.cuda.is_available() else -1)
@app.post("/generate")
def generate_text(prompt: str):
result = pipe(prompt, max_length=100)
return {"response": result[0]['generated_text']}
@app.get("/health")
def health_check():
return {"status": "healthy"}
Step 2: Create Kubernetes Manifests
Add llm-deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-deployment
spec:
replicas: 2
selector:
matchLabels:
app: llm-app
template:
metadata:
labels:
app: llm-app
spec:
containers:
- name: llm-container
image: your-registry/llm-app:v1
ports:
- containerPort: 8000
resources:
requests:
memory: "8Gi"
cpu: "2"
nvidia.com/gpu: 1
limits:
memory: "12Gi"
nvidia.com/gpu: 1
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 30
---
apiVersion: v1
kind: Service
metadata:
name: llm-service
spec:
selector:
app: llm-app
ports:
- port: 80
targetPort: 8000
type: LoadBalancer
Step 3: Deploy Everything
docker build -t your-registry/llm-app:v1 .
docker push your-registry/llm-app:v1
kubectl apply -f llm-deployment.yaml
kubectl get pods
kubectl get services
Test the model:
SERVICE_IP=$(kubectl get service llm-service -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
curl -X POST http://$SERVICE_IP/generate \
-H "Content-Type: application/json" \
-d '{"prompt":"Explain Kubernetes in simple terms"}'
Part 2: Better Deployment with vLLM
vLLM provides fast inference and handles many requests at once. Therefore, it’s ideal for production.
Production vLLM Setup
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-production
spec:
replicas: 3
selector:
matchLabels:
app: vllm-llm
template:
metadata:
labels:
app: vllm-llm
spec:
containers:
- name: vllm-server
image: vllm/vllm-openai:latest
args:
- "--model=microsoft/phi-2"
- "--port=8000"
- "--gpu-memory-utilization=0.85"
- "--max-model-len=2048"
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1
memory: 16Gi
requests:
nvidia.com/gpu: 1
memory: 14Gi
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
selector:
app: vllm-llm
ports:
- port: 80
targetPort: 8000
type: LoadBalancer
Auto-Scaling Your LLM
This HPA setup scales based on CPU and RPS:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-autoscaler
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-production
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: requests_per_second
target:
type: AverageValue
averageValue: 50
Part 3: Chat Application Example
You can connect a simple FastAPI chat app to the vLLM backend.
Chat App Config
apiVersion: v1
kind: ConfigMap
metadata:
name: chat-config
data:
MODEL_ENDPOINT: "http://vllm-service.default.svc.cluster.local/v1"
MAX_TOKENS: "500"
Chat App Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: chat-application
spec:
replicas: 2
selector:
matchLabels:
app: chat-app
template:
metadata:
labels:
app: chat-app
spec:
containers:
- name: chat-backend
image: python:3.9
command: ["python", "app.py"]
ports:
- containerPort: 8080
envFrom:
- configMapRef:
name: chat-config
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1"
volumeMounts:
- name: app-volume
mountPath: /app
volumes:
- name: app-volume
configMap:
name: chat-config
Chat App Code
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
import requests
import os
app = FastAPI()
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_methods=["*"],
allow_headers=["*"],
)
MODEL_ENDPOINT = os.getenv("MODEL_ENDPOINT", "http://localhost:8000")
MAX_TOKENS = int(os.getenv("MAX_TOKENS", 500))
@app.post("/chat")
async def chat_endpoint(message: str):
try:
response = requests.post(
f"{MODEL_ENDPOINT}/completions",
json={"prompt": message, "max_tokens": MAX_TOKENS, "temperature": 0.7},
timeout=30
)
if response.status_code == 200:
return {"response": response.json()["choices"][0]["text"]}
raise HTTPException(status_code=500, detail="Model service error")
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.get("/health")
def health_check():
return {"status": "healthy", "model_endpoint": MODEL_ENDPOINT}
Part 4: Monitoring GPU Usage
You can track GPU performance by installing NVIDIA’s DCGM exporter:
apiVersion: v1
kind: ServiceAccount
metadata:
name: dcgm-exporter
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dcgm-exporter
spec:
selector:
matchLabels:
name: dcgm-exporter
template:
metadata:
labels:
name: dcgm-exporter
spec:
serviceAccountName: dcgm-exporter
containers:
- name: dcgm-exporter
image: nvidia/dcgm-exporter:latest
ports:
- containerPort: 9400
name: metrics
securityContext:
runAsNonRoot: false
runAsUser: 0
volumeMounts:
- name: pod-gpu-resources
mountPath: /var/lib/kubelet/pod-resources
readOnly: true
volumes:
- name: pod-gpu-resources
hostPath:
path: /var/lib/kubelet/pod-resources
Part 5: Best Practices
1. Set Resource Requests and Limits
resources:
requests:
memory: "12Gi"
cpu: "4"
nvidia.com/gpu: 1
limits:
memory: "14Gi"
nvidia.com/gpu: 1
2. Add Health Checks
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 30
3. Cache Models on Disk
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-pvc
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
Part 6: Troubleshooting
GPU Memory Issues
kubectl describe pod <pod-name> | grep -A 5 -B 5 "GPU"
Reduce batch size:
--max-batch-size=4
Slow Model Loading
Increase probe delay:
initialDelaySeconds: 300
Slow Responses
kubectl top pods
Enable GPU monitoring:
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/main/nvidia-device-plugin.yml
Part 7: Quick Start Script
#!/bin/bash
set -e
MODEL_NAME=${1:-"microsoft/phi-2"}
GPU_COUNT=${2:-1}
REPLICAS=${3:-2}
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: quick-llm
spec:
replicas: $REPLICAS
selector:
matchLabels:
app: quick-llm
template:
metadata:
labels:
app: quick-llm
spec:
containers:
- name: llm
image: vllm/vllm-openai:latest
args: ["--model-id", "$MODEL_NAME", "--port", "8000"]
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: $GPU_COUNT
memory: 12Gi
---
apiVersion: v1
kind: Service
metadata:
name: quick-llm-service
spec:
selector:
app: quick-llm
ports:
- port: 80
targetPort: 8000
type: LoadBalancer
EOF
Final Thoughts
Deploying an LLM on Kubernetes becomes much simpler when you break the process into small steps. First, containerize your model. Next, create Kubernetes manifests. After that, add probes, resource limits, and monitoring. As your traffic grows, auto-scaling will help your system handle more users.
By following these practices, you’ll have a stable and scalable deployment that you can improve over time.