Claude Skills for DevOps — How to Build Your First SKILL.md

Claude Code Skills diagram showing SKILL.md files automating Terraform, Kubernetes, and CI/CD DevOps workflows

You’ve probably hit this wall with Claude Code: you describe the same context over and over. Every new session, you explain your Terraform conventions, your deployment process, your K8s naming standards. Claude helps, but it keeps defaulting to generic patterns that don’t match your team’s setup.

Claude Skills fix this. Instead of burning context on setup every session, you write your conventions once in a SKILL.md file — and Claude automatically loads and applies them whenever the task calls for it.

This guide covers what Skills are, how they work, and how to build three real DevOps skills you can use today: a Terraform IaC skill, a Kubernetes manifest validation skill, and a CI/CD deployment skill.


What Is a Claude Skill?

A Claude Skill is a folder containing a single SKILL.md file. That file has two parts:

  • A short YAML frontmatter block at the top — name, description, and optional config
  • A markdown body with the instructions Claude follows when the skill is triggered
.claude/
  skills/
    terraform-iac/
      SKILL.md
    k8s-validate/
      SKILL.md
    deploy-pipeline/
      SKILL.md

That’s the entire structure. No plugin manager. No configuration UI. Just a file in a folder.

When you start a Claude Code session, Claude reads the description from each skill’s frontmatter — lightweight, maybe 100 tokens per skill. The full instructions only load if Claude decides the skill is relevant to what you’re asking. This is the key design decision: skills use progressive disclosure rather than dumping all instructions into every session upfront.

For DevOps engineers running long infrastructure sessions — managing Terraform state, debugging Kubernetes workloads, writing CI pipelines — this matters. You keep your context budget for the actual work instead of burning it on repeated instructions.


Skills vs Slash Commands vs MCP Servers

These three things often get confused. Here’s the distinction that actually matters:

Skills are model-invoked. Claude automatically decides when to apply a skill based on what you’re asking. You write a skill for Terraform reviews, and the moment you ask Claude to write a Terraform module, it loads that skill without you doing anything.

Slash commands are user-invoked. You explicitly type /deploy or /review to trigger them. Skills replaced standalone slash commands in January 2026 — slash commands are now just a way to explicitly invoke a skill.

MCP servers give Claude tools — the ability to call external APIs, query databases, read your AWS account. Skills give Claude knowledge — the conventions, standards, and procedures to follow when using those tools.

The practical analogy: MCP is the kitchen (knives, ingredients, equipment). A skill is the recipe (how to use them). Many real workflows combine both: a Kubernetes skill defines the review process, and a Kubernetes MCP server gives Claude live access to your cluster.


Where to Install Skills

# System-wide — available in every Claude Code session
~/.claude/skills/your-skill/SKILL.md

# Project-specific — only active in this repo
.claude/skills/your-skill/SKILL.md

For team conventions and project-specific standards, use project-level skills and commit them to your repo. Everyone on the team gets the same Claude behavior automatically.

For personal workflows that apply everywhere — your preferred code style, your debugging approach — use system-wide skills.


Building Real DevOps Skills

Skill 1: Terraform IaC Conventions

The problem this solves: Claude writes valid Terraform, but it doesn’t know your team’s module structure, naming conventions, variable standards, or which provider versions you’re pinned to. You end up editing every generated file to match your conventions.

Create .claude/skills/terraform-iac/SKILL.md:

---
name: terraform-iac
description: >
  Apply when writing, reviewing, or modifying any Terraform or OpenTofu code.
  Use for new modules, variable files, provider configs, state management,
  or any .tf file work.
---

## Terraform conventions for this project

### Module structure
Every module must have: main.tf, variables.tf, outputs.tf, versions.tf.
No logic in outputs.tf — outputs only reference resources or locals.
Group related resources in the same .tf file, not everything in main.tf.

### Naming conventions
Resources: {project}-{environment}-{resource_type} (e.g., devtoolhub-prod-eks-cluster)
Variables: snake_case always. Boolean variables prefix with enable_ or disable_.
Outputs: descriptive names that explain what the value is used for.

### Provider and version constraints
Always pin provider versions with ~> (e.g., ~> 5.0 not >= 5.0).
Terraform version: require >= 1.6.0 in versions.tf.
Never use latest or no constraint.

### Variables
Every variable needs: description, type, and either a default or explicit no default comment.
Use validation blocks for variables that have a limited valid range.
Sensitive variables must have sensitive = true.

### State and backends
Remote state only — never local state in production modules.
Use data "terraform_remote_state" for cross-module references, not hardcoded values.

### What NOT to do
Do not hardcode account IDs, region names, or ARNs — use variables or data sources.
Do not use count for resources that have meaningful names — use for_each with a map.
Do not create resources that aren't tagged — always include standard tags.

## Required tags for all resources
```hcl
tags = {
  Project     = var.project_name
  Environment = var.environment
  ManagedBy   = "terraform"
  Owner       = var.team_name
}


Now when you ask Claude to write a new VPC module or review an existing Terraform file, it applies these conventions without you mentioning them.

---

### Skill 2: Kubernetes Manifest Validation

The problem this solves: Claude generates Kubernetes YAML that works, but may miss resource limits, security contexts, liveness probes, or your team's label standards. Things that pass `kubectl apply` but fail in production.

Create `.claude/skills/k8s-validate/SKILL.md`:

```markdown
---
name: k8s-validate
description: >
  Apply when writing, reviewing, or generating any Kubernetes manifests —
  Deployments, StatefulSets, Services, Ingress, RBAC, NetworkPolicy, or Helm values.
  Also use when asked to check if a manifest is production-ready.
---

## Kubernetes manifest standards

### Every Deployment must have
- resources.requests and resources.limits for CPU and memory — no exceptions
- livenessProbe and readinessProbe configured
- securityContext with runAsNonRoot: true and readOnlyRootFilesystem: true
- At least 2 replicas for any production workload
- podDisruptionBudget defined for anything critical

### Required labels on all resources
```yaml
labels:
  app: <service-name>
  environment: <env>
  team: <team-name>
  version: <image-tag>

Security checks — flag these as errors

  • Any container running as root (runAsUser: 0 or no runAsNonRoot)
  • Privileged: true in securityContext
  • hostNetwork: true or hostPID: true
  • Missing imagePullPolicy (should always be explicit)
  • Using image tag :latest — require pinned digest or semver tag

Resource sizing guidance

  • Do not set CPU limits unless absolutely necessary (causes throttling)
  • Always set memory limits — OOMKilled is worse than CPU throttle
  • Requests should reflect actual steady-state usage, not peaks

What to check in Ingress

  • TLS configured — no plain HTTP ingress in production
  • Correct ingressClassName set
  • Annotations match the ingress controller in use

Review output format

When reviewing a manifest, output:

  1. ERRORS — things that will cause problems (missing limits, security violations)
  2. WARNINGS — things that should be fixed (no PDB, no probes)
  3. SUGGESTIONS — best practices that would improve the manifest
  4. FIXED MANIFEST — the corrected YAML at the end

After installing this skill, asking Claude "review this Deployment manifest" triggers a structured review with errors, warnings, and a corrected output — every time, without prompting.

---

### Skill 3: CI/CD Deployment Pipeline

The problem this solves: Generic CI/CD skills produce pipelines that look reasonable but don't match your actual stack — wrong caching strategy, missing environment gates, no rollback steps.

Create `.claude/skills/deploy-pipeline/SKILL.md`:

```markdown
---
name: deploy-pipeline
description: >
  Apply when writing or modifying CI/CD pipeline configurations — GitHub Actions,
  GitLab CI, or deployment workflows. Use for new pipelines, adding stages,
  debugging pipeline failures, or optimizing build times.
disable-model-invocation: true
---

## CI/CD pipeline standards

### Stack
- CI platform: GitHub Actions
- Container registry: GitHub Container Registry (ghcr.io)
- Deployment target: Kubernetes (EKS)
- IaC: Terraform
- GitOps: ArgoCD watching the manifests repo

### Pipeline stages (in order)
1. lint — run linters and static analysis
2. test — unit and integration tests (parallel where possible)
3. build — Docker image build and push to ghcr.io
4. scan — Trivy image scan, fail on CRITICAL severity
5. deploy-staging — update image tag in manifests repo (ArgoCD picks it up)
6. smoke-test — run health check against staging endpoint
7. deploy-prod — requires manual approval gate, same process as staging

### GitHub Actions conventions
- Use reusable workflows (workflow_call) for repeated stages
- Cache: actions/cache for npm/pip/go dependencies
- Always pin action versions to a SHA, not a tag (e.g., actions/checkout@abc123)
- Secrets: use GitHub Secrets for credentials, never hardcode
- Concurrency: cancel in-progress runs on the same branch for non-main branches

### Docker build conventions
- Multi-stage builds only — never ship dev dependencies
- Base image: use distroless or alpine variants
- Build args for version metadata (BUILD_DATE, GIT_SHA, VERSION)
- Always push both :sha and :latest tags on main branch merges

### Rollback procedure
ArgoCD rollback: `argocd app rollback <app-name> <revision>`
Document the revision number before every deploy in the pipeline output.

### What NOT to do
- No curl | bash installers in pipelines
- No hardcoded AWS account IDs or region names
- No deploying directly from CI — always update the manifests repo and let ArgoCD sync
- Do not skip the image scan stage, even for hotfixes

## disable-model-invocation: true
This skill has real deployment side effects. Only invoke explicitly with /deploy-pipeline.

Note the disable-model-invocation: true flag. For skills that have real infrastructure side effects — actually deploying code, updating manifests, interacting with production — you never want them to trigger automatically. This flag means the skill only runs when you explicitly invoke it with /deploy-pipeline.


The SKILL.md Format Reference

Here’s every frontmatter field you’ll actually use:

---
name: skill-name                    # Required. Used for explicit invocation
description: >                      # Required. One paragraph. This is what Claude
  Describe exactly when this skill  # reads to decide whether to auto-load the skill.
  should be applied. Be specific.   # Vague descriptions = poor triggering accuracy.
disable-model-invocation: true      # Optional. Prevents auto-loading. Explicit only.
allowed-tools:                      # Optional. Restrict which tools this skill can use.
  - Bash
  - Read
context: fork                       # Optional. Runs skill in isolated subagent context.
---                                 # Keeps skill work out of your main conversation.

The description field is the most important. Claude uses it to decide whether to load the full skill. If it’s vague (“help with infrastructure”), Claude loads it for everything or nothing. Be specific: “Apply when writing, reviewing, or modifying any Terraform or OpenTofu .tf files.”


The Security Warning You Shouldn’t Skip

Before you install community skills from GitHub, understand the risk.

Skills run with the same permissions as your Claude Code agent. If Claude Code can run bash commands and access your AWS credentials, a skill can too — and it will look like your own agent doing it.

In February 2026, Snyk researchers scanned 3,984 public skills from community registries and found 13.4% had critical-level vulnerabilities, with 76 confirmed malicious payloads. Attack techniques included base64-encoded commands that steal AWS credentials and jailbreak attempts that try to disable safety mechanisms.

The rules to follow:

  • Only install skills from sources you trust — official Anthropic skills, well-known community contributors, or your own team
  • Read the SKILL.md before installing — it’s just a markdown file, it takes 60 seconds
  • Audit regularly — skills you installed 6 months ago may have been updated
  • Use allowed-tools to restrict scope — a Terraform skill has no business running arbitrary bash commands

The official Anthropic-curated skills index is at github.com/anthropics/claude-code-skills — start there before going to community sources.


Best Practices

Write focused skills, not mega-skills. The most common mistake is a single SKILL.md trying to handle commits, PRs, branch naming, changelog updates, and code review all at once. Narrow skills trigger accurately. Broad skills trigger randomly or not at all.

Use context: fork for heavy tasks. If a skill does significant work — generating an entire Terraform module, running a full manifest audit — forking to a subagent keeps the output clean and doesn’t pollute your main conversation history.

Commit project skills to the repo. Your .claude/skills/ directory should be in source control. When a teammate pulls the repo, they get the same Claude behavior. Skills become part of your team’s infrastructure standards, not a personal config.

CLAUDE.md vs skills — know the difference. CLAUDE.md loads into every session in the project — use it for rules that always apply (always use British English, always add type hints). Skills load conditionally — use them for task-specific workflows (only apply Terraform conventions when writing Terraform). If everything lives in CLAUDE.md, you burn context on every session regardless of what you’re doing.

Test your skill descriptions. After writing a new skill, start a fresh session and ask something that should trigger it. If Claude doesn’t apply the skill, the description needs to be more specific. Iterate on the description field — it’s the one line that determines whether the skill works reliably.


What’s Next

Skills give Claude your conventions. The next step up is giving Claude the ability to act on your infrastructure directly — and that’s where Managed Agents come in.

In the next post, we’ll cover Claude Managed Agents: how to deploy a cloud-hosted agent that can run long infrastructure tasks, orchestrate multiple subagents in parallel, and improve itself over time via the new Dreaming feature. Real examples for DevOps automation — Kubernetes cost analysis, infrastructure drift detection, and automated incident triage.

If you’re using Claude Code and want to go deeper with MCP integrations that pair with these skills — connecting Claude directly to your AWS account, Kubernetes cluster, or GitHub — check out our guide to MCP for DevOps engineers.

Leave a Reply