GitOps as Source of Truth: Rebuilding Clusters from Git

What Source of Truth Actually Means

Kubernetes workloads are disposable—when a node fails, the orchestrator reschedules elsewhere. But this only works if the state is stored somewhere reliable.

What was running on that node? What configuration did it have? With GitOps, the answer is always: whatever's in the git repository.

Source of truth means:

The repository defines what should be running
The cluster continuously reconciles to match
Divergence is automatically corrected
You can rebuild from the repository alone

We've tested this. Not in a theoretical drill—in actual cluster rebuilds. It works.

The Rebuild Process (Tested)

We've done this multiple times when setting up new environments and testing disaster recovery:

Step 1: Provision infrastructure (~15 minutes)

cd infrastructure/opentofu/environments/production
tofu init
tofu apply -auto-approve

# This creates:
# - Kubernetes cluster (Scaleway Kapsule)
# - Node pools (3 nodes, multi-AZ)
# - Object storage buckets
# - DNS records
# - Load balancers

Step 2: Bootstrap Flux (~5 minutes)

# Get cluster credentials
export KUBECONFIG=./kubeconfig-production.yaml

# Bootstrap Flux
flux bootstrap github \
  --owner=aknostic \
  --repository=clouds-of-europe \
  --path=gitops/clusters/production \
  --personal

# Create SOPS decryption key
kubectl create secret generic sops-age \
  --namespace=flux-system \
  --from-file=age.agekey=$HOME/.config/sops/age/production.key

Flux installs itself and connects to the Git repository.

Step 3: Wait for reconciliation (~30-40 minutes)

# Watch Flux sync everything
flux get kustomizations --watch

# Monitor pods coming up
watch kubectl get pods --all-namespaces

Flux automatically deploys (in order):

Infrastructure components (cert-manager, external-dns)
Monitoring stack (Prometheus, Grafana, Loki)
Application namespaces and RBAC
Application deployments
Gateway API routes and certificates
Secrets (decrypted from SOPS)

Step 4: Verify (~5 minutes)

# Check all pods running
kubectl get pods --all-namespaces

# Check ingress working
curl https://clouds-of-europe.eu

# Check database
kubectl exec -it postgres-cluster-1 -n app -- \
  psql -U postgres -c "SELECT COUNT(*) FROM users;"

Total time: ~60 minutes from "cluster doesn't exist" to "serving production traffic."

No runbooks to follow manually. No configurations to remember. No tribal knowledge required. Just: provision infrastructure, bootstrap Flux, wait.

Why This Matters

Confidence: Knowing you CAN rebuild eliminates a category of anxiety. Infrastructure is cattle, not pets. Lose a cluster? Rebuild it.

Disaster recovery: Real DR requires testing. We've tested this. Multiple times. In different environments. It works.

Documentation: Git IS the documentation. Want to know how Grafana is configured? Read the Helm values in Git. Want to know what version of PostgreSQL? Read the manifest.

Onboarding: New team member: "Clone the repo, read the gitops/ directory." That's the entire system, readable and navigable.

Preventing Drift

Drift is when cluster state diverges from Git state. This happens through:

Manual kubectl apply commands
UI changes (clicking buttons in dashboards)
Scripts that modify resources directly
"Quick fixes" during incidents

Why drift is dangerous:

Git lies about reality: Documentation says X, cluster runs Y. Which is truth?
Rebuilds fail: Rebuild from Git produces different result than current cluster
Changes get lost: Someone fixes something manually, then cluster update reverts it
Debugging is impossible: Logs show configuration that doesn't match Git

Preventing drift: Flux with prune: true

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps
  namespace: flux-system
spec:
  prune: true  # Delete resources not in Git
  path: ./gitops/apps/production
  sourceRef:
    kind: GitRepository
    name: flux-system

Resources not defined in Git get deleted at next reconciliation. This sounds aggressive. It enforces discipline.

Result:

Manual kubectl apply? Resource gets deleted at next reconciliation
UI changes? Reverted within minutes
Everything stays consistent with Git

The No kubectl apply Rule

We have a simple rule: No kubectl apply against production. Ever.

Changes go through Git:

# Wrong
kubectl apply -f hotfix.yaml  # Don't do this

# Right
git add hotfix.yaml
git commit -m "Fix: increase memory limit for api-server"
git push
# Wait for Flux to reconcile

For urgent changes:

# Trigger immediate reconciliation
flux reconcile kustomization apps --with-source

This makes "change via Git" acceptable even during incidents. Commit fix, trigger reconciliation, change is live in ~30 seconds.

Our Repository Structure

clouds-of-europe/
├── infrastructure/
│   └── opentofu/
│       └── environments/
│           ├── foundation/      # S3, registry, DNS
│           ├── management/      # Monitoring cluster
│           ├── test/            # Test environment
│           └── production/      # Production
└── gitops/
    └── clusters/
        ├── management/
        │   ├── flux-system/        # Flux components
        │   ├── infrastructure/     # cert-manager, external-dns
        │   └── observability/      # Prometheus, Grafana, Loki
        ├── test/
        │   ├── flux-system/
        │   ├── infrastructure/
        │   └── apps/               # Application deployments
        └── production/
            ├── flux-system/
            ├── infrastructure/
            └── apps/

Every resource defined in Git. Nothing manual. Nothing in someone's head.

Practical Benefits

Reproducibility: Same Git commit = same infrastructure state. Testing, staging, production can be identical.

Auditability: "Who changed the firewall rules?" git log shows exactly who, when, why.

Review process: Infrastructure changes go through pull requests. Second pair of eyes before production.

Rollback: Bad deployment? git revert and wait for reconciliation. Clean, auditable rollback.

Sources:

#gitops #flux #sourceoftruth #disasterrecovery #kubernetes