GitOps as Source of Truth: Rebuilding Clusters from Git
Our disaster recovery isn't theoretical—we've tested it. Fresh cluster from Git: OpenTofu provisions infrastructure, Flux bootstraps, under an hour to working cluster. Drift is the enemy.
By Jurg van Vliet
Published Nov 15, 2025
What Source of Truth Actually Means
Kubernetes workloads are disposable—when a node fails, the orchestrator reschedules elsewhere. But this only works if the state is stored somewhere reliable.
What was running on that node? What configuration did it have? With GitOps, the answer is always: whatever's in the git repository.
Source of truth means:
- The repository defines what should be running
- The cluster continuously reconciles to match
- Divergence is automatically corrected
- You can rebuild from the repository alone
We've tested this. Not in a theoretical drill—in actual cluster rebuilds. It works.
The Rebuild Process (Tested)
We've done this multiple times when setting up new environments and testing disaster recovery:
Step 1: Provision infrastructure (~15 minutes)
cd infrastructure/opentofu/environments/production
tofu init
tofu apply -auto-approve
# This creates:
# - Kubernetes cluster (Scaleway Kapsule)
# - Node pools (3 nodes, multi-AZ)
# - Object storage buckets
# - DNS records
# - Load balancers
Step 2: Bootstrap Flux (~5 minutes)
# Get cluster credentials
export KUBECONFIG=./kubeconfig-production.yaml
# Bootstrap Flux
flux bootstrap github \
--owner=aknostic \
--repository=clouds-of-europe \
--path=gitops/clusters/production \
--personal
# Create SOPS decryption key
kubectl create secret generic sops-age \
--namespace=flux-system \
--from-file=age.agekey=$HOME/.config/sops/age/production.key
Flux installs itself and connects to the Git repository.
Step 3: Wait for reconciliation (~30-40 minutes)
# Watch Flux sync everything
flux get kustomizations --watch
# Monitor pods coming up
watch kubectl get pods --all-namespaces
Flux automatically deploys (in order):
- Infrastructure components (cert-manager, external-dns)
- Monitoring stack (Prometheus, Grafana, Loki)
- Application namespaces and RBAC
- Application deployments
- Gateway API routes and certificates
- Secrets (decrypted from SOPS)
Step 4: Verify (~5 minutes)
# Check all pods running
kubectl get pods --all-namespaces
# Check ingress working
curl https://clouds-of-europe.eu
# Check database
kubectl exec -it postgres-cluster-1 -n app -- \
psql -U postgres -c "SELECT COUNT(*) FROM users;"
Total time: ~60 minutes from "cluster doesn't exist" to "serving production traffic."
No runbooks to follow manually. No configurations to remember. No tribal knowledge required. Just: provision infrastructure, bootstrap Flux, wait.
Why This Matters
Confidence: Knowing you CAN rebuild eliminates a category of anxiety. Infrastructure is cattle, not pets. Lose a cluster? Rebuild it.
Disaster recovery: Real DR requires testing. We've tested this. Multiple times. In different environments. It works.
Documentation: Git IS the documentation. Want to know how Grafana is configured? Read the Helm values in Git. Want to know what version of PostgreSQL? Read the manifest.
Onboarding: New team member: "Clone the repo, read the gitops/ directory." That's the entire system, readable and navigable.
Preventing Drift
Drift is when cluster state diverges from Git state. This happens through:
- Manual
kubectl applycommands - UI changes (clicking buttons in dashboards)
- Scripts that modify resources directly
- "Quick fixes" during incidents
Why drift is dangerous:
- Git lies about reality: Documentation says X, cluster runs Y. Which is truth?
- Rebuilds fail: Rebuild from Git produces different result than current cluster
- Changes get lost: Someone fixes something manually, then cluster update reverts it
- Debugging is impossible: Logs show configuration that doesn't match Git
Preventing drift: Flux with prune: true
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: apps
namespace: flux-system
spec:
prune: true # Delete resources not in Git
path: ./gitops/apps/production
sourceRef:
kind: GitRepository
name: flux-system
Resources not defined in Git get deleted at next reconciliation. This sounds aggressive. It enforces discipline.
Result:
- Manual
kubectl apply? Resource gets deleted at next reconciliation - UI changes? Reverted within minutes
- Everything stays consistent with Git
The No kubectl apply Rule
We have a simple rule: No kubectl apply against production. Ever.
Changes go through Git:
# Wrong
kubectl apply -f hotfix.yaml # Don't do this
# Right
git add hotfix.yaml
git commit -m "Fix: increase memory limit for api-server"
git push
# Wait for Flux to reconcile
For urgent changes:
# Trigger immediate reconciliation
flux reconcile kustomization apps --with-source
This makes "change via Git" acceptable even during incidents. Commit fix, trigger reconciliation, change is live in ~30 seconds.
Our Repository Structure
clouds-of-europe/
├── infrastructure/
│ └── opentofu/
│ └── environments/
│ ├── foundation/ # S3, registry, DNS
│ ├── management/ # Monitoring cluster
│ ├── test/ # Test environment
│ └── production/ # Production
└── gitops/
└── clusters/
├── management/
│ ├── flux-system/ # Flux components
│ ├── infrastructure/ # cert-manager, external-dns
│ └── observability/ # Prometheus, Grafana, Loki
├── test/
│ ├── flux-system/
│ ├── infrastructure/
│ └── apps/ # Application deployments
└── production/
├── flux-system/
├── infrastructure/
└── apps/
Every resource defined in Git. Nothing manual. Nothing in someone's head.
Practical Benefits
Reproducibility: Same Git commit = same infrastructure state. Testing, staging, production can be identical.
Auditability: "Who changed the firewall rules?" git log shows exactly who, when, why.
Review process: Infrastructure changes go through pull requests. Second pair of eyes before production.
Rollback: Bad deployment? git revert and wait for reconciliation. Clean, auditable rollback.
Sources:
#gitops #flux #sourceoftruth #disasterrecovery #kubernetes