Backstage + Crossplane + ArgoCD: from the workshop counter to the assembled engine
⚠️ Factory warning: I like cars. A lot. So let me apologize in advance — this post is packed with references to the shop, engine builds, and turbos. 🏎️ If you’re more into software mechanics than the real kind, relax: every analogy comes with the technical translation right beside it.
Every tuning shop starts the same way: one good mechanic, a lug wrench, and zero process. The customer shows up, describes what they want (“I want around 300 hp, but it’s got to handle a daily driver”), and the mechanic builds it all by hand — picks the turbo, sizes the injectors, dials in the tune. It works beautifully… until the line of customers grows. Then the builder becomes the bottleneck, every car comes out different from the last, and nobody remembers which tune went into which engine.
An infrastructure platform is the same thing. The “builder-does-everything” is the DevOps team answering tickets: create a namespace for me, spin up a bucket, open a database. Every request is hand-crafted, every delivery is slightly different, and the knowledge lives in two people’s heads.
Elite shops solve this with three things: a kit catalog on the wall (Stage 1, Stage 2, Stage 3 — the customer picks the kit, not the bolt), an assembly line that turns an order into an engine, and an obsessive shop foreman who checks that the car out on the road is exactly the same as the project on the bench.
In platform engineering, those three roles have names:
- Backstage is the counter with the kit catalog.
- Crossplane is the assembly line.
- ArgoCD is the shop foreman.
This post is a lesson in two parts. In Part 1, I build the three pillars from
scratch, with a local lab you can copy and paste — actual baby steps. In Part
2, I pop the hood on whisperops, a real project that uses
exactly this triad to deliver self-service AI agents, and I dissect three custom
resources in increasing order of complexity: XDataset, XAgentBudget, and
XDatasetAgent. By the end, you should walk away able to build your own line.
Grab your coffee ☕ (or the gas-station energy drink) and come along.
Part 1 — The three pillars, baby steps
Backstage: the workshop counter 🛎️
Backstage is an open source developer portal created by Spotify. It does many things (service catalog, documentation, plugins), but for this lesson what matters is the scaffolder — the Software Templates mechanism.
A Software Template is the shop’s order form: a form with a few well-chosen fields. The customer doesn’t fill in “camshaft lobe diameter” — they pick “Stage 2” and the rest is derived. The anatomy of a template has two halves:
# template.yaml — the kit's order form
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
name: garage-stage-kit
title: Stage Kit
description: Order a complete tuning kit for your team.
spec:
owner: platform-team
type: service
# HALF 1 — the form. Each property becomes a field in the UI.
# Pure JSON Schema: validation happens in the browser, before
# anything touches the cluster.
parameters:
- title: Order
required: [team_name, stage]
properties:
team_name:
title: Team name
type: string
# regex in the schema = an invalid order never leaves the counter
pattern: '^[a-z][a-z0-9-]{2,28}$'
stage:
title: Kit
type: string
enum: [stage1, stage2, stage3]
# HALF 2 — the steps. What the counter does when the customer signs.
steps:
# 1. Render the skeleton/ files, substituting ${{values.X}}
- id: fetch
action: fetch:template
input:
url: ./skeleton
values:
team_name: ${{ parameters.team_name }}
stage: ${{ parameters.stage }}
# 2. Create a Git repository and push the result.
# The signed order goes into the shop's project ledger.
# (Gitea = the local Git server we'll spin up in the lab)
- id: publish
action: publish:gitea
input:
repoUrl: cnoe.localtest.me:8443/gitea?repo=garage-${{ parameters.team_name }}
defaultBranch: main
# 3. Register the ArgoCD Application pointing at the new repo.
# A custom action the CNOE Backstage ships built-in — it does
# via API the kubectl apply we'll do by hand in the lab.
- id: argocd
action: cnoe:create-argocd-app
input:
appName: garage-${{ parameters.team_name }}
appNamespace: argocd
argoInstance: in-cluster
projectName: default
repoUrl: https://cnoe.localtest.me:8443/gitea/giteaAdmin/garage-${{ parameters.team_name }}
path: manifests
Notice two details that will come back in Part 2:
- The template doesn’t create the order’s resources. It writes the order into a Git repository and, at most, registers the Application that tells ArgoCD to watch that repo. The one that applies the content is another piece (spoiler: the shop foreman).
- The syntax is
${{ values.x }}— with the$in front. Backstage uses Nunjucks underneath, but with that custom prefix. Forgetting the$makes the expression pass raw into Git, and the one that blows up is ArgoCD at apply time, with a crypticinvalid map keyerror. Write that one down.
Crossplane: the assembly line 🏭
Crossplane turns Kubernetes into a universal control plane: beyond Pods and Services, the cluster learns to create GCS buckets, service accounts, databases — any resource that has a provider. But the superpower isn’t talking to the cloud; it’s layered abstraction. Three concepts:
| Concept | In the shop | What it is |
|---|---|---|
| XRD (CompositeResourceDefinition) | The order form’s homologation | Defines the API of your composite resource: which fields the order accepts, which are required, the regex for each |
| Composition | The kit’s assembly manual | Says HOW to expand an order into N real resources |
| XR (Composite Resource) | A specific order | ”Stage 2 for team ae86” — an instance of the API the XRD defined |
And underneath it all, the Managed Resources (MRs) — the individual parts (a bucket, an IAM binding, a Deployment) that the providers reconcile.
The part that leveled up in recent versions: the modern Composition runs in
Pipeline mode, a sequence of
Composition Functions
— each function is a station on the assembly line. The first station measures
the order, the second machines the parts, the third assembles, the last runs the
dyno and stamps “done.” Functions can be off-the-shelf generics
(function-go-templating, function-auto-ready) or your own, written in
Python,
Go, or KCL.
A minimal example — the XRD first:
# xrd.yaml — the homologation: which fields an XGarage order accepts
apiVersion: apiextensions.crossplane.io/v1
kind: CompositeResourceDefinition
metadata:
# convention: <plural>.<group>
name: xgarages.blog.opsbogus.dev
spec:
# Cluster-scoped: the XR lives outside namespaces (it's going to CREATE one).
# Crossplane v2 also supports namespaced XRs — more on that in Part 2.
scope: Cluster
group: blog.opsbogus.dev
names:
kind: XGarage
plural: xgarages
# which Composition to use when the order doesn't specify one
defaultCompositionRef:
name: xgarage-default
versions:
- name: v1alpha1
served: true
referenceable: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
teamName:
type: string
pattern: '^[a-z][a-z0-9-]{2,28}$'
stage:
type: string
enum: [stage1, stage2, stage3]
required: [teamName, stage]
And the Composition, in Pipeline mode with two stations:
# composition.yaml — the assembly manual for the XGarage kit
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
name: xgarage-default
spec:
# ties this manual to the API the XRD homologated
compositeTypeRef:
apiVersion: blog.opsbogus.dev/v1alpha1
kind: XGarage
mode: Pipeline
pipeline:
# STATION 1: render the desired resources from the order.
# function-go-templating is the "off-the-shelf generic" function:
# Go templates reading the observed XR.
- step: render
functionRef:
name: function-go-templating
input:
apiVersion: gotemplating.fn.crossplane.io/v1beta1
kind: GoTemplate
source: Inline
inline:
template: |
{{- $team := .observed.composite.resource.spec.teamName }}
{{- $stage := .observed.composite.resource.spec.stage }}
# Part 1: the team's bay (a Namespace).
# Object is the provider-kubernetes MR: an "envelope" that
# applies any K8s manifest as a managed resource.
apiVersion: kubernetes.crossplane.io/v1alpha2
kind: Object
metadata:
# explicit name = predictable in kubectl (no random suffix)
name: garage-{{ $team }}-namespace
annotations:
# logical name of the part inside the Composition
gotemplating.fn.crossplane.io/composition-resource-name: namespace
spec:
forProvider:
manifest:
apiVersion: v1
kind: Namespace
metadata:
name: garage-{{ $team }}
# tells the provider HOW to authenticate — explained in the lab
providerConfigRef:
name: in-cluster
---
# Part 2: the spec sheet taped to the bay wall (ConfigMap).
apiVersion: kubernetes.crossplane.io/v1alpha2
kind: Object
metadata:
name: garage-{{ $team }}-spec-sheet
annotations:
gotemplating.fn.crossplane.io/composition-resource-name: spec-sheet
spec:
forProvider:
manifest:
apiVersion: v1
kind: ConfigMap
metadata:
name: spec-sheet
namespace: garage-{{ $team }}
data:
stage: {{ $stage }}
team: {{ $team }}
providerConfigRef:
name: in-cluster
# FINAL STATION: the dyno. function-auto-ready marks the XR as Ready
# when all composed parts become Ready. ALWAYS last.
- step: ready
functionRef:
name: function-auto-ready
The order itself — notice how ridiculously small it is compared to what it generates:
# xr.yaml — the order: "Stage 2 for team ae86"
apiVersion: blog.opsbogus.dev/v1alpha1
kind: XGarage
metadata:
name: projeto-ae86
spec:
teamName: ae86
stage: stage2
That asymmetry is the heart of the pattern: the interface is lean, the expansion is fat. The customer signs one line; the assembly line delivers the complete engine. And because the expansion happens inside the cluster, in Crossplane’s reconcile, it holds: if someone deletes the ConfigMap by hand, Crossplane recreates it. It’s an engine that reassembles itself.
ArgoCD: the shop foreman 🧐
ArgoCD implements GitOps: Git is
the single source of truth, and a controller continuously compares what’s
declared in the repository with what’s running in the cluster. Detected a
difference? It corrects it. In the shop: the foreman walks around with the
project tucked under his arm and won’t tolerate “parking-lot hacks” — if the car
on the road diverges from the project in the ledger, he undoes the hack
(selfHeal) or removes the part that isn’t in the project (prune).
The unit of work is the Application: this repository, at this path, applied to
this cluster.
# application.yaml — the shop foreman takes over the garage-ae86 project
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: garage-ae86
namespace: argocd
spec:
project: default
source:
# Gitea in-cluster URL (ArgoCD runs inside the cluster,
# so it uses the service DNS, not the external hostname)
repoURL: http://my-gitea-http.gitea.svc.cluster.local:3000/giteaAdmin/garage-ae86.git
targetRevision: HEAD
# only this folder is applied — files outside it are invisible
path: manifests
destination:
server: https://kubernetes.default.svc
syncPolicy:
automated:
prune: true # part not in the project? remove it
selfHeal: true # hack on the car? undo it
Two ArgoCD patterns that show up in every serious platform:
- App-of-apps: a “root” Application whose content is… other Applications. You apply it ONCE by hand, and it pulls in the rest of the platform. It’s the official bootstrap pattern.
- Sync waves:
argocd.argoproj.io/sync-wave: "3"annotations that order the apply within a sync. You can’t torque the cylinder head before seating the block — and you can’t apply aProviderConfigbefore its CRD exists. Docs here.
The triangle: who writes, who delivers, who assembles 🔺
Now the magic — how the three connect. The short answer: they never talk to each other directly. They communicate through Git and the API server.
The division of responsibilities:
- Backstage writes. The scaffolder renders the order (an XR) and pushes it to Git. It has no credential to create any bucket — and that’s a feature: the portal’s attack surface is “write YAML into a repo.”
- ArgoCD delivers. It carries the order from the ledger to the cluster and ensures it stays there, identical, forever. It also doesn’t know what a bucket is — to it, the XR is just YAML like any other.
- Crossplane assembles. It takes the XR and expands it into the N real resources, with continuous reconcile. It doesn’t know a portal or a Git repo exists.
Each tool has ONE job. You can swap Backstage for another portal (or for a
git push by hand — we’ll do exactly that shortly) without touching the rest. You
can swap Gitea for GitHub. You can add Kyverno policies in the middle without any
of the three knowing. This combination even has a reference-stack name in the
community: the BACK stack (Backstage, ArgoCD,
Crossplane, Kyverno).
Why not just let Backstage create the resources directly via API? Because then the order isn’t recorded anywhere — no audit trail, no rollback via
git revert, no continuous reconcile. The Git in the middle of the path is what turns “a script that creates things” into “a platform that maintains things.”
Hands-on: the local lab 🔧
Time to get our hands dirty. We’ll build the complete triangle on your machine
using idpbuilder, the tool from the
CNOE community that spins up a local IDP with one command:
a kind cluster with Gitea + ArgoCD + ingress, all talking to each other,
certificates and DNS resolved (the cnoe.localtest.me domain points at
127.0.0.1).
Prerequisites: Docker running, kubectl, helm, git, and curl.
Step 1 — install idpbuilder and create the IDP:
The install is a single binary: download, extract, run. No magic.
# macOS Apple Silicon — for Linux/Intel swap "darwin-arm64" for
# "linux-amd64" (or your OS-arch pair; see the releases page)
curl -fsSL -o idpbuilder.tar.gz \
https://github.com/cnoe-io/idpbuilder/releases/latest/download/idpbuilder-darwin-arm64.tar.gz
tar xzf idpbuilder.tar.gz idpbuilder
./idpbuilder version
# optional: put it on the PATH — from here on I just call `idpbuilder`
# (if you skip this line, use ./idpbuilder in the next commands)
sudo install -m 0755 idpbuilder /usr/local/bin/
# spin up the IDP: kind cluster + Gitea + ArgoCD + nginx ingress (~2 min).
# --use-path-routing serves everything under ONE hostname (cnoe.localtest.me:8443
# /gitea, /argocd…) — the same mode the Backstage package in Step 7 uses,
# so the URLs don't change midway through the lab.
idpbuilder create --use-path-routing
# service credentials
idpbuilder get secrets
# ArgoCD UI: https://cnoe.localtest.me:8443/argocd
# Gitea UI: https://cnoe.localtest.me:8443/gitea
Step 2 — install Crossplane:
helm repo add crossplane-stable https://charts.crossplane.io/stable
helm repo update
# --wait holds until the control plane is up
helm install crossplane crossplane-stable/crossplane \
--namespace crossplane-system --create-namespace --wait
Step 3 — install the provider and the functions:
provider-kubernetes is the provider that applies arbitrary K8s manifests as
Managed Resources — perfect for the lab because it needs no cloud credential at
all. Mind the order within the file: the DeploymentRuntimeConfig comes
before the Provider that references it.
# provider-and-functions.yaml
# 1st: pin the provider's ServiceAccount name. Without this, Crossplane
# generates the SA with a hash suffix (provider-kubernetes-abc123) and the
# static ClusterRoleBinding in the next block doesn't match until you
# update the Provider. Stable name = stable RBAC.
apiVersion: pkg.crossplane.io/v1beta1
kind: DeploymentRuntimeConfig
metadata:
name: provider-kubernetes-runtime
spec:
serviceAccountTemplate:
metadata:
name: provider-kubernetes
---
# 2nd: the Provider is a PACKAGE: Crossplane pulls the image, installs the
# CRDs (Object, ProviderConfig), and brings up the controller pod. This takes
# ~1 min — keep that latency in mind, it becomes a gotcha in Part 2.
apiVersion: pkg.crossplane.io/v1
kind: Provider
metadata:
name: provider-kubernetes
spec:
package: xpkg.upbound.io/crossplane-contrib/provider-kubernetes:v1.2.1
runtimeConfigRef:
name: provider-kubernetes-runtime # ← the fixed-name SA above
---
# The two off-the-shelf functions the Composition uses.
apiVersion: pkg.crossplane.io/v1
kind: Function
metadata:
name: function-go-templating
spec:
package: xpkg.upbound.io/crossplane-contrib/function-go-templating:v0.11.0
---
apiVersion: pkg.crossplane.io/v1
kind: Function
metadata:
name: function-auto-ready
spec:
package: xpkg.upbound.io/crossplane-contrib/function-auto-ready:v0.6.4
kubectl apply -f provider-and-functions.yaml
# wait for everything to be INSTALLED=True HEALTHY=True
kubectl get providers.pkg.crossplane.io,functions.pkg.crossplane.io
Now the provider needs two things: permission (RBAC) to create resources in the cluster, and a ProviderConfig telling it how to authenticate.
# provider-rbac-and-config.yaml
# The provider pod runs with the ServiceAccount we pinned above;
# we give it cluster-admin because this is a single-tenant lab. In
# production, restrict it to a ClusterRole with exactly the kinds your
# Compositions emit.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: provider-kubernetes-cluster-admin
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
subjects:
- kind: ServiceAccount
# this name is only stable because the DeploymentRuntimeConfig pinned it
name: provider-kubernetes
namespace: crossplane-system
---
# "InjectedIdentity" = use the provider pod's own ServiceAccount
# to talk to the API server. The standard for acting on the SAME cluster.
apiVersion: kubernetes.crossplane.io/v1alpha1
kind: ProviderConfig
metadata:
name: in-cluster
spec:
credentials:
source: InjectedIdentity
kubectl apply -f provider-rbac-and-config.yaml
(whisperops uses exactly this trio, with the RuntimeConfig in a sync wave before the Provider’s — the same ordering lesson comes back in Part 2.)
Step 4 — apply the XRD and the Composition (the xrd.yaml and
composition.yaml from the previous section):
kubectl apply -f xrd.yaml
kubectl apply -f composition.yaml
# the XRD needs to become ESTABLISHED=True — that's the moment
# Kubernetes starts accepting orders of kind XGarage
kubectl get xrd
# NAME ESTABLISHED OFFERED AGE
# xgarages.blog.opsbogus.dev True 15s
kubectl get compositions
# NAME XR-KIND XR-APIVERSION AGE
# xgarage-default XGarage blog.opsbogus.dev/v1alpha1 10s
Step 5 — first test, no Git yet. Apply the order directly and watch the assembly line work:
kubectl apply -f xr.yaml # the XGarage "projeto-ae86"
# the order
kubectl get xgarage
# NAME SYNCED READY COMPOSITION AGE
# projeto-ae86 True True xgarage-default 30s
# the parts it expanded
kubectl get objects.kubernetes.crossplane.io
# NAME KIND PROVIDERCONFIG SYNCED READY AGE
# garage-ae86-namespace Namespace in-cluster True True 40s
# garage-ae86-spec-sheet ConfigMap in-cluster True True 40s
kubectl get namespace garage-ae86
kubectl get configmap -n garage-ae86 spec-sheet -o yaml
Now the test that sells the concept — try to pull a hack:
# delete the ConfigMap by hand (the "parking-lot hack")
kubectl delete configmap -n garage-ae86 spec-sheet
How long until it comes back? Here lies a nuance worth learning early: the
provider does not watch the wrapped part — it detects the drift on the next
poll of the Object MR, and the provider-kubernetes default is 10 minutes
(the XR reconcile itself runs every ~60s, but it only ensures the MR exists
with the right spec; the one that notices the ConfigMap is gone is the provider’s
poll). To avoid waiting, force a reconcile by touching the MR — any update to the
MR enqueues it immediately:
kubectl annotate objects.kubernetes.crossplane.io garage-ae86-spec-sheet \
reconcile.crossplane.io/now="$(date +%s)" --overwrite
kubectl get configmap -n garage-ae86 spec-sheet
# it's back 🪄 — the reconcile reassembled the part.
Step 6 — the full GitOps loop. Now we’ll do BY HAND what Backstage would do — because Backstage performs no magic, it does exactly this:
# local Gitea credential — WARNING: the generated password comes FULL of
# special characters the shell interprets ([, }, ?, *, $…).
# Always paste it in SINGLE QUOTES, or zsh blows up with
# "bad pattern" / "no matches found".
idpbuilder get secrets -p gitea # user: giteaAdmin
GITEA_PASS='<PASTE-THE-PASSWORD-HERE>'
# and to embed it in the remote URL, it needs URL-encoding
# (a raw "?" in the middle of the password breaks URL parsing — git thinks
# the port isn't a number):
GITEA_PASS_ENC=$(printf '%s' "$GITEA_PASS" | \
python3 -c "import sys,urllib.parse; print(urllib.parse.quote(sys.stdin.read(), safe=''))")
# 1. create the repository via API (what publish:gitea would do)
# (in -u the password goes raw — double quotes suffice, curl doesn't parse a URL here)
curl -k -X POST "https://cnoe.localtest.me:8443/gitea/api/v1/user/repos" \
-u "giteaAdmin:${GITEA_PASS}" -H "Content-Type: application/json" \
-d '{"name": "garage-ae86", "default_branch": "main"}'
# 2. assemble the content: the order inside manifests/
mkdir -p garage-ae86/manifests && cd garage-ae86
cp ../xr.yaml manifests/xgarage.yaml
git init -b main && git add . && git commit -m "order: stage2 on ae86"
# 3. push (self-signed cert → sslVerify=false; ENCODED password in the URL)
git -c http.sslVerify=false remote add origin \
"https://giteaAdmin:${GITEA_PASS_ENC}@cnoe.localtest.me:8443/gitea/giteaAdmin/garage-ae86.git"
git -c http.sslVerify=false push -u origin main
# (got the URL wrong on the first try and git complains "remote origin
# already exists"? Fix it with: git remote set-url origin "<correct-URL>")
# 4. register the Application (what cnoe:create-argocd-app would do)
kubectl apply -f application.yaml
The XR from Step 5 already exists in the cluster — ArgoCD simply adopts it, because the content in Git is identical to what’s running. From now on, the ledger is in charge.
There: the triangle is closed. Now edit the order in Git — change
stage: stage2 to stage: stage3, commit, push — and watch ArgoCD sync and
Crossplane update the ConfigMap. You never touched the cluster again; just the
project ledger.
# follow the propagation
kubectl get application -n argocd garage-ae86 -w
kubectl get configmap -n garage-ae86 spec-sheet -o jsonpath='{.data.stage}'
Step 7 (optional) — the real Backstage. idpbuilder installs the complete portal (Backstage + Keycloak + Argo Workflows) via the CNOE community’s reference package. One honest caveat: packages only register at cluster creation, so the command recreates the kind from scratch (~6 min) — the state from steps 2–6 evaporates. The good news: since we’ve been on path-routing since Step 1, all the URLs stay the same, and re-applying steps 2–4 is copy-paste of the same files. Good engines reassemble fast.
idpbuilder create --recreate --use-path-routing \
-p https://github.com/cnoe-io/stacks//ref-implementation
From here on it’s step-by-step in the portal:
7.1 — Login. Open https://cnoe.localtest.me:8443/ and click Sign In.
Backstage delegates authentication to Keycloak: user user1, password in the
USER_PASSWORD field of the idpbuilder get secrets output.
7.2 — Register the template. First, push template.yaml + the skeleton/
folder to a Gitea repository — the same API + push flow from Step 6, in a repo
called garage-template. Just one piece I haven’t shown yet: the skeleton, which
is literally xr.yaml with placeholders in place of the literals:
# skeleton/manifests/xgarage.yaml — what fetch:template renders.
# By default, fetch:template processes ALL skeleton files as
# templates, substituting ${{ values.x }} with the form inputs.
apiVersion: blog.opsbogus.dev/v1alpha1
kind: XGarage
metadata:
name: garage-${{ values.team_name }}
spec:
teamName: ${{ values.team_name }}
stage: ${{ values.stage }}
In Backstage: Create… → Register Existing Component, pointing at the template’s
raw URL:
https://cnoe.localtest.me:8443/gitea/giteaAdmin/garage-template/raw/branch/main/template.yaml
7.3 — Order the kit. Create… again: the Stage Kit card now appears
alongside the CNOE example templates. Fill in team_name and pick the stage —
notice the name regex is validated in real time, in the browser, before anything
touches the cluster. An invalid order never leaves the counter.
7.4 — Watch the execution. When you click Create, the scaffolder runs the steps in the template’s order — fetch:template → publish:gitea → cnoe:create-argocd-app — logging each one in real time.
7.5 — Check the triangle. The new repo is in Gitea
(/gitea/giteaAdmin/garage-<team>), the Application in ArgoCD
(/argocd/applications), and the XR running in the cluster (kubectl get xgarage). The flow is identical to what you did by hand in Steps 5–6 — and that’s
why we did it by hand first.
Debug tip:
crossplane beta trace xgarage projeto-ae86shows the full tree of the XR with each part’s state — the equivalent of popping the hood with the engine running. (The CLI installs withbrew install crossplaneor via the official script at docs.crossplane.io.)
End of Part 1. You have the triangle working locally. Now let’s see what happens when this pattern meets a real problem.
Part 2 — whisperops: the real shop
whisperops is my test bench for these concepts: a platform on GCP where anyone can create, through Backstage, an AI agent that analyzes a CSV — two LLM agents (planner + worker, orchestrated by kagent, running Gemini via Vertex AI), a Python sandbox to run analyses, a web chat, budget enforcement, and full observability. All on a single VM running kind, with the entire IDP layer (Gitea, Keycloak, ArgoCD, Backstage) brought in by the same idpbuilder from Part 1’s lab.
What matters for this lesson is the core: three custom resources in increasing order of complexity. I’ll present them as three levels of tuning:
XDataset— the catalog of homologated fuels (not even Crossplane!)XAgentBudget— the electronic wastegate with cutoff (3-function pipeline)XDatasetAgent— the complete Stage 3 kit (6-function pipeline, 22 parts)
Level 1 — XDataset: the catalog of homologated fuels ⛽
Every agent analyzes a dataset, and the datasets live in a GCS bucket. The design question: how does the Backstage form know which datasets exist? And how does the Composition validate that the requested dataset is real?
whisperops’s answer is a registry: one XDataset resource per CSV in the
bucket. And here comes the first lesson, which sounds like a trick question:
XDataset is NOT a Crossplane XRD. It’s a plain Kubernetes CRD.
# xdataset-xrd.yaml (the filename is misleading — read the comment)
# XDataset doesn't compose anything — it's a pure data record, managed
# by the dataset-watcher controller, which owns the status and the
# ready transitions. An XRD would require a Composition owning the
# reconcile, which would CONFLICT with the controller writing status.
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: xdatasets.whisperops.io
spec:
group: whisperops.io
scope: Cluster
names:
kind: XDataset
plural: xdatasets
shortNames: [xds, datasets]
versions:
- name: v1alpha1
served: true
storage: true
subresources:
status: {} # separate status: only the controller writes it
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
gcsPath:
type: string
pattern: "^gs://.+\\.csv$" # validation at admission
displayName:
type: string
sizeBytes:
type: integer
minimum: 1
required: [gcsPath, displayName, sizeBytes]
status:
type: object
properties:
ready: { type: boolean }
sizeHuman: { type: string }
lastSeen: { type: string, format: date-time }
The lesson: not everything needs to be an XR. If the resource composes
nothing — if it’s just a record with a clear owner — a plain CRD with a controller
is the right tool. Forcing an XRD here would create an ownership conflict over the
status.
The one keeping the catalog is the dataset-watcher, a Python controller (~430
lines in the main module) that runs a reconcile every 30 seconds — the robot
stockroom clerk of the shop, checking the fuel inventory:
- Lists the
*.csvin thegs://whisperops-datasets/bucket. - For each blob, upserts an
XDataset(normalizing the name:Athlete_Recovery.csvbecomesathlete-recovery, butspec.gcsPathpreserves the blob’s original name). - In parallel, publishes a Backstage
Resourceentity to a Git catalog repository — this is where the form’s dropdown feeds from. - If the CSV disappears from the bucket, the CR and the entity disappear together.
Point 3 closes an elegant circuit. The Backstage form doesn’t use a hardcoded
enum; it uses an EntityPicker filtering catalog entities:
# excerpt from the dataset-whisperer template — the dynamic dropdown
dataset_ref:
title: Dataset
type: string
ui:field: EntityPicker # dropdown fed by the catalog
ui:options:
catalogFilter:
kind: Resource
spec.type: dataset # only entities published by the watcher
Result: make upload-datasets uploads a new CSV, and in ~1 minute it shows up in
the form’s dropdown — zero code or template change. A new fuel arrived at the
shop, the clerk labeled it, and the order form already offers the option.
The catalog in action — kubectl get xds reads like the shop’s labeled shelf
(the printer columns come from the CRD itself):
kubectl get xds
NAME SIZE READY LAST_SEEN AGE
california-housing 1.9 MiB true 21s 3d2h
online-retail-ii 90.4 MiB true 21s 3d2h
spotify-tracks 19.2 MiB true 21s 3d2h
And who reads the record? The agent’s Composition, in the first station —
using a Crossplane mechanism called extra resources: the function declares “I
need the XDataset called california-housing” and Crossplane fetches it for it.
Important detail: because the mechanism fetches any resource by group+kind+name,
it works with a plain CRD — one more reason not to force an XRD on the
registry. We’ll see the code for this read in Level 3.
The complete registry architecture, in a diagram:
One source of truth (the bucket), three projections (CR, entity, pipeline context) — each consumer reads the projection it understands, and none of them needs a GCS credential beyond the watcher itself.
🙃 Garage confession: I got carried away here, I’ll admit. A simple upload button in Backstage would have solved the problem — but I wanted to explore the registry pattern, write a custom controller, see the EntityPicker fed dynamically… Sometimes you buy the forged camshaft for an engine that didn’t even need it, just to see how it beds in. The upside: you just got the full tour of the pattern.
Level 2 — XAgentBudget: the electronic wastegate 💸
An LLM agent spends money on every token. Without control, an agent stuck in a
loop is an engine with the wastegate jammed shut: boost climbs until it blows —
except here what blows is the invoice. XAgentBudget is the electronic
wastegate with cutoff: it measures the pressure continuously and, if it crosses
the limit, cuts the fuel.
Before the “how,” the “why” of the architecture. The first version was a 470-line imperative controller — and the classic bug appeared: three different components (the controller, a probe on the frontend, and an error classifier) independently inferred “is this agent paused?”, and they diverged. A false “Budget Exhausted” banner every time the planner hiccupped. The rewrite as a Composition fixes that at the root: the XR’s status becomes the single source of truth, written by a pipeline and read by everyone.
The XRD (abridged) shows the contract:
# xrd.yaml of XAgentBudget — note the scope
apiVersion: apiextensions.crossplane.io/v1
kind: CompositeResourceDefinition
metadata:
name: xagentbudgets.whisperops.io
spec:
scope: Namespaced # the XR lives INSIDE the agent's namespace
group: whisperops.io
names:
kind: XAgentBudget
shortNames: [xab, budget]
versions:
- name: v1alpha1
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
agentName: { type: string }
budgetUsd: { type: number, minimum: 0, maximum: 10000 }
pricingRef: # per-token price table
type: object # (external ConfigMap — changing
properties: # price needs no rebuild)
name: { type: string }
namespace: { type: string }
enforcement:
type: string
enum: [enabled, monitor-only] # bench mode: measures, doesn't cut
default: enabled
status:
type: object
properties:
spentUsd: { type: number }
ratio: { type: number } # spend / budget
paused: { type: boolean } # THE source of truth
cause:
type: string
enum: [running, budget-exhausted, agent-unreachable, unknown]
The Composition is a pipeline of 3 stations + dyno, running every ~60 seconds (Crossplane’s default poll interval — more on that below):
Station 1 — fetch-spend (Python, with the
function-sdk-python):
queries Mimir (Prometheus), summing the agent’s token counters and multiplying by
the price table:
# function-budget-fetch-spend — the boost gauge
# For each token type (input/output/cached), sum the increase()
# over the XR's lifetime window and convert to USD via the price table
# (ConfigMap mounted at /etc/pricing — changing price is config only).
for metric, price_key in (
("whisperops_tokens_input_total", "input_per_million"),
("whisperops_tokens_output_total", "output_per_million"),
("whisperops_tokens_cached_input_total", "cached_input_per_million"),
):
unit_price = prices[price_key] / 1_000_000
# the model matcher matters: the price table is per-model
expr = (
f'sum(increase({metric}{{agent_name="{agent}",model="{MODEL_NAME}"}}'
f"[{window_s}s])) * {unit_price}"
)
...
# Design decision: FAIL OPEN. If Mimir goes down, spend = 0.0 and the
# agent keeps running — an observability outage can never pause a
# healthy agent. The price of this: silent under-enforcement while
# Mimir is down.
except httpx.HTTPError as e:
response.warning(rsp, f"mimir query failed: {e}; using spend=0.0")
spend = 0.0
# The result goes into the pipeline CONTEXT — not the status.
# Context = a note passed station to station, dies at the end of the
# reconcile. Status = what gets written to the XR.
rsp.context["spend_usd"] = spend
rsp.context["window_sec"] = window_s
Station 2 — decide: the pure logic. Compares, decides, and writes the
status — the only place the truth is written:
# function-budget-decide — the ECU
ratio = spend / budget if budget > 0 else 0.0
if deletion_ts: # XR being deleted? unpause
should_pause = False # (hand the car back running)
elif enforcement == "monitor-only": # bench mode: measures, doesn't cut
should_pause = False
else:
should_pause = ratio >= 1.0 # 100% of budget = cutoff
# write the source of truth into the XR's status
resource.update(rsp.desired.composite, {
"status": {
"spentUsd": round(spend, 4),
"ratio": round(ratio, 4),
"ratioPct": f"{ratio * 100:.2f}%", # for kubectl get xab
"paused": should_pause,
"cause": "budget-exhausted" if should_pause else "running",
},
})
# and pass the decision to the next station via context
rsp.context["should_pause"] = should_pause
Station 3 — render: the actuation. Emits two Object MRs of the
provider-kubernetes
that patch spec.replicas of the planner and worker Deployments:
# function-budget-render — the fuel cut
def _make_object_mr(role: str, namespace: str, replicas: int) -> dict:
return {
# NAMESPACED flavor of Object (.m.) — required because the
# XAgentBudget is a namespaced XR, and in Crossplane v2 a
# namespaced XR only composes namespaced MRs. The legacy Object
# (kubernetes.crossplane.io/v1alpha2) is cluster-scoped and
# fails with "cannot apply cluster scoped composed resource".
"apiVersion": "kubernetes.m.crossplane.io/v1alpha1",
"kind": "Object",
"metadata": {"name": f"{namespace}-{role}-replicas",
"namespace": namespace},
"spec": {
# Observe + Update, NO Create and NO Delete:
# - don't create: the Deployment already exists (kagent made it)
# - don't delete: MR GC must not bring down the Deployment
"managementPolicies": ["Observe", "Update"],
"forProvider": {
# deliberately SPARSE manifest: only the field we
# want to own. Server-side apply makes the provider
# own spec.replicas and NOTHING else — a complete
# manifest here would steal ownership of every
# field from the kagent operator.
"manifest": {
"apiVersion": "apps/v1",
"kind": "Deployment",
"metadata": {"name": role, "namespace": namespace},
"spec": {"replicas": replicas},
},
},
"providerConfigRef": {"kind": "ClusterProviderConfig",
"name": "in-cluster"},
},
}
And who consumes the source of truth? The chat-frontend does a GET on the XR on
every request (no cache, on purpose — unpausing becomes visible within one cycle)
and maps it to HTTP with honest semantics: 402 Payment Required only when
paused && cause == "budget-exhausted"; 503 is reserved for infrastructure
failure. It was exactly this separation that killed the false banner.
The nuances worth the lesson:
- The 60 seconds of latency aren’t configured anywhere — it’s Crossplane’s default poll interval. An agent can blow the budget by up to one cycle before the cutoff. A conscious trade-off: reducing the interval would double the Mimir queries per agent.
- There’s no ownership fight over
spec.replicas: the patch is server-side apply with its own field manager, owning ONE field. And ArgoCD never reverts it because the planner/worker Deployments aren’t even managed by it — the one that creates them is the kagent operator. Layers that don’t see each other don’t fight. kubectl get xabis the runbook: the printer columns (BUDGET, SPENT, RATIO, PAUSED, CAUSE) make the XR itself the incident summary.
The wastegate in action, at two moments:
# at cruise — 29% of budget consumed, agent running:
kubectl get xab -n agent-housing-bot
NAME BUDGET SPENT RATIO PAUSED CAUSE
housing-bot 5.00 1.4602 29.20% false running
# …after a day of heavy questions — crossed 100%, cut off ✂️
kubectl get xab -n agent-housing-bot
NAME BUDGET SPENT RATIO PAUSED CAUSE
housing-bot 5.00 5.0213 100.43% true budget-exhausted
# and the cutoff is visible in the engine — planner and worker zeroed out:
kubectl get deploy -n agent-housing-bot planner worker
NAME READY UP-TO-DATE AVAILABLE AGE
planner 0/0 0 0 3d2h
worker 0/0 0 0 3d2h
Level 3 — XDatasetAgent: the complete Stage 3 kit 🏎️
Now the boss fight. The XDatasetAgent (XDA) is the XR that represents an
entire agent — and its Composition expands from ONE declaration into 22
resources spanning Kubernetes and GCP. It’s the Stage 3 kit: forged engine,
turbo, intercooler, ECU, and even the Level 2 electronic wastegate installed at
the factory.
First, the before-and-after that justifies everything. In the previous version of
whisperops, the Backstage skeleton had ~20 Nunjucks templates — each agent
resource was a .njk file rendered by the scaffolder and applied by ArgoCD. It
worked, but with two costs: the template was a monster to maintain, and ArgoCD
managed dozens of resources that mutated at runtime, requiring 4 blocks of
ignoreDifferences to keep it from fighting with the budget kill-switch, with
budget top-ups, and with Kyverno defaults.
After the rewrite, what ArgoCD applies is one file (plus the Backstage
catalog-info.yaml, which sits at the repo root, out of its reach):
# skeleton/manifests/xdatasetagent.yaml.njk — the complete order.
# The Composition renders the agent's 22 resources from this.
apiVersion: whisperops.io/v1alpha1
kind: XDatasetAgent
metadata:
name: ${{ values.agent_name }}
spec:
crossplane:
compositionRef:
name: xdatasetagent-default # v2 way of choosing the manual
agentName: ${{ values.agent_name }}
datasetRef:
name: ${{ values.dataset_id }} # validates against the catalog (Level 1)
budgetUsd: ${{ values.budget_usd }} # becomes an XAgentBudget (Level 2)
description: "${{ values.description }}"
baseDomain: ${{ values.base_domain }}
projectId: ${{ values.project_id }}
And the 4 blocks of ignoreDifferences? They vanished. ArgoCD now manages a
single resource — the order — and all runtime mutations happen on the composed
parts, below its line of sight. This is perhaps the most important
architectural lesson in the post: shrinking the surface managed by GitOps
dissolves the conflicts between selfHeal and runtime mutation, instead of
administering them exception by exception.
The Composition declares the line with 6 stations + dyno:
# xdatasetagent-default.yaml — the complete assembly line
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
name: xdatasetagent-default
spec:
compositeTypeRef:
apiVersion: whisperops.io/v1alpha1
kind: XDatasetAgent
mode: Pipeline
pipeline:
- step: validate-dataset # 1. check the fuel in the catalog
functionRef: { name: function-xda-validate-dataset }
- step: compute-tuning # 2. size the engine to the dataset
functionRef: { name: function-xda-compute-tuning }
- step: render-iam # 3. GCP parts: bucket, SA, IAM
functionRef: { name: function-xda-render-iam }
- step: render-workloads # 4. K8s parts: 15 resources
functionRef: { name: function-xda-render-workloads }
- step: render-dashboard # 5. instrument panel (Grafana)
functionRef: { name: function-xda-render-dashboard }
- step: emit-budget # 6. install the wastegate (XAgentBudget!)
functionRef: { name: function-xda-emit-budget }
- step: ready # final dyno
functionRef: { name: function-auto-ready }
The data flow between stations uses the two channels you already know from Level 2 — and here the distinction becomes vital:
Station by station, with the nuance of each:
1. validate-dataset — check the fuel. Reads the Level 1 XDataset via
extra resources. The pattern has a two-phase bootstrap subtlety that catches
everyone off guard:
# The function DECLARES what it needs; Crossplane fetches and RE-INVOKES.
selector = rsp.requirements.extra_resources["xdataset"]
selector.api_version = "whisperops.io/v1alpha1"
selector.kind = "XDataset"
selector.match_name = dataset_ref
# On the FIRST invocation, req.extra_resources comes EMPTY — the
# requirement hasn't been met yet. The function needs to return
# early with an honest state (phase=Validating) instead of failing.
# On the next invocation, the XDataset comes populated.
if not xds_items:
resource.update(rsp.desired.composite, {"status": {
"phase": "Validating",
"conditions": [{"type": "Ready", "status": "Unknown",
"reason": "BootstrappingExtraResources", ...}],
}})
return rsp
# Dataset doesn't exist or not-ready? Fail EARLY, with a readable cause.
# Better an XR Failed with "DatasetNotFound" than a sandbox crashing
# 10 minutes later with a GCS error.
Validated, it puts {name, gcsPath, sizeBytes, displayName} into the context.
2. compute-tuning — size the engine. A bigger carburetor demands more fuel;
a bigger dataset demands more memory in the sandbox. The heuristic: pandas takes
~3.5× the CSV size in RAM; round up to the next GiB, with a floor of 1 GiB and a
ceiling of 8 GiB:
PANDAS_BLOAT_FACTOR = 3.5 # validated against the real datasets
def compute_sandbox_mem_mi(size_bytes: int) -> int:
raw_mib = math.ceil((size_bytes * PANDAS_BLOAT_FACTOR) / (1024 * 1024))
gib_rounded = math.ceil(raw_mib / GIB_IN_MIB) * GIB_IN_MIB
return max(MIN_SANDBOX_MIB, min(gib_rounded, MAX_SANDBOX_MIB))
This station also carries the trickiest footgun of the Python SDK: the
resource.update(rsp.desired.composite, {"status": {...}}) does a SHALLOW update
— the entire status block is REPLACED, erasing what the previous station wrote.
The fix is read-merge-write:
# Read the ACCUMULATED desired status (which came from previous stations),
# merge the new field, and write the complete block back.
# Without this, sandboxMemMi would erase phase/datasetFmt/conditions
# written by validate-dataset.
desired_status = (
resource.struct_to_dict(rsp.desired.composite.resource).get("status") or {}
)
desired_status["sandboxMemMi"] = sandbox_mem_mi
resource.update(rsp.desired.composite, {"status": desired_status})
3. render-iam — the GCP parts. Emits 5 Managed Resources from Crossplane’s
GCP providers: the agent’s bucket, a ServiceAccount, a ServiceAccountKey, and two
ProjectIAMMember — with IAM CEL conditions restricting each grant to exactly the
right bucket (viewer on the shared datasets bucket, admin only on the agent’s own
bucket). Least-privilege per agent, generated by code.
4. render-workloads — the bulk of the engine. 15 Kubernetes manifests
wrapped in Object MRs: the agent’s Namespace, prompts ConfigMap, NetworkPolicy,
ModelConfig, and the two kagent Agent CRs (planner + worker), sandbox
(Deployment + Service + RemoteMCPServer, the kagent MCP tool-server),
chat-frontend (SA + RoleBinding + Deployment + Service + Ingress), and a Kyverno
policy. Two design decisions make this station work:
# DECISION 1 — the chicken-and-egg of the namespace:
# The XDA is cluster-scoped because it CREATES the namespace the parts
# live in (you can't live inside what you haven't built yet).
# But a namespaced Object MR needs to exist in a namespace that ALREADY
# exists at create time. Solution: the MR lives in crossplane-system
# (always exists), while the wrapped manifest points at the agent's
# namespace — provider-kubernetes tries to apply, fails while the
# namespace doesn't exist, and converges on its own when the Namespace
# part settles. Eventual consistency doing the work.
return {
"apiVersion": "kubernetes.m.crossplane.io/v1alpha1",
"kind": "Object",
"metadata": {"name": mr_name, "namespace": "crossplane-system"},
"spec": {"forProvider": {"manifest": manifest}, # ← target: agent-{name}
"providerConfigRef": {"kind": "ClusterProviderConfig",
"name": "in-cluster"}},
}
# DECISION 2 — two flavors of Object coexisting:
# The Namespace (cluster-scoped) uses the LEGACY Object
# (kubernetes.crossplane.io/v1alpha2), because the .m. flavor only has
# a namespaced Object. The namespaced parts use the .m. flavor by
# CONVENTION of Crossplane v2 — in Part 1 we wrapped a ConfigMap
# in the legacy Object and it worked: a cluster-scoped XR CAN compose
# legacy Objects wrapping namespaced manifests. The HARD scope rule
# ("namespaced XR only composes namespaced MR") only binds on the
# Level 2 XAgentBudget. Two provider configs coexist, both
# "in-cluster": a ProviderConfig (legacy group) and a
# ClusterProviderConfig (.m. group).
5. render-dashboard — the instrument panel. Renders one Grafana dashboard
per agent (a ConfigMap from a JSON template). A veteran’s detail: the substitution
uses str.replace with sentinels (__AGENT__), never str.format() — the
Grafana JSON’s braces would blow up .format(). And the result goes through
json.loads before becoming a ConfigMap, to fail at reconcile and not silently in
Grafana.
6. emit-budget — the factory wastegate. The final station emits… another
XR. The Level 2 XAgentBudget is born here, as a composed part of the XDA:
# Composition-of-compositions: the XDA emits an XAgentBudget, which has
# its OWN pipeline (fetch-spend → decide → render) running on its
# own reconcile cycle. Emitting an XR is identical to emitting an
# MR — only the wrapped kind changes.
xab = {
"apiVersion": "whisperops.io/v1alpha1",
"kind": "XAgentBudget",
"metadata": {
# CONTRACT: the XAB name == the agent name. The chat-frontend
# looks up getXAgentBudget("agent-{name}", "{name}") — a
# mismatch breaks the chat with a misleading 503.
"name": agent,
"namespace": f"agent-{agent}",
},
"spec": {
"crossplane": {"compositionRef": {"name": "xagentbudget-default"}},
"agentName": agent,
"budgetUsd": budget_usd,
"pricingRef": {"name": "whisperops-pricing",
"namespace": "crossplane-system"},
"enforcement": "enabled",
},
}
resource.update(rsp.desired.resources["xagentbudget"], xab)
This nesting is what lets Level 2 exist as an independent product: the budget has its own lifecycle, reconcile, and API — the XDA just instantiates it. Like the electronic wastegate you buy separately, but which the Stage 3 kit ships pre-installed.
The complete inventory, to settle the count of the 22 parts:
| Station | Parts | What |
|---|---|---|
| render-iam | 5 | Bucket, ServiceAccount, ServiceAccountKey, 2× ProjectIAMMember (GCP) |
| render-workloads | 15 | Namespace, prompts CM, NetworkPolicy, ModelConfig, 2× Agent (kagent), sandbox (Deploy+Svc+RemoteMCPServer), chat-frontend (SA+RB+Deploy+Svc+Ingress), Kyverno policy |
| render-dashboard | 1 | ConfigMap with the agent’s Grafana dashboard |
| emit-budget | 1 | XAgentBudget (nested XR) |
And the result in the terminal — the XR with the printer columns derived from the status the stations wrote:
kubectl get xda
NAME DATASET BUDGET READY URL AGE
housing-bot California Housing 5.00 True https://agent-housing-bot.34.61.7.12.sslip.io:8443/ 12m
And popping the hood with crossplane beta trace — the XR tree with the 22 parts
hanging off it (abridged output):
crossplane beta trace xdatasetagent housing-bot
NAME SYNCED READY
XDatasetAgent/housing-bot True True
├─ Object/agent-housing-bot-namespace True True
├─ Object/agent-housing-bot-agent-planner True True
├─ Object/agent-housing-bot-agent-worker True True
├─ Object/agent-housing-bot-sandbox-deployment True True
├─ Object/agent-housing-bot-chat-frontend-deploy… True True
├─ Bucket/agent-housing-bot-bucket True True
├─ ServiceAccount/agent-housing-bot-sa True True
├─ XAgentBudget/housing-bot True True
└─ … (14 more parts)
The complete flow: from form to chat 💬
Putting the three levels together, the path from a click in Backstage to an agent holding a conversation:
Two template tricks worth noting:
- Hidden parameters with sed-bake:
base_domainandproject_idareui:widget: hiddenfields with sentinel defaults (__BASE_DOMAIN__). At deploy, the bootstrap queries the VM’s metadata server and “bakes” the real values into the template viased. The dev never types an IP or project ID — and because the hidden fields have validation regex, a failed sed-bake breaks the form loudly, instead of scaffolding with a placeholder. path: manifestson the Application: thecatalog-info.yamlsits at the repo root (for Backstage to discover) and out of ArgoCD’s reach (which only looks atmanifests/). Without this, ArgoCD would try to apply a Backstage entity as a K8s resource and stay eternally OutOfSync.
And Day-2? Changing a live agent’s budget, swapping the dataset, destroying
the agent — these are all Backstage templates too, and they all respect GitOps:
the change goes to Git first, and ArgoCD applies it. An in-cluster Job does a
GET on the file via the Gitea API (capturing the SHA), edits the exact line with
sed/awk, does a PUT with the previous SHA (optimistic concurrency), and
annotates the Application with refresh=hard so it doesn’t wait for the poll. No
kubectl patch on the live resource — the shop foreman would revert it in
seconds, and rightly so.
The destroy deserves its diagram, because the order is the lesson:
Deleting the repo before the XR would invert the race: selfHeal would lose its source, but the orphaned XR would be left behind. In GitOps, teardown is choreography: you silence the reconciler, dismantle the state, and only then erase the project from the ledger.
Lessons from the shop (what I broke so you don’t have to) ⚠️
Consolidating the nuances that appeared along the way, plus a few that only show up in production:
- The CRD-establish race, at three scales. A Crossplane
Provideris a package: the CR syncs in seconds, but the CRDs it installs take minutes. If theProviderConfigis in the same Application, ArgoCD’s dry-run fails (“kind doesn’t exist”). whisperops defends in three layers: separate Applications with sync waves (providers in wave 3, config in wave 5),retrywith exponential backoff, andSkipDryRunOnMissingResource=trueannotated on the resource. The same race reappears inside the providers app (theDeploymentRuntimeConfigmust precede the Provider) and in the content (example XRs are excluded from the sync withexclude: "examples/*", because applying an XR before the XRD establishes brings down the whole sync). ignoreDifferencesalone isn’t enough — it only changes the diff. For selfHeal not to overwrite the field, you needRespectIgnoreDifferences=truein syncOptions. And the mature version of this lesson: if you need manyignoreDifferences, maybe the problem is the surface ArgoCD manages — shrink it (that was exactly the XDA rewrite).- Sync waves order the apply, not the readiness — and sync-wave annotations on Crossplane-composed resources are decorative (the one that creates them is Crossplane, which ignores them). For real ordering between async layers, whisperops uses a PreSync hook that polls the prerequisite — converting async state into a hard precondition.
- An XRD’s
spec.namesis immutable in Crossplane v2 (CELself == oldSelf). Adding a shortName on a live cluster is rejected at admission; it only lands on a recreate. Plan the names in the first version. - The function-sdk-python footguns: proto
Structhas no.get()(convert withresource.struct_to_dict), nested assignment on a Struct blows up (useresource.update), the status update is shallow (read-merge-write), andreq.extra_resourcesis aMessageMapthe default converter doesn’t understand (iterate the keys). None of these show up in a tutorial; all of them show up in the first real pipeline. - Context propagates — as long as each station cooperates. In the Python SDK,
response.to(req)copies the received context into the response; a function that builds the response by hand, without that helper, silently discards the note and the following stations see nothing. whisperops re-emits the critical keys explicitly, as belt-and-suspenders. And the distinction still holds: context is a station note (dies at the end of the reconcile); status is the durable record. packagePullPolicy: Alwayson functions with the:latesttag — otherwise the digest gets cached and the function pods keep running old code after a rebuild, silently.- Fail open where the measuring system’s failure can’t punish the measured (fetch-spend with Mimir down), and fail early and readably where the order is invalid (DatasetNotFound at station 1, regex on the form, limits on the XRD). Layered validation — browser, admission, pipeline — gives feedback at the cheapest possible point.
Build your own 🛠️
The checklist to adapt this to your context, in implementation order:
- Start with the empty triangle:
idpbuilder create, Crossplane via Helm, provider-kubernetes. Reproduce Part 1’s lab up to the “edit it in Git and watch it propagate” step. Without that solid, the rest collapses. - Model ONE lean abstraction. Pick the resource your team requests most (a web service? a database? a bucket?) and design the XRD with the minimum of fields — if the form needs more than 5, the abstraction is leaking.
- Composition in Pipeline mode from day 1, even if with a generic function
(
function-go-templating) +function-auto-ready. Migrating from patch-and-transform later hurts more. - When the logic grows, write your own functions (Python or Go). Validation with readable failure first, render after, auto-ready always last. Durable status on the XR; context only between stations.
- Backstage comes last — when the manual flow (push + Application) is solid. The scaffolder template is just the form in front of what you already proved works.
- Day-2 is born GitOps: every mutation goes to Git first. If you catch
yourself writing
kubectl patchin a runbook, back up two squares. - Record the races you lose. CRD-establish, namespace chicken-and-egg, selfHeal vs runtime — they all have a declarative solution (waves, retry, hooks, scoping). The parking-lot hack holds until the shop foreman walks by.
The measure of success is the same as the shop’s: the customer picks the kit at
the counter, signs a one-page form, and days of hand-crafted work become minutes
of assembly line — with the foreman ensuring every car on the road is identical to
the project in the ledger. When your kubectl get shows an XR Ready: True with
22 parts hanging off it, you’ll understand why I call it an engine that
reassembles itself.
References 📚
Official documentation:
- Backstage — Software Templates and Writing Templates — the scaffolder and the
${{ }}syntax - Backstage — Software Catalog — the catalog that feeds the EntityPicker
- Crossplane — Compositions — Compositions and Pipeline mode
- Crossplane — Composite Resource Definitions — XRDs, scopes, and versioning
- Crossplane — Write a Composition Function in Python — the official Python SDK guide
- Argo CD — Cluster Bootstrapping (app-of-apps)
- Argo CD — Sync Phases and Waves
- Argo CD — Automated Sync Policy — prune, selfHeal, and friends
- CNOE and idpbuilder — the one-command local IDP
- kagent — the cloud-native AI agent framework
- Kyverno — Generate Rules — the fourth pillar of the BACK stack
Tools and code:
- crossplane/function-sdk-python — the Python functions SDK
- crossplane-contrib/provider-kubernetes — the provider behind the Object MRs
- cnoe-io/idpbuilder — idpbuilder code and packages
- gitops-bridge-dev/gitops-bridge — patterns for the Terraform → GitOps bridge
- whisperops — the project dissected in Part 2 (LAB page)
Research and fundamentals:
- CNCF Platforms White Paper — the canonical definition of an internal platform
- DORA Research — the research linking self-service and automation to delivery performance
- Team Topologies — Key Concepts — platform teams and the Thinnest Viable Platform
- Accelerating Control Systems with GitOps (arXiv) — recent research on GitOps as a single source of truth
Further reading:
- The BACK Stack and the intro post — the Backstage + ArgoCD + Crossplane + Kyverno reference stack
- Crossplane Blog — Introducing function-python
- Crossplane Blog — Composition Functions in Production — a real case from VSHN
- CNCF TAG App Delivery — Platform Engineering trends
- Argo CD Application Dependencies — app-of-apps + sync waves
- Building an IDP with Backstage, AKS, Crossplane and Argo CD — the same triangle in another cloud, for contrast