← Back to blog

Backstage + Crossplane + ArgoCD: from the workshop counter to the assembled engine

Backstage + Crossplane + ArgoCD: from the workshop counter to the assembled engine Backstage + Crossplane + ArgoCD: from the workshop counter to the assembled engine

⚠️ Factory warning: I like cars. A lot. So let me apologize in advance — this post is packed with references to the shop, engine builds, and turbos. 🏎️ If you’re more into software mechanics than the real kind, relax: every analogy comes with the technical translation right beside it.

Every tuning shop starts the same way: one good mechanic, a lug wrench, and zero process. The customer shows up, describes what they want (“I want around 300 hp, but it’s got to handle a daily driver”), and the mechanic builds it all by hand — picks the turbo, sizes the injectors, dials in the tune. It works beautifully… until the line of customers grows. Then the builder becomes the bottleneck, every car comes out different from the last, and nobody remembers which tune went into which engine.

An infrastructure platform is the same thing. The “builder-does-everything” is the DevOps team answering tickets: create a namespace for me, spin up a bucket, open a database. Every request is hand-crafted, every delivery is slightly different, and the knowledge lives in two people’s heads.

Elite shops solve this with three things: a kit catalog on the wall (Stage 1, Stage 2, Stage 3 — the customer picks the kit, not the bolt), an assembly line that turns an order into an engine, and an obsessive shop foreman who checks that the car out on the road is exactly the same as the project on the bench.

In platform engineering, those three roles have names:

  • Backstage is the counter with the kit catalog.
  • Crossplane is the assembly line.
  • ArgoCD is the shop foreman.

This post is a lesson in two parts. In Part 1, I build the three pillars from scratch, with a local lab you can copy and paste — actual baby steps. In Part 2, I pop the hood on whisperops, a real project that uses exactly this triad to deliver self-service AI agents, and I dissect three custom resources in increasing order of complexity: XDataset, XAgentBudget, and XDatasetAgent. By the end, you should walk away able to build your own line.

Grab your coffee ☕ (or the gas-station energy drink) and come along.


Part 1 — The three pillars, baby steps

Backstage: the workshop counter 🛎️

Backstage is an open source developer portal created by Spotify. It does many things (service catalog, documentation, plugins), but for this lesson what matters is the scaffolder — the Software Templates mechanism.

A Software Template is the shop’s order form: a form with a few well-chosen fields. The customer doesn’t fill in “camshaft lobe diameter” — they pick “Stage 2” and the rest is derived. The anatomy of a template has two halves:

# template.yaml — the kit's order form
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: garage-stage-kit
  title: Stage Kit
  description: Order a complete tuning kit for your team.
spec:
  owner: platform-team
  type: service

  # HALF 1 — the form. Each property becomes a field in the UI.
  # Pure JSON Schema: validation happens in the browser, before
  # anything touches the cluster.
  parameters:
    - title: Order
      required: [team_name, stage]
      properties:
        team_name:
          title: Team name
          type: string
          # regex in the schema = an invalid order never leaves the counter
          pattern: '^[a-z][a-z0-9-]{2,28}$'
        stage:
          title: Kit
          type: string
          enum: [stage1, stage2, stage3]

  # HALF 2 — the steps. What the counter does when the customer signs.
  steps:
    # 1. Render the skeleton/ files, substituting ${{values.X}}
    - id: fetch
      action: fetch:template
      input:
        url: ./skeleton
        values:
          team_name: ${{ parameters.team_name }}
          stage: ${{ parameters.stage }}

    # 2. Create a Git repository and push the result.
    #    The signed order goes into the shop's project ledger.
    #    (Gitea = the local Git server we'll spin up in the lab)
    - id: publish
      action: publish:gitea
      input:
        repoUrl: cnoe.localtest.me:8443/gitea?repo=garage-${{ parameters.team_name }}
        defaultBranch: main

    # 3. Register the ArgoCD Application pointing at the new repo.
    #    A custom action the CNOE Backstage ships built-in — it does
    #    via API the kubectl apply we'll do by hand in the lab.
    - id: argocd
      action: cnoe:create-argocd-app
      input:
        appName: garage-${{ parameters.team_name }}
        appNamespace: argocd
        argoInstance: in-cluster
        projectName: default
        repoUrl: https://cnoe.localtest.me:8443/gitea/giteaAdmin/garage-${{ parameters.team_name }}
        path: manifests

Notice two details that will come back in Part 2:

  1. The template doesn’t create the order’s resources. It writes the order into a Git repository and, at most, registers the Application that tells ArgoCD to watch that repo. The one that applies the content is another piece (spoiler: the shop foreman).
  2. The syntax is ${{ values.x }} — with the $ in front. Backstage uses Nunjucks underneath, but with that custom prefix. Forgetting the $ makes the expression pass raw into Git, and the one that blows up is ArgoCD at apply time, with a cryptic invalid map key error. Write that one down.

Crossplane: the assembly line 🏭

Crossplane turns Kubernetes into a universal control plane: beyond Pods and Services, the cluster learns to create GCS buckets, service accounts, databases — any resource that has a provider. But the superpower isn’t talking to the cloud; it’s layered abstraction. Three concepts:

ConceptIn the shopWhat it is
XRD (CompositeResourceDefinition)The order form’s homologationDefines the API of your composite resource: which fields the order accepts, which are required, the regex for each
CompositionThe kit’s assembly manualSays HOW to expand an order into N real resources
XR (Composite Resource)A specific order”Stage 2 for team ae86” — an instance of the API the XRD defined

And underneath it all, the Managed Resources (MRs) — the individual parts (a bucket, an IAM binding, a Deployment) that the providers reconcile.

The part that leveled up in recent versions: the modern Composition runs in Pipeline mode, a sequence of Composition Functions — each function is a station on the assembly line. The first station measures the order, the second machines the parts, the third assembles, the last runs the dyno and stamps “done.” Functions can be off-the-shelf generics (function-go-templating, function-auto-ready) or your own, written in Python, Go, or KCL.

A minimal example — the XRD first:

# xrd.yaml — the homologation: which fields an XGarage order accepts
apiVersion: apiextensions.crossplane.io/v1
kind: CompositeResourceDefinition
metadata:
  # convention: <plural>.<group>
  name: xgarages.blog.opsbogus.dev
spec:
  # Cluster-scoped: the XR lives outside namespaces (it's going to CREATE one).
  # Crossplane v2 also supports namespaced XRs — more on that in Part 2.
  scope: Cluster
  group: blog.opsbogus.dev
  names:
    kind: XGarage
    plural: xgarages
  # which Composition to use when the order doesn't specify one
  defaultCompositionRef:
    name: xgarage-default
  versions:
    - name: v1alpha1
      served: true
      referenceable: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                teamName:
                  type: string
                  pattern: '^[a-z][a-z0-9-]{2,28}$'
                stage:
                  type: string
                  enum: [stage1, stage2, stage3]
              required: [teamName, stage]

And the Composition, in Pipeline mode with two stations:

# composition.yaml — the assembly manual for the XGarage kit
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: xgarage-default
spec:
  # ties this manual to the API the XRD homologated
  compositeTypeRef:
    apiVersion: blog.opsbogus.dev/v1alpha1
    kind: XGarage
  mode: Pipeline
  pipeline:
    # STATION 1: render the desired resources from the order.
    # function-go-templating is the "off-the-shelf generic" function:
    # Go templates reading the observed XR.
    - step: render
      functionRef:
        name: function-go-templating
      input:
        apiVersion: gotemplating.fn.crossplane.io/v1beta1
        kind: GoTemplate
        source: Inline
        inline:
          template: |
            {{- $team := .observed.composite.resource.spec.teamName }}
            {{- $stage := .observed.composite.resource.spec.stage }}
            # Part 1: the team's bay (a Namespace).
            # Object is the provider-kubernetes MR: an "envelope" that
            # applies any K8s manifest as a managed resource.
            apiVersion: kubernetes.crossplane.io/v1alpha2
            kind: Object
            metadata:
              # explicit name = predictable in kubectl (no random suffix)
              name: garage-{{ $team }}-namespace
              annotations:
                # logical name of the part inside the Composition
                gotemplating.fn.crossplane.io/composition-resource-name: namespace
            spec:
              forProvider:
                manifest:
                  apiVersion: v1
                  kind: Namespace
                  metadata:
                    name: garage-{{ $team }}
              # tells the provider HOW to authenticate — explained in the lab
              providerConfigRef:
                name: in-cluster
            ---
            # Part 2: the spec sheet taped to the bay wall (ConfigMap).
            apiVersion: kubernetes.crossplane.io/v1alpha2
            kind: Object
            metadata:
              name: garage-{{ $team }}-spec-sheet
              annotations:
                gotemplating.fn.crossplane.io/composition-resource-name: spec-sheet
            spec:
              forProvider:
                manifest:
                  apiVersion: v1
                  kind: ConfigMap
                  metadata:
                    name: spec-sheet
                    namespace: garage-{{ $team }}
                  data:
                    stage: {{ $stage }}
                    team: {{ $team }}
              providerConfigRef:
                name: in-cluster

    # FINAL STATION: the dyno. function-auto-ready marks the XR as Ready
    # when all composed parts become Ready. ALWAYS last.
    - step: ready
      functionRef:
        name: function-auto-ready

The order itself — notice how ridiculously small it is compared to what it generates:

# xr.yaml — the order: "Stage 2 for team ae86"
apiVersion: blog.opsbogus.dev/v1alpha1
kind: XGarage
metadata:
  name: projeto-ae86
spec:
  teamName: ae86
  stage: stage2

That asymmetry is the heart of the pattern: the interface is lean, the expansion is fat. The customer signs one line; the assembly line delivers the complete engine. And because the expansion happens inside the cluster, in Crossplane’s reconcile, it holds: if someone deletes the ConfigMap by hand, Crossplane recreates it. It’s an engine that reassembles itself.

ArgoCD: the shop foreman 🧐

ArgoCD implements GitOps: Git is the single source of truth, and a controller continuously compares what’s declared in the repository with what’s running in the cluster. Detected a difference? It corrects it. In the shop: the foreman walks around with the project tucked under his arm and won’t tolerate “parking-lot hacks” — if the car on the road diverges from the project in the ledger, he undoes the hack (selfHeal) or removes the part that isn’t in the project (prune).

The unit of work is the Application: this repository, at this path, applied to this cluster.

# application.yaml — the shop foreman takes over the garage-ae86 project
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: garage-ae86
  namespace: argocd
spec:
  project: default
  source:
    # Gitea in-cluster URL (ArgoCD runs inside the cluster,
    # so it uses the service DNS, not the external hostname)
    repoURL: http://my-gitea-http.gitea.svc.cluster.local:3000/giteaAdmin/garage-ae86.git
    targetRevision: HEAD
    # only this folder is applied — files outside it are invisible
    path: manifests
  destination:
    server: https://kubernetes.default.svc
  syncPolicy:
    automated:
      prune: true     # part not in the project? remove it
      selfHeal: true  # hack on the car? undo it

Two ArgoCD patterns that show up in every serious platform:

  • App-of-apps: a “root” Application whose content is… other Applications. You apply it ONCE by hand, and it pulls in the rest of the platform. It’s the official bootstrap pattern.
  • Sync waves: argocd.argoproj.io/sync-wave: "3" annotations that order the apply within a sync. You can’t torque the cylinder head before seating the block — and you can’t apply a ProviderConfig before its CRD exists. Docs here.

The triangle: who writes, who delivers, who assembles 🔺

Now the magic — how the three connect. The short answer: they never talk to each other directly. They communicate through Git and the API server.

DEV the client 1. fills in BACKSTAGE the front desk · writes GIT (Gitea) the ledger 2. git push (1 file) 3. creates Application (API) 4. sync ArgoCD the boss · delivers 5. kubectl apply (the XR) CLUSTER Crossplane the assembly line · builds 6. expands (reconcile) N real resources Namespaces · RBAC · Deployments · buckets · IAM…

The division of responsibilities:

  1. Backstage writes. The scaffolder renders the order (an XR) and pushes it to Git. It has no credential to create any bucket — and that’s a feature: the portal’s attack surface is “write YAML into a repo.”
  2. ArgoCD delivers. It carries the order from the ledger to the cluster and ensures it stays there, identical, forever. It also doesn’t know what a bucket is — to it, the XR is just YAML like any other.
  3. Crossplane assembles. It takes the XR and expands it into the N real resources, with continuous reconcile. It doesn’t know a portal or a Git repo exists.

Each tool has ONE job. You can swap Backstage for another portal (or for a git push by hand — we’ll do exactly that shortly) without touching the rest. You can swap Gitea for GitHub. You can add Kyverno policies in the middle without any of the three knowing. This combination even has a reference-stack name in the community: the BACK stack (Backstage, ArgoCD, Crossplane, Kyverno).

Why not just let Backstage create the resources directly via API? Because then the order isn’t recorded anywhere — no audit trail, no rollback via git revert, no continuous reconcile. The Git in the middle of the path is what turns “a script that creates things” into “a platform that maintains things.”

Hands-on: the local lab 🔧

Time to get our hands dirty. We’ll build the complete triangle on your machine using idpbuilder, the tool from the CNOE community that spins up a local IDP with one command: a kind cluster with Gitea + ArgoCD + ingress, all talking to each other, certificates and DNS resolved (the cnoe.localtest.me domain points at 127.0.0.1).

Prerequisites: Docker running, kubectl, helm, git, and curl.

Step 1 — install idpbuilder and create the IDP:

The install is a single binary: download, extract, run. No magic.

# macOS Apple Silicon — for Linux/Intel swap "darwin-arm64" for
# "linux-amd64" (or your OS-arch pair; see the releases page)
curl -fsSL -o idpbuilder.tar.gz \
  https://github.com/cnoe-io/idpbuilder/releases/latest/download/idpbuilder-darwin-arm64.tar.gz
tar xzf idpbuilder.tar.gz idpbuilder
./idpbuilder version

# optional: put it on the PATH — from here on I just call `idpbuilder`
# (if you skip this line, use ./idpbuilder in the next commands)
sudo install -m 0755 idpbuilder /usr/local/bin/
# spin up the IDP: kind cluster + Gitea + ArgoCD + nginx ingress (~2 min).
# --use-path-routing serves everything under ONE hostname (cnoe.localtest.me:8443
# /gitea, /argocd…) — the same mode the Backstage package in Step 7 uses,
# so the URLs don't change midway through the lab.
idpbuilder create --use-path-routing

# service credentials
idpbuilder get secrets
# ArgoCD UI: https://cnoe.localtest.me:8443/argocd
# Gitea UI:  https://cnoe.localtest.me:8443/gitea

Step 2 — install Crossplane:

helm repo add crossplane-stable https://charts.crossplane.io/stable
helm repo update
# --wait holds until the control plane is up
helm install crossplane crossplane-stable/crossplane \
  --namespace crossplane-system --create-namespace --wait

Step 3 — install the provider and the functions:

provider-kubernetes is the provider that applies arbitrary K8s manifests as Managed Resources — perfect for the lab because it needs no cloud credential at all. Mind the order within the file: the DeploymentRuntimeConfig comes before the Provider that references it.

# provider-and-functions.yaml
# 1st: pin the provider's ServiceAccount name. Without this, Crossplane
# generates the SA with a hash suffix (provider-kubernetes-abc123) and the
# static ClusterRoleBinding in the next block doesn't match until you
# update the Provider. Stable name = stable RBAC.
apiVersion: pkg.crossplane.io/v1beta1
kind: DeploymentRuntimeConfig
metadata:
  name: provider-kubernetes-runtime
spec:
  serviceAccountTemplate:
    metadata:
      name: provider-kubernetes
---
# 2nd: the Provider is a PACKAGE: Crossplane pulls the image, installs the
# CRDs (Object, ProviderConfig), and brings up the controller pod. This takes
# ~1 min — keep that latency in mind, it becomes a gotcha in Part 2.
apiVersion: pkg.crossplane.io/v1
kind: Provider
metadata:
  name: provider-kubernetes
spec:
  package: xpkg.upbound.io/crossplane-contrib/provider-kubernetes:v1.2.1
  runtimeConfigRef:
    name: provider-kubernetes-runtime   # ← the fixed-name SA above
---
# The two off-the-shelf functions the Composition uses.
apiVersion: pkg.crossplane.io/v1
kind: Function
metadata:
  name: function-go-templating
spec:
  package: xpkg.upbound.io/crossplane-contrib/function-go-templating:v0.11.0
---
apiVersion: pkg.crossplane.io/v1
kind: Function
metadata:
  name: function-auto-ready
spec:
  package: xpkg.upbound.io/crossplane-contrib/function-auto-ready:v0.6.4
kubectl apply -f provider-and-functions.yaml

# wait for everything to be INSTALLED=True HEALTHY=True
kubectl get providers.pkg.crossplane.io,functions.pkg.crossplane.io

Now the provider needs two things: permission (RBAC) to create resources in the cluster, and a ProviderConfig telling it how to authenticate.

# provider-rbac-and-config.yaml
# The provider pod runs with the ServiceAccount we pinned above;
# we give it cluster-admin because this is a single-tenant lab. In
# production, restrict it to a ClusterRole with exactly the kinds your
# Compositions emit.
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: provider-kubernetes-cluster-admin
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
  - kind: ServiceAccount
    # this name is only stable because the DeploymentRuntimeConfig pinned it
    name: provider-kubernetes
    namespace: crossplane-system
---
# "InjectedIdentity" = use the provider pod's own ServiceAccount
# to talk to the API server. The standard for acting on the SAME cluster.
apiVersion: kubernetes.crossplane.io/v1alpha1
kind: ProviderConfig
metadata:
  name: in-cluster
spec:
  credentials:
    source: InjectedIdentity
kubectl apply -f provider-rbac-and-config.yaml

(whisperops uses exactly this trio, with the RuntimeConfig in a sync wave before the Provider’s — the same ordering lesson comes back in Part 2.)

Step 4 — apply the XRD and the Composition (the xrd.yaml and composition.yaml from the previous section):

kubectl apply -f xrd.yaml
kubectl apply -f composition.yaml

# the XRD needs to become ESTABLISHED=True — that's the moment
# Kubernetes starts accepting orders of kind XGarage
kubectl get xrd
# NAME                         ESTABLISHED   OFFERED   AGE
# xgarages.blog.opsbogus.dev   True                    15s

kubectl get compositions
# NAME              XR-KIND   XR-APIVERSION                 AGE
# xgarage-default   XGarage   blog.opsbogus.dev/v1alpha1    10s

Step 5 — first test, no Git yet. Apply the order directly and watch the assembly line work:

kubectl apply -f xr.yaml    # the XGarage "projeto-ae86"

# the order
kubectl get xgarage
# NAME           SYNCED   READY   COMPOSITION       AGE
# projeto-ae86   True     True    xgarage-default   30s

# the parts it expanded
kubectl get objects.kubernetes.crossplane.io
# NAME                     KIND        PROVIDERCONFIG   SYNCED   READY   AGE
# garage-ae86-namespace    Namespace   in-cluster       True     True    40s
# garage-ae86-spec-sheet   ConfigMap   in-cluster       True     True    40s

kubectl get namespace garage-ae86
kubectl get configmap -n garage-ae86 spec-sheet -o yaml

Now the test that sells the concept — try to pull a hack:

# delete the ConfigMap by hand (the "parking-lot hack")
kubectl delete configmap -n garage-ae86 spec-sheet

How long until it comes back? Here lies a nuance worth learning early: the provider does not watch the wrapped part — it detects the drift on the next poll of the Object MR, and the provider-kubernetes default is 10 minutes (the XR reconcile itself runs every ~60s, but it only ensures the MR exists with the right spec; the one that notices the ConfigMap is gone is the provider’s poll). To avoid waiting, force a reconcile by touching the MR — any update to the MR enqueues it immediately:

kubectl annotate objects.kubernetes.crossplane.io garage-ae86-spec-sheet \
  reconcile.crossplane.io/now="$(date +%s)" --overwrite

kubectl get configmap -n garage-ae86 spec-sheet
# it's back 🪄 — the reconcile reassembled the part.

Step 6 — the full GitOps loop. Now we’ll do BY HAND what Backstage would do — because Backstage performs no magic, it does exactly this:

# local Gitea credential — WARNING: the generated password comes FULL of
# special characters the shell interprets ([, }, ?, *, $…).
# Always paste it in SINGLE QUOTES, or zsh blows up with
# "bad pattern" / "no matches found".
idpbuilder get secrets -p gitea   # user: giteaAdmin
GITEA_PASS='<PASTE-THE-PASSWORD-HERE>'

# and to embed it in the remote URL, it needs URL-encoding
# (a raw "?" in the middle of the password breaks URL parsing — git thinks
# the port isn't a number):
GITEA_PASS_ENC=$(printf '%s' "$GITEA_PASS" | \
  python3 -c "import sys,urllib.parse; print(urllib.parse.quote(sys.stdin.read(), safe=''))")

# 1. create the repository via API (what publish:gitea would do)
#    (in -u the password goes raw — double quotes suffice, curl doesn't parse a URL here)
curl -k -X POST "https://cnoe.localtest.me:8443/gitea/api/v1/user/repos" \
  -u "giteaAdmin:${GITEA_PASS}" -H "Content-Type: application/json" \
  -d '{"name": "garage-ae86", "default_branch": "main"}'

# 2. assemble the content: the order inside manifests/
mkdir -p garage-ae86/manifests && cd garage-ae86
cp ../xr.yaml manifests/xgarage.yaml
git init -b main && git add . && git commit -m "order: stage2 on ae86"

# 3. push (self-signed cert → sslVerify=false; ENCODED password in the URL)
git -c http.sslVerify=false remote add origin \
  "https://giteaAdmin:${GITEA_PASS_ENC}@cnoe.localtest.me:8443/gitea/giteaAdmin/garage-ae86.git"
git -c http.sslVerify=false push -u origin main
# (got the URL wrong on the first try and git complains "remote origin
#  already exists"? Fix it with: git remote set-url origin "<correct-URL>")

# 4. register the Application (what cnoe:create-argocd-app would do)
kubectl apply -f application.yaml

The XR from Step 5 already exists in the cluster — ArgoCD simply adopts it, because the content in Git is identical to what’s running. From now on, the ledger is in charge.

There: the triangle is closed. Now edit the order in Git — change stage: stage2 to stage: stage3, commit, push — and watch ArgoCD sync and Crossplane update the ConfigMap. You never touched the cluster again; just the project ledger.

# follow the propagation
kubectl get application -n argocd garage-ae86 -w
kubectl get configmap -n garage-ae86 spec-sheet -o jsonpath='{.data.stage}'

Step 7 (optional) — the real Backstage. idpbuilder installs the complete portal (Backstage + Keycloak + Argo Workflows) via the CNOE community’s reference package. One honest caveat: packages only register at cluster creation, so the command recreates the kind from scratch (~6 min) — the state from steps 2–6 evaporates. The good news: since we’ve been on path-routing since Step 1, all the URLs stay the same, and re-applying steps 2–4 is copy-paste of the same files. Good engines reassemble fast.

idpbuilder create --recreate --use-path-routing \
  -p https://github.com/cnoe-io/stacks//ref-implementation

From here on it’s step-by-step in the portal:

7.1 — Login. Open https://cnoe.localtest.me:8443/ and click Sign In. Backstage delegates authentication to Keycloak: user user1, password in the USER_PASSWORD field of the idpbuilder get secrets output.

7.2 — Register the template. First, push template.yaml + the skeleton/ folder to a Gitea repository — the same API + push flow from Step 6, in a repo called garage-template. Just one piece I haven’t shown yet: the skeleton, which is literally xr.yaml with placeholders in place of the literals:

# skeleton/manifests/xgarage.yaml — what fetch:template renders.
# By default, fetch:template processes ALL skeleton files as
# templates, substituting ${{ values.x }} with the form inputs.
apiVersion: blog.opsbogus.dev/v1alpha1
kind: XGarage
metadata:
  name: garage-${{ values.team_name }}
spec:
  teamName: ${{ values.team_name }}
  stage: ${{ values.stage }}

In Backstage: Create… → Register Existing Component, pointing at the template’s raw URL: https://cnoe.localtest.me:8443/gitea/giteaAdmin/garage-template/raw/branch/main/template.yaml

7.3 — Order the kit. Create… again: the Stage Kit card now appears alongside the CNOE example templates. Fill in team_name and pick the stage — notice the name regex is validated in real time, in the browser, before anything touches the cluster. An invalid order never leaves the counter.

7.4 — Watch the execution. When you click Create, the scaffolder runs the steps in the template’s order — fetch:template → publish:gitea → cnoe:create-argocd-app — logging each one in real time.

7.5 — Check the triangle. The new repo is in Gitea (/gitea/giteaAdmin/garage-<team>), the Application in ArgoCD (/argocd/applications), and the XR running in the cluster (kubectl get xgarage). The flow is identical to what you did by hand in Steps 5–6 — and that’s why we did it by hand first.

Debug tip: crossplane beta trace xgarage projeto-ae86 shows the full tree of the XR with each part’s state — the equivalent of popping the hood with the engine running. (The CLI installs with brew install crossplane or via the official script at docs.crossplane.io.)

End of Part 1. You have the triangle working locally. Now let’s see what happens when this pattern meets a real problem.


Part 2 — whisperops: the real shop

whisperops is my test bench for these concepts: a platform on GCP where anyone can create, through Backstage, an AI agent that analyzes a CSV — two LLM agents (planner + worker, orchestrated by kagent, running Gemini via Vertex AI), a Python sandbox to run analyses, a web chat, budget enforcement, and full observability. All on a single VM running kind, with the entire IDP layer (Gitea, Keycloak, ArgoCD, Backstage) brought in by the same idpbuilder from Part 1’s lab.

GCP VM — cluster kind IDP layer — CNOE/idpbuilder Gitea · Keycloak · ArgoCD · Backstage Platform layer Crossplane · providers · Kyverno · kagent LGTM · Reflector · dataset-watcher Agent layer — 1 ns per agent chat-frontend · planner · worker · sandbox XAgentBudget (the Level 2 wastegate) external call Vertex AI Gemini 2.5 Pro

What matters for this lesson is the core: three custom resources in increasing order of complexity. I’ll present them as three levels of tuning:

  1. XDataset — the catalog of homologated fuels (not even Crossplane!)
  2. XAgentBudget — the electronic wastegate with cutoff (3-function pipeline)
  3. XDatasetAgent — the complete Stage 3 kit (6-function pipeline, 22 parts)

Level 1 — XDataset: the catalog of homologated fuels ⛽

Every agent analyzes a dataset, and the datasets live in a GCS bucket. The design question: how does the Backstage form know which datasets exist? And how does the Composition validate that the requested dataset is real?

whisperops’s answer is a registry: one XDataset resource per CSV in the bucket. And here comes the first lesson, which sounds like a trick question:

XDataset is NOT a Crossplane XRD. It’s a plain Kubernetes CRD.

# xdataset-xrd.yaml (the filename is misleading — read the comment)
# XDataset doesn't compose anything — it's a pure data record, managed
# by the dataset-watcher controller, which owns the status and the
# ready transitions. An XRD would require a Composition owning the
# reconcile, which would CONFLICT with the controller writing status.
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: xdatasets.whisperops.io
spec:
  group: whisperops.io
  scope: Cluster
  names:
    kind: XDataset
    plural: xdatasets
    shortNames: [xds, datasets]
  versions:
    - name: v1alpha1
      served: true
      storage: true
      subresources:
        status: {}            # separate status: only the controller writes it
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                gcsPath:
                  type: string
                  pattern: "^gs://.+\\.csv$"   # validation at admission
                displayName:
                  type: string
                sizeBytes:
                  type: integer
                  minimum: 1
              required: [gcsPath, displayName, sizeBytes]
            status:
              type: object
              properties:
                ready: { type: boolean }
                sizeHuman: { type: string }
                lastSeen: { type: string, format: date-time }

The lesson: not everything needs to be an XR. If the resource composes nothing — if it’s just a record with a clear owner — a plain CRD with a controller is the right tool. Forcing an XRD here would create an ownership conflict over the status.

The one keeping the catalog is the dataset-watcher, a Python controller (~430 lines in the main module) that runs a reconcile every 30 seconds — the robot stockroom clerk of the shop, checking the fuel inventory:

  1. Lists the *.csv in the gs://whisperops-datasets/ bucket.
  2. For each blob, upserts an XDataset (normalizing the name: Athlete_Recovery.csv becomes athlete-recovery, but spec.gcsPath preserves the blob’s original name).
  3. In parallel, publishes a Backstage Resource entity to a Git catalog repository — this is where the form’s dropdown feeds from.
  4. If the CSV disappears from the bucket, the CR and the entity disappear together.

Point 3 closes an elegant circuit. The Backstage form doesn’t use a hardcoded enum; it uses an EntityPicker filtering catalog entities:

# excerpt from the dataset-whisperer template — the dynamic dropdown
dataset_ref:
  title: Dataset
  type: string
  ui:field: EntityPicker          # dropdown fed by the catalog
  ui:options:
    catalogFilter:
      kind: Resource
      spec.type: dataset           # only entities published by the watcher

Result: make upload-datasets uploads a new CSV, and in ~1 minute it shows up in the form’s dropdown — zero code or template change. A new fuel arrived at the shop, the clerk labeled it, and the order form already offers the option.

The catalog in action — kubectl get xds reads like the shop’s labeled shelf (the printer columns come from the CRD itself):

kubectl get xds
NAME                 SIZE       READY   LAST_SEEN   AGE
california-housing   1.9 MiB    true    21s         3d2h
online-retail-ii     90.4 MiB   true    21s         3d2h
spotify-tracks       19.2 MiB   true    21s         3d2h

And who reads the record? The agent’s Composition, in the first station — using a Crossplane mechanism called extra resources: the function declares “I need the XDataset called california-housing” and Crossplane fetches it for it. Important detail: because the mechanism fetches any resource by group+kind+name, it works with a plain CRD — one more reason not to force an XRD on the registry. We’ll see the code for this read in Level 3.

The complete registry architecture, in a diagram:

gs://whisperops-datasets/ the physical stock (the source) lists every 30s dataset-watcher the stockkeeper upsert publishes entities XDataset CRs the K8s catalog Backstage catalog the front-desk catalog ↑ read by the Composition ↑ read by the EntityPicker

One source of truth (the bucket), three projections (CR, entity, pipeline context) — each consumer reads the projection it understands, and none of them needs a GCS credential beyond the watcher itself.

🙃 Garage confession: I got carried away here, I’ll admit. A simple upload button in Backstage would have solved the problem — but I wanted to explore the registry pattern, write a custom controller, see the EntityPicker fed dynamically… Sometimes you buy the forged camshaft for an engine that didn’t even need it, just to see how it beds in. The upside: you just got the full tour of the pattern.

Level 2 — XAgentBudget: the electronic wastegate 💸

An LLM agent spends money on every token. Without control, an agent stuck in a loop is an engine with the wastegate jammed shut: boost climbs until it blows — except here what blows is the invoice. XAgentBudget is the electronic wastegate with cutoff: it measures the pressure continuously and, if it crosses the limit, cuts the fuel.

Before the “how,” the “why” of the architecture. The first version was a 470-line imperative controller — and the classic bug appeared: three different components (the controller, a probe on the frontend, and an error classifier) independently inferred “is this agent paused?”, and they diverged. A false “Budget Exhausted” banner every time the planner hiccupped. The rewrite as a Composition fixes that at the root: the XR’s status becomes the single source of truth, written by a pipeline and read by everyone.

The XRD (abridged) shows the contract:

# xrd.yaml of XAgentBudget — note the scope
apiVersion: apiextensions.crossplane.io/v1
kind: CompositeResourceDefinition
metadata:
  name: xagentbudgets.whisperops.io
spec:
  scope: Namespaced          # the XR lives INSIDE the agent's namespace
  group: whisperops.io
  names:
    kind: XAgentBudget
    shortNames: [xab, budget]
  versions:
    - name: v1alpha1
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                agentName:  { type: string }
                budgetUsd:  { type: number, minimum: 0, maximum: 10000 }
                pricingRef:                  # per-token price table
                  type: object               # (external ConfigMap — changing
                  properties:                # price needs no rebuild)
                    name: { type: string }
                    namespace: { type: string }
                enforcement:
                  type: string
                  enum: [enabled, monitor-only]   # bench mode: measures, doesn't cut
                  default: enabled
            status:
              type: object
              properties:
                spentUsd: { type: number }
                ratio:    { type: number }       # spend / budget
                paused:   { type: boolean }      # THE source of truth
                cause:
                  type: string
                  enum: [running, budget-exhausted, agent-unreachable, unknown]

The Composition is a pipeline of 3 stations + dyno, running every ~60 seconds (Crossplane’s default poll interval — more on that below):

every ~60s, the reconcile runs the line fetch-spend measures the pressure · Mimir ctx: spend_usd decide compares vs. limit · writes status ctx: should_pause render acts: replicas 0 or 1 decide → XR.status.paused (source of truth) render → Object MRs patch planner + worker

Station 1 — fetch-spend (Python, with the function-sdk-python): queries Mimir (Prometheus), summing the agent’s token counters and multiplying by the price table:

# function-budget-fetch-spend — the boost gauge
# For each token type (input/output/cached), sum the increase()
# over the XR's lifetime window and convert to USD via the price table
# (ConfigMap mounted at /etc/pricing — changing price is config only).
for metric, price_key in (
    ("whisperops_tokens_input_total", "input_per_million"),
    ("whisperops_tokens_output_total", "output_per_million"),
    ("whisperops_tokens_cached_input_total", "cached_input_per_million"),
):
    unit_price = prices[price_key] / 1_000_000
    # the model matcher matters: the price table is per-model
    expr = (
        f'sum(increase({metric}{{agent_name="{agent}",model="{MODEL_NAME}"}}'
        f"[{window_s}s])) * {unit_price}"
    )
    ...

# Design decision: FAIL OPEN. If Mimir goes down, spend = 0.0 and the
# agent keeps running — an observability outage can never pause a
# healthy agent. The price of this: silent under-enforcement while
# Mimir is down.
except httpx.HTTPError as e:
    response.warning(rsp, f"mimir query failed: {e}; using spend=0.0")
    spend = 0.0

# The result goes into the pipeline CONTEXT — not the status.
# Context = a note passed station to station, dies at the end of the
# reconcile. Status = what gets written to the XR.
rsp.context["spend_usd"] = spend
rsp.context["window_sec"] = window_s

Station 2 — decide: the pure logic. Compares, decides, and writes the status — the only place the truth is written:

# function-budget-decide — the ECU
ratio = spend / budget if budget > 0 else 0.0

if deletion_ts:                      # XR being deleted? unpause
    should_pause = False             # (hand the car back running)
elif enforcement == "monitor-only":  # bench mode: measures, doesn't cut
    should_pause = False
else:
    should_pause = ratio >= 1.0      # 100% of budget = cutoff

# write the source of truth into the XR's status
resource.update(rsp.desired.composite, {
    "status": {
        "spentUsd": round(spend, 4),
        "ratio": round(ratio, 4),
        "ratioPct": f"{ratio * 100:.2f}%",   # for kubectl get xab
        "paused": should_pause,
        "cause": "budget-exhausted" if should_pause else "running",
    },
})
# and pass the decision to the next station via context
rsp.context["should_pause"] = should_pause

Station 3 — render: the actuation. Emits two Object MRs of the provider-kubernetes that patch spec.replicas of the planner and worker Deployments:

# function-budget-render — the fuel cut
def _make_object_mr(role: str, namespace: str, replicas: int) -> dict:
    return {
        # NAMESPACED flavor of Object (.m.) — required because the
        # XAgentBudget is a namespaced XR, and in Crossplane v2 a
        # namespaced XR only composes namespaced MRs. The legacy Object
        # (kubernetes.crossplane.io/v1alpha2) is cluster-scoped and
        # fails with "cannot apply cluster scoped composed resource".
        "apiVersion": "kubernetes.m.crossplane.io/v1alpha1",
        "kind": "Object",
        "metadata": {"name": f"{namespace}-{role}-replicas",
                     "namespace": namespace},
        "spec": {
            # Observe + Update, NO Create and NO Delete:
            # - don't create: the Deployment already exists (kagent made it)
            # - don't delete: MR GC must not bring down the Deployment
            "managementPolicies": ["Observe", "Update"],
            "forProvider": {
                # deliberately SPARSE manifest: only the field we
                # want to own. Server-side apply makes the provider
                # own spec.replicas and NOTHING else — a complete
                # manifest here would steal ownership of every
                # field from the kagent operator.
                "manifest": {
                    "apiVersion": "apps/v1",
                    "kind": "Deployment",
                    "metadata": {"name": role, "namespace": namespace},
                    "spec": {"replicas": replicas},
                },
            },
            "providerConfigRef": {"kind": "ClusterProviderConfig",
                                  "name": "in-cluster"},
        },
    }

And who consumes the source of truth? The chat-frontend does a GET on the XR on every request (no cache, on purpose — unpausing becomes visible within one cycle) and maps it to HTTP with honest semantics: 402 Payment Required only when paused && cause == "budget-exhausted"; 503 is reserved for infrastructure failure. It was exactly this separation that killed the false banner.

The nuances worth the lesson:

  • The 60 seconds of latency aren’t configured anywhere — it’s Crossplane’s default poll interval. An agent can blow the budget by up to one cycle before the cutoff. A conscious trade-off: reducing the interval would double the Mimir queries per agent.
  • There’s no ownership fight over spec.replicas: the patch is server-side apply with its own field manager, owning ONE field. And ArgoCD never reverts it because the planner/worker Deployments aren’t even managed by it — the one that creates them is the kagent operator. Layers that don’t see each other don’t fight.
  • kubectl get xab is the runbook: the printer columns (BUDGET, SPENT, RATIO, PAUSED, CAUSE) make the XR itself the incident summary.

The wastegate in action, at two moments:

# at cruise — 29% of budget consumed, agent running:
kubectl get xab -n agent-housing-bot
NAME          BUDGET   SPENT    RATIO    PAUSED   CAUSE
housing-bot   5.00     1.4602   29.20%   false    running

# …after a day of heavy questions — crossed 100%, cut off ✂️
kubectl get xab -n agent-housing-bot
NAME          BUDGET   SPENT    RATIO     PAUSED   CAUSE
housing-bot   5.00     5.0213   100.43%   true     budget-exhausted

# and the cutoff is visible in the engine — planner and worker zeroed out:
kubectl get deploy -n agent-housing-bot planner worker
NAME      READY   UP-TO-DATE   AVAILABLE   AGE
planner   0/0     0            0           3d2h
worker    0/0     0            0           3d2h

Level 3 — XDatasetAgent: the complete Stage 3 kit 🏎️

Now the boss fight. The XDatasetAgent (XDA) is the XR that represents an entire agent — and its Composition expands from ONE declaration into 22 resources spanning Kubernetes and GCP. It’s the Stage 3 kit: forged engine, turbo, intercooler, ECU, and even the Level 2 electronic wastegate installed at the factory.

First, the before-and-after that justifies everything. In the previous version of whisperops, the Backstage skeleton had ~20 Nunjucks templates — each agent resource was a .njk file rendered by the scaffolder and applied by ArgoCD. It worked, but with two costs: the template was a monster to maintain, and ArgoCD managed dozens of resources that mutated at runtime, requiring 4 blocks of ignoreDifferences to keep it from fighting with the budget kill-switch, with budget top-ups, and with Kyverno defaults.

After the rewrite, what ArgoCD applies is one file (plus the Backstage catalog-info.yaml, which sits at the repo root, out of its reach):

# skeleton/manifests/xdatasetagent.yaml.njk — the complete order.
# The Composition renders the agent's 22 resources from this.
apiVersion: whisperops.io/v1alpha1
kind: XDatasetAgent
metadata:
  name: ${{ values.agent_name }}
spec:
  crossplane:
    compositionRef:
      name: xdatasetagent-default    # v2 way of choosing the manual
  agentName: ${{ values.agent_name }}
  datasetRef:
    name: ${{ values.dataset_id }}   # validates against the catalog (Level 1)
  budgetUsd: ${{ values.budget_usd }}  # becomes an XAgentBudget (Level 2)
  description: "${{ values.description }}"
  baseDomain: ${{ values.base_domain }}
  projectId: ${{ values.project_id }}

And the 4 blocks of ignoreDifferences? They vanished. ArgoCD now manages a single resource — the order — and all runtime mutations happen on the composed parts, below its line of sight. This is perhaps the most important architectural lesson in the post: shrinking the surface managed by GitOps dissolves the conflicts between selfHeal and runtime mutation, instead of administering them exception by exception.

The Composition declares the line with 6 stations + dyno:

# xdatasetagent-default.yaml — the complete assembly line
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: xdatasetagent-default
spec:
  compositeTypeRef:
    apiVersion: whisperops.io/v1alpha1
    kind: XDatasetAgent
  mode: Pipeline
  pipeline:
    - step: validate-dataset      # 1. check the fuel in the catalog
      functionRef: { name: function-xda-validate-dataset }
    - step: compute-tuning        # 2. size the engine to the dataset
      functionRef: { name: function-xda-compute-tuning }
    - step: render-iam            # 3. GCP parts: bucket, SA, IAM
      functionRef: { name: function-xda-render-iam }
    - step: render-workloads      # 4. K8s parts: 15 resources
      functionRef: { name: function-xda-render-workloads }
    - step: render-dashboard      # 5. instrument panel (Grafana)
      functionRef: { name: function-xda-render-dashboard }
    - step: emit-budget           # 6. install the wastegate (XAgentBudget!)
      functionRef: { name: function-xda-emit-budget }
    - step: ready                 # final dyno
      functionRef: { name: function-auto-ready }

The data flow between stations uses the two channels you already know from Level 2 — and here the distinction becomes vital:

1. validate-dataset checks the dataset (Level 1) 2. compute-tuning sizes the sandbox 3. render-iam 5 GCP MRs (bucket · SA · IAM) 4. render-workloads 15 K8s MRs (the bulk of the engine) 5. render-dashboard 1 ConfigMap (Grafana dashboard) 6. emit-budget 1 nested XR — XAgentBudget! THE TWO DATA CHANNELS CONTEXT ticket between stations · dies on reconcile XR.STATUS durable record · kubectl get xda

Station by station, with the nuance of each:

1. validate-dataset — check the fuel. Reads the Level 1 XDataset via extra resources. The pattern has a two-phase bootstrap subtlety that catches everyone off guard:

# The function DECLARES what it needs; Crossplane fetches and RE-INVOKES.
selector = rsp.requirements.extra_resources["xdataset"]
selector.api_version = "whisperops.io/v1alpha1"
selector.kind = "XDataset"
selector.match_name = dataset_ref

# On the FIRST invocation, req.extra_resources comes EMPTY — the
# requirement hasn't been met yet. The function needs to return
# early with an honest state (phase=Validating) instead of failing.
# On the next invocation, the XDataset comes populated.
if not xds_items:
    resource.update(rsp.desired.composite, {"status": {
        "phase": "Validating",
        "conditions": [{"type": "Ready", "status": "Unknown",
                        "reason": "BootstrappingExtraResources", ...}],
    }})
    return rsp

# Dataset doesn't exist or not-ready? Fail EARLY, with a readable cause.
# Better an XR Failed with "DatasetNotFound" than a sandbox crashing
# 10 minutes later with a GCS error.

Validated, it puts {name, gcsPath, sizeBytes, displayName} into the context.

2. compute-tuning — size the engine. A bigger carburetor demands more fuel; a bigger dataset demands more memory in the sandbox. The heuristic: pandas takes ~3.5× the CSV size in RAM; round up to the next GiB, with a floor of 1 GiB and a ceiling of 8 GiB:

PANDAS_BLOAT_FACTOR = 3.5     # validated against the real datasets

def compute_sandbox_mem_mi(size_bytes: int) -> int:
    raw_mib = math.ceil((size_bytes * PANDAS_BLOAT_FACTOR) / (1024 * 1024))
    gib_rounded = math.ceil(raw_mib / GIB_IN_MIB) * GIB_IN_MIB
    return max(MIN_SANDBOX_MIB, min(gib_rounded, MAX_SANDBOX_MIB))

This station also carries the trickiest footgun of the Python SDK: the resource.update(rsp.desired.composite, {"status": {...}}) does a SHALLOW update — the entire status block is REPLACED, erasing what the previous station wrote. The fix is read-merge-write:

# Read the ACCUMULATED desired status (which came from previous stations),
# merge the new field, and write the complete block back.
# Without this, sandboxMemMi would erase phase/datasetFmt/conditions
# written by validate-dataset.
desired_status = (
    resource.struct_to_dict(rsp.desired.composite.resource).get("status") or {}
)
desired_status["sandboxMemMi"] = sandbox_mem_mi
resource.update(rsp.desired.composite, {"status": desired_status})

3. render-iam — the GCP parts. Emits 5 Managed Resources from Crossplane’s GCP providers: the agent’s bucket, a ServiceAccount, a ServiceAccountKey, and two ProjectIAMMember — with IAM CEL conditions restricting each grant to exactly the right bucket (viewer on the shared datasets bucket, admin only on the agent’s own bucket). Least-privilege per agent, generated by code.

4. render-workloads — the bulk of the engine. 15 Kubernetes manifests wrapped in Object MRs: the agent’s Namespace, prompts ConfigMap, NetworkPolicy, ModelConfig, and the two kagent Agent CRs (planner + worker), sandbox (Deployment + Service + RemoteMCPServer, the kagent MCP tool-server), chat-frontend (SA + RoleBinding + Deployment + Service + Ingress), and a Kyverno policy. Two design decisions make this station work:

# DECISION 1 — the chicken-and-egg of the namespace:
# The XDA is cluster-scoped because it CREATES the namespace the parts
# live in (you can't live inside what you haven't built yet).
# But a namespaced Object MR needs to exist in a namespace that ALREADY
# exists at create time. Solution: the MR lives in crossplane-system
# (always exists), while the wrapped manifest points at the agent's
# namespace — provider-kubernetes tries to apply, fails while the
# namespace doesn't exist, and converges on its own when the Namespace
# part settles. Eventual consistency doing the work.
return {
    "apiVersion": "kubernetes.m.crossplane.io/v1alpha1",
    "kind": "Object",
    "metadata": {"name": mr_name, "namespace": "crossplane-system"},
    "spec": {"forProvider": {"manifest": manifest},   # ← target: agent-{name}
             "providerConfigRef": {"kind": "ClusterProviderConfig",
                                    "name": "in-cluster"}},
}

# DECISION 2 — two flavors of Object coexisting:
# The Namespace (cluster-scoped) uses the LEGACY Object
# (kubernetes.crossplane.io/v1alpha2), because the .m. flavor only has
# a namespaced Object. The namespaced parts use the .m. flavor by
# CONVENTION of Crossplane v2 — in Part 1 we wrapped a ConfigMap
# in the legacy Object and it worked: a cluster-scoped XR CAN compose
# legacy Objects wrapping namespaced manifests. The HARD scope rule
# ("namespaced XR only composes namespaced MR") only binds on the
# Level 2 XAgentBudget. Two provider configs coexist, both
# "in-cluster": a ProviderConfig (legacy group) and a
# ClusterProviderConfig (.m. group).

5. render-dashboard — the instrument panel. Renders one Grafana dashboard per agent (a ConfigMap from a JSON template). A veteran’s detail: the substitution uses str.replace with sentinels (__AGENT__), never str.format() — the Grafana JSON’s braces would blow up .format(). And the result goes through json.loads before becoming a ConfigMap, to fail at reconcile and not silently in Grafana.

6. emit-budget — the factory wastegate. The final station emits… another XR. The Level 2 XAgentBudget is born here, as a composed part of the XDA:

# Composition-of-compositions: the XDA emits an XAgentBudget, which has
# its OWN pipeline (fetch-spend → decide → render) running on its
# own reconcile cycle. Emitting an XR is identical to emitting an
# MR — only the wrapped kind changes.
xab = {
    "apiVersion": "whisperops.io/v1alpha1",
    "kind": "XAgentBudget",
    "metadata": {
        # CONTRACT: the XAB name == the agent name. The chat-frontend
        # looks up getXAgentBudget("agent-{name}", "{name}") — a
        # mismatch breaks the chat with a misleading 503.
        "name": agent,
        "namespace": f"agent-{agent}",
    },
    "spec": {
        "crossplane": {"compositionRef": {"name": "xagentbudget-default"}},
        "agentName": agent,
        "budgetUsd": budget_usd,
        "pricingRef": {"name": "whisperops-pricing",
                       "namespace": "crossplane-system"},
        "enforcement": "enabled",
    },
}
resource.update(rsp.desired.resources["xagentbudget"], xab)

This nesting is what lets Level 2 exist as an independent product: the budget has its own lifecycle, reconcile, and API — the XDA just instantiates it. Like the electronic wastegate you buy separately, but which the Stage 3 kit ships pre-installed.

The complete inventory, to settle the count of the 22 parts:

StationPartsWhat
render-iam5Bucket, ServiceAccount, ServiceAccountKey, 2× ProjectIAMMember (GCP)
render-workloads15Namespace, prompts CM, NetworkPolicy, ModelConfig, 2× Agent (kagent), sandbox (Deploy+Svc+RemoteMCPServer), chat-frontend (SA+RB+Deploy+Svc+Ingress), Kyverno policy
render-dashboard1ConfigMap with the agent’s Grafana dashboard
emit-budget1XAgentBudget (nested XR)

And the result in the terminal — the XR with the printer columns derived from the status the stations wrote:

kubectl get xda
NAME          DATASET              BUDGET   READY   URL                                                   AGE
housing-bot   California Housing   5.00     True    https://agent-housing-bot.34.61.7.12.sslip.io:8443/   12m

And popping the hood with crossplane beta trace — the XR tree with the 22 parts hanging off it (abridged output):

crossplane beta trace xdatasetagent housing-bot
NAME                                                SYNCED   READY
XDatasetAgent/housing-bot                           True     True
├─ Object/agent-housing-bot-namespace               True     True
├─ Object/agent-housing-bot-agent-planner           True     True
├─ Object/agent-housing-bot-agent-worker            True     True
├─ Object/agent-housing-bot-sandbox-deployment      True     True
├─ Object/agent-housing-bot-chat-frontend-deploy…   True     True
├─ Bucket/agent-housing-bot-bucket                  True     True
├─ ServiceAccount/agent-housing-bot-sa              True     True
├─ XAgentBudget/housing-bot                         True     True
└─ (14 more parts)

The complete flow: from form to chat 💬

Putting the three levels together, the path from a click in Backstage to an agent holding a conversation:

1 DEV fills in 4 fields: name · description · dataset · budget 2 fetch:template renders the skeleton (1 XR + catalog-info) 3 publish:gitea creates the repo agent-<name> and pushes 4 cnoe:create-argocd-app creates the Application (path: manifests/) 5 ArgoCD sync → kubectl apply the XDatasetAgent 6 Crossplane runs the line: 6 stations → 22 parts 7 kagent brings up planner + worker (Vertex AI) 8 chat-frontend is live → the dev chats with the dataset

Two template tricks worth noting:

  • Hidden parameters with sed-bake: base_domain and project_id are ui:widget: hidden fields with sentinel defaults (__BASE_DOMAIN__). At deploy, the bootstrap queries the VM’s metadata server and “bakes” the real values into the template via sed. The dev never types an IP or project ID — and because the hidden fields have validation regex, a failed sed-bake breaks the form loudly, instead of scaffolding with a placeholder.
  • path: manifests on the Application: the catalog-info.yaml sits at the repo root (for Backstage to discover) and out of ArgoCD’s reach (which only looks at manifests/). Without this, ArgoCD would try to apply a Backstage entity as a K8s resource and stay eternally OutOfSync.

And Day-2? Changing a live agent’s budget, swapping the dataset, destroying the agent — these are all Backstage templates too, and they all respect GitOps: the change goes to Git first, and ArgoCD applies it. An in-cluster Job does a GET on the file via the Gitea API (capturing the SHA), edits the exact line with sed/awk, does a PUT with the previous SHA (optimistic concurrency), and annotates the Application with refresh=hard so it doesn’t wait for the poll. No kubectl patch on the live resource — the shop foreman would revert it in seconds, and rightly so.

The destroy deserves its diagram, because the order is the lesson:

destroy-agent — order matters 0 turn off the Application's selfHeal otherwise ArgoCD RECREATES the XR! 1 kubectl delete xdatasetagent cascade tears down the 22 parts 2 remove finalizers + delete the Application 3 delete the namespace (defensive) 4 delete the repo on Gitea the source of truth, last! 5 purge the entities from Backstage 6 audit event to Loki who · what · when

Deleting the repo before the XR would invert the race: selfHeal would lose its source, but the orphaned XR would be left behind. In GitOps, teardown is choreography: you silence the reconciler, dismantle the state, and only then erase the project from the ledger.

Lessons from the shop (what I broke so you don’t have to) ⚠️

Consolidating the nuances that appeared along the way, plus a few that only show up in production:

  1. The CRD-establish race, at three scales. A Crossplane Provider is a package: the CR syncs in seconds, but the CRDs it installs take minutes. If the ProviderConfig is in the same Application, ArgoCD’s dry-run fails (“kind doesn’t exist”). whisperops defends in three layers: separate Applications with sync waves (providers in wave 3, config in wave 5), retry with exponential backoff, and SkipDryRunOnMissingResource=true annotated on the resource. The same race reappears inside the providers app (the DeploymentRuntimeConfig must precede the Provider) and in the content (example XRs are excluded from the sync with exclude: "examples/*", because applying an XR before the XRD establishes brings down the whole sync).
  2. ignoreDifferences alone isn’t enough — it only changes the diff. For selfHeal not to overwrite the field, you need RespectIgnoreDifferences=true in syncOptions. And the mature version of this lesson: if you need many ignoreDifferences, maybe the problem is the surface ArgoCD manages — shrink it (that was exactly the XDA rewrite).
  3. Sync waves order the apply, not the readiness — and sync-wave annotations on Crossplane-composed resources are decorative (the one that creates them is Crossplane, which ignores them). For real ordering between async layers, whisperops uses a PreSync hook that polls the prerequisite — converting async state into a hard precondition.
  4. An XRD’s spec.names is immutable in Crossplane v2 (CEL self == oldSelf). Adding a shortName on a live cluster is rejected at admission; it only lands on a recreate. Plan the names in the first version.
  5. The function-sdk-python footguns: proto Struct has no .get() (convert with resource.struct_to_dict), nested assignment on a Struct blows up (use resource.update), the status update is shallow (read-merge-write), and req.extra_resources is a MessageMap the default converter doesn’t understand (iterate the keys). None of these show up in a tutorial; all of them show up in the first real pipeline.
  6. Context propagates — as long as each station cooperates. In the Python SDK, response.to(req) copies the received context into the response; a function that builds the response by hand, without that helper, silently discards the note and the following stations see nothing. whisperops re-emits the critical keys explicitly, as belt-and-suspenders. And the distinction still holds: context is a station note (dies at the end of the reconcile); status is the durable record.
  7. packagePullPolicy: Always on functions with the :latest tag — otherwise the digest gets cached and the function pods keep running old code after a rebuild, silently.
  8. Fail open where the measuring system’s failure can’t punish the measured (fetch-spend with Mimir down), and fail early and readably where the order is invalid (DatasetNotFound at station 1, regex on the form, limits on the XRD). Layered validation — browser, admission, pipeline — gives feedback at the cheapest possible point.

Build your own 🛠️

The checklist to adapt this to your context, in implementation order:

  1. Start with the empty triangle: idpbuilder create, Crossplane via Helm, provider-kubernetes. Reproduce Part 1’s lab up to the “edit it in Git and watch it propagate” step. Without that solid, the rest collapses.
  2. Model ONE lean abstraction. Pick the resource your team requests most (a web service? a database? a bucket?) and design the XRD with the minimum of fields — if the form needs more than 5, the abstraction is leaking.
  3. Composition in Pipeline mode from day 1, even if with a generic function (function-go-templating) + function-auto-ready. Migrating from patch-and-transform later hurts more.
  4. When the logic grows, write your own functions (Python or Go). Validation with readable failure first, render after, auto-ready always last. Durable status on the XR; context only between stations.
  5. Backstage comes last — when the manual flow (push + Application) is solid. The scaffolder template is just the form in front of what you already proved works.
  6. Day-2 is born GitOps: every mutation goes to Git first. If you catch yourself writing kubectl patch in a runbook, back up two squares.
  7. Record the races you lose. CRD-establish, namespace chicken-and-egg, selfHeal vs runtime — they all have a declarative solution (waves, retry, hooks, scoping). The parking-lot hack holds until the shop foreman walks by.

The measure of success is the same as the shop’s: the customer picks the kit at the counter, signs a one-page form, and days of hand-crafted work become minutes of assembly line — with the foreman ensuring every car on the road is identical to the project in the ledger. When your kubectl get shows an XR Ready: True with 22 parts hanging off it, you’ll understand why I call it an engine that reassembles itself.


References 📚

Official documentation:

Tools and code:

Research and fundamentals:

Further reading:

Read next