Whisperops — LAB — opsbogus.dev

The initial idea behind this project was to put a few DevOps and SRE concepts to the test and, as a bonus, try out an AI feature directly.

Key Highlights

There are a few ways to look at this project, which gradually turned into a small monster.

From an end-user functionality perspective, this is an extremely simple project. The user creates their own agent to analyze any CSV file. The platform creates two agents that talk to each other, the planner and the worker, where the user interacts with the planner through a chat frontend, asking questions about the dataset. The worker, in turn, has predefined instructions to run Python commands in a sandbox, analyze the dataset, generate a deterministic response, and, whenever necessary, also add an interactive chart to the chat response.
From the application perspective, there are some interesting resources here as study material:
1. The first and most relevant one is the use of an IDP (Backstage using CNOE) integrated with ArgoCD and Crossplane Compositions. Instead of having several Nunjucks templates, we have an XDatasetAgent Composition going through a Crossplane pipeline with 6 functions (validate-dataset → compute-tuning → render-iam → render-workloads → render-dashboard → emit-budget) that, at the end of the chain, creates and manages around 22 Kubernetes and GCP resources. There is full multi-tenant control over each deployment, where, directly from the IDP UI, it is possible to manage the lifecycle of each agent-app by changing configurations (budget control, dataset changes, etc.) and deleting the entire context in an isolated and safe way. All actions performed are also auditable and observable.
2. Each agent-app has a budget that can be configured through the IDP, with full observability across the agent lifecycle. When the predefined budget reaches 100%, the chat immediately stops working.
The project includes a fairly complete observability bundle using a full LGTM stack (OTel Collector, Grafana Alloy, Mimir, and Loki), delivering 9 extremely detailed dashboards, plus one additional dashboard for each deployed agent-app (also controlled by the IDP + Compositions flow). This might be one of the strongest parts of the project.
1. Cluster Health (USE Method) 🖥️ The health of the machine underneath everything. CPU, memory, disk, and network Utilization/Saturation/Errors for the node, plus OOMKills, container restarts, and pods outside the Running state. If the e2-standard-8 VM starts choking, this is where you see it first.
2. LLM Platform Overview 🤖 The top-level product view. Chat RED metrics (req/min, error %), TTFT p50/p95, input and output tokens/min, active agents, conversation turns, and sandbox executions. The single pane to know whether the agent platform is healthy.
3. Agent Lifecycle Activity 📋 Day-2 audit trail. How many agents were scaffolded, actions in the last 24h, who performed them (top actors), recent destroys, and breakdown by agent/type. It answers “who changed what and when” through Loki logs.
4. SLO Compliance 🎯 The reliability contracts. Availability 99%, TTFT@30s 95%, sandbox success 99%, with remaining error budget and multi-window burn rate. This is the dashboard that triggers, and justifies, the SLO alerts.
5. Service Map 🕸️ The topology via Tempo spanmetrics + service-graphs. Dependency graph (chat→planner→worker→sandbox A2A), calls/min, p95 and error % by service, slowest operations, and inter-service edges. Where latency lives along the path.
6. Cost & Token Economics 💰 The money. Lifetime cost, spend rate $/min, total vs remaining budget per agent, token throughput (in/out/cached), cache hit rate, accumulated cache savings, and cost per turn. Calibrated for Gemini pricing.
7. RED Method (per service) 🔴 Granular RED by service, not aggregated. Rate/Errors/Duration split across chat-frontend, sandbox, and planner, including sandbox errors by type. When D2 points to a problem, this isolates which service is to blame.
8. Apdex per Agent 😊 User satisfaction translated into a number. Apdex per agent (T=30s, tolerable=90s, calibrated for Pro): satisfied/tolerating/frustrated by TTFT, percentiles, heatmap, and which agents fell into the “frustrated zone” (<0.7).
9. ArgoCD + Crossplane Platform Health ⚙️ The health of the GitOps/control-plane layer. ArgoCD apps Synced/Healthy, reconcile latency, sync churn, Kyverno decisions/violations by policy, Crossplane pod phases, and provider restarts. The dashboard for “the platform that runs the platform.”
10. Per-Agent Detail 🔍 A dashboard automatically generated for each scaffolded agent. It shows, filtered only for that specific agent, health (availability, TTFT, Apdex), cost (spend vs budget), token/sandbox/resource usage, plus logs and traces. It is meant for deep investigation when a specific agent misbehaves.
Last but not least, I also configured some interesting security guardrails for this project scenario:
1. Hardened sandbox: the code generated by the LLM runs as non-root, with a read-only filesystem, no capabilities, and no K8s token, in order to avoid prompt injection and similar issues.
2. Default-deny egress: the sandbox can only talk to GCS, DNS, and OTel; any exfiltration attempt outside the cluster is blocked by NetworkPolicy.
3. No Git Secrets: all SA keys are generated on every deployment and rotated, so there are no committed or persisted credentials.
4. IAM least privilege: each SA has only the minimum required role (Vertex = inference; agent = only its own bucket), limiting the blast radius if a key leaks.
5. Cluster-wide Kyverno admission: 6 policies block images from untrusted registries, privileged containers, and pods without resource limits.
6. Day-2 with SSO + minimal RBAC: operational actions require Keycloak login (built into CNOE) and run under an SA that cannot create agents or grant itself permissions.
7. Audit trail: every Day-2 action records the actor, action, and agent in Loki, providing full traceability over any IDP action.
8. Accepted residual risks: broad bootstrap SA, ~60s budget latency, and replicated Vertex key, all documented with a mitigation path (GKE + Workload Identity).

Other Features

Dataset control is handled through a GCP bucket, where any CSV can be uploaded directly. When creating an agent-app, the IDP displays a drop-down menu with all CSVs available in the bucket. This feature also works through Compositions (XDataset), which monitor the bucket state, validate the CSVs, perform some normalizations, and make them available in the IDP on a recurring basis.
Agent budget control is also handled through a Composition (XAgentBudget) that goes through a 3-step pipeline (fetch-spent gets the amount spent through Mimir, decide analyzes the ratio result, and finally, render updates the agent Deployments according to the result).
The project was designed as a one-shot deployment because it is a demo. All the main resources in the project are GitOps first, so, to speed up the deployment process, there is a Makefile that brings up the infrastructure with Terraform, builds the images, and pushes all the structure code to a Gitea instance inside the cluster. The deployment process takes around 30 minutes and is fully idempotent.
It is not exactly a feature, but inside the docs directory, there are many diagrams and explanations about each layer of the project. These are detailed technical documents covering each project decision. Soon, I will go deeper into some of those details in specific posts on this blog.
This project was entirely developed using Spec Driven Development with Claude. The framework chosen to execute the project was a personal fork of AgentSpec.

Key Highlights

Other Features

Related posts