Architecting the Agentic Enterprise

For most of the last few years the question I asked of a language model was simple: can it answer this? In production, the question has changed. It is now: can it execute, recover, and stay inside the rails for the next four hours, across half a dozen tools, on behalf of a person who is not watching?

That second question is much harder. It is also why a single model is no longer the unit of architecture. The unit is a stack. A model that reasons. A framework that turns probabilistic reasoning into something you can deploy and roll back. A wire protocol that lets independent agents collaborate across trust boundaries. And a UI contract that lets those agents project interfaces back to people without handing them a script tag and a prayer.

I have been building on this stack for a while now in Google’s Ads Central UX team, on agentic tooling that helps UX researchers reason over third-party reports at scale. The four layers I keep coming back to are Gemini, ADK, A2A, and A2UI. This post is what I would tell myself a year ago about each of them.

The four layers

Gemini, doing the thinking

The first habit to drop is treating the model as a text generator. In an agent, it is a probabilistic decision engine with a much narrower job description: plan, pick a tool from a registry, read the result, decide whether the plan is still valid, and either continue or back up. Gemini is tuned for this, and two properties matter most.

The first is the context window. With Gemini 1.5 Pro and onward, the agent can carry its full action history in-context across a long task. Every prior tool call, every error, every intermediate decision, all of it stays available without bolting on a retrieval system just to remember itself. That collapses an entire class of short-term-memory bugs that plague smaller-context systems. Most of the time when I see an agent appear to forget what it did three steps ago, the cause is a smaller model with a window so tight that context had to be summarised away. Pay for the bigger window. It is almost always cheaper than the bug.

The second is structured tool use. The Gemini API supports parallel function calling, where multiple independent tools fire in a single turn. It supports compositional calls, where the output of one tool is fed straight into another. And it enforces a strict id contract that pairs every functionCall with its corresponding functionResponse. That id contract sounds boring until you have ten parallel tool returns landing asynchronously and a single race condition would corrupt the turn. The id is the thing that prevents that, quietly, every time. Do not invent your own.

How multi-step actually works

The pattern most of us end up with is some flavour of ReAct. At each step, the model produces its reasoning before its action, then evaluates the result of the action before proposing the next one. Reflection, where the agent checks whether its last step contradicted an earlier conclusion, is what stops it from confidently walking off a cliff. The ratio of compute spent on reflection versus action is one of the two or three knobs that meaningfully changes outcomes.

A more useful mental model than “the agent decides what to do” is this: at every step, the agent samples an action from a conditional distribution over its current state. The state is the accumulated history of every prior tool call, observation, and intermediate decision. Long-horizon success is mostly a function of how well that distribution stays inside the part of action space your guardrails approve of. Phrased that way, the engineering work is obvious: shape the state, shape the tool surface, and the action distribution will mostly take care of itself.

ADK, doing the structure

Cognition is necessary; it is nowhere near sufficient. A bare model in a loop is one of the easier ways to set money on fire. The loop runs, the prompt drifts, and the only intervention available is killing the process. The Agent Development Kit is the framework layer that turns a model into a software system you can reason about, deploy, and roll back.

ADK is open-source, model-agnostic in principle (it routes through LiteLLM if you want Claude or GPT under the hood), and ships in Python, TypeScript, Go, and Java. Most of the value, in practice, shows up at the orchestration layer. ADK 2.0 leaned hard into graph-based execution, and I think that was the right call.

Graphs over prompts

The single most important architectural decision in agent frameworks right now is whether the workflow is a prompt or a graph. Prompt-based systems try to talk the model into following a procedure. Graph-based systems describe the procedure deterministically and let the model fill in the interesting parts. The first feels easier on day one and breaks badly by day thirty.

ADK gives you four primitives to compose with:

Sequential. A then B then C. Use it where the order is fixed by logic or compliance.
Parallel. Five sub-agents fire concurrently against five backends; the parent waits for all of them. Most of the latency wins live here.
Iterative loop. Generator agent emits code, compiler agent rejects it, generator tries again, bounded by a max iteration count. The bound is what separates this from the failure mode at the start of this section.
Dynamic routing. The model itself acts as a router, classifying intent and dispatching to a specialist sub-agent. This is the escape hatch from the temptation to put every instruction into one giant prompt.

The pattern that actually scales is what people call an “agent team”. A small constellation of specialist sub-agents, each with a tight prompt, a small tool surface, and a single responsibility. A root planner refines the user’s goal, gets a human-in-the-loop approval if it needs one, and delegates execution to the specialists. Below is roughly the shape of a research agent in this style, simplified for the post but built on the same bones.

adk_agent.ts

import { LlmAgent, FunctionTool } from '@google/adk';

interface Report {
  sourceId: string;
  title: string;
  publishedAt: string;
  body: string;
}

// Tool: fetch a single research report by id from the source registry.
async function fetchReport(sourceId: string): Promise<Report> {
  // ...lookup, auth, retry, schema validation
  return {} as Report;
}

export const researchAgent = new LlmAgent({
  name: 'research_agent',
  model: 'gemini-2.5-pro',
  description: 'Pulls third-party reports and summarises them on demand.',
  instruction: `You are a research agent. Given a research question,
identify the most relevant sources, fetch them in parallel using
the fetchReport tool, and produce a structured synthesis with
citations. Never fabricate a source id.`,
  tools: [new FunctionTool(fetchReport)],
});

State, persistence, and governance

Production agents have to survive process restarts, day-long tasks, and the occasional human stepping away mid-flow. ADK paired with the Vertex AI Agent Engine gives you persistent session state, an Agent Memory Bank that curates long-term memories across interactions, and (with extensions like Restate) durable execution that resumes a long task exactly where it stopped. The first time a long-running task survived a redeploy without losing its place, I stopped thinking of agents as scripts and started thinking of them as services.

The piece that matters most for enterprise work is governance. The platform layer enforces a few non-negotiables: a sandbox for any agent-generated code execution, an Agent Registry as the single source of truth for which tools and sub-agents are approved, an Agent Gateway that applies IAM consistently, a Model Armor layer that filters prompt injection at the gateway, and an evaluator that asserts trajectories stay inside known-good logical paths. None of these are glamorous. All of them are what makes the difference between “cool demo” and “something I can ship.”

A2A, the wire between agents

The moment you have more than one agent, you have a fragmentation problem. A logistics agent built on ADK cannot natively talk to an inventory agent built on a different framework. Every team writes their own brittle bridge, and the bridges become a second product nobody wanted to build. The Agent-to-Agent (A2A) protocol, announced in April 2025 and donated to the Linux Foundation soon after, is the common-language layer that makes this go away.

A2A is not a replacement for MCP; it is orthogonal to it. MCP standardises how a single agent reaches its tools. A2A standardises how distinct agents negotiate and exchange work as peers. The two compose cleanly. An A2A-exposed agent will typically use MCP internally to reach its own tools, and expose only its capabilities, not its tool surface, to the outside world.

The architectural commitment underneath A2A is opacity. An agent should be able to be useful to another agent without leaking its prompts, its tool registry, or its memory. An external recruiting agent can ask an internal HR agent to schedule an interview; the HR agent can do the work without exposing which calendar API it calls, what fields are in the underlying record, or which model it is running on. That opacity is what makes cross-organisation agent collaboration tractable from a security perspective. Without it, every integration becomes a privacy review.

The AgentCard, and what it gets you

Every A2A-compliant agent advertises itself with an AgentCard. It is a small JSON document at a well-known URL (typically /.well-known/agent.json) describing the agent’s identity, supported input/output modalities, authentication model, and the named skills it offers. The card is the discovery primitive. A client agent finds a card, reads the skills, and constructs a delegation request against the skill it needs. Think Robots.txt, but for capabilities.

/.well-known/agent.json

{
  "name": "WebSearchAgent",
  "description": "Performs grounded web search and returns cited summaries.",
  "url": "https://agents.example.com/web-search",
  "version": "1.2.0",
  "capabilities": {
    "streaming": true,
    "pushNotifications": true
  },
  "defaultInputModes":  ["text"],
  "defaultOutputModes": ["text", "structured"],
  "skills": [
    {
      "id": "search",
      "name": "Search",
      "description": "Issue a search query and return ranked results with snippets.",
      "tags": ["web", "research"],
      "examples": [
        "Find recent analyst commentary on agentic infrastructure."
      ]
    }
  ]
}

Tasks and the long-running case

Delegation in A2A is modelled as a Task with a defined lifecycle. Short tasks are synchronous and feel like an RPC call. Long tasks (a multi-step research job, a supplier negotiation that takes hours, anything that needs human intervention along the way) are asynchronous. The client fires the task, the server acknowledges, the client disconnects, and the server delivers updates and the final result via push notifications to a client-supplied webhook. The first time you ship a real long-running agent, the push model goes from “nice to have” to obvious.

The transport choices are deliberately boring. HTTP for requests, JSON-RPC for the message shape, and Server-Sent Events for streaming intermediate updates and large artifacts. Boring is the point. The lower the operational novelty in the transport layer, the easier it is to reason about firewalls, observability, and retries when something goes wrong at 3 AM.

A2UI, agent-driven interfaces without the XSS

A2A solves machine-to-machine. A2UI solves the harder problem of agent-to-human, where the right interface for a given turn is rarely the same as the interface for the previous turn. A user asking an agent to book a multi-leg flight should not get back a wall of markdown. They should get a stateful flight search form. A user asking it to summarise a document should get prose. The agent, not the application, decides which one is appropriate.

The obvious way to do this, letting the model emit HTML and render it, is also the way to ship an XSS factory. A2UI takes the opposite philosophy: extreme decoupling between what the agent describes and what the client renders.

The registry pattern

A2UI agents do not emit executable code. They emit a declarative JSON description of UI components composed from a fixed catalogue the client app registered up front. The host app ships with a hardcoded set of trusted, native widgets like Card, Button, TextField, and DatePicker, implemented in React, Lit, Flutter, or whatever framework you are using. During handshake, the client advertises which catalogue IDs it supports; the agent commits to using only those identifiers in its responses.

The descriptions themselves are typically declared with Zod schemas on the server, with the natural-language description of each component injected directly into the agent’s system prompt so it knows what it can ask for. When the client receives a payload like { "type": "date-picker", ... }, the local A2UI renderer maps that abstract identifier to its real, vetted implementation. The model never touches a DOM node. It cannot. That single property is what lets you ship this to a real user.

Why a flat adjacency list

The other interesting design call in A2UI is the data structure. A naive approach would represent the UI as a deeply nested tree and stream it as a single JSON blob. That fails for two reasons. First, LLMs are visibly bad at producing deeply nested JSON in a single forward pass without misclosing brackets or drifting on hierarchy depth. Second, the user experience of waiting for a complete payload before anything renders is terrible.

A2UI represents the interface as a flat adjacency list. Every component is an independent object in an array, with parent and child relationships expressed as string IDs. The server streams new components incrementally as JSON Lines over SSE. The client buffers them by surfaceId. An explicit beginRendering message with the root component’s ID tells the client when the surface is structurally complete enough to walk and paint. This kills the “flash of incomplete content” problem and, more importantly, dramatically improves the model’s reliability. Emitting a flat list of small objects is a much easier prediction task than emitting a well-balanced tree in one shot.

A2UI stream (JSONL over SSE)

// Stream order is incremental. The client buffers until beginRendering arrives.
{"type":"surfaceUpdate","surfaceId":"flightSearch","components":[
  {"id":"root","type":"column","children":["title","form"]}
]}
{"type":"surfaceUpdate","surfaceId":"flightSearch","components":[
  {"id":"title","type":"heading","props":{"text":"Find a flight"}}
]}
{"type":"surfaceUpdate","surfaceId":"flightSearch","components":[
  {"id":"form","type":"row","children":["from","to","date","go"]}
]}
{"type":"surfaceUpdate","surfaceId":"flightSearch","components":[
  {"id":"from","type":"text-field","props":{"label":"From","bind":"$.from"}},
  {"id":"to","type":"text-field","props":{"label":"To","bind":"$.to"}},
  {"id":"date","type":"date-picker","props":{"label":"Date","bind":"$.date"}},
  {"id":"go","type":"button","props":{"label":"Search","action":"submit"}}
]}
{"type":"beginRendering","surfaceId":"flightSearch","rootId":"root"}

Interactivity follows the same pattern in reverse. When a user interacts with a component, the client constructs a userAction payload, resolves any data-bound values against its local data model, and ships the event back over the A2A transport. The agent evaluates the new state and streams back targeted surfaceUpdate or dataModelUpdate messages, patching rather than re-rendering. Frameworks like CopilotKit and the AG-UI ecosystem have started providing the React/Next.js scaffolding to make all of this feel native to a modern app.

A worked example: agentic data pipelines

The cleanest stress test for the entire stack is enterprise data engineering. Modern data architectures rely on multi-stage pipelines. A Bronze layer for raw ingestion, a Silver layer for cleansing and enrichment, a Gold layer for business-level aggregates, all of it typically materialised in a warehouse like BigQuery and orchestrated with Dataproc or Airflow. These pipelines fail constantly. The failure mode of choice is usually a tired SRE at 2 AM grepping logs across half a dozen tabs to figure out why the executive dashboard is showing yesterday’s numbers.

The agentic version of the same task is straightforward, and it is a useful end-to-end example because it touches every layer of the stack at once. The SRE asks an agent “why did the orders pipeline fail?” The agent fans out across telemetry through MCP-exposed tools, pulling the stack trace from Cloud Logging, the schema and freshness of the target table from BigQuery, and the upstream lineage from Dataplex. It correlates an out-of-memory exception in a Spark worker with an upstream volume spike in the Bronze layer and proposes a concrete fix: bump the Dataproc batch memory from 8GB to 16GB and rerun.

What it does not do is execute that fix on its own. That is the point.

The deterministic sandwich

The pattern that makes this safe in production is what people have started calling the deterministic sandwich. The probabilistic part of the system, the model, is encapsulated between two deterministic layers that own the consequential decisions.

The brain (probabilistic). Gemini reads the messy, cross-system signal and makes the diagnostic call. Pattern-matching across noisy logs is what it is good at, and what humans are slow at.
The safety layer (deterministic). A hardcoded policy engine intercepts every proposed action before it runs. It rejects categorically destructive ones like DROP TABLE and DELETE against production datasets, and anything that touches the warehouse’s history, regardless of how confident the model is. The model does not get to argue with the policy.
The hands (deterministic). The actual mitigations are vetted, unit-tested scripts in the tool registry. The model does not write restart_spark_batch.sh; it picks it from the registry and supplies arguments. If a script does not exist, the agent does not invent one.

Wrapped around all of this is a hard human-in-the-loop check for any consequential action. The agent does not self-execute the memory bump. It presents the fully formulated plan as a gcloud command for the SRE to approve. Every action driven through the agent is routed through Application Default Credentials, which means every action is recorded in Cloud Audit Logs and attributed to a real identity. There is no shadow account doing the work.

The agent is also grounded in localised context. A GEMINI.md file at the repo root captures naming conventions, business rules, and the painful gotchas of the specific data estate. Dataplex glossary tags on BigQuery columns give the agent semantic context for fields whose column names alone would not communicate it. Every time the agent makes a mistake the SRE corrects, the correction goes into local knowledge so the same mistake does not happen twice. After enough rounds, what you have is not really an agent any more. It is the codified judgment of the team that runs the pipeline, with a model wrapped around it.

Closing notes

The headline change of the last year is not that models got smarter. It is that the surrounding architecture finally got serious. Gemini does the reasoning. ADK turns the reasoning into a system you can reason about. A2A lets that system collaborate across organisational boundaries without leaking itself. A2UI lets it talk to humans without owning the render tree. None of these layers replaces the others, and trying to collapse them, by putting all the orchestration logic into a single mega-prompt, or letting the model emit raw HTML, or inventing a private inter-agent protocol, is the kind of decision that looks fine in a demo and fails in production three weeks in.

The interesting work for the next year is going to happen in two places. The first is evaluators: how do you know your agent is getting better, and not just more confident? The second is the cost model: parallel sub-agents are fast and expensive, sequential chains are slow and cheap, and almost nobody has a clean mental model for the trade-off yet. Both of those problems get easier when the underlying stack is already separated into layers you can measure, swap, and budget independently. That, more than any individual capability, is what the four layers buy you.

If you are starting on this stack now, my one piece of advice is the boring one. Do not skip the framework. The temptation to wire a model directly to a few tools and call it an agent is enormous, and so is the regret three months in when you realise you have rebuilt half of ADK without the opinions, half of A2A without the contracts, and half of A2UI without the safety. Start at the stack, not at the model. The model is the easy part.

Architecting the Agentic Enterprise: Gemini, ADK, A2A, and A2UI