The word "agent" is doing a lot of work in 2026. It means at least three different things depending on who's saying it:
- A general-purpose autonomous loop (the "AGI" framing).
- A task-specific structured loop with tools, memory, and a stopping condition (the useful framing).
- A LangChain demo (the framing that's wasted the most engineering hours).
For clinical decision support, only the second framing matters. An agent is a structured loop that:
- Receives a goal.
- Has a finite, vetted set of tools.
- Reasons about state and chooses a next action.
- Stops when the goal is met, escalates when it can't be, and logs everything.
That description is recognizable to anyone who's ever written a workflow engine. The new part is that the "reason about state" step uses an LLM. The rest of the engineering is the same engineering you've always done.
The three failure modes of clinical agent systems
Before the architecture, the failure modes — because the architecture is shaped by them.
Unbounded loops. The agent thinks longer, calls more tools, reaches deeper into context, and eventually times out or produces noise. Mitigation: hard step caps, hard time caps, hard tool-call caps.
Silent confabulation. The model invents a fact that the orchestration layer accepts because the schema is loose. Mitigation: every tool output has a strict schema; every model output has a strict schema; mismatches are fatal, not warnings.
Quiet escalation failure. The agent encounters something it can't handle and produces a plausible-looking output instead of escalating. Mitigation: explicit "I don't know" pathways, calibrated confidence, and a default-to-human routing when confidence is low.
The architecture
For each clinical scenario:
A goal contract
A typed input describing the task: patient context shape, scenario type, what "done" looks like. The contract is reviewed by the clinical team. It's the part that matters most and it's the smallest.
A vetted tool set
A small list of tools the agent can call:
- Retrieve a specific clinical document
- Look up a guideline
- Check a value against a reference range
- Draft a structured output
Every tool has a schema. Every tool's output has a schema. The agent cannot call tools outside the list. The list is auditable.
A reasoning loop with a budget
The loop:
- Plan: model produces a step plan grounded in the goal contract.
- Execute: model calls a tool. Output is schema-validated.
- Reflect: model assesses progress. Decides to continue, stop, or escalate.
- Repeat — with a hard step cap (often 8–12 for clinical scenarios).
The budget is enforced by the orchestrator, not requested politely from the model.
A halting condition
Either:
- The goal contract is satisfied (output schema present, all required fields filled, validation passed).
- The model returns "I cannot complete this task because X."
- The step budget is exhausted.
Each halting condition routes to a specific downstream action. The orchestrator decides; the model recommends.
A governance pass
Before any output reaches a human, a governance pass:
- Validates output schema.
- Applies hard rules (small set of "never do this").
- Computes a confidence signal from agreement with the golden case neighborhood.
- Routes low-confidence outputs to mandatory human review.
Audit logging
Every step is logged: plan, tool call, tool output, reflection, decision. The whole trace is replayable. The trace is what the auditor will see when there's a question; it's also what your engineers will use when they're tuning.
Why a small tool set beats a big one
The temptation, especially with general-purpose models, is to give the agent twenty tools and let it figure out which to use. This is a mistake in clinical contexts. Each additional tool is a:
- Failure mode that has to be evaluated
- Schema your code has to validate
- Capability the safety reviewer has to clear
Three or four tools per scenario, deeply specified, beats twelve tools loosely specified. Always.
Why one agent per scenario beats one super-agent
Same logic. Coupling all your clinical scenarios to one prompt or one tool set makes the evaluation matrix combinatorial and the governance story unbearable.
A skill file per scenario; an agent loop per scenario; an evaluation suite per scenario. They share infrastructure (the orchestrator, the governance layer, the logging) but they're independent units of safety.
The thing that's actually new
If you take the description above and remove the word "model," it reads like the design of a careful workflow engine. The architecture is familiar.
What's new is that one of the steps — the reasoning step — is now flexible in a way it wasn't before. That flexibility is what lets you ship new scenarios in days. It's also what makes the safety work harder. Every gain in flexibility is paid for in evaluation, governance, and logging.
The teams that win at this are the ones that pay that price up front.
If your team is staring at an agent demo and wondering how to make it audit-ready and clinically defensible, that's the conversation I have on most discovery calls. Book one — 30 minutes, no deck, real diagrams.