Architecture · Mar 12, 2026 · 4 min read

Agent orchestration for clinical decision support

An 'agent' is a fashionable word for something that's been working in production code for decades: a controlled loop that calls tools, reasons about state, and decides what to do next. Here's how to apply that pattern to clinical decision support without the failure modes that have killed lesser CDS systems.

The word "agent" is doing a lot of work in 2026. It means at least three different things depending on who's saying it:

A general-purpose autonomous loop (the "AGI" framing).
A task-specific structured loop with tools, memory, and a stopping condition (the useful framing).
A LangChain demo (the framing that's wasted the most engineering hours).

For clinical decision support, only the second framing matters. An agent is a structured loop that:

Receives a goal.
Has a finite, vetted set of tools.
Reasons about state and chooses a next action.
Stops when the goal is met, escalates when it can't be, and logs everything.

That description is recognizable to anyone who's ever written a workflow engine. The new part is that the "reason about state" step uses an LLM. The rest of the engineering is the same engineering you've always done.

The three failure modes of clinical agent systems

Before the architecture, the failure modes — because the architecture is shaped by them.

Unbounded loops. The agent thinks longer, calls more tools, reaches deeper into context, and eventually times out or produces noise. Mitigation: hard step caps, hard time caps, hard tool-call caps.

Silent confabulation. The model invents a fact that the orchestration layer accepts because the schema is loose. Mitigation: every tool output has a strict schema; every model output has a strict schema; mismatches are fatal, not warnings.

Quiet escalation failure. The agent encounters something it can't handle and produces a plausible-looking output instead of escalating. Mitigation: explicit "I don't know" pathways, calibrated confidence, and a default-to-human routing when confidence is low.

The architecture

For each clinical scenario:

A goal contract

A typed input describing the task: patient context shape, scenario type, what "done" looks like. The contract is reviewed by the clinical team. It's the part that matters most and it's the smallest.

A vetted tool set

A small list of tools the agent can call:

Retrieve a specific clinical document
Look up a guideline
Check a value against a reference range
Draft a structured output

Every tool has a schema. Every tool's output has a schema. The agent cannot call tools outside the list. The list is auditable.

A reasoning loop with a budget

The loop:

Plan: model produces a step plan grounded in the goal contract.
Execute: model calls a tool. Output is schema-validated.
Reflect: model assesses progress. Decides to continue, stop, or escalate.
Repeat — with a hard step cap (often 8–12 for clinical scenarios).

The budget is enforced by the orchestrator, not requested politely from the model.

A halting condition

Either:

The goal contract is satisfied (output schema present, all required fields filled, validation passed).
The model returns "I cannot complete this task because X."
The step budget is exhausted.

Each halting condition routes to a specific downstream action. The orchestrator decides; the model recommends.

A governance pass

Before any output reaches a human, a governance pass:

Validates output schema.
Applies hard rules (small set of "never do this").
Computes a confidence signal from agreement with the golden case neighborhood.
Routes low-confidence outputs to mandatory human review.

Audit logging

Every step is logged: plan, tool call, tool output, reflection, decision. The whole trace is replayable. The trace is what the auditor will see when there's a question; it's also what your engineers will use when they're tuning.

Why a small tool set beats a big one

The temptation, especially with general-purpose models, is to give the agent twenty tools and let it figure out which to use. This is a mistake in clinical contexts. Each additional tool is a:

Failure mode that has to be evaluated
Schema your code has to validate
Capability the safety reviewer has to clear

Three or four tools per scenario, deeply specified, beats twelve tools loosely specified. Always.

Why one agent per scenario beats one super-agent

Same logic. Coupling all your clinical scenarios to one prompt or one tool set makes the evaluation matrix combinatorial and the governance story unbearable.

A skill file per scenario; an agent loop per scenario; an evaluation suite per scenario. They share infrastructure (the orchestrator, the governance layer, the logging) but they're independent units of safety.

The thing that's actually new

If you take the description above and remove the word "model," it reads like the design of a careful workflow engine. The architecture is familiar.

What's new is that one of the steps — the reasoning step — is now flexible in a way it wasn't before. That flexibility is what lets you ship new scenarios in days. It's also what makes the safety work harder. Every gain in flexibility is paid for in evaluation, governance, and logging.

The teams that win at this are the ones that pay that price up front.

If your team is staring at an agent demo and wondering how to make it audit-ready and clinically defensible, that's the conversation I have on most discovery calls. Book one — 30 minutes, no deck, real diagrams.

agentsclinical decision supportorchestrationtool use

Next step

Want me to build something like this for your team?

Thirty-minute call. We'll look at the workflow you most wish was already automated and decide if it's a fit.

Book a call

Up next

Evaluation · 4 min

Evaluation frameworks for clinical AI outputs