All insights

Architecture · Apr 8, 2026 · 4 min read

Replacing a clinical rules engine with LLM reasoning

Rules engines are brittle, slow to extend, and expensive to test. LLMs are flexible, fast to extend, and impossible to test the same way. Here's the architecture pattern that lets you trade the first set of problems for the second — without losing your safety story.

Most healthcare companies that ship clinical-decision-support today are running a rules engine somewhere. Some are explicit (Drools, a custom DSL). Some are implicit — fifteen thousand lines of branching Python that function as a rules engine even if no one calls it that.

These systems share three properties:

  1. They're reliable in the lanes they were designed for.
  2. Every new condition or workflow is a multi-week project.
  3. The test matrix grows quadratically with the rule count.

You hit a moment, eventually, where extension velocity becomes the dominant constraint on the business. New enterprise pilots want workflows your rules engine doesn't cover. Clinicians have a tenth condition they want to add. Each addition is a regression risk and a sprint of QA.

The temptation is to swap the rules engine for "an LLM." This usually fails. Either it works in the demo and falls apart in production, or it works in production but the safety story collapses under a clinical review.

The pattern that works is more careful. Here's the shape.

Don't replace the rules engine. Move the reasoning out of it.

A rules engine is doing two things at once: structured retrieval ("given these inputs, what facts apply?") and reasoning ("given those facts, what should we recommend?"). It's the second part that scales badly.

The pivot is:

  • Keep deterministic retrieval. Your rules engine — or whatever replaces it — is excellent at "given a patient profile, return the relevant clinical context."
  • Move the reasoning step to an LLM, grounded in that retrieved context, governed by skill files that encode the playbook for each scenario.

This gives you the best of both substrates: the auditability and predictability of deterministic retrieval, and the extensibility of natural-language reasoning.

The architecture in three layers

Layer 1 — Retrieval

For each clinical scenario, define a context shape: the structured data the reasoning step needs. Patient demographics, problem list, recent encounters, relevant labs, applicable guidelines, prior interventions.

Retrieval is deterministic code. It pulls from your EHR (FHIR if you have it, custom adapters where you don't), from your data warehouse, from any clinical knowledge base. It returns a typed object.

Why this matters for safety: if the retrieval is wrong, the reasoning is wrong. Keeping it deterministic means it's testable the way you already test code, and any bug shows up before the LLM gets involved.

Layer 2 — Reasoning, with skill files

A skill file is the playbook for one scenario, written in a structured format that the LLM follows. It contains:

  • The role and goal of this step
  • The structured context it expects
  • The reasoning approach (rule out X first, then evaluate Y)
  • The output schema
  • The safety boundaries (what to refuse, what to escalate)

You write one skill file per scenario. Same model, same orchestration, just different playbooks. This is what makes extension fast: adding a new scenario is a new skill file, not a new branch in the rule tree.

The reasoning call is structured: a known model, a deterministic temperature, a JSON-schema output, and an evaluation harness that gates every change.

Layer 3 — Governance

The output of the reasoning layer doesn't go straight to a clinician. It goes through a governance layer that:

  • Validates the output schema
  • Applies hard rules (a small set — the things you can't reason about, like "never recommend X with allergy Y")
  • Logs the full prompt + retrieval + output for audit
  • Routes ambiguous outputs to human review
  • Tracks drift against the golden case suite

This layer is where most of the real safety work lives. It's also the cheapest part to build, because it's just code.

What you give up

You give up the property that a rules engine has where any output can be traced back to a specific rule. LLM reasoning is grounded but not deterministic in that way.

You replace it with golden cases + drift monitoring + structured logging. You can show, for any output, the inputs that produced it and how the system behaves on the canonical reference cases. That's a different audit story, but it's a coherent one — and it's actually the right story, because the rules engine's traceability was always an artifact of complexity that no human could keep in their head.

What you gain

  • New scenarios in days, not months. A skill file is a small, reviewable artifact.
  • Better handling of edge cases — the LLM has nuance that fifteen "if/else" branches couldn't encode.
  • A test surface that scales. Golden cases per scenario, not rules.
  • A more honest safety story. You stop pretending the system is provably correct and start documenting how you keep it good in practice.

What you don't do

You don't:

  • Skip the evaluation harness. It's the entire safety story.
  • Replace the orchestration layer with a single mega-prompt. Skill files isolate scope.
  • Send patient data to a hobbyist API tier. BAAs all the way down.
  • Ship without a governance layer. The model output is never the final word.

If you're staring at a rules engine that's slowing the business down and wondering whether the LLM pivot is real or hype, the short answer is: it's real if you build it the careful way. I went through this transition at Altitude in 2025, and the architecture above is what came out the other side. If your team is at that fork, book a call and we can look at your specific shape together.

LLMrules engineclinical decision supportagent orchestration

Next step

Want me to build something like this for your team?

Thirty-minute call. We'll look at the workflow you most wish was already automated and decide if it's a fit.