Forward Deployed Engineering
An honest map of what AI is and is not good for. Why probabilistic systems should handle cognition, deterministic systems should handle verification and execution, and why this is exactly how the human brain has always worked. A field guide for solutions architects, embedded engineers, and anyone tempted to wire a chatbot to production.
1. The Premise
There is a job title that has quietly become the most important one in software. Palantir invented it in the early 2010s and called it the Forward Deployed Software Engineer. Internally the role was called Delta. For a stretch Palantir employed more Deltas than traditional software engineers, and the role has since been borrowed by every serious AI company on earth.1 A Forward Deployed Engineer embeds with one customer. They build production workflows on top of a platform. They are half engineer and half anthropologist. They go where the work actually happens and they ship the thing that finally works.
The role is having a renaissance because nothing about the AI era is shippable from a desk in Palo Alto. The model is a generalist; the customer has a specialist's problem. The model hallucinates; the customer pays the bill when it does. The model has no idea what the customer's data looks like, what its compliance team will tolerate, or which of its forty thousand internal acronyms means something different on a Tuesday. Someone has to go sit next to the customer and figure that out. That someone is the Forward Deployed Engineer.
This essay is about what that work should look like in the AI era, and it is going to argue something that may sound counterintuitive at first. The right job for a Forward Deployed Engineer in 2026 is not to wire a chatbot to a production system and watch it run. It is to build, with the customer, a hardened deterministic library of the customer's actual workflows. To stand up the update scripts that keep the underlying data fresh. To write the validators that prove an output is correct before anyone sees it. And then, only after that scaffolding exists, to use a language model as the thin cognitive layer that lets a human ask for any of it in plain English.
Probabilistic AI gets cognition. Deterministic libraries get verification and execution. That is the thesis. Everything that follows is an elaboration of why this division of labor is the right one, why it mirrors how the human brain has worked for the entire history of human brains, and why the alternative (the one currently being sold as the future) is going to age very badly.
Two consequences fall out of the thesis and are worth naming at the top. The first: most of what passes for AI engineering today is the wrong shape. Plugging an autonomous agent into a workflow and hoping it figures things out is a wager that the model's failure rate on this customer's data is acceptable. The customer has not been told the failure rate. Often the agent's deployer has not measured it either. The second: the actual value of a Forward Deployed Engineer is the deterministic asset they leave behind. The prompts they write are ephemeral; the models they call will be deprecated in eighteen months. The vetted workflows, the validated data sources, the rule packs, the update scripts; these compound, they are citable, and they remain useful long after the model behind them has been replaced by the next one.
The reader is asked to hold one operational definition for the rest of the essay. Deterministic logic is logic where the same inputs produce the same outputs every time, where the rules are inspectable, and where a failure is reproducible. Probabilistic logic is logic where the same inputs produce a distribution over outputs, where the rules are not directly inspectable, and where a failure may or may not reproduce. Language models are emphatically probabilistic. Calculators are emphatically deterministic. Both are useful. They are not interchangeable, and the entire question of how to do useful AI work is the question of which one to use where.
2. Two Kinds of Logic
Start with a definition that is sharper than the usual hand-wave. A function is deterministic if, for any input x, it returns the same output f(x) on every invocation, with no dependence on hidden state, no dependence on time, and no dependence on the temperature of the room.2 A square root is deterministic. A regex match is deterministic. A SHA-256 hash is deterministic. A bridge-load calculation done from public physics is deterministic. The check on a contract's governing-law clause done against a vetted ruleset is deterministic. If you run it again tomorrow you will get the same answer, and if it was wrong yesterday it will be wrong tomorrow in exactly the same way; both properties are gifts.
A function is probabilistic if the output is drawn from a distribution conditioned on the input. The output is not a value; it is a sample. Run it again and you may get a different value. The behavior is not a bug. The behavior is the function. A language model, at the level the user interacts with it, samples tokens one at a time from a probability distribution over a vocabulary, where each distribution is conditioned on the tokens that came before and on the model's weights.3 Even at temperature zero, where the sampler is deterministic given identical hardware and software, the model's behavior is still statistical in the sense that matters: the answer it produces is the model's best guess given a training distribution, and there is no rule inside the model you can point at and say "this is why."
The two kinds of logic have complementary virtues. Deterministic logic is auditable; you can read the source, identify the rule that fired, and cite the input that caused the verdict. It is reproducible; every output can be regenerated. It is composable in the strong sense; two deterministic functions composed are themselves deterministic. It is cheap to run once written. And it is narrow; it solves the problem it was written to solve and not a millimeter more.
Probabilistic logic is broad; one model can read a contract, write a poem, summarize a paper, and translate Slovenian. It is tolerant of fuzzy input; it does not need the user to phrase the question in exactly the right way. It is fluent in a register that deterministic logic will never be. And it is expensive; every invocation pays for the inference. It is opaque; the rule that produced the answer is distributed across billions of weights and is not citable. It is variable; the answer today and the answer tomorrow may not agree, and on the cases that matter the disagreement is exactly the cases you would have wanted to know about.
Fig. 1. The two kinds of logic, side by sideThe fundamental trade is not a fight about which kind of logic is better. It is a question of placement. Pick the right one for the layer of the system you are building. The mistake the field is making, in 2026, is using probabilistic logic in places where deterministic logic was sitting right there. The other mistake, the rarer one, is using deterministic logic in places where probabilistic logic would have been kinder; nobody should write a regex to parse natural English when a small model can do it in twenty milliseconds.
A useful shorthand. Deterministic logic is front-loaded labor; you pay the cost up front, in design and writing and testing, and then it costs almost nothing to run. Probabilistic logic is back-loaded labor; you pay almost no cost up front (the model already exists) and then you pay forever in inference, in monitoring, in incident response when the model gets a thing wrong, and in legal fees when the wrong thing matters. The bill comes due either way. Architects choose which kind of bill they want.
3. The Twenty Percent Brain
Here is a fact that has been quietly available for sixty years and that should govern more design decisions than it does. The adult human brain weighs about three pounds, or roughly two percent of body mass. It consumes about twenty percent of the body's metabolic budget at rest.4 A square inch of brain tissue is the most expensive square inch the organism owns. Evolution did not give the brain that allocation lightly. It gave it because the brain does work that nothing cheaper can do, and it gave only the minimum allocation needed to do that work.
Read the implication again carefully. The most computationally expensive organ in the body is the one that does the probabilistic, generalist, fuzzy-input, broad-domain work. Reasoning is metabolically expensive. So is imagination. So is planning. So is paying attention. Marcus Raichle's research group, in a 2002 paper that became foundational, called the brain's resting energy use the "default mode" and traced it to a network of regions that hum continuously whether you are doing anything specific or not.5 Thinking, even idle thinking, is expensive.
The body has noticed this. Over hundreds of millions of years it has built a remarkable arsenal of deterministic offload mechanisms: the cerebellum for motor patterns once learned, the spinal cord for reflexes that bypass the brain entirely, the basal ganglia for habit, the autonomic nervous system for organ regulation, the gut's own enteric nervous system for digestion. Each of these is a low-power dedicated circuit. None of them does general reasoning. All of them do exactly one thing reliably. The pattern across biology is unmistakable: offload everything that can be made deterministic, and reserve the expensive probabilistic substrate for the cases that genuinely need it.
Daniel Kahneman gave the cognitive version of this architecture its most famous name. System 1 is fast, automatic, and effortless; it is the part of the mind that completes "two plus two equals" before you have finished reading the sentence. System 2 is slow, deliberate, and effortful; it is the part of the mind that solves seventeen times twenty-four with a pencil.6 Kahneman's lifelong project was demonstrating that humans are, by default, lazy with System 2 and prefer to coast on System 1. The reason is not character. The reason is calories. System 2 burns glucose; System 1 hardly does. A brain that ran System 2 continuously would starve.
The relevance to AI architecture is unsubtle. A language model is the silicon equivalent of System 2. It is a fluent generalist that costs serious energy per invocation. Every token sampled is real electricity off the grid; the inference cost of frontier models, at scale, is measured in megawatts.7 An organization that runs its operations entirely on language model calls has built the cognitive equivalent of a brain that refuses to use the cerebellum. It will work. It will be slow. It will be expensive. And it will, eventually, get something wrong in a way a more thoughtful architecture would have prevented.
The honest framing is the opposite. Use deterministic logic for everything the cerebellum could do (routine, repeatable, narrow, well-defined). Use probabilistic logic for everything that genuinely needs the prefrontal cortex (novel situations, ambiguous inputs, fuzzy semantics, intent inference). Then make sure the boundary between the two is the cheapest and tightest part of the design. The human nervous system is a sixty-million-year proof of concept for this architecture. We can copy the homework.
The brain spends twenty percent of the body's calories doing the work nothing cheaper can do. Use AI the same way.
One more biology note before moving on, because it sharpens the design implication. The Karl Friston free energy principle, controversial but generative, models the brain as a prediction machine that is constantly trying to minimize surprise.8 The brain runs an internal model of the world, predicts what comes next, and updates the model when reality disagrees. The expensive part is the prediction. The cheap part is the comparison with reality. The brain spends its calories on the probabilistic forward pass and offloads the verification to the much cheaper sensory feedback loop. Deterministic verification is biological; it is what an eyeball is for.
The translation to AI engineering writes itself. The model predicts. The deterministic linter compares the prediction to reality. The cheap part stays cheap. The expensive part is contained. The system as a whole stays accurate because the cheap-and-deterministic outer loop disciplines the expensive-and-probabilistic inner one. This is the architecture the rest of the essay is going to defend.
4. What Probabilistic Logic Is Genuinely Good At
It would be unfair to the technology to argue for deterministic-first without first giving language models their due. The honest list is long and worth writing down, because most of the bad architecture in the field comes from people who have not separated the things AI is genuinely great at from the things it is being asked to do because the demo looked good.
The following capabilities are real, durable, and not going away. An architect should reach for a model first when a task is in this list.
Intent extraction from natural language
Humans phrase things badly. A user types "i need to know how much wire for 60a at 80 feet" and what they mean is a voltage-drop calculation under the National Electrical Code with assumed copper conductor and a target three-percent drop. No regex parses that input correctly. A language model parses it in twenty milliseconds, returns a structured JSON object naming the tool and the arguments, and almost never gets the high-level intent wrong on a small, fixed tool registry. This is the single most valuable thing language models do, and it is the work they should be doing every time the user opens their mouth.
Translation between representations
Free text to JSON. JSON to free text. SQL to English. English to SQL. Python to Rust. A code review comment to a Jira ticket. A medical chart note to an ICD-10 code candidate set. All of these are translation tasks. Translation is what language models are literally trained to do; the original transformer paper was a translation paper.9 When the translation has a deterministic ground truth (a schema, a grammar, a glossary), the model's output can be deterministically validated and the system gets the best of both worlds.
Summarization
Compressing a long document into a short one, preserving the load-bearing claims, is a task humans hate and models do well. The caveat is that summarization is also where the most-cited LLM benchmark failure lives. Vectara's Hallucination Evaluation Model leaderboard, updated through 2025, found that the best-performing models on grounded summarization hallucinate on under one percent of cases on a forgiving benchmark, and on harder benchmarks the same "reasoning" models exceed ten percent.10 The capability is real and the failure rate is not zero. Use models for summarization; have a human, or a deterministic checker, read the summary before it goes anywhere consequential.
Drafting
First drafts of emails, essays, code, policies, slide decks. The first draft is the part of any creative task with the highest cognitive activation energy and the lowest stakes; getting unblocked is worth more than getting it right. A model writes the first draft. A human edits the second draft. The result is faster than either alone and rarely worse than the human alone. The error mode is the human who ships the first draft without editing it; this is a workflow failure, not a model failure.
Classification with fuzzy boundaries
"Is this email a sales pitch, a customer support request, or a phishing attempt?" The categories overlap. The features that distinguish them are implicit. A rule-based classifier gets seventy percent accuracy and a model gets ninety-five. The remaining five percent is where the model is wrong and a deterministic post-check on the high-stakes path (does the email contain a payment instruction?) keeps the system safe.
Code generation in well-trodden domains
Boilerplate. CRUD endpoints. Test scaffolding. Conversion from one framework to another. Anything that a competent junior could write and that has a million prior examples in the training corpus. The model is fast, the human reviews, and the deterministic compiler is the final judge. The compiler is the linter; this architecture already exists and it works because the compiler is deterministic.
Reasoning about unstructured context
"Given this thirty-page deposition transcript and this contract, where do the parties disagree about scope?" No deterministic rule pack will get there. A model will, imperfectly, often well enough to direct a human's attention to the right pages. This is the cognitive-router move: the model points the human at the work, and the human does the work. The model is a flashlight, not a hammer.
Brainstorming
Variations on a theme, alternative phrasings, candidate names, possible counterarguments. The probabilistic nature of the output is a feature here; you want variety. A model generating ten taglines and a human picking one is faster than the human writing the ten themselves, and the model never gets bored.
Tutoring and explanation
Re-phrasing a concept until it lands. Generating examples. Answering follow-up questions about a known body of material. With retrieval grounding, where the model is restricted to citing from a deterministic corpus, this is reliable enough to be useful, and the citation requirement keeps the model honest about what it actually knows versus what it is inventing.
Notice what is not on this list. Doing math. Running calculations. Following a regulation precisely. Verifying that an output meets a constraint. Producing the same answer twice. Anything that has to be cited in a court filing. Any task where being slightly wrong matters more than being roughly right. Those are the deterministic side of the line. The list above is real and it is not small. The model has plenty to do without being asked to do those.
5. What Deterministic Logic Is Genuinely Good At
Now the other side. The following tasks are ones where deterministic logic is not merely an option but the only honest choice. Reach for the model in any of these and you are choosing variance, opacity, and ongoing inference cost when you could have had a function call.
Arithmetic
Every commercial language model occasionally gets long multiplication wrong. Every calculator gets it right. The right design is to detect that an arithmetic operation is needed and call the calculator. Frontier vendors now do this internally through tool use; it took roughly five years of public embarrassment before the field accepted that asking a model to do arithmetic was using the wrong tool.11 Do not repeat the mistake at the application layer.
Unit and dimensional conversion
Newtons to pounds-force. Milligrams per kilogram to milligrams per pound. Kilowatt-hours to BTUs. The conversion factor is a fixed number. The output should be a fixed number. A model that occasionally drops a decimal point is not a tool, it is a hazard.
Schema and grammar validation
"Is this document valid JSON?" is a question with one answer. "Does this YAML parse?" is a question with one answer. "Does this command match the protocol?" is a question with one answer. Modern decoding techniques (constrained sampling, JSON-mode, structured outputs) push grammar enforcement into the sampler itself so the model literally cannot emit invalid output; this is the right design.12
Cryptographic operations
Signing. Verifying. Hashing. Encrypting. Decrypting. Generating keys. Every one of these has a published specification and a deterministic implementation. A language model that "computes" an Ed25519 signature is not signing anything; it is generating plausible-looking bytes. The bytes will not verify. The correct architecture is to have the model decide that a signature is needed and call a function that actually signs.
Regulatory and contractual rule application
"Does this contract contain a unilateral indemnification clause?" "Does this medical bill apply the correct CPT modifier?" "Does this purchase order violate the SOX delegation-of-authority matrix?" These are questions with citable rules, public sources, and unambiguous answers when the rule is applied correctly. The model can help find the rule. The model should not be the final authority on whether the rule fired. Vaulytica's design is the worked example: about eighty deterministic rules over ten categories, every finding tied to a rule ID and a dataset version, no model in the verification path.13
Engineering calculations from public physics
Voltage drop. Friction loss. Conduit fill. Refrigerant superheat. Beam deflection. Drug-dose-per-kilogram. The formulas exist. The constants exist. The right answer is a number that can be regenerated by anyone with the same inputs. The right architecture is a calculator (or 342 calculators, if you happen to be Rough Logic) that anyone can use and that nobody has to trust.14
Data parsing and structured extraction with known formats
An X.509 certificate has a defined ASN.1 structure. A PGP key has a defined binary format. A CSV has a defined grammar. An RFC 4253 SSH key has a defined wire format. Parsing these is a deterministic exercise; the result is either correct or the input is malformed. Encrypt A Lotta's parsers do this in the browser without a server because the operation is mechanical.
State machines and protocol implementations
TCP. TLS. OAuth. SAML. Every protocol is a finite state machine; the transitions are defined; the rules are inspectable. The compatibility cost of getting a protocol slightly wrong is enormous (every other implementation of the protocol has to interoperate with yours). The model is not going to invent a new state machine; it is going to invent a wrong one. Use the deterministic library that someone has already certified.
Output validation
This is the load-bearing one and deserves a section of its own (see "The Architecture" below). When a model produces a structured output, a deterministic validator can check, before anyone consumes the output, that it parses, that required fields are present, that values are in range, that the output's claims are grounded in the source the model was supposed to be working from, and that no high-stakes invariant has been violated. A deterministic validator is the cheapest, fastest, most trustworthy second pair of eyes available.
Audit trails and provenance
What rules fired, with what versions, against what inputs, producing what verdicts. This is metadata about the run, and it is exactly the kind of thing models are bad at producing because the model does not actually know what it did. Determinism shines here. Every rule has an ID, every dataset has a version, every verdict has a timestamp and a signature, and the entire trace is reproducible.15
Repetition
Anything that has to happen the same way ten thousand times in a row. Payroll. Tax filings. Patient triage protocols. Aircraft startup checklists. The model is overkill and its variance is the wrong shape; the deterministic procedure has been the right tool for these jobs since the invention of the procedure.
The combined list is most of what an enterprise actually does. The cognitive surface area where "an answer that is roughly right is good enough" is real, but it is a much smaller fraction of any business than its marketing material suggests. The job of the Forward Deployed Engineer is to find that surface area, draw a line around it, put a model on the right side of the line, and put deterministic libraries on the other side.
6. The Failure Frontier
It is easier to argue about architecture in the abstract than to look at the actual incidents that have already happened, but the actual incidents are the only honest data. The following are not cherry-picked; they are the most-cited examples of probabilistic systems failing in ways a deterministic outer loop would have caught.
Moffatt versus Air Canada (2024)
An Air Canada chatbot told Jake Moffatt, whose grandmother had just died, that bereavement fares could be applied retroactively after travel. They could not. The airline's own static webpage said so. Moffatt booked the flight, applied for the refund, was denied, and sued. The British Columbia Civil Resolution Tribunal found Air Canada liable for negligent misrepresentation. The airline's argument that the chatbot was "a separate legal entity responsible for its own actions" was rejected.16 The relevant fact for our purposes is not the damages award (about eight hundred Canadian dollars). The relevant fact is that the chatbot's answer was not validated against the company's own deterministic policy. A trivial deterministic check (does the chatbot's answer match the static refund policy page?) would have caught it. None existed. Air Canada is not stupid. The architecture was wrong.
Microsoft Tay (2016)
An early consumer-facing chatbot was deployed to Twitter with no deterministic output filter on inflammatory content. Within sixteen hours users had induced it to produce racist and pro-genocide outputs. Microsoft pulled it. The failure was not the model's; the model was working as designed (it learned from inputs). The failure was that no deterministic content filter sat between the model and the public. Every consumer-facing model since has had one. The lesson, learned the expensive way, was that the deterministic outer loop was not optional.
Lawyers citing fake cases
In 2023 a New York attorney filed a brief that cited six legal cases. The cases did not exist; ChatGPT had invented them. The court sanctioned the lawyers. By 2025 the pattern had repeated dozens of times in different jurisdictions. The deterministic check is trivial: every cited case must resolve to a real entry in a legal database. Westlaw and LexisNexis are deterministic. The validator that compares "case the model cited" against "cases that actually exist" is a one-day project. It was not in place because the workflow was "ask the model, file the brief"; the workflow had no outer loop.
Air Canada (again), and every other airline chatbot
Air Canada is not alone. Reporting through 2024 and 2025 identified similar pattern failures at airlines, banks, telecoms, and insurers, where customer-facing chatbots made commitments their own backends would not honor. Each is the same architectural error: a probabilistic agent given write authority over commitments without a deterministic check that the commitment is allowed.
The non-AI ancestor: Knight Capital (2012)
This was not an AI failure but it is the cleanest illustration of why deterministic outer loops matter. A software deployment activated dormant code in a trading system. Over forty-five minutes the system placed roughly four million orders, losing the firm approximately $460 million and effectively ending it.17 The deterministic check that would have stopped it (a position-limit guard, an order-rate circuit breaker) was not in place. The pattern is identical to the AI pattern: an autonomous process operating without a deterministic boundary on its blast radius. The cost was four hundred sixty million dollars in forty-five minutes. The boundary would have cost a week of engineering.
Hallucination in production summarization
Vectara's HHEM leaderboard documents the current state. Best frontier models hallucinate on roughly one percent of grounded summarization tasks under a forgiving benchmark, and ten percent or more on a harder benchmark, with the "reasoning" models often doing worse on grounded tasks than the non-reasoning ones.10 The numbers are getting better, slowly. They are not on a trajectory toward zero. A summarization workflow that does not validate against the source material is shipping the hallucination rate to its users; for most enterprise contexts (legal, medical, financial) that rate is unacceptable.
The general pattern
Every well-known AI failure since 2016 has the same shape. A probabilistic system was given authority over a consequential output. No deterministic gate stood between the output and the consumer. The gate was technically easy to build; it was not built because the deployer either did not believe it was necessary or had not thought of it. The failure was discovered in production, at the customer's expense, often in court.
A model without a deterministic outer loop is a writeable production database with no integrity constraints. It will, eventually, write something that costs more than the outer loop would have.
The good news is that the pattern is also the prescription. Every failure on the list above is a piece of training data for the architecture this essay is arguing for. Build the deterministic outer loop. Run the model inside it. Validate every output before it becomes a commitment, a payment, a filing, or a decision. The outer loop is the cheap part; do not skip the cheap part to save the back-loaded probabilistic bill that comes later.
7. The Architecture
The architecture this essay defends has five layers. Each layer has a job. Each job is matched to the kind of logic that does it best. The boundaries between layers are where most of the engineering effort goes; the interiors are mostly already-solved problems.
Fig. 2. The five-layer architectureThe layers in detail.
Layer 1: Human intent
This is just text. A sentence. A request. "Decode this medical bill." "What gauge wire for 60 amps at 80 feet." "Lint this contract." The user is not asked to learn anything. The user does not see the architecture. The architecture is the thing that disappears.
Layer 2: Cognitive router (probabilistic, but small)
A small language model (Phi-3 Mini, Gemma 2 2B, Qwen 2.5 1.5B) running locally in the browser or on the device.18 Its job is bounded and specific: read the user's text, pick the right tool from a fixed registry, and produce a structured JSON object naming the tool and its arguments. If arguments are missing it asks one clarifying question. It does not execute anything. It does not generate the final answer. It is a smart switch.
Why small and local. The cognitive task here is narrow (routing among a known set of tools) and the latency budget is tight. A frontier model is wildly overkill. A small model runs in twenty to two hundred milliseconds, fits in a few gigabytes, and works offline. The privacy properties are also strict: the user's text never leaves the device. transformers.js and WebLLM make this trivial in a browser as of 2025.19
Why probabilistic at all at this layer. Because natural language is ambiguous and a deterministic intent classifier would require maintaining ten thousand regexes and would still miss "i need to figure out how much wire i need for sixty amps yo" while the model gets it on the first try. Routing is exactly the task language models are good at.
Layer 3: Argument validator (deterministic)
The router's JSON output is checked against a schema. The arguments are type-checked, range-checked, and unit-checked. "60 amps" parses to a positive integer ampere value. "80 feet" parses to a positive distance with a unit. "voltage 120 or 240 or 480" passes a domain check. If anything fails, the request bounces back to layer 2 with the specific error and the user is asked a clarifying question. The validator itself is fifty lines of code per tool. It runs in microseconds. It is the first place the architecture says no.
Layer 4: Execution (deterministic)
The actual workflow runs. A pure function. Takes the validated arguments. Produces the result. Writes an audit trail naming the function version, the rule pack version, the inputs, the intermediate computations, and the output. The execution layer is the vetted asset. It is the thing the Forward Deployed Engineer spent six months building. It does not call the model. It does not know the model exists. It is the cerebellum.
Layer 5: Output linter and signed verdict (deterministic)
This is the "second pair of eyes" layer and it is the part of the architecture that is genuinely undervalued in 2026. The output of layer 4 is run through a deterministic rule pack appropriate to the domain. For a contract check the rule pack contains the contractual invariants (no unilateral indemnification without consideration, no governing-law clause pointing at a jurisdiction not in the approved list, and so on). For a medical dose calculation the rule pack contains the safety bounds (no acetaminophen above 4000 mg per day, no pediatric dose above weight-adjusted maximum). For a generated document the rule pack contains structural and citation checks. The verdict is signed with an Ed25519 key; downstream consumers can verify that "this output passed rule pack X version Y" before they trust it.20 The signed verdict is the cryptographic version of vaulytica's audit trail. Anyone can verify it. Nobody has to trust the producer.
The rendered output is then assembled by a deterministic template (not by the model). The numbers come out as the numbers. The citations come out as the citations. The model does not get a second pass to "make it more readable"; the template is already readable, and the second pass is where hallucinations enter.
What about the cases that need the model in the middle
Some tasks legitimately need probabilistic synthesis inside the workflow. Summarizing a contract for a layperson, for instance. The architecture handles this by allowing the deterministic workflow to call a model with a hardened prompt as one step. The model's output is then itself routed through layer 5 (validated against the source material, checked for citation grounding, signed). The user never connects directly to a chatbot. The model is a function call inside a governed workflow. The probabilistic surface is contained.
Fig. 3. The bounded model call inside a deterministic workflowThe pattern is the kinetic execution firewall pattern, applied to text instead of motors.21 Cognition lives upstream of the firewall. Execution lives downstream. The firewall is deterministic, signed, and citable. The architecture is the same architecture safety-critical industries have been using since the invention of safety-critical industries; nothing about AI changes the basic shape.
8. The Job
Now the part this essay has been building toward. What does a Forward Deployed Engineer actually do, day to day, if the thesis is right?
The short answer is that they build the deterministic asset. Concretely, a Forward Deployed Engineer on a real engagement does the following work, roughly in order, often in parallel.
1. Map the customer's actual workflows
Sit with the customer. Watch what they do. Write it down. Most enterprise workflows have never been written down at the precision required for software. The first deliverable is a workflow inventory: a list of the discrete tasks the customer performs, the inputs each one takes, the outputs each one produces, and the rules each one applies. This is the part that looks like anthropology and is the part that no remote model can do for you.
2. Identify the deterministic core of each workflow
For each workflow, ask: what is the rule? Where does it come from? What is the citable source? Most workflows in a regulated industry have a citable source. The customer may not know what it is or may know it imperfectly. The Forward Deployed Engineer finds it, names it, and writes it into the rule pack with a version and a source link. A surprising fraction of "the way we have always done it" turns out to derive from a specific regulation, a specific contract clause, or a specific safety standard; the source matters because when the source updates, the rule has to update with it.
3. Build the update scripts
This is the unglamorous part and it is the part that compounds. Every data source the deterministic logic depends on (the CPT code set, the National Electrical Code, the federal poverty level table, the FDA drug schedule, the customer's own internal price book) has a refresh cadence. Build a GitHub Actions workflow, a cron job, a Cloud Run scheduler, whatever fits the customer's stack, that pulls the latest version, diffs it against the cached version, runs the regression tests, and either ships the update or files an issue for human review. Without this layer the deterministic logic is correct on the day it is written and slowly wrong every day after; with this layer it stays correct indefinitely.
4. Write the rule packs and validators
This is most of the work. Encode the rules as deterministic functions. Each rule has an ID, a description, a citation, a version, a unit test, and ideally a few real-world examples it has caught. The rule pack is the asset. Vaulytica's eighty rules, Sophie Well's drug-dose calculators, Rough Logic's three hundred forty-two field math functions; these are the worked examples. They look small until you appreciate that each one is several days of research and verification, and that together they constitute a serious deterministic surface that did not exist before.
5. Build the cognitive router on top
Once the deterministic core exists, the model layer is the easy part. A small local model, a tool registry pointing at the deterministic functions, a JSON schema for arguments, a clarifying-question loop. Two weeks of work, perhaps less. This is the part that the demo videos show. It is the smallest and least durable part of the engagement.
6. Wire up the audit trail and the signed verdict
Every workflow run produces a structured audit log. Every output gets a signed verdict naming the rule pack version, the input hash, and the findings. The customer can cite the output, prove it was the output, and demonstrate which rules fired and why. This is the part that turns a useful tool into a tool that survives contact with a regulator, a court, or an unhappy auditor.
7. Hand off, document, leave
The deliverable is the deterministic asset, not the consultant's continuing presence. The customer should be able to operate, extend, and audit the system without you. The documentation should be such that the next engineer (yours or theirs) can read it and continue the work. The model behind it is replaceable. The library, the rule packs, and the update scripts are not.
Notice what is not on this list. There is no "build an agent that handles customer support autonomously." There is no "let the model run overnight and we will see what it does." There is no chatbot that touches a production database without a validator between them. The job is the deterministic asset. The model is a layer on top. The asset compounds; the model gets deprecated.
What a real engagement looks like
Imagine a regional hospital system that wants to "use AI" to help patients understand their medical bills. The standard pitch from a competitor is "we will deploy a chatbot fine-tuned on healthcare data; patients ask questions and it answers." The standard outcome is the Air Canada outcome, plus HIPAA.
The Forward Deployed Engineer's version of the engagement is different.
Month one is a workflow inventory. The hospital's billing team explains what patients actually ask. The team writes down the twenty-five most common questions. Each question is mapped to the deterministic resource that answers it: the CPT code set, the modifier list, the EOB grammar, the financial-assistance policy, the price transparency rule, the explanation of how to read an itemized statement.
Months two and three are the deterministic asset. Each of the twenty-five questions gets a function. Each function has unit tests, a citation, a rule pack, and an update script. The CPT code set updates annually; the cron job pulls it. The hospital's price book updates quarterly; the cron job pulls it from the hospital's existing data warehouse. The financial-assistance policy updates when policy updates; a human reviews and bumps the version.
Month four is the cognitive layer. A small local model (running in the patient's browser, no PHI leaves the device) reads the patient's question, picks the right tool, asks for the missing details, runs the tool, and renders the answer through a deterministic template. The signed verdict says which tool ran, which rule pack version, and what the audit trail was. The patient sees a clear answer with citations. The hospital sees a tamper-evident log.
Month five is hardening, observability, and handoff. The hospital's billing team can audit every interaction. The Forward Deployed Engineer leaves. The system keeps running. The model behind it will be replaced three times in the next ten years; the rule packs and update scripts will keep working through every replacement.
This is what real Forward Deployed Engineering looks like. It is slower than the demo. It costs about the same as the demo over a year and a fraction of the demo over three years. It produces an asset the hospital owns rather than a vendor dependency the hospital rents.
9. The 24/7 Problem
A useful thought experiment. Imagine an organization that runs its day-to-day operations entirely through probabilistic agents. No deterministic checks. The agents read email, write replies, make commitments, transfer funds, schedule employees, and update records. They run continuously. They never sleep. Their failure rate is one percent per decision.
Suppose the organization makes one hundred decisions per day. At a one percent failure rate that is one bad decision per day. Some of those bad decisions are harmless (a customer is offered a slightly wrong support article). Some are inconvenient (a meeting is scheduled in the wrong timezone). Some are expensive (a wire transfer goes to the wrong account). Some are catastrophic (an invoice is paid that should not have been, or a contract clause is agreed to that should not have been).
Now scale the thought experiment. A medium-sized enterprise makes more than one hundred decisions per day; many make tens of thousands. A one percent failure rate over ten thousand decisions per day is one hundred bad decisions per day. Of those, perhaps one will be catastrophic. Over a year, three hundred and sixty-five catastrophic outcomes.
The math is approximate. The point is exact. An organization that operates without deterministic outer loops has accepted a steady-state rate of catastrophic outcomes proportional to its decision volume and its model's failure rate. The organization may not know this. The deployer of the system may not have told them. The board may have approved an AI strategy without anyone walking through this arithmetic. It is, nevertheless, the deal.
This is not a hypothetical objection. It is the reason aviation runs the way it does. A commercial pilot does not improvise the engine start sequence; they read the checklist. The checklist is deterministic. The reason it is deterministic is that the failure mode of an improvised start sequence is loss of the aircraft. The checklist is decades of accumulated, deterministic, version-controlled lessons. The deterministic layer is the operational substrate; the human's probabilistic judgment is reserved for the situations the checklist does not cover.22
The same pattern explains why every safety-critical industry has deterministic floors. Nuclear has them. Medicine has them (every hospital has a code-blue protocol; nobody invents a resuscitation from first principles in the moment). Finance has them (every trading firm has position limits, kill switches, and trade-rate guards). The deterministic floor is not a substitute for skilled humans; it is the substrate that lets the humans focus on the cases that are genuinely novel without having to relitigate the routine ones.
A safety-critical industry without a deterministic floor is not a safety-critical industry. It is an industry that has not yet had its bad day.
The AI industry has been operating without a deterministic floor in customer-facing applications for a few years now and the bad day has already started arriving on schedule. Moffatt versus Air Canada was a small one. The next several will be larger. The question for any organization considering an AI deployment is not "what is the upside if the model works well." It is "what is the cost when the model gets one wrong, multiplied by how often we should expect that to happen, divided by the cost of the deterministic outer loop we are choosing not to build." The arithmetic almost always favors the outer loop.
10. How Humans Actually Run
One of the satisfying things about this architecture is that it is not new. Humans have been running it for as long as there have been humans. The hardware is just biological.
Consider an experienced surgeon. The surgeon does not improvise an appendectomy. The steps are a deterministic protocol; the anatomy is a deterministic map; the instruments are laid out in a deterministic order; the time-out before incision is a deterministic checklist mandated by the World Health Organization since 2008.23 The surgeon's probabilistic judgment is reserved for the moments when something unexpected happens (a vessel in an unusual location, a complication, an instrument failure). The deterministic layer holds the routine so the probabilistic layer can attend to the novel. The result is a surgery completed in forty-five minutes that would, without the deterministic layer, take the same surgeon four hours and produce a worse outcome.
Consider an experienced cook. The cook does not measure salt during sauteing; that part is automatic, deterministic, and lives in the cerebellum. The cook does, however, taste the sauce at the end and adjust; that part is probabilistic, conscious, and lives in the prefrontal cortex. The boundary is so well-tuned that the cook is not even aware of it. The architecture is doing its job.
Consider an experienced accountant. The accountant does not compute the depreciation schedule by hand; they use software, which is deterministic. They do, however, judge whether an asset qualifies as a capital expenditure or an operating one; that is probabilistic, requires reading the relevant regulation in context, and is exactly the work the software should not be doing on its own.
Consider any expert in any field. The pattern repeats. Deterministic procedures handle the routine. Probabilistic judgment handles the novel. The expert who tries to do everything probabilistically burns out and makes more errors than the expert who has internalized the procedures. The expert who tries to do everything deterministically is rigid and fails the moment something unusual happens.
Daniel Kahneman gave this dual-process architecture its System 1 / System 2 framing in Thinking, Fast and Slow, drawing on decades of experiments he ran with Amos Tversky and others.6 Karl Friston modeled it as the free energy principle and gave it a precise mathematical form.8 The contemplative tradition has been describing it for over a thousand years in different language; Patanjali's Yoga Sutras distinguish the witnessing consciousness from the patterned modifications of mind that the consciousness observes.24 The architecture is older than the word for it.
The lesson for AI is that the architecture has been there all along, the engineering question is just how to build it in silicon. The honest answer is: more or less the way the brain does. Use the probabilistic substrate for the work it does well. Offload everything else to deterministic procedures that are cheap, reliable, and trainable. Make the boundary between them tight, fast, and inspectable. Spend most of the design effort on the boundary.
The deterministic offload tools humans already use
It is also worth noticing how many deterministic tools humans already use to offload work the brain is bad at. The list is long, and every item is a piece of evidence for the thesis.
- The calculator: arithmetic offloaded.
- The calendar: time offloaded.
- The checklist: procedure offloaded.
- The recipe: a cooking protocol offloaded.
- The shopping list: short-term memory offloaded.
- The map: spatial reasoning offloaded.
- The spreadsheet: bookkeeping offloaded.
- The clock: timekeeping offloaded.
- The thermostat: regulation offloaded.
- The traffic light: coordination offloaded.
- The contract: agreement offloaded.
- The law: norms offloaded.
- The standard: interoperability offloaded.
Every one of these is a deterministic system that frees the human brain to do the work the brain is genuinely good at. The list of things AI tooling can offload is exactly continuous with this list; the right way to think about AI engineering is as the continuation of a fifty-thousand-year project to make the prefrontal cortex less busy. The probabilistic substrate (the brain, the model) should be reserved for the irreducibly novel.
11. Honest Trade-Offs
This essay has argued strongly for one shape of architecture. Honesty requires acknowledging the cases where the shape is the wrong shape and where the trade-off goes the other way. The framework should be load-bearing, not religious.
When probabilistic-first is the right call
Open-ended creative work. If the deliverable is a poem, a brainstorm, a draft, or a variation on a theme, the right tool is the model. There is no deterministic rule pack for "make this email warmer." The variance is the value. Lint at the boundary (no slurs, no fabricated facts, no personally identifying information) but let the model do its work in the middle.
Long-tail support questions. A customer support workflow handling the top one hundred questions deterministically and routing the rest to a model is the right shape. The deterministic layer covers the routine, the model handles the novel, and the cost-of-being-wrong is bounded by the kind of question (returns policy, not legal commitment).
Pure exploration. Research, ideation, "what should I learn next," "what are the analogous failures in other industries." Open-ended cognition has no deterministic answer; the right move is to talk to the model and treat its output as a starting point for human verification.
When the cost of being wrong is small and reversible. A spam filter that mis-classifies one in a thousand emails is a usability issue, not a catastrophe. The user moves the email back to the inbox. The deterministic outer loop adds overhead that does not earn its keep. Probabilistic is fine.
When the deterministic rule pack does not exist and would take longer to build than the lifetime of the use case. Be honest about this one. Sometimes the workflow is novel enough or temporary enough that writing the rule pack is not worth it. A model with a human in the loop is the right tool for the temporary case. Do not pretend the rule pack is coming if you do not intend to write it.
What the deterministic-first architecture costs
Front-loaded labor. The deterministic asset is expensive to build. Months, not weeks. The customer pays in time before they see the payoff. This is the largest practical objection to the architecture and it is real.
Narrowness. The deterministic workflow handles the workflows it was built to handle. It does not generalize. Adding a new workflow is more work than asking a model a new question. The breadth of "the model can do anything" is real and the deterministic-first architecture gives some of it up.
Maintenance. Every data source the deterministic logic depends on has to be kept fresh. Every rule has to be revisited when the underlying source changes. The update scripts and the regression tests are not free.
The boundary work is real engineering. The cognitive router and the output linter are the most novel parts of the system and the parts most likely to have subtle bugs. A bad router picks the wrong tool; a bad linter passes outputs that should have failed. Spend the effort on the boundary.
The risk of false confidence. A signed verdict is not a guarantee of correctness; it is a guarantee that a specific rule pack version was run and produced a specific finding. If the rule pack is incomplete, the verdict can be confidently wrong. Determinism is a property of the procedure, not of the world. The rule pack is still a human artifact and is still subject to error; the difference is that the error is now citable, reproducible, and fixable, not nebulous.
The non-tradeoff
One thing that is not a trade-off, and is worth stating cleanly: using a deterministic outer loop does not reduce the user's experience of fluency. The user still types in natural language. The model still does the routing. The output is rendered through a template that reads well. The user does not see the architecture and does not need to. The deterministic layer is invisible to them. It is invisible to them in the same way a hospital's surgical safety checklist is invisible to the patient on the table; the patient does not need to know the procedure exists for the procedure to be saving their life.
12. The Manifesto
The Forward Deployed Engineer of the AI era has the most interesting job in software. They sit at the boundary where probabilistic generalists meet deterministic specifics, where vendor capabilities meet customer realities, where the demo meets the production system. The job is not to be a model whisperer. The job is to build, with the customer, the deterministic asset that survives the next ten model deprecations.
Here is the position, restated, for anyone who skipped to the end.
Probabilistic AI gets cognition. Intent extraction. Routing. Translation. Drafting. Summarization. Classification with fuzzy edges. The model does the work it does well. The model is the cortex.
Deterministic libraries get verification and execution. Math. Rule application. Schema enforcement. Cryptographic operations. Engineering calculations. Regulatory checks. Output validation. The libraries are the cerebellum. They are cheap to run. They are inspectable. They are citable. They compound.
The boundary between them is the architecture. Argument validators on the way in. Output linters on the way out. Signed verdicts that downstream consumers can verify. Audit trails that survive contact with regulators and courts. The boundary is where the engineering effort goes.
The Forward Deployed Engineer's job is to build the deterministic asset. Workflow inventory. Rule packs. Update scripts. Validators. Audit trails. The model layer goes on top in the last two weeks of the engagement. The asset is the deliverable. The asset is what the customer owns. The asset is what compounds.
Open source the asset. If the asset is genuinely useful (and most deterministic rule packs for genuinely useful workflows are), share it under a permissive license. Vaulytica's contract checks. Sophie Well's drug calculators. Rough Logic's field math. Encrypt A Lotta's crypto utilities. Each of these is a deterministic public good. None of them depended on a specific model. Each will be useful for decades. None of them phones home. None of them tracks the user. None of them is for sale.
Stay humble about what AI is. The model is not a colleague. It is a tool. A useful one. An expensive one. A probabilistic one. The model does not know what it does not know. The deterministic outer loop is the thing that does.
If a customer asks you to deploy an autonomous agent into a production workflow without a deterministic outer loop, say no. Show them the Moffatt v. Air Canada ruling. Show them the lawyers who got sanctioned. Show them the Knight Capital arithmetic. Offer them the architecture in this essay instead. Some will hire you. The ones who do not were going to learn the same lesson the expensive way regardless. You are not obligated to be there when they do.
The work is patient. The work is unglamorous. The work compounds.
Go forward. Get deployed. Build the deterministic asset. Use the model where the model is useful. Be honest about where it is not. Leave the customer with something they own. Leave the open-source ecosystem with one more rule pack the next person does not have to write. Leave the field with one more proof that this is how the work should be done.
The probabilistic substrate is for cognition. The deterministic libraries are for verification and execution. The boundary between them is the architecture. Build the boundary. Then go home.
Would you like to work together?
Forward deployed engineering, the way this essay describes it. Let's build something that compounds.
Get in touch1 Gergely Orosz, "What are Forward Deployed Engineers, and why are they so in demand?" The Pragmatic Engineer, 2024. Documents the origin of the role at Palantir as "Delta," its prevalence at the company prior to 2016, and its subsequent adoption across the AI industry. See also "A Day in the Life of a Palantir Forward Deployed Software Engineer," Palantir Blog.
2 The formal definition of determinism in computation is given in any introductory text on computability, but Hopcroft, Motwani, and Ullman's Introduction to Automata Theory, Languages, and Computation (3rd ed., 2006) gives the canonical treatment. A function is deterministic in the sense used here when it can be expressed as a deterministic Turing machine.
3 Vaswani et al., "Attention Is All You Need." NeurIPS 2017. The transformer architecture that underlies all modern large language models samples the next token from a softmax distribution over the vocabulary, conditioned on the prior tokens and the learned weights. The probabilistic nature of the output is baked into the architecture, not an artifact of the implementation.
4 Raichle, M.E. and Gusnard, D.A. "Appraising the brain's energy budget." Proceedings of the National Academy of Sciences, Vol. 99, No. 16, 2002, pp. 10237-10239. The 20 percent figure is the consensus value across multiple imaging modalities and traces back to the Kety-Schmidt nitrous oxide measurements of cerebral blood flow and metabolism in the late 1940s; Sokoloff's later [14C]deoxyglucose method (1977) extended the measurement to regional resolution. See also Attwell, D. and Laughlin, S.B. "An energy budget for signaling in the grey matter of the brain." Journal of Cerebral Blood Flow and Metabolism, Vol. 21, No. 10, 2001, pp. 1133-1145.
5 Raichle, M.E. et al. "A default mode of brain function." PNAS, Vol. 98, No. 2, 2001, pp. 676-682. The original identification of the default mode network and its baseline energy consumption.
6 Kahneman, Daniel. Thinking, Fast and Slow. Farrar, Straus and Giroux, 2011. The popular synthesis of decades of research on dual-process cognition. The "System 1 / System 2" terminology originates earlier in Stanovich and West (2000) but Kahneman's framing is the one that traveled.
7 Estimates of frontier model inference cost are unstable but the order of magnitude is documented in multiple industry reports. Patterson et al., "Carbon Emissions and Large Neural Network Training," 2021, gives an early baseline; the Stanford AI Index reports from 2023, 2024, and 2025 document the scaling.
8 Friston, K. "The free-energy principle: a unified brain theory?" Nature Reviews Neuroscience, Vol. 11, 2010, pp. 127-138. Friston's model is controversial in its strongest formulation but the weaker claim (that the brain operates by minimizing prediction error against an internal generative model) is broadly accepted.
9 Vaswani et al., op. cit. The original transformer paper trained the architecture on English-to-German translation; the broader applicability to language modeling was discovered subsequently.
10 Vectara's Hallucination Evaluation Model (HHEM) Leaderboard, accessible at huggingface.co/spaces/vectara/leaderboard. As of mid-2025, frontier models on the next-generation benchmark exceed 10 percent hallucination rates on grounded summarization, with "reasoning" models often performing worse than non-reasoning counterparts.
11 Frontier vendors (Anthropic, OpenAI, Google) have all added tool-use APIs that allow the model to call out to a deterministic calculator. The capability was added in 2023-2024 across the major providers and is now considered standard.
12 See Willard, B. and Louf, R., "Efficient Guided Generation for Large Language Models," 2023, for the technical basis of grammar-constrained sampling. OpenAI's "Structured Outputs" feature and Anthropic's tool-use schema enforcement implement variants of the same idea.
13 Vaulytica is an open-source deterministic contract checker available at vaulytica.com. About eighty rules across ten categories, every finding tied to a rule ID and a dataset version, no model in the verification path. MIT licensed.
14 Rough Logic, available at roughlogic.com, implements three hundred forty-two calculators for the trades, every formula computed from public physics or public-domain data. The model serves as a rough demonstration that "deterministic library as public utility" is a viable category at scale.
15 The audit-trail pattern is borrowed from the FDA's 21 CFR Part 11 and EU Annex 11 requirements for electronic records in pharmaceutical manufacturing, which mandate tamper-evident audit trails for any automated decision affecting patient safety. The pattern generalizes well outside pharma and is the right shape for any AI-mediated workflow with consequential outputs.
16 Moffatt v. Air Canada, 2024 BCCRT 149. British Columbia Civil Resolution Tribunal, decision issued February 14, 2024. The tribunal held Air Canada liable for negligent misrepresentation arising from incorrect bereavement-fare information given to the plaintiff by the airline's chatbot. Damages of CAD $650.88 plus interest and fees.
17 Securities and Exchange Commission Order, In the Matter of Knight Capital Americas LLC, Release No. 70694, October 16, 2013. Documents the August 1, 2012 incident in which a software deployment activated dormant code, causing approximately $460 million in pre-tax losses over 45 minutes and effectively ending the firm. The deterministic guards that would have prevented the outcome (position limits, order-rate circuit breakers) were not in place.
18 The named small models are Phi-3 (Microsoft), Gemma 2 (Google), and Qwen 2.5 (Alibaba). All are open-weight, all run on consumer hardware, all are suitable for in-browser deployment via transformers.js or WebLLM as of 2025.
19 Xenova's transformers.js library and the MLC-AI WebLLM project both demonstrate that small language models can be run entirely in the browser using WebGPU, with model weights cached locally and no server-side inference. The privacy and latency properties are strict.
20 Bernstein, D.J. et al. "High-speed high-security signatures." Journal of Cryptographic Engineering, Vol. 2, No. 2, 2012, pp. 77-89. Ed25519 provides 128-bit security with 64-byte signatures and sub-millisecond verification on modern hardware, suitable for signing every workflow verdict produced by a deterministic linter.
21 See "The Kinetic Execution Firewall" (claygood.com/essays/the-kinetic-execution-firewall) for the worked example in the robotics domain. The pattern (probabilistic cognition upstream, deterministic firewall in the middle, signed actuation downstream) is the same one defended here, applied to motors instead of text.
22 Gawande, Atul. The Checklist Manifesto: How to Get Things Right. Metropolitan Books, 2009. The canonical popular treatment of why deterministic procedural floors outperform expert improvisation in safety-critical settings, with extensive case studies from aviation, surgery, and construction.
23 World Health Organization Surgical Safety Checklist, 2008. A 19-item checklist deployed across hospitals in eight pilot countries reduced surgical mortality by approximately 47 percent in the validation study. Haynes, A.B. et al. "A Surgical Safety Checklist to Reduce Morbidity and Mortality in a Global Population." New England Journal of Medicine, Vol. 360, 2009, pp. 491-499.
24 Patanjali, Yoga Sutras. Scholarly dating varies between the 2nd century BCE and the 4th century CE; recent academic consensus (Maas and others) favors a date around 400 CE. Book I, Sutras 2-4 distinguish between the witnessing consciousness (purusha) and the patterned modifications of mind (chitta-vritti) that the consciousness observes. The contemplative architecture predates the engineering one by roughly a millennium and a half.
Note. This essay is a working document. The author welcomes correction. Send notes to hi@claygood.com.











