Multi-agent systems shouldn't gossip. They should share a case file.
Multi-agent LLM systems commonly couple specialised agents through natural-language messages. One agent produces a textual result. Another agent reads that result, re-encodes it in its own vocabulary, and produces its own textual output. I argue that this is an architectural anti-pattern, not an intended feature.
Each natural-language re-encoding step is a lossy compression: it erases types, structure, and provenance. It introduces semantic drift that accumulates with every hop. It makes individual agents untestable because their inputs and outputs are strings. And it multiplies the token budget for anything non-trivial.
I propose the Clipboard Pattern: a shared typed state object,
enforced by a TypedDict schema, that flows through a sequence of
specialist nodes within a cognitive unit (a LangGraph). Each specialist reads the
fields it needs, writes its results into structured fields, and hands the clipboard
to the next specialist. No message passing. No re-encoding.
This article shows how the pattern is implemented in Novaberg's graphs, how a three-level taxonomy — Roles, Departments, Graphs — keeps composition clean at scale, and why a single typed state object is the technical expression of the principle: transport data completely, let the consumer format.
A case file moves from desk to desk. The litigator reads the facts, drafts a legal position, slides the file to the paralegal. The paralegal pulls the precedents, adds them to the file, slides it back. Nobody writes emails to each other. Nobody summarises the file in their own words. The file is the truth. Each person contributes a section. At the end, the file is the output.
Now look at most multi-agent LLM systems in the wild. An agent generates a paragraph. Another agent reads that paragraph, writes its own paragraph back. A third agent writes about both paragraphs. By the time the workflow ends, nobody can tell you what happened, and the tokens you paid for were spent on three agents each rewriting the previous one's summary in their own voice.
Put the two pictures side by side. One is obviously how serious work happens. The other is a children's game dressed up as orchestration. The rest of this article is a defence of the first picture and a critique of the second — grounded in code, in a working system, and in the pragmatic admission that this pattern is not new. It is just not named.
# Pattern A — agent-to-agent messaging
result = agent_b.invoke(agent_a.invoke(user_message))
# Pattern B — clipboard
state = agent_a(state)
state = agent_b(state)
The default pattern in popular multi-agent frameworks — CrewAI, AutoGen, and most "agent-supervisor" templates in the LangChain ecosystem — is this: specialised agents communicate by sending each other natural-language strings. One agent's output becomes another's prompt. The framework takes care of scheduling and handoffs. The application author writes roles and instructions; the agents do the rest.
This is the pattern Marina Wyss documents carefully in her 2025 overview of the field, "AI Agents: Complete Course" — a 150-page synthesis by a senior practitioner at Amazon. She devotes a chapter to "Communication Pitfalls," and she is not wrong about what they are. What I disagree with is the proposed remedy. The remedy offered by the consensus is better prompts and clearer roles. I think the remedy has to be architectural. The pitfalls are not behaviours the paradigm accidentally produces; they are what the paradigm is.
Every time an agent reads another agent's text and writes its own, something subtle changes. The receiving agent has its own vocabulary canon, its own sense of what matters, its own training-data-induced priors. It restates the claim in its own voice. That voice is not identical to the source. After two hops, the error is still small. After five, the original intent is a rumour.
In one system I audited, a legal-review agent was asked to verify a contract. What it received was not the contract. It was a compliance officer's summary of the contract. The summary had dropped the termination clause. The legal-review agent dutifully found nothing wrong. Nobody was at fault. The architecture was at fault.
Every natural-language hop costs tokens. The receiving agent must read the sender's
text and generate its own. If four agents are in the loop, you pay for four
re-encodings of roughly the same information. A typed field that says
confidence: 0.83 costs one token in the prompt and zero tokens to read.
A paragraph that says "I'm fairly confident, maybe eighty percent or so" costs
fifteen tokens to write and fifteen to read, and tells the next agent less.
An agent whose input is a free-form string cannot be unit-tested in any meaningful way. You can assert that the output "mentions termination" — and that assertion passes or fails on the whim of the generation. There is no contract to enforce and no fixture to pin. Production LLM work is work; it deserves tests. Strings-in, strings-out is the opposite of tests.
A typed field, by contrast, is a thing you can assert about.
state["agent_results"][0].status == "abgeschlossen" either holds or it
does not. The node that should produce it can be driven by fixture input and verified
by fixture output. The gods of CI smile.
A trail of natural-language exchanges is, technically, an audit trail. Practically it is an archaeology site. Who decided to include the precedent? Why was the termination clause dropped? You can read the exchange and guess. A typed state, written to by named nodes with declared write-sets, lets you answer those questions in O(1): the field was set by this node at this graph step.
| Pattern | Communication | Symptom |
|---|---|---|
| Agent-to-agent messages CrewAI · AutoGen |
Text strings | Semantic drift, token cost, string-in I/O |
| Supervisor + tool-handoff LangChain · ReAct |
Text + implicit tool args | Better, but the arg schema is implicit |
| Shared memory blocks Letta / MemGPT |
LLM-edited memory | Flexible, non-deterministic |
| Clipboard Novaberg · LangGraph |
Typed state + dispatch | Deterministic, testable, no drift |
None of this is a claim that text-messaging is never the right choice. It is a claim that text-messaging as the default coupling between specialists is expensive, fragile, and opaque. The default has to change.
The Clipboard Pattern is not a framework. It is a discipline. In Python, the
discipline is enforced by a TypedDict and by the runtime guarantees of
a state graph like LangGraph. The rules are three.
First, there is exactly one state object per cognitive unit — one clipboard per case, if you like. Second, every node declares, by convention, which fields it reads and which it writes; a reviewer reading the code can produce the dependency graph of the pipeline without running it. Third, there are no messages. Nodes do not talk to each other. They talk to the state.
class ConversationState(TypedDict):
user_input: str
intent: str
current_emotion: str
needs_memory: bool
memory_context: str
agent_name: str # one agent per turn
agent_results: list[AgentResult] # audit trail
response: str
ConversationState in Novaberg carries
over sixty fields — each with a single declared writer and one or more readers.
The crucial discipline is not in TypedDict itself — Python will happily
run you off a cliff — but in the code-review culture around it. A node function has
the shape State → State. It receives the clipboard, modifies its own
fields, and returns. That is the entire contract. Consider the planner that routes
work inside the character's graph:
def planner_node(state: ConversationState) -> ConversationState:
# reads: intent, management_target, agent_results
# writes: agent_name (empty when nothing left to do)
state["agent_name"] = next_agent_to_run(state)
return state
def agent_dispatch_node(state: ConversationState) -> ConversationState:
# reads: agent_name, state (as a whole — the clipboard)
# writes: agent_results (+1), agent_name = ""
dispatch = find_dispatch(state["agent_name"])
state["agent_results"].append(dispatch(state))
state["agent_name"] = ""
return state
# LangGraph conditional edges:
# planner -> agent_dispatch if agent_name != ""
# planner -> responder if agent_name == ""
# agent_dispatch -> planner (loop: planner decides anew)
agent_queue. No batch. The planner makes exactly one decision per
pass — "which agent, if any" — and the dispatch runs it. Multi-agent turns happen
by iteration, never by a pre-filled list; later agents can build on the results of
earlier ones, and the planner exits the moment the answer is ripe.
Two things deserve to be said out loud. The difference from the text-messaging pattern is not stylistic; it is type safety. The difference from a queue-based scheduler is not stylistic either; it is observability. Each iteration of the loop is a distinct graph step with a full state snapshot — you can pause, inspect, replay. A queue would hide the same decisions inside a function's local scope.
The word "dispatch" in agent_dispatch_node hides a small but consequential
pattern. A specialist agent lives inside its own world — its own vocabulary, its own
subgraph, its own state schema. The planner cannot know any of that. The planner only
knows the outer ConversationState. Something must translate between
the outer clipboard and the agent's inner one. That something is the dispatch.
One of the principles I have kept coming back to, in design notes, is this:
Every department ships its own dispatch. The notes agent has a
dispatch.py that unpacks the parts of ConversationState it
needs — the current intent, the user's emotion, the relevant memory fragments — into
a department-local AgentState, runs the agent, and folds the
AgentResult back into the outer clipboard. No central router touches
it. Plugin-shaped by construction.
agents/notizen/
├── agent.py # subgraph wiring (LangGraph nodes)
├── klassifikation.py # classify — recognise the action
├── suche.py # resolve — find the target entry
├── crud.py # crud — create/read/update/delete
├── resume.py # resume — continue after a clarification
├── bestaetigung.py # confirm — phrase the result for the responder
├── dispatch.py # ConversationState ↔ AgentState
├── init.sql # schema (bi-temporal, soft-delete indexed)
├── __init__.py # package marker for auto-discovery
└── AGENT.md # capabilities, triggers, tests
Adding a new agent means making a new directory with its own dispatch.py
and its own init.sql. No central router code is touched. Auto-discovery
finds the dispatch the same way it finds the agent. The schema lives with the agent,
not in a monolithic migrations file — because the department that owns the behaviour
also owns the storage.
The dispatch resolves the boundary between the outer clipboard and the inner one. But there is a deeper boundary inside the agent itself: the boundary between what the user said and what the specialist needs to hear. Crossing it cleanly is what turns a domain agent from a clever prompt into a reliable piece of software.
A patient walks into a clinic. He tells the receptionist: "Something feels off, my chest gets tight when I climb stairs, and I've been dizzy all week." The receptionist does not diagnose. She routes — cardiology, not dermatology. That is the Router. In the cardiology intake, a nurse takes the patient's words and fills out an admission form: dyspnoea on exertion, vertigo, seven days, no prior cardiac history. That is the Classify node. It translates natural speech into the department's professional vocabulary.
The cardiologist reads the admission form. He does not ask the patient to tell the story again. He reads structured observations in his own language and decides what comes next. That is the CRUD. The patient's original words — the worry, the phrasing, the tone — are not lost. They sit in their own fields on the clipboard, read by the responder when it is time to answer the patient in the patient's language. But the specialist never sees them. The specialist works on validated, domain-specific data. Two layers inside the same state object, consumed by different nodes.
Novaberg calls this domain language normalisation. Every
department declares its own professional vocabulary — the actions it recognises,
the entities it operates on, the shape it expects. The Classify node translates
the user's natural sentence into that vocabulary, inline, as part of its existing
LLM call. No extra round-trip. No separate NLU pipeline. The output is a single
structured field, normalised, that the downstream roles can rely on.
User says: "Throw the bananas off the list."
Classify: action = remove_content
target = "Shopping list"
normalised = "remove_content: remove bananas
from note 'Shopping list'"
CRUD reads: action, target, normalised
Responder: reads user_input — preserves tone in reply
The same principle from the other side. An architect tells a client, "We'll use oak beams, stained in a warm grey." The client nods. But the structural engineer who picks up the file does not read "warm grey." He reads: load-bearing capacity 14 kN/m, span 4.2 m, cross-section 120 × 240 mm. Architect and engineer share a file. They do not share a vocabulary. The file carries both layers — the client-facing description and the engineering specification — and each reader takes what belongs to them.
This is why a department has five specialised roles — not because the domain is complicated, but because each role needs to be relieved of the work that does not belong to it. A single agent asked to do all five in one prompt collapses under context load. Five roles hold five small contexts, each individually testable.
Translates. It bridges the gap between natural speech and a named action. It fills out the department's intake form in the department's professional language. Everything downstream operates on what Classify produced — not on what the user actually said.
Finds. It locates the right entity — the right list, the right appointment, the right note — using fuzzy search with score-gap disambiguation and an embedding fallback. It never interprets intent. If the match is ambiguous, it flags a clarification rather than guessing.
Executes. Exactly the operation that Classify named, on the entity that Resolve located. CRUD never reads the user's original words. Its input is entirely the structured fields written by its upstream peers. This is what makes the four-phase hardening — recognise, validate, execute, verify — possible at all.
Re-enters. When a clarification was needed last turn — "which list, household or office?" — Resume picks up the pending state on the next user input and restarts the subgraph from the point where it paused. The pause is not an error path; it is a first-class state the clipboard can hold.
Phrases the outcome in domain-appropriate language for the Responder to weave into the reply. Confirm is the only role that produces natural language inside the agent, and it produces one sentence, for one consumer. The Responder still owns the user-facing voice; Confirm just hands it a clean fact.
Domain language is what makes a new department a small change instead of a large one. Adding a department means declaring a new vocabulary — the actions, the entities, a handful of translation examples — and dropping a new directory next to the others. No central code is touched. No coordinator has to learn the new domain. The department arrives with its own intake form, its own specialists, its own dispatch. The clipboard carries whatever they write.
So the clipboard is not just a data-transport mechanism. It is the translation boundary between human speech and machine-readable action — and it keeps both representations alive, side by side, for the different readers that need them.
Metaphors and schemas only go so far. Here is what actually runs in Novaberg when a
user types one sentence. Novaberg's event model splits a conversational turn across
two graphs: Path 1 — the HumanGraph — perceives the user's turn and persists it;
Path 2 — the CharacterGraph — generates the reply. They are decoupled by a Redis
event queue. The user gets an immediate 202 Accepted after Path 1 and
the reply arrives later over a WebSocket. Latency is unchanged; the graphs are free
to be specialised.
ConversationState) moves left to right through
every node. No arrow carries a string between peers; every arrow is a state
transition. The dashed rail between paths is a Redis event; the two paths share a
session but not a graph. Quality-control nodes (Thinker, Tribunal, Corrector) and the character's self-perception are omitted for clarity.
Perception extracts intent="task", topic="shopping",
emotion="neutral". The Enricher loads the session, short-term memory,
long-term memory, and the character hash. EI-Calc runs on the user's stream:
the user's emotion trajectory, the current affect vector, a plausibility check on the
communication mode. Salience decides that "butter / shopping list" is worth
remembering and adds a pending write for the short-term store. The Dispatcher
persists the turn and emits an event onto the Redis queue. No agent call here. Path 1
perceives and stores. Done.
The character's graph picks up the event. Its own Enricher loads what it needs. The
Router sees management_action="update" and flags the notes department.
The Planner sets agent_name="notizen". Agent Dispatch translates the
outer clipboard into the notes agent's AgentState, runs the subgraph —
classify recognises an append action, resolve finds the shopping list by name using
fuzzy matching with an embedding fallback, CRUD appends "butter," confirm phrases the
result — and folds the AgentResult back into
agent_results.
The Planner now sees a fresh result and decides there is nothing more to do:
agent_name="". The Responder reads agent_results[0].ergebnis
and writes: "Done — butter is on the shopping list." At every point between Path 1
and the user seeing that sentence, the state was typed, readable, and pausable.
If you print the state after every node, you get a complete causal trace of the turn — not a transcript of what a model said about what another model said about what the user said. The clipboard is the conversation. The user-facing sentence is a side-effect.
Scale is where design decisions either compound or decay. If every new capability requires touching a central file, a system grows less maintainable per capability added. The Clipboard Pattern scales — in Novaberg at least — because the departments organise themselves around three recurring levels, not because any framework demands it. I want to be careful here. The three levels are a pattern the code has settled into, not a contract the runtime verifies.
Roles are recurring functional patterns that most workflow agents share. There are five of them, and they have converged unforced. Classify decides what kind of action is being requested. Resolve finds the target entity — the right shopping list, the right appointment, the right note. CRUD performs the data operation. Resume handles the case where the agent is being re-entered after a clarification turn ("which list? the household one or the office one?"). Confirm phrases the outcome in domain-appropriate language for the responder to weave into the reply.
Each role lives in its own file. The shape is a convention, not a subclass of anything. Workflow agents share the convention; agents whose work is shaped differently deviate, on purpose. The characterisation agents (the ones that update Nova's self-model) do not have a Resolve step; they have a different skeleton. That is fine. The taxonomy is descriptive, not prescriptive.
A department is a composition of roles, specialised for a domain. The notes department
(notizen), the timeline department, the character-identity department —
all share the five-role skeleton, each instantiating it with its own vocabulary,
its own storage, its own domain semantics. In Novaberg's current codebase there
are eleven agent directories, each with its own dispatch.
A graph is a cognitive unit. It orchestrates departments into meaningful workflows. Novaberg runs three compiled graphs: the HumanGraph (perceives and stores the user's turn), the CharacterGraph (decides and responds), and a smaller AgentGraph used by the background worker. Each graph has its own state schema — though ConversationState is shared between HumanGraph and CharacterGraph, carrying the same session forward.
I want to resist the temptation to turn this into a formal framework. The value is not in a Role base class or a Department registry. The value is that the repetition becomes visible. When you know a new agent will likely want classify/resolve/crud/ resume/confirm, you start with a template and override where the domain demands it. When you know a graph routes work by means of Planner → Dispatch → Planner → Responder, you recognise the shape immediately in a new codebase. The taxonomy is a reading aid first and a factoring hint second. It is not a type system.
A reader who takes nothing else from this article should take this: the Clipboard Pattern is the primitive. The three levels are what the code looks like after you apply the primitive for long enough.
Adopting the Clipboard Pattern changes what you can say about your system's behaviour. Not in abstractions — in concrete claims that either hold or do not.
A node function is State → State. If it is not stochastic on purpose —
most routing, planning, dispatch nodes are not — then it is deterministic. Feed it
the same clipboard twice, get the same clipboard twice. You cannot say that about an
agent-to-agent conversation; model temperature alone makes the same prompt produce a
different paragraph on the second run.
A node's test looks like: build a fixture state, invoke the node, assert on named
fields of the output. Planner gets agent_results[0].status == "abgeschlossen"
and should exit — does it? Dispatch receives agent_name="notizen" and
should call the notes dispatch — does it? These are real, fast, robust tests. An
agent-to-agent "assertion" that the output "mentions termination" is not.
Fields are fields. A confidence: 0.83 is a number, not a paragraph.
Natural language appears exactly twice in the pipeline: once at the input (the user's
turn) and once at the output (the responder's reply). Every stage in between operates
on structured data. Token usage follows directly: one perception call, one response
call, plus whatever the agent's own internal work requires. Contrast with a four-agent
messaging system where every hop is a full round-trip.
LangGraph persists state snapshots between nodes. A production incident becomes a replay: load the snapshot for graph step 7, run the node in a debugger, watch the decision happen. The trail is not a transcript; it is the clipboard at each desk.
The most common objection I hear to the Clipboard Pattern is that it sacrifices emergent behaviour — the thing where two agents surprise you by producing a better answer through their exchange than either would alone. There is a grain of truth in this. For the kind of workflow I have described — a user turn that wants a grounded, correct, persistent response — emergent behaviour is a liability, not an asset. You do not want the compliance officer and the legal reviewer to improvise. You want them to read the file.
Where free agent conversation genuinely helps — brainstorming, debate, negotiation, roleplay — the Clipboard Pattern is the wrong tool. Build those with a messaging substrate and accept the token cost. But most systems advertised as "multi-agent" are pipelines. Calling a pipeline a conversation does not make it one. It just makes it expensive.
The clipboard pattern is not mine. It lives in any LangGraph codebase that takes state seriously, in any Rails or Django controller that keeps logic out of HTTP bodies, and — probably — in every law firm you have ever visited.
What I have done is give it a name, codify the discipline, and show what happens when you commit to it all the way down. Novaberg implements the pattern end-to-end. The source lives at codeberg.org/ClausVomBerg/Novaberg under Apache 2.0. If this argument has any force, you should be able to point at the code and find a line where it breaks. If you can, I would very much like to hear about it in the issues.
Next in this series: why an AI assistant needs its own emotions, and what goes wrong when it only mirrors yours.
The Clipboard Pattern sits at the intersection of several traditions. LangGraph itself is the runtime that makes it practical in Python. Elixir/OTP's GenServer messaging inspired the structured-payload discipline (via process isolation rather than shared state, but the principle is kin). React's unidirectional data flow is the closest analogue outside the LLM world.
The Amazon practitioner overview I referenced — Marina Wyss, "AI Agents: Complete Course" (Medium / Data Science Collective, December 2025) — is the most careful defence of the messaging paradigm I know, and worth reading even if you ultimately disagree. Her Complexity/Precision matrix for task selection is orthogonal to this argument and compatible with either paradigm. Her Memory taxonomy — dynamic and static — is a useful introduction but thinner than the five-layer memory system in Novaberg; that's material for a later paper.
Graphiti / Zep (arXiv 2501.13956) argues adjacent: structured memory beats text-based conversational history. langgraph-supervisor-py, and the pattern in JoshuaC215/agent-service-toolkit and cgoncalves94/multi_agent_system, move in the same direction as this paper without naming the pattern as such. If the name sticks, I hope it helps those efforts cohere.
| Objection | Response |
|---|---|
| Text messages are more flexible. | Yes — and more flexible, in production, is less reliable. A typed state can always carry a free notes field when flexibility is truly needed. |
| Amazon practitioners use the messaging pattern. | They do, and they describe the "Communication Pitfalls" as a chapter in their own right. The problem is recognised; the remedy stays inside the paradigm. This paper proposes a paradigm shift, not a patch. |
| The state dict will grow enormous. | Discipline required. Novaberg's ConversationState has over sixty fields, code-reviewed, versioned, readable. It has not exploded in over sixty development sessions. |
| This is just shared memory. | No. Typed. Versioned by LangGraph's immutable reducers. Orchestrated through a graph. No concurrent thread access — nodes run in a deterministic sequence. |
| What about emergent behaviour? | Covered in § 7. Pick the right tool. Most "multi-agent" systems are pipelines wearing the word conversation. |
| What if a node errors? | Every AgentResult carries a status field — abgeschlossen, fehler, rueckfrage. The responder sees the error and speaks to it. Nodes that raise are caught, logged, and routed to a degraded-path. |
| What about streaming? | LangGraph + FastAPI's SSE support lets every node emit events. The state is not compromised by streaming the reply; streaming is a delivery detail. |