Recursive Language Model Agent
Recursive Language Model Agent
The RecursiveLanguageModelAgent (also exported as synalinks.RLM) is a
specialization of CodeModeAgent for tasks where the input itself is
too large or too noisy to feed straight into the language model. Instead
of packing a whole book, log dump, or scraped corpus into the primary
LM's context, the agent treats those inputs as an external
environment: it writes Python that programmatically slices, filters,
and aggregates the data, and recursively delegates semantic work to a
sub-LM on the snippets it actually cares about.
The pattern follows Recursive Language Models (Zhang, Kraska, Khattab — 2025).
Why Recursive?
A long context is expensive on three axes that compound:
- Token cost scales linearly with the prompt size; a single 200K prompt can outweigh a hundred small ones.
- Latency scales linearly too, and provider rate limits often bottleneck on input tokens before output tokens.
- Accuracy regresses past a model-specific knee — the "lost in the middle" effect — even when the relevant span fits.
Two patterns are common at this scale:
- Stuff it all in. Hits the wall above. Easy to write, expensive to run, fragile on the recall side.
- Off-line preprocess + retrieval. Build an index, retrieve top-k chunks per query, ground the LM on the chunks. Fast at query time, but you've committed to a chunking strategy, an embedding model, and a retrieval ceiling before you know what the question is.
RLM is the third option. The primary LM never sees the long input; it
sees only an InputsSummary (per-field previews and sizes). The full
value lives in the sandbox under inputs[field]. The LM writes code
that decides — per query, per turn — what slice matters, then calls a
sub-LM on that slice. No fixed chunking, no fixed retriever, no
context blowup on the primary LM.
flowchart TD
A[Long input + Query] --> S[InputsSummary<br/>previews + sizes only]
S --> P[Primary LM]
P --> C[Python snippet]
C --> X[Monty Sandbox<br/>inputs is dict, full value]
X --> Q{semantic work?}
Q -->|Yes| L[llm_query / llm_query_batched<br/>sub-LM on a snippet]
Q -->|No| R[Pure code: regex, slicing, set ops]
L --> O[Observation]
R --> O
O --> P
P -->|done| SU[submit result]
The primary LM stays in a small, structured context (summary + tool catalog + accumulated trajectory). Sub-LM calls each see only the snippet that the code carved out. The framework caps how many sub-LM calls one run can make, so a runaway loop can't blow your bill.
Minimal Example
import synalinks
class Doc(synalinks.DataModel):
text: str
class Answer(synalinks.DataModel):
answer: str
primary = synalinks.LanguageModel(model="openai/gpt-4o")
inputs = synalinks.Input(data_model=Doc)
outputs = await synalinks.RLM(
data_model=Answer,
language_model=primary,
)(inputs)
agent = synalinks.Program(inputs=inputs, outputs=outputs)
long_text = open("book.txt").read()
result = await agent(Doc(text=long_text))
print(result.prettify_json())
The whole book.txt lives in the sandbox; the primary LM only sees
its preview and length.
Inputs as External Environment
RecursiveLanguageModelAgent inherits the CodeModeAgent rule that
the user input is bound as a dict named inputs inside the sandbox —
but it leans on it harder. The primary LM sees an InputsSummary in
its prompt instead of the raw values:
fields:
- name: text
type: str
size: 482917 # full character length
preview: "Chapter 1... (first 200 chars)"
truncated: true
Two things follow from that:
- Read full values through
inputs[field], never re-type them from the preview. The preview is truncated; the sandbox dict is not. - Aggregate in code, query the sub-LM on the result. A regex pulls
every quoted string from a 500K document in one snippet; only the
matching spans get passed to
llm_queryfor analysis.
This is the inversion of the usual "give the LM the whole document and let it figure it out" approach: the LM stays small and orchestrates, the data stays big and external.
The Two Recursive Helpers
The sandbox always exposes two extra async tools beyond submit:
llm_query(prompt)
async def main():
out = await llm_query(prompt="Summarize this paragraph: ...")
print(out["result"])
asyncio.run(main())
Sends one prompt to the sub-LM (configurable, see below). Returns
{"result": <text>}. Use it for semantic work on a snippet you've
already carved out with code: classification, summarization, entity
extraction, reformatting.
llm_query_batched(prompts)
async def main():
out = await llm_query_batched(prompts=[
"Is this paragraph about finance? <p1>",
"Is this paragraph about finance? <p2>",
"Is this paragraph about finance? <p3>",
])
finance = [p for p, label in zip(paragraphs, out["result"]) if "yes" in label.lower()]
asyncio.run(main())
Same as llm_query but takes a list and dispatches them concurrently
via asyncio.gather. Returns {"result": [<text-or-error>, ...]},
one entry per input prompt, in input order. Strongly preferred over a
Python loop of llm_query calls — sequential calls waste wall time
inside the per-snippet timeout.
Failed sub-prompts come back as strings prefixed with
"[error] <ExceptionType>: <message>". Filter them before aggregating
so an exception doesn't get silently treated as data.
Budget
Both helpers share one counter capped at max_llm_calls (default 50).
When exhausted, both short-circuit:
No quota is consumed by the short-circuit, and the error message
explicitly tells the LM to fall back to code-side aggregation. The
counter resets on every agent(...) invocation — concurrent calls
get independent budgets.
For llm_query_batched, the budget is checked all-or-nothing: a batch
of 10 with 5 calls remaining short-circuits with {"result": [], ...}
rather than partially fulfilling.
Adding Your Own Tools
RecursiveLanguageModelAgent inherits the tools= argument from
CodeModeAgent. The names llm_query, llm_query_batched, and
submit are reserved, but anything else you bind is exposed inside
the sandbox as a global async callable — alongside the recursive
helpers — and can be composed with them in the same snippet.
Because tool bodies run on the host with full Python privileges
(filesystem, network, third-party libraries), they're how the agent
reaches anything outside Monty's whitelist. Decorate each tool with
@synalinks.saving.register_synalinks_serializable() so it
round-trips through get_config / from_config:
@synalinks.saving.register_synalinks_serializable()
async def fetch_url(url: str) -> dict:
"""Fetch a URL and return its body.
Args:
url (str): the URL to GET.
"""
import httpx
async with httpx.AsyncClient() as client:
resp = await client.get(url)
return {"status": resp.status_code, "body": resp.text}
agent = synalinks.RLM(
data_model=Answer,
language_model=primary,
sub_language_model=cheap,
tools=[synalinks.Tool(fetch_url)],
)
Inside a snippet the LM can now interleave host effects with recursive
sub-LM work — fetch a page, hand its body to llm_query for
classification, then submit. See Binding Tools — The Bridge to
the Host in the CodeModeAgent guide for the full contract,
naming gotchas, and the security implications of bound tools.
Choosing a Sub-LM
By default sub_language_model defaults to language_model — the
primary LM is reused for sub-queries. This is fine for prototyping but
usually wasteful: the primary LM does the orchestration, planning, and
final answer formatting, where capability matters; the sub-LM handles
"summarize this 800-token snippet" or "is this paragraph about X?",
where a much smaller, cheaper model is enough.
Pass a separate sub_language_model to split the workload:
primary = synalinks.LanguageModel(model="openai/gpt-4o")
cheap = synalinks.LanguageModel(model="openai/gpt-4o-mini")
agent = synalinks.RLM(
data_model=Answer,
language_model=primary,
sub_language_model=cheap,
)
A typical RLM run dispatches dozens of sub-LM calls and one or two primary-LM turns, so the bill is dominated by the sub-LM's per-call cost. Picking a cheap model here is the single biggest cost lever.
Working Rules (Inside the Prompt)
The default instructions hand the LM a numbered checklist of how to operate the recursive loop. They show up verbatim in the prompt, but they're worth reading directly because they describe the failure modes this agent design has hit in practice:
- EXPLORE FIRST — print sample values, lengths, types, and shapes
of
inputs[field]before slicing or batching. A cheap probe turn prevents wasted sub-LM calls on the wrong field. - CODE FOR STRUCTURE,
llm_queryFOR MEANING — regex, slicing, and set ops find WHERE things are; the sub-LM understands WHAT they mean. Don't burnllm_querybudget on aggregation a one-liner can do. - MINIMIZE RETYPING — when values are long, precise, or
error-prone (IDs, numbers, quoted text, code), re-access them via
inputs[field]and compute in Python. Never copy from theInputsSummarypreview into a sub-LM prompt — the preview is truncated. - VERIFY BEFORE SUBMITTING — if results look wrong (empty, zeros, unexpected shape), inspect on a separate turn. Don't submit a guess.
submitIS TERMINAL — the snippet runs to completion (so aprintnext tosubmitis captured), but a successful submit ends the loop with no follow-up turn — you never get to read that print. Inspect on one turn, submit on the next.
You can replace these with your own instructions= if your task wants
a different operational style, but the defaults are tuned for the
"long input, sparse query" workload RLM is built for.
Defaults Differ from CodeModeAgent
| Setting | CodeModeAgent |
RecursiveLanguageModelAgent |
Why |
|---|---|---|---|
max_iterations |
5 | 20 | RLM workflows explore → carve → batch-query → aggregate → submit. |
timeout |
5s | 60s | One llm_query_batched(20 prompts) snippet routinely needs >5s. |
max_llm_calls |
n/a | 50 | Hard ceiling; the LM is told about it and falls back to code on overrun. |
Override any of them if your workload knows better.
RLM vs CodeModeAgent vs FunctionCallingAgent
| Aspect | FunctionCallingAgent | CodeModeAgent | RecursiveLanguageModelAgent |
|---|---|---|---|
| Action shape | JSON tool call | Python snippet | Python snippet |
| Primary LM context | Full input + tool catalog | InputsSummary + tool catalog | InputsSummary + tool catalog |
| Long-input handling | Stuff into prompt | Stuff into prompt (read via inputs) |
Treat as external environment |
| Recursive sub-LM | — | — | llm_query / llm_query_batched built-in |
| Cost lever | Pick one model | Pick one model | Pick a cheap sub_language_model |
| Best for | Discrete, well-typed actions | Composition / batching / control flow | Long inputs, sparse / semantic queries |
Reach for RecursiveLanguageModelAgent when the input dwarfs the
question — long documents, large logs, big lists where only a few
items matter — and you want the LM to decide what to look at rather
than commit upfront to a chunking or retrieval strategy.
Complete Example
import asyncio
from dotenv import load_dotenv
import synalinks
class Doc(synalinks.DataModel):
text: str = synalinks.Field(description="The document to analyze")
class Answer(synalinks.DataModel):
answer: str = synalinks.Field(description="The final answer to the user")
async def main():
load_dotenv()
synalinks.clear_session()
primary = synalinks.LanguageModel(model="openai/gpt-4o")
cheap = synalinks.LanguageModel(model="openai/gpt-4o-mini")
inputs = synalinks.Input(data_model=Doc)
outputs = await synalinks.RLM(
data_model=Answer,
language_model=primary,
sub_language_model=cheap,
max_iterations=10,
max_llm_calls=20,
)(inputs)
agent = synalinks.Program(inputs=inputs, outputs=outputs, name="rlm_qa")
long_text = open("book.txt").read()
result = await agent(Doc(text=long_text))
print(result.prettify_json())
if __name__ == "__main__":
asyncio.run(main())
Inside that agent(...) call, the primary LM never sees more than the
preview of book.txt. It writes Python that scans the full text in
the sandbox, picks out spans worth reading, dispatches them to the
cheap sub-LM, aggregates the answers, and calls submit.
Key Takeaways
-
Long inputs as external environment: the primary LM sees a metadata summary; the full value lives in
inputs[field]inside the sandbox. -
Two recursive helpers:
llm_query(prompt)andllm_query_batched(prompts)send work to a sub-LM. Both return dicts with aresultkey (and anerrorkey when the budget is exhausted or a batched prompt fails). -
Shared budget:
max_llm_callscaps total sub-LM calls per run and is shared between the two helpers. Eachagent(...)invocation gets a fresh counter; concurrent calls do not race. -
Pick a cheap
sub_language_model: the bill is dominated by sub-LM calls. Splitting primary vs. sub-LM is the largest cost lever. -
Working rules in the prompt: the default instructions hand the LM a 5-item operating checklist tuned for the long-input, sparse-query workload. Override
instructions=if your task wants a different style. -
Same
submitdiscipline asCodeModeAgent: the LM ends a run by callingsubmit(result={...}). Schema validation, themax_iterationsfallback, and the per-snippettimeoutsemantics all carry over unchanged fromCodeModeAgent.
API References
Answer
Doc
build_haystack(needle, paragraphs=200)
Return a long document with needle placed near the middle.