Applied AI Conference · Berlin, 05/28/202601 / 17

The Anatomy
of LobsterX

a document processing agent

C

Clelia Astra Bertelli · LlamaIndex

Redefine Document Workflows with AI Agents

Intro

Brain

Loop

World

Recap

Follow Along01b / 17

Open on your device

Scan to follow along on your phone or tablet

Redefine Document Workflows with AI Agents

Intro

Brain

Loop

World

Recap

Introduction02 / 17

Hi, I'm Clelia

Member of Technical Staff @ LlamaIndex, where i work on agents, retrieval systems and OSS
Background in computational biology, then slowly drifted into AI and engineering
I build small, opinionated agents to stress-test the framework I work on
Today: a guided tour of one of them → LobsterX 🦞

Redefine Document Workflows with AI Agents

Intro

Brain

Loop

World

Recap

Introduction03 / 17

What is LobsterX?

A document-processing AI agent that lives in a Telegram chat. You send it a PDF and a task; it parses, extracts, classifies, reasons, and replies asynchronously when it's done.

~600

LOC of agent implementation

~1.5k

LOC of workflow orchestration underneath

3

swappable LLM providers (OpenAI, Anthropic, Google)

Small enough to dissect on stage. Real enough to be interesting.

Redefine Document Workflows with AI Agents

Intro

Brain

Loop

World

Recap

Introduction04 / 17

Why dissect an agent?

Most agents look like a black box: prompt in, answer out. The interesting engineering lives in the gap between those two. We'll walk through it using four anatomical metaphors:

Brain — the LLM, steered into structured behaviour
Loop — the event-driven workflow that drives it
Eyes & Limbs — the filesystem and tools it touches
Ears & Mouth — how it talks to a human

Redefine Document Workflows with AI Agents

Intro

Brain

Loop

World

Recap

The Brain05 / 17

The Brain: an LLM with a problem

The LLM is the only decision-maker — it picks what to think, what to call, when to stop
But LLMs are non-deterministic: same prompt, different shapes of output
For an agent, that's fatal. You can't parse "well I think maybe call the tool with..." with a regex and hope for the best
We need a way to constrain the model's output to something the rest of the system can rely on

If the brain is unreliable, every downstream step inherits that unreliability. The fix has to start at the LLM call itself.

Redefine Document Workflows with AI Agents

Intro

Brain

Loop

World

Recap

The Brain06 / 17

Steering: Structured Outputs

Every LLM call in LobsterX is constrained by a typed JSON schema. The model cannot reply with free-form prose — it must fill in a known shape.

One schema per operation: a Think looks different from an Act
Forces a clean separation between reasoning and action
The LLM wrapper exposes only structured-generation methods — there is no "raw chat" escape hatch in the agent code
Same schema works across OpenAI, Anthropic and Google providers

Redefine Document Workflows with AI Agents

Intro

Brain

Loop

World

Recap

The Loop07 / 17

The Loop: Agent Workflows

LobsterX is built on LlamaIndex Agent Workflows: an event-driven, async-first stepwise execution engine.

Each step is a typed Python function that consumes one event type and emits another
No central orchestrator — the event types wire the steps together implicitly
Async by construction: a long-running tool call doesn't block anything else
Loops are not special cases — they're just steps that re-emit upstream event types

Redefine Document Workflows with AI Agents

Intro

Brain

Loop

World

Recap

The Loop08 / 17

Event-Driven Execution

InputUser prompt

→

ThinkReason about next step

→

ActCall a tool

→

ObserveProcess result

↺

StopFinal answer

Each arrow is a typed event. Observe re-enters Think until Think decides the task is done and emits Stop — at which point the workflow terminates and the answer goes back to the user.

Redefine Document Workflows with AI Agents

Intro

Brain

Loop

World

Recap

World09 / 17

Three windows on the world

The brain+loop is a generalist scaffolding. What makes LobsterX a document agent are the three interfaces it exposes to the world.

Filesystem Where documents live and where the agent writes its outputs

Document Tools Parse, extract and classify unstructured content via LlamaCloud

Chat Interface Telegram bot — async upload and notification

Redefine Document Workflows with AI Agents

Intro

Brain

Loop

World

Recap

World · Filesystem10 / 17

The Eyes: a virtual filesystem

File ops route through AgentFS, a virtualized layer — not the real machine FS
The agent gets read / write / edit / grep / glob — no delete, no shell execution
Scope is bounded to a working directory; common credential files (.env, and other files are excluded entirely)
Telegram-uploaded PDFs are written into AgentFS, never your real disk

If the agent is jailbroken into writing something destructive, the damage stays inside the virtual FS. Nothing leaks to the host unless you explicitly sync it.

Redefine Document Workflows with AI Agents

Intro

Brain

Loop

World

Recap

World · Tools11 / 17

The Limbs: document tools

Filesystem ops alone only see plain text. To actually understand unstructured documents, the agent calls three LlamaCloud tools — each with its own typed input schema.

LlamaParse Full-text parsing of PDFs, Office docs and more via OCR, VLMs and agentic pipelines

LlamaExtract Schema-driven extraction — you define the JSON shape, the tool fills it in

LlamaClassify Classification into user-defined categories with confidence signals

Redefine Document Workflows with AI Agents

Intro

Brain

Loop

World

Recap

World · Tools12 / 17

Why these tools change the game

A document agent is only as good as its eyes on unstructured content. Generic OCR isn't enough — layout, tables and figures all carry meaning that naive text extraction loses.

LlamaParse Layout-aware parsing for PDFs, DOCX, PPTX, XLSX, images. Tables stay as tables; figures get described by VLMs; reading order is preserved across columns.

LlamaExtract You hand it a JSON schema, it hands you back populated objects — typed, citation-linked, validated. No glue prompt engineering on the agent side.

LlamaClassify User-defined categories with confidence signals — the agent uses it to route documents (invoice? contract? report?) before deciding what to do next.

Each tool exposes a typed input schema, so the Act step can call them with full structured-output guarantees end to end.

Redefine Document Workflows with AI Agents

Intro

Brain

Loop

World

Recap

World · Chat13 / 17

Ears & Mouth: async by default

Telegram was chosen specifically because messaging is async — no spinner, no held-open HTTP connection
Documents come in as message attachments and land in AgentFS. Text messages are dispatched as workflow inputs
Document workflows can take minutes to half an hour — the agent pings you back when done
This maps cleanly onto the workflow engine, which is async-first already

The right interface for a long-running agent isn't a chatbot — it's a colleague who replies when they're finished.

Redefine Document Workflows with AI Agents

Intro

Brain

Loop

World

Recap

World · API Mode14 / 17

Same agent, different shell

The Telegram bot is one frontend. The same agent core also runs as a FastAPI server, with the workflow's async-first shape carried all the way through.

Task manager In-memory dict of task_id → asyncio.Task, guarded by a lock. POST /task spawns, GET polls, DELETE cancels.

Rate limiting Per-endpoint per-minute limits via fastapi-throttle — uploads, creates, polls and deletes each have their own budget

Auth & CORS Starlette middlewares: bearer-token auth + configurable allowed origins

Redefine Document Workflows with AI Agents

Intro

Brain

Loop

World

Recap

Recap · Safety15 / 17

A note on safety

Virtual filesystem — no exposure of the host FS
No shell — the agent cannot run arbitrary bash
Read / write / edit only — no delete primitive at all
No skills — custom behaviour comes in via an AGENTS.md file, not via potentially unvetted instructions
Credential files excluded from the virtual FS — the agent can't even read them

None of this prevents prompt injection from a malicious document the agent has been asked to read. The mitigations bound the blast radius; they don't eliminate it.

Redefine Document Workflows with AI Agents

Intro

Brain

Loop

World

Recap

Recap · Anatomy16 / 17

The full anatomy

B

The Brain LLM steered into Think / Act / Observe / Stop via structured outputs

E

The Eyes AgentFS — a sandboxed virtual filesystem with bounded primitives

L

The Limbs LlamaParse, LlamaExtract, LlamaClassify — each a typed tool call

M

Ears & Mouth Telegram (or FastAPI) — async-first, notification-driven

Redefine Document Workflows with AI Agents

Intro

Brain

Loop

World

Recap

Recap · Takeaways17 / 17

Key takeaways

Structured outputs are the single biggest lever for turning an LLM into a reliable agent component
Layout-aware doc tools (Parse / Extract / Classify) are what allows the agent to really understand unstructured documents
Event-driven workflows give you loops, branches and async for free
Virtual filesystems let you grant filesystem-style tools without the filesystem-style risk
Async interfaces are the right shape for long-running document work

C

Clelia Astra Bertelli · clelia@runllama.ai · LinkedIn · X · clelia.dev

Thank you!Questions?

Intro

Brain

Loop

World

Recap

The Anatomyof LobsterX

Open on your device

Hi, I'm Clelia

What is LobsterX?

Why dissect an agent?

The Brain: an LLM with a problem

Steering: Structured Outputs

The Loop: Agent Workflows

Event-Driven Execution

Three windows on the world

The Eyes: a virtual filesystem

The Limbs: document tools

Why these tools change the game

Ears & Mouth: async by default

Same agent, different shell

A note on safety

The full anatomy

Key takeaways

The Anatomy
of LobsterX