Applied AI Conference · Berlin, 05/28/202601 / 17

The Anatomy
of LobsterX

a document processing agent

C
Clelia Astra Bertelli  ·  LlamaIndex
Redefine Document Workflows with AI Agents
Intro
Brain
Loop
World
Recap
Follow Along01b / 17

Open on your device

QR code

Scan to follow along on your phone or tablet

Redefine Document Workflows with AI Agents
Intro
Brain
Loop
World
Recap
Introduction02 / 17

Hi, I'm Clelia

  • Member of Technical Staff @ LlamaIndex, where i work on agents, retrieval systems and OSS
  • Background in computational biology, then slowly drifted into AI and engineering
  • I build small, opinionated agents to stress-test the framework I work on
  • Today: a guided tour of one of them LobsterX 🦞
Redefine Document Workflows with AI Agents
Intro
Brain
Loop
World
Recap
Introduction03 / 17

What is LobsterX?

A document-processing AI agent that lives in a Telegram chat. You send it a PDF and a task; it parses, extracts, classifies, reasons, and replies asynchronously when it's done.

~600
LOC of agent implementation
~1.5k
LOC of workflow orchestration underneath
3
swappable LLM providers (OpenAI, Anthropic, Google)

Small enough to dissect on stage. Real enough to be interesting.

Redefine Document Workflows with AI Agents
Intro
Brain
Loop
World
Recap
Introduction04 / 17

Why dissect an agent?

Most agents look like a black box: prompt in, answer out. The interesting engineering lives in the gap between those two. We'll walk through it using four anatomical metaphors:

  • Brain — the LLM, steered into structured behaviour
  • Loop — the event-driven workflow that drives it
  • Eyes & Limbs — the filesystem and tools it touches
  • Ears & Mouth — how it talks to a human
Redefine Document Workflows with AI Agents
Intro
Brain
Loop
World
Recap
The Brain05 / 17

The Brain: an LLM with a problem

  • The LLM is the only decision-maker — it picks what to think, what to call, when to stop
  • But LLMs are non-deterministic: same prompt, different shapes of output
  • For an agent, that's fatal. You can't parse "well I think maybe call the tool with..." with a regex and hope for the best
  • We need a way to constrain the model's output to something the rest of the system can rely on

If the brain is unreliable, every downstream step inherits that unreliability. The fix has to start at the LLM call itself.

Redefine Document Workflows with AI Agents
Intro
Brain
Loop
World
Recap
The Brain06 / 17

Steering: Structured Outputs

Every LLM call in LobsterX is constrained by a typed JSON schema. The model cannot reply with free-form prose — it must fill in a known shape.

  • One schema per operation: a Think looks different from an Act
  • Forces a clean separation between reasoning and action
  • The LLM wrapper exposes only structured-generation methods — there is no "raw chat" escape hatch in the agent code
  • Same schema works across OpenAI, Anthropic and Google providers
Redefine Document Workflows with AI Agents
Intro
Brain
Loop
World
Recap
The Loop07 / 17

The Loop: Agent Workflows

LobsterX is built on LlamaIndex Agent Workflows: an event-driven, async-first stepwise execution engine.

  • Each step is a typed Python function that consumes one event type and emits another
  • No central orchestrator — the event types wire the steps together implicitly
  • Async by construction: a long-running tool call doesn't block anything else
  • Loops are not special cases — they're just steps that re-emit upstream event types
Redefine Document Workflows with AI Agents
Intro
Brain
Loop
World
Recap
The Loop08 / 17

Event-Driven Execution

InputUser prompt
ThinkReason about next step
ActCall a tool
ObserveProcess result
StopFinal answer

Each arrow is a typed event. Observe re-enters Think until Think decides the task is done and emits Stop — at which point the workflow terminates and the answer goes back to the user.

Redefine Document Workflows with AI Agents
Intro
Brain
Loop
World
Recap
World09 / 17

Three windows on the world

The brain+loop is a generalist scaffolding. What makes LobsterX a document agent are the three interfaces it exposes to the world.

Filesystem Where documents live and where the agent writes its outputs
Document Tools Parse, extract and classify unstructured content via LlamaCloud
Chat Interface Telegram bot — async upload and notification
Redefine Document Workflows with AI Agents
Intro
Brain
Loop
World
Recap
World · Filesystem10 / 17

The Eyes: a virtual filesystem

  • File ops route through AgentFS, a virtualized layer — not the real machine FS
  • The agent gets read / write / edit / grep / glob — no delete, no shell execution
  • Scope is bounded to a working directory; common credential files (.env, and other files are excluded entirely)
  • Telegram-uploaded PDFs are written into AgentFS, never your real disk

If the agent is jailbroken into writing something destructive, the damage stays inside the virtual FS. Nothing leaks to the host unless you explicitly sync it.

Redefine Document Workflows with AI Agents
Intro
Brain
Loop
World
Recap
World · Tools11 / 17

The Limbs: document tools

Filesystem ops alone only see plain text. To actually understand unstructured documents, the agent calls three LlamaCloud tools — each with its own typed input schema.

LlamaParse Full-text parsing of PDFs, Office docs and more via OCR, VLMs and agentic pipelines
LlamaExtract Schema-driven extraction — you define the JSON shape, the tool fills it in
LlamaClassify Classification into user-defined categories with confidence signals
Redefine Document Workflows with AI Agents
Intro
Brain
Loop
World
Recap
World · Tools12 / 17

Why these tools change the game

A document agent is only as good as its eyes on unstructured content. Generic OCR isn't enough — layout, tables and figures all carry meaning that naive text extraction loses.

LlamaParse Layout-aware parsing for PDFs, DOCX, PPTX, XLSX, images. Tables stay as tables; figures get described by VLMs; reading order is preserved across columns.
LlamaExtract You hand it a JSON schema, it hands you back populated objects — typed, citation-linked, validated. No glue prompt engineering on the agent side.
LlamaClassify User-defined categories with confidence signals — the agent uses it to route documents (invoice? contract? report?) before deciding what to do next.

Each tool exposes a typed input schema, so the Act step can call them with full structured-output guarantees end to end.

Redefine Document Workflows with AI Agents
Intro
Brain
Loop
World
Recap
World · Chat13 / 17

Ears & Mouth: async by default

  • Telegram was chosen specifically because messaging is async — no spinner, no held-open HTTP connection
  • Documents come in as message attachments and land in AgentFS. Text messages are dispatched as workflow inputs
  • Document workflows can take minutes to half an hour — the agent pings you back when done
  • This maps cleanly onto the workflow engine, which is async-first already

The right interface for a long-running agent isn't a chatbot — it's a colleague who replies when they're finished.

Redefine Document Workflows with AI Agents
Intro
Brain
Loop
World
Recap
World · API Mode14 / 17

Same agent, different shell

The Telegram bot is one frontend. The same agent core also runs as a FastAPI server, with the workflow's async-first shape carried all the way through.

Task manager In-memory dict of task_id → asyncio.Task, guarded by a lock. POST /task spawns, GET polls, DELETE cancels.
Rate limiting Per-endpoint per-minute limits via fastapi-throttle — uploads, creates, polls and deletes each have their own budget
Auth & CORS Starlette middlewares: bearer-token auth + configurable allowed origins
Redefine Document Workflows with AI Agents
Intro
Brain
Loop
World
Recap
Recap · Safety15 / 17

A note on safety

  • Virtual filesystem — no exposure of the host FS
  • No shell — the agent cannot run arbitrary bash
  • Read / write / edit only — no delete primitive at all
  • No skills — custom behaviour comes in via an AGENTS.md file, not via potentially unvetted instructions
  • Credential files excluded from the virtual FS — the agent can't even read them

None of this prevents prompt injection from a malicious document the agent has been asked to read. The mitigations bound the blast radius; they don't eliminate it.

Redefine Document Workflows with AI Agents
Intro
Brain
Loop
World
Recap
Recap · Anatomy16 / 17

The full anatomy

B
The Brain LLM steered into Think / Act / Observe / Stop via structured outputs
E
The Eyes AgentFS — a sandboxed virtual filesystem with bounded primitives
L
The Limbs LlamaParse, LlamaExtract, LlamaClassify — each a typed tool call
M
Ears & Mouth Telegram (or FastAPI) — async-first, notification-driven
Redefine Document Workflows with AI Agents
Intro
Brain
Loop
World
Recap
Recap · Takeaways17 / 17

Key takeaways

  • Structured outputs are the single biggest lever for turning an LLM into a reliable agent component
  • Layout-aware doc tools (Parse / Extract / Classify) are what allows the agent to really understand unstructured documents
  • Event-driven workflows give you loops, branches and async for free
  • Virtual filesystems let you grant filesystem-style tools without the filesystem-style risk
  • Async interfaces are the right shape for long-running document work
C
Clelia Astra Bertelli  ·  clelia@runllama.ai  ·  LinkedIn  ·  X  ·  clelia.dev
Thank you!Questions?
Intro
Brain
Loop
World
Recap