Building Stitcher: an AI that animates anything you type

Four services, a sandbox that runs AI-written code, a retrieval stack that teaches a model Manim, and a collaborative video editor. A field guide to the most complex thing I have built.

You know those 3Blue1Brown videos? The ones where math just sort of unfolds on screen and suddenly eigenvectors make sense? Those are made with a Python library called Manim. It is gorgeous, and it is also a pain. You write code, wait for it to render, find out your arrow is 0.3 units too far left, fix it, render again. Hours vanish.

Stitcher is my answer to that. You type a sentence like "explain Dijkstra's algorithm" and you get back a narrated, animated video. No code, no render farm to babysit, no timeline to fight. An AI plans the scenes, writes the Manim code, runs it in a sandbox, records a voiceover, and stitches the whole thing together. Think 3Blue1Brown, except the part where a human writes the animation is now a pipeline.

I spent around ten months on this. It is, by a wide margin, the most complex thing I have ever built, and most of that complexity is invisible if you only watch the final video. So this is me lifting the hood. I will keep it skimmable: each chapter is one piece of the machine, and you can bail the moment you have seen enough. Let's go.

1. The shape of the thing

The first real decision was whether to keep everything in one app. I did not. Stitcher is four separate services, and each one does exactly one job:

The API (Node, Express) is the front door. Auth, credits, and it owns the job queue.
The RAG service (Node, Fastify) figures out which Manim APIs and examples the model is going to need.
The Agent (Node, Fastify, LangGraph) is the brain that actually writes the animation code.
The Render service (Python, FastAPI) runs that code and produces a video.

Why four and not one? Because they fail differently and scale differently. The render service is CPU-bound Python that I would not trust with my house keys. The agent mostly sits around waiting on LLMs. The API is a normal web server. Cram them into one process and a single bad render takes down your login page. Split them and I can run ten render boxes against one API box, and a crash in the sandbox is just a failed job, not an outage.

flowchart TD
    U([you type a prompt]) --> API["API orchestrator<br/>auth · credits · job queue"]
    API -->|BullMQ job| W([worker runs the pipeline])
    W --> RAG["1 · RAG<br/>retrieve Manim examples"]
    RAG --> AG["2 · Agent<br/>write the animation code"]
    AG --> RND["3 · Render<br/>run it in a sandbox"]
    RND --> OUT([a narrated video])
    RAG -.-> DB[(Mongo + vector index)]
    AG -.-> LLM{{OpenAI · Claude · Gemini}}
    RND -.-> CDN[(Cloudinary CDN)]

They talk over plain HTTP with a shared X-Api-Key, and every request carries a trace id so I can follow one video through all four services in Grafana. Nothing clever. Boring on purpose. The interesting part is what each service does, so let's open them up.

2. Teaching a model to write Manim

Here is the dirty secret of "AI writes code": models are genuinely mid at Manim. It is a niche library, the training data is thin, and the model will confidently call methods that do not exist. YOLO a prompt straight into an LLM and you get code that looks right and crashes on line 12.

So before the model writes anything, the RAG service does the homework. And it does not work the way the textbook RAG diagrams do.

2.1 Rewrite the question first

A prompt like "show me how binary search works" is a terrible search query. It has zero Manim vocabulary in it. So step one is a tiny, fast LLM call that rewrites the prompt into the language of the library: it pulls out the actual APIs this will probably need (MArray, Text, Indicate) and flags whether a plugin fits (this one screams manim-dsa). The raw question never touches the vector store. The rewritten, API-flavored version does.

One small trick that paid off: the giant list of Manim symbols I feed that rewriter is static, and it sits at the very front of the prompt every single time. That means the provider can cache it, and I only pay full freight for the few hundred tokens that actually change. Latency and cost both drop, for free.

2.2 Two searches, then a merge

Retrieval is hybrid. A dense vector search (Gemini embeddings over a vector index in Mongo) catches semantic matches, and a sparse BM25 search catches the exact-keyword stuff embeddings fumble, like a specific class name. I fuse the two ranked lists with Reciprocal Rank Fusion, which is a fancy way of saying "things that rank high on both lists win." Then a cheap reranker model does a final pass and scores how relevant each example actually is.

2.3 Coverage beats diversity

Normally, to stop your retrieved examples from being five near-identical snippets, you reach for MMR (Maximal Marginal Relevance), which trades a little relevance for a little diversity. I went a different route, because for code generation I do not care about abstract diversity. I care about API coverage.

So the final selection step is coverage-greedy. It looks at every candidate, sees which Manim APIs each one demonstrates, and greedily builds a set that covers the most distinct APIs the model is about to need. The coder ends up staring at examples that, together, show it every tool for the job, instead of five different ways to draw the same circle. Small idea. It made the generated code noticeably less hallucinated.

flowchart TD
    Q([user prompt: explain binary search]) --> RW["rewrite into Manim vocabulary:<br/>the likely APIs + a matching plugin"]
    RW --> DS["dense + sparse search<br/>vectors and BM25, run together"]
    DS --> RRF[RRF: fuse the two ranked lists]
    RRF --> RR[rerank with a cheap LLM]
    RR --> CV["coverage pick:<br/>keep the examples that cover the<br/>most distinct APIs, not MMR"]
    CV --> G([grounded context, handed to the agent])

3. The agent that writes, runs, and fixes its own code

Once retrieval hands over a pile of grounded context, the agent takes over. It is a LangGraph state machine, and it runs a loop that should feel familiar to anyone who has actually shipped code: plan, write, run, and when it breaks, fix it.

flowchart TD
    P([prompt]) --> PL["planner<br/>a style guide + the scene beats"]
    PL --> CD["coder<br/>Manim Python + the narration per beat"]
    CD --> RN["render<br/>run it in the sandbox"]
    RN --> Q{works?}
    Q -->|yes| DONE([ship the video])
    Q -->|no| DBG["debugger<br/>read the traceback, patch the code,<br/>log a CoderLesson"]
    DBG -->|"retry, up to 3x"| RN

The planner turns the prompt into a style guide (colors, pacing, where things sit on screen) and a list of scene beats. The coder writes the Manim Python for those beats, including the exact narration line for each one. The render step runs it. If it crashes, the debugger reads the real Python traceback, patches the code, and we try again, up to three times before we give up and tell the user honestly.

3.1 The voiceover contract

Every spoken line is treated as a contract. The audio is generated once from those exact strings, and the narration text becomes a kind of cache key. If a later debugging pass quietly reworded a sentence, the audio would no longer line up with the animation timing, so the system is borderline obsessive about copying narration beats back in, character for character. It is the least glamorous code in the repo and it is completely load-bearing.

3.2 The model that learns from its mistakes

This is the part that low-key feels like cheating: the coder has a memory. Every time a render fails and then gets fixed, the system writes a one or two line lesson about what went wrong ("don't pass a list to this thing, it wants a VGroup"). Those lessons get fed back into future generations. So the model is not just smart, it is getting less dumb over time on the exact mistakes it tends to make. I call them CoderLessons, and watching that table fill itself up was one of the more satisfying parts of the whole build.

4. Running AI-written code without getting owned

Okay, the scary chapter. The render service takes Python that an AI wrote, based on a prompt that a stranger on the internet typed, and runs it on my server. If that sentence does not unsettle you a little, read it again.

The threat model is simple: assume the code is hostile. Assume it wants to read /etc/passwd, fork forever, eat all the memory, phone home, and scribble over the disk. The goal is that even if all of that is true, the worst case is one failed render. So the sandbox is defense in depth, layers stacked so getting past one still leaves you stuck behind the next.

flowchart TD
    IN([AI-written Python arrives]) --> L1
    subgraph L1["layer 1 · AST guard"]
      subgraph L2["layer 2 · bubblewrap namespaces · no network"]
        subgraph L3["layer 3 · rlimits: cpu, memory, procs"]
          subgraph L4["layer 4 · seccomp syscall allowlist"]
            CODE([the AI's code finally runs here])
          end
        end
      end
    end
    CODE --> OUT([killed at 120s · OOM-detected · scratch dir wiped])

The outer layer is the cheapest: before spawning anything, an AST check rejects code that imports os or subprocess, calls eval, or tries to open files it has no business touching. Most obvious abuse dies right here, for free.

Get past that and it runs inside bubblewrap with fresh namespaces, so it gets its own mount, network, and process views. There is no network at all. The filesystem is read-only except one scratch directory. It runs as a user id that does not exist on the host. Then rlimits cap CPU, memory, file size, open files, and process count, so a fork bomb just smacks into a wall. Then an optional seccomp filter narrows the allowed syscalls down to the short list Manim actually uses and explicitly bans things like ptrace and mount. Wrapping all of it, a wall-clock timer kills the entire process tree at 120 seconds, OOM kills get detected, and the scratch directory is wiped after every single render.

Once the video exists, ffmpeg muxes in the voiceover (copying the video stream untouched, re-encoding only the audio), grabs a thumbnail, and the whole thing gets uploaded to a CDN. The render service is stateless. It remembers nothing between jobs. That is the entire point.

5. The glue

Four services is lovely until someone asks the obvious question: how does the user watch a progress bar move while all of this is happening across three machines? This is my favorite part, honestly, because it is the least visible, and it is what makes the thing feel alive instead of feeling like a form you submit and then stare at.

A generation is a BullMQ job sitting in Redis. The worker pulls it and runs the pipeline: retrieve, write, render, mux, upload. It gets retries with exponential backoff for free, so one flaky LLM call does not nuke your whole video.

The live updates work like this. Every time the worker finishes a step, it publishes a tiny event to a Redis channel named after the chat. The API is subscribed to those channels and re-emits whatever it hears over Socket.IO to the right browser. No polling, no "refresh to check status." The worker says "rendering now," and a fraction of a second later your screen says it too.

flowchart TD
    WS([worker finishes a step, say rendering]) -->|"redis.publish(chat:id)"| RP[(Redis pub/sub)]
    RP --> API[the API re-emits over Socket.IO]
    API --> BR([the browser moves the progress bar, no polling])

One trace id ties the whole thing together, end to end, so when something goes wrong I can pull up a single timeline and see exactly which service ate the latency. Wiring up observability early is one of those things that feels like a tax until the first time it saves you three hours, and then you never go back.

6. The editor nobody asked for, then everybody wanted

Generating a video is great right up until the user wants the title two seconds later and the music a touch quieter. So Stitcher ships a full timeline editor in the browser, and it is collaborative, because of course I could not help myself.

The editor is React on Vite, and the collaboration runs on Yjs, a CRDT library. Two people can drag clips on the same timeline at once and it just merges, no "someone else is editing" locks, no last-write-wins stomping. You see their cursor move, you see what they have selected, in real time.

The bug that taught me the most lives here. Yjs only syncs the bare minimum, so the shared document stored asset ids, not the fat asset objects with their urls and thumbnails. When a second person joined, their timeline would hydrate from just the ids and the rich data would vanish. The fix was a hydration cache: keep a local map from id to full object, and rebuild the timeline from the synced positions plus that cache, treating the synced doc as the source of truth for "where things are" and the cache as the source of truth for "what things are." Obvious written down. Took me a while.

Performance got its own pass too: the playback clock runs on requestAnimationFrame for smooth scrubbing, cursor presence is throttled to about 25 updates a second so the network does not melt, routes are code-split and lazy-loaded, and the timeline's frame thumbnails are pulled on demand from the CDN instead of generated up front.

And the name finally earns itself here. When you hit export, the backend uses ffmpeg to stitch the per-scene clips into one continuous video. That is the literal Stitcher. Payments run through Razorpay on a credit model, where generating and editing both cost credits, so the economics actually close.

7. The plugins I forked

One last thing I am quietly proud of. Manim has a small ecosystem of community plugins for specific domains, and Stitcher leans on six of them: chess, data structures, machine learning, physics, chemistry, and astronomy. Instead of asking the model to draw a chessboard from primitives (pain, and it will look wrong), the retrieval layer can route a chess prompt to manim-chess, which already knows how to render a board, parse FEN and PGN, and animate legal moves including castling and en passant.

Two of those, chess and the data structures one (manim-dsa), I had to fork and fix myself. They targeted a newer Python than my render image runs, so they imported names that did not exist in my interpreter and refused to install at all. The fixes were tiny (relax a version pin in one, route a typing import through a backport in the other) but without them, two whole domains of my product simply would not render. I wrote up exactly what I changed in a FIXES file inside each fork, so future me knows why those lines look weird. Then I wired both into the system so the model knows the plugins exist and how to drive them.

Where this goes next

That is Stitcher, or at least the half of it that fits in a blog post. The honest takeaway after ten months is that the AI was never the hard part. Getting an LLM to write Manim is a Tuesday. The hard part was everything around it: making untrusted code safe to run, keeping four services in sync, stopping a CRDT from corrupting a timeline, and making the whole thing feel instant while it does an enormous amount of work behind a single progress bar.

If you want the deeper cuts (the seccomp policy, the exact RRF math, the LangGraph state, the way the tracing is wired) I am happy to write follow-ups. For now, go type a sentence and watch it turn into a video. That still has not gotten old. ✨