AV
13 min read

CHAD: building a 99M-param Hinglish texting bot from scratch

A transformer hand-written in PyTorch, four billion tokens of Indian Reddit, knowledge distillation through a Codex proxy, and the hard wall every tiny model hits. Trained on zero dollars of compute, runs entirely in your browser.

Everyone is building wrappers around someone else's model right now. I wanted to know what was actually inside the box, so I built one. From scratch. A real decoder-only transformer, hand-written in PyTorch, ~99 million parameters, trained from nothing on four billion tokens of text I had to go out and manufacture myself.

I trained it to talk like my group chat: romanized Hinglish, lowercase, dry, a little mean. I named it chad. It is a chill Gen-Z texting bot, and the best part is that it runs entirely in your browser tab, on your own CPU, which means it costs me exactly nothing to serve.

The whole thing took about two weeks of nights and zero dollars of compute. It is also the project that taught me the single most useful thing I know about LLMs, which is what they fundamentally cannot do no matter how good your data is. We will get there. This is the full build, and you can skim it by chapter.

1. A transformer from scratch

I did not import a model. I wrote every piece by hand: rotary position embeddings (RoPE), grouped-query attention (GQA), a SwiGLU feed-forward, RMSNorm. This is the exact modern Llama and Mistral recipe, just shrunk down. The final shape is d_model 768, 12 layers, 12 query heads but only 3 key/value heads, a 1024 token context, and a tied output head, which lands at exactly 98,913,024 parameters.

flowchart TD
    T([token ids]) --> E["embedding 32000 × 768<br/>(tied to the output head)"]
    E --> B
    subgraph B["× 12 transformer blocks"]
      direction TB
      A1["x = x + attn(RMSNorm x)<br/>GQA · 12 q / 3 kv heads · RoPE"]
      A2["x = x + ffn(RMSNorm x)<br/>SwiGLU gated FFN"]
      A1 --> A2
    end
    B --> N([final RMSNorm])
    N --> H["output head 768 → 32000<br/>(same matrix as the embedding)"]
    H --> O([next-token logits])

Every one of those choices earns its place, and writing them by hand is the only way I actually understood why:

  • RoPE encodes position by rotating the query and key vectors, so attention depends on the relative distance between tokens. It adds zero parameters and extrapolates past the training length.
  • GQA is the reason there are 3 KV heads instead of 12. At generation time the model caches a key and value for every past token, and that cache is what eats your memory. Sharing KV heads across query heads shrinks it 4x for basically no quality loss.
  • SwiGLU is a gated feed-forward (three matrices, a SiLU gate) that consistently beats a plain ReLU block at the same parameter count.
  • Weight tying makes the input embedding and the output projection the same matrix. That alone saves 24.5 million parameters and couples the two representations, which helps a small model.

The goal here was not to be clever. It was to write something I could read top to bottom and fully believe. A modern LLM is not magic. It is these five ideas stacked twelve times.

2. A tokenizer for a language that does not spell consistently

Hinglish has no spelling standard. "bhai," "bhaii," "bhaiya," all valid, all in the wild. People code-switch mid-sentence and drop an emoji in the middle of a word. A normal English tokenizer would choke on this and emit a stream of unknown tokens.

So I trained a custom 32,000-token byte-level BPE tokenizer on the corpus. Byte-level is the key word: the base alphabet is literally the 256 possible bytes, so every conceivable input is already made of tokens and there is no such thing as an out-of-vocabulary character, ever. No <unk>, no fallback, just bytes merging into whole Hinglish words. I also forced digit-splitting (every number breaks into single digits) so the model does not waste vocab memorizing "2024" as one atom. The whole thing trained in 159 seconds on my Mac and ships as one 2.2 MB JSON file.

3. Manufacturing four billion tokens of Hinglish

This was the hardest part of the entire project, and it has nothing to do with the model. There is no big, clean, romanized-Hinglish pretraining corpus. The giant web crawls are overwhelmingly English. The serious Indian-language datasets are in Devanagari script, which is the wrong format, because Indians do not text in Devanagari, they text in roman letters. So I had to build the corpus from scratch.

The only place on earth with billions of tokens of real, casual, romanized Hinglish is Indian Reddit. I pulled it in bulk from the Pushshift archive torrents, with no Reddit API and no rate limits. The whole trick lives in one detail:

flowchart TD
    TOR[(Pushshift per-subreddit torrent<br/>~80k files = all of Reddit)] --> PICK["parse the catalog in memory,<br/>pick 214 Indian subs by name"]
    PICK --> DL["aria2c --select-file<br/>pull only those .zst, not the 4 TB"]
    DL --> ZSTD["decompress with long-distance zstd<br/>(2 GB window)"]
    ZSTD --> CLEAN["clean: drop bots, non-Latin, junk;<br/>keep romanized only"]
    CLEAN --> DEDUP["Bloom-filter dedup, fixed memory"]
    DEDUP --> OUT([~4.13B tokens, packed to a uint16 .bin])

The Pushshift dumps exist in two layouts of the same data. One is by-month (a single file holds every subreddit for that month), which is useless, because grabbing one community means downloading roughly 4 TB and scanning it. The other is by-subreddit (one file per community), which is exactly what you want. A torrent file is just a small catalog, so I parsed it in memory, name-filtered down to 214 Indian subreddits, and handed only those file indices to the downloader. One non-obvious flag, --file-allocation=none, was load-bearing: without it the client preallocates the entire multi-TB torrent on disk before downloading a single byte.

Then came cleaning, which is where most of the real work hides. Drop the bots and AutoModerator. Drop anything not in Latin script (some "Indian" subs like r/Kerala are actually Malayalam, wrong language for this corpus). Drop the junk and the copypasta. Dedup with a fixed-memory Bloom filter. What survived was about 4.13 billion clean tokens across 128.9 million comments, packed into a compact uint16 binary stream. That pile of unhinged Indian Reddit is the entire personality of the model before I ever fine-tuned it.

4. Pretraining on free TPUs for zero dollars

I do not own a GPU and I had no budget, so the constraint was simple: this trains on $0 of compute or it does not happen. The tokenizer ran on my Mac. The pretraining ran on Kaggle's free TPU v5e-8 (eight chips, and crucially, with no credit card on file they physically cannot bill me). The fine-tuning ran on Kaggle's free T4.

I never opened a notebook. Everything ran headless: I would build one self-contained training script, push it to Kaggle as a "kernel" (a batch job) straight from my laptop, poll for status, and download the checkpoint when it finished.

It was not smooth. The pretraining run took five attempts to get one clean pass. I OOM'd the chips at batch size 32. I caught a SIGTERM at exactly 460 seconds from a missing XLA mark_step call. I hit Kaggle's 8-hour session wall halfway through and had to checkpoint and resume in a second session. The final model landed at a validation loss of about 3.77: fluent, idiomatic Hinglish, no persona yet, and a habit of leaking raw Reddit because I had pretrained with no boundaries between comments. That last bug is exactly what the next phase exists to fix.

5. The distillation trick: stealing humor

Here is the core idea of the whole project, and the thing I am proudest of. A 99M model cannot be funny. That is just true. It can learn the shape of a joke (the rhythm, the slang, the gif at the end) but it cannot write one, because a real joke needs world knowledge and a model of what the other person finds cutting, and a tiny model has neither. So I stopped asking it to be funny.

Instead I used instruction backtranslation. Take a real, upvoted Reddit comment (upvotes above 50). That comment is the funny answer, and a human already wrote it. Now have a big LLM write a plausible question that this comment would be the perfect reply to. You end up with a chat example where the hard part (being funny) was done by a human, and the LLM only did the trivial part (invent a setup).

flowchart TD
    C["a real, upvoted Reddit comment<br/>(the ANSWER, written by a human)"] --> T{{"teacher LLM<br/>GPT-5.5 / GPT-5.4-mini via XyPro,<br/>or DeepSeek V4 Flash"}}
    T -->|writes a question it answers| PAIR["a chat pair: a question,<br/>then the real comment as the answer"]
    PAIR --> EX([one training example: the LLM only wrote the easy setup])

5.1 Where the teacher models came from (XyPro)

I generated tens of thousands of these pairs, which means tens of thousands of LLM calls, which normally means a real bill. I had a Codex Pro subscription, and I had already built a little side project called XyPro that wraps that subscription behind a plain OpenAI-compatible endpoint on localhost. So I pointed my generation scripts at XyPro for the GPT-5.5 and GPT-5.4-mini calls, and at DeepSeek V4 Flash for the cheap, fast bulk. DeepSeek had one genuinely cursed footgun worth knowing: you have to send exactly thinking: {"type": "disabled"} or it silently leaves reasoning on, dumps the answer into a hidden field, and hands you back empty content. Cost me a couple hours before I pinned it down.

5.2 The persona, and six tries to land it

Distillation teaches humor, but it does not teach manners. So I also generated a "default voice" across 50 categories (greetings, sad, smalltalk, roast-bait, homework, and so on) from a single written persona spec. And the persona itself pivoted hard. The first three versions chased a loud, cocky "bhai" that roasted everything, including people who were genuinely sad, which was just mean and bad. From v4 on I dumped that entirely for a chill Gen-Z texter: lowercase, dry, roughly 55% romanized hindi and 45% english woven into every reply, with one golden rule (match the user's energy, and be savage only when invited).

It took six full data regenerations, v1 through v4_3, to get there. The thing I want to stress: the model never changed across any of those. Same 99M weights, same architecture, byte for byte. Every single improvement came from regenerating the data and re-measuring. That is the actual job.

6. Tools without touching the vocabulary

I gave chad two tools, a calculator and reaction gifs, under one hard constraint: no new tokens. Adding special tokens means a bigger embedding table, a resized output head, and re-running the entire export pipeline. So the tags (<calc>, <gif>, <result>) are just ordinary plain text the model learns to type, and the runtime spots them with a regex on the decoded output.

The calculator is a real little protocol. The model writes the expression and then stops dead, control returns to the runtime, the runtime does the actual arithmetic, and then the model writes its real reply using the answer:

flowchart TD
    U([user: 256 + 789 kitna hota hai]) --> M["chad emits a calc tag wrapping 256+789,<br/>then STOPS"]
    M --> RT["the runtime computes it<br/>(the model never does the math)"]
    RT --> RES["re-prompt with result = 1045"]
    RES --> A([chad: 1045 bro, itna toh ho hi jata hai])

And this is where it gets interesting. The mechanism is flawless: on a 16-prompt test the model fired the tool, wrote a valid expression, and used the result in character 16 out of 16 times. But the arithmetic itself is only right about 56% of the time. Direct "A op B" is perfect. Word problems mangle the operands ("I had 5000, bought a 2345 phone, how much left" turns into 2345 minus 1798, which is just nonsense). Hold onto that split. It is the whole punchline of the last chapter.

7. Aligning it, and grading something with no right answer

After fine-tuning there were still residual problems: it would occasionally cave when provoked, leak an offensive English line, or blurt out its maker's name. I did a final preference pass to clean those up (DPO, technically RPO once I learned that pure DPO has no stable operating point on a model this small). It sharpened the tone and killed the leaks. It also added exactly zero new capability, which is foreshadowing.

Grading was its own problem, because there is no correct answer to a roast, so perplexity tells you nothing. I used LLM-as-a-judge: DeepSeek V4 Flash scoring every reply on a rubric (relevance, tone-match, voice, coherence, plus a flag for "did it roast someone who did not ask for it"). When the persona pivoted from savage to chill, I had to rewrite the rubric too, because the old grader would have happily rewarded the exact cringe I was trying to remove. The final model scores 1.84 out of 2 on relevance and near the ceiling on voice, with the bad-roast rate down from 24% to about 2%.

8. Shipping it to the browser for $0

This is the part I am smug about. The model is small enough to send to the user and run on their machine. No inference server, no GPU, no per-token bill. It downloads once, caches forever, and every token is generated on the visitor's own CPU.

The conversion is almost a cheat. My hand-written architecture is, structurally, identical to a Llama (same RMSNorm, RoPE, GQA, SwiGLU, tied embeddings). So turning my custom checkpoint into a HuggingFace Llama is a pure key-rename, no weight surgery, and I parity-check the logits to prove the two models are the same. Then I export to ONNX and run it with transformers.js on WASM, inside a web worker so it never freezes the page.

flowchart TD
    PT[("best.pt, my custom PyTorch model")] --> HF["rename the keys → a HuggingFace Llama<br/>(no surgery, logits parity-checked)"]
    HF --> ONNX["export to ONNX, fp32, 472 MB"]
    ONNX --> JS["transformers.js + onnxruntime-web (WASM)"]
    JS --> BR([runs in your browser tab, on your cpu, ~78 tok/s])
    BR --> COST([inference cost to me: $0])

It does about 78 tokens a second on my Mac. I actually ship the full fp32 model (472 MB) instead of the int8 one (119 MB), because int8 quietly wrecks the calculator's digit copying, and I would rather eat the download than ship broken math. There is a Next.js app for the chat and a tiny Express/Mongo backend that does nothing but log (chat works perfectly even when it is completely down). And every conversation a real person rates becomes a future training example, so the thing has a built-in data flywheel. Free inference and free data. I like that a lot.

9. The 99M ceiling (the real finding)

If you take one thing from this whole project, take this. Across six data regenerations and an alignment pass, every capability sorted cleanly into one of two buckets.

Data-addressable. Voice, relevance, staying on topic, manners, output format, tool-call format, leak suppression. Fine-tuning moved all of these, every time. They are pattern-level behaviors, and enough good data installs them.

Parameter-bound. Genuine humor, factual knowledge, math from a word problem, and working memory. Nothing moved these. Ever. Tell chad your name and ask for it two turns later, while the name is still sitting right there in the context window, and it will confidently say a different name. When you inspect the logits, the correct name token sits around rank 4,000 out of 32,000, meaning essentially zero probability. Ask it a fact and it word-salads, because the parameters to actually store facts are not there.

The cleanest demonstration is that calculator from chapter 6. The tool protocol (a pattern) is 16 out of 16. The operand binding (reasoning) is 56%. Same model, same prompt, one half learned flawlessly and the other half slammed into a wall, in a single feature.

Pattern-level coherence is data-addressable. Variable-binding and reasoning are parameter-bound.

That one line is the entire project. It is also why every workaround for the "be funny" problem was the same move: do not ask the small model to reason, source the capability from outside (real human humor, a deterministic calculator) and let the 99M model do the only thing it is genuinely great at, which is sounding exactly like a guy from your group chat.

What I actually learned

I set out to demystify the box, and I did. A modern LLM is RoPE and grouped-query attention and a gated feed-forward and a mountain of data and a lot of deeply unglamorous plumbing. None of it is magic. The part that actually is magic, the reasoning, the facts, the wit, is the part you cannot buy with cleaner data. It only shows up with scale, and now I have felt exactly where that wall is with my own hands.

chad is tiny and it is dumb and it talks exactly like my friends, and it runs in a browser tab for free. Honestly could not be prouder of the little guy. 🫡

aillmfrom-scratchdeep-learning