Running Transformers.js in a Chrome Extension: What I Learned Building a Gemma 4 Assistant

I recently built a Chrome extension that runs a local AI assistant powered by Gemma 4 E2B, all through Transformers.js. The idea was simple: give users a side panel that can understand page content, answer questions, and highlight relevant sections without sending data to any server.

Turns out, making that work under Manifest V3 constraints involves some interesting trade-offs. Here’s what I learned.

The architecture split that actually works

The first decision is where to put what. MV3 gives you three runtime contexts: background service worker, side panel, and content script. The instinct might be to keep everything in one place, but that breaks down fast when you’re dealing with large models.

I landed on this split:

Background handles everything heavy: model loading, inference, agent logic, and shared state like conversation history.
Side panel is purely UI: chat input, streaming output, and setup controls.
Content script is the page bridge: it extracts DOM content and applies highlights.

The key insight is that the background is the single source of truth. The side panel doesn’t hold any state about the conversation; it sends events like AGENT_GENERATE_TEXT to the background, which appends the message, runs inference, and emits MESSAGES_UPDATE back. The side panel just renders what it receives.

This avoids duplicate model loads across tabs and keeps the UI responsive. More importantly, it respects Chrome’s security model: the content script is the only thing that can touch the DOM, and the background is the only thing that should hold expensive resources.

Messaging: keep it typed and simple

Once you split runtimes, messaging becomes the backbone. I defined all message types as enums in a shared types file. The pattern is straightforward:

Side panel asks background to do something (initialize models, generate text, clear messages).
Background responds with progress updates and state changes.
Background also talks to the content script for page extraction and highlighting.

The rule I followed: the background coordinates everything. Neither the side panel nor the content script initiates actions on their own. They request, the background decides, and the background pushes updates back.

Model loading and the cache gotcha

This project uses two models: Gemma 4 (text generation, quantized to q4f16) and MiniLM (embeddings, fp32). The split is intentional—Gemma handles reasoning and tool decisions, while MiniLM does semantic search for features like “find similar pages.”

All inference runs in the background service worker. This gives a single model host for all tabs and sessions, which avoids duplicate memory usage. But there’s a subtle benefit I didn’t expect: models loaded from a service worker are cached under the extension’s origin (chrome-extension://), not per-website origins. So you get one shared cache for the entire extension install, which makes subsequent loads faster.

The downside? MV3 service workers can be suspended and restarted at any time. You can’t assume models stay in memory forever. The code needs to check what’s cached, estimate remaining download size, and be ready to re-initialize when the worker wakes up.

I made the model lifecycle explicit: CHECK_MODELS inspects the cache, INITIALIZE_MODELS downloads and loads everything, and the UI shows progress via DOWNLOAD_PROGRESS events. Once loaded, the pipeline instances are reused until the worker dies.

What I’d do differently next time

A few things I’d change if I were starting over:

First, the conversation history lives in the background service worker. That makes sense for avoiding duplication, but it means if the worker dies, you lose the chat history. I should persist it to local storage or IndexedDB periodically.

Second, the content script is simpler than I’d like. Right now it just extracts page text and applies highlights. I’d like to add more granular element selection and better handling of dynamic content that loads after the initial page load.

Third, I underestimated the download size. Gemma 4 at q4f16 is still a few hundred megabytes. The first load takes a while, and users on slower connections might not have the patience. A progress indicator helps, but I’m considering offering a smaller model as a fallback.

The takeaway

Running Transformers.js in a Chrome extension is absolutely feasible, but you need to be deliberate about architecture. The MV3 constraints aren’t arbitrary—they force you to think about resource management and lifecycle in ways that actually make your extension more robust.

If you’re building something similar, start with the messaging contract. Define what each runtime owns, make the background the single coordinator, and treat model state as recoverable. The rest is just plumbing.

The full source is on GitHub if you want to see how it all fits together. It’s not perfect, but it works.