A real language model that reads your dataset and answers questions — running entirely in the browser tab via WebGPU, with no API key and no server.
▶ Live demo: https://www.johnmikelregida.com/labs/copilot
Runs entirely in your browser. No API key, no backend, no telemetry — your data never leaves the device.
Paste a small dataset (CSV or a plain list), then ask a question or request a data-quality read. A quantised open-weights LLM is downloaded once, compiled to your GPU, and runs inference locally; the answer streams back token-by-token without a single network request leaving the page. This is interesting to anyone who deals with data that cannot leave the device — regulated, on-prem, air-gapped, or simply privacy-sensitive workflows — and to engineers tracking the browser's emergence as a genuine inference runtime.
It is deliberately scoped: a ~0.5B-parameter model is small. It is good for structural and data-quality observations over a few rows, not for heavy reasoning over large tables. The honesty matters more than the demo.
The whole app is a single self-contained index.html — no build step, no framework, no bundler.
- Inference engine: WebLLM (
@mlc-ai/web-llm), loaded as an ES module straight fromhttps://esm.run/@mlc-ai/web-llm. WebLLM uses the MLC/Apache TVM runtime to compile the model to WebGPU compute shaders and run it locally. - Model:
Qwen2.5-0.5B-Instruct-q4f16_1-MLC— Qwen2.5 0.5B Instruct, 4-bit weight quantisation with fp16 activations (q4f16_1). The download is ~0.5 GB and is cached by the browser after the first load. - Gated load: the model is not fetched on page open. WebGPU is feature-detected with
navigator.gpu; if it's missing, the load button is disabled and the page shows a clear fallback message plus a worked example so it stays useful without WebGPU. The ~0.5 GB download is only triggered by an explicit Load model button, with a live progress bar wired to WebLLM'sinitProgressCallback. - Prompt construction: a fixed system prompt frames the model as a precise data analyst that must cite real values and never invent data. The user's pasted dataset is truncated to the first 2000 characters and concatenated with the question.
- Generation:
engine.chat.completions.create({ ..., temperature: 0.3, stream: true })— an OpenAI-shaped streaming API. Output is accumulated and HTML-escaped before rendering, so pasted data can't inject markup.
page load → navigator.gpu check → (button) import WebLLM via esm.run
→ CreateMLCEngine("Qwen2.5-0.5B-Instruct-q4f16_1-MLC") [~0.5 GB, cached]
→ system prompt + DATA (≤2000 chars) + QUESTION
→ streamed completion (temp 0.3) → escaped, rendered token-by-token
A note on provenance: the default dataset shipped in the textarea is a small synthetic stops table (ATCO-style codes, deliberately seeded with empty coordinates, stale modified dates, an inactive-but-present row, and a "Bank of Engalnd" misspelling) used to demonstrate the data-quality read. It is illustrative example data, not a real export.
On-device inference removes the two hardest objections to putting an LLM near sensitive data: the data never crosses a trust boundary, and there is no per-call API cost or rate limit — the compute is the user's own GPU. The same pattern generalises to data-platform work: a data-quality "linter" that runs in the analyst's browser, contract validation that never ships rows to a vendor, or an agent step that can operate offline and air-gapped. As WebGPU support matures, "the model runs where the data already is" becomes a real architectural option rather than a compliance compromise.
These are static pages that fetch ES modules and WebGPU/WASM assets from CDNs, so they must be served over HTTP — file:// will not work.
cd data-copilot
python3 -m http.server 8000
# then open http://localhost:8000/index.html in a WebGPU-capable browserThere is no build step, no npm install, and no data pipeline — index.html is the entire application. To run the model live you need a desktop browser with WebGPU enabled (recent Chrome, Edge, or Safari) and enough VRAM for the quantised 0.5B model. Without WebGPU the page still loads and shows the worked example.
- WebLLM (
@mlc-ai/web-llm) over the MLC / Apache TVM runtime - WebGPU for on-device compute
- Qwen2.5-0.5B-Instruct,
q4f16_1quantisation - Vanilla JS ES modules via
esm.run(no framework, no bundler) - Single self-contained
index.html
Built by John Mikel Regida — Lead Data Architect (Thoughtworks; UK Dept for Transport / NaPTAN; ex-CTO; 5× Google Cloud Professional). GitHub: github.com/johnmikel. Site: https://www.johnmikelregida.com
Part of the JMR Labs suite — https://www.johnmikelregida.com/labs