Most parsers do too much. They build a semantic model of the file when all you need is its structure — scopes, lists, fields. The meaning belongs to the application reading the parsed data, not to the parser itself. But the lex+yacc tradition forces both: tokenize first, then reconstruct everything with a state machine. Covering a full format that way is enormous work — and almost never finished.
rawast formalizes the structure-first approach as a universal bidirectional grammar-driven engine for structured text and binary formats. Every EDA tool today reimplements its own readers for LEF, DEF, GDSII, Liberty, and every other format the field uses — every one re-parsing the same files. rawast inverts that: one engine, grammars as data files, and a binary container that distributes parsed data so downstream consumers never re-parse text at all. Ships as a C++17 library with Python bindings.
The parser is one engine; the grammar is data — a JSON / .rawast file you load at runtime. The engine reads text or bytes and produces a JSON-shaped value tree (arrays, dicts, scalars). One engine reads any format, no recompile. The output is queryable without a format-specific API.
Three properties make this work: it's a structural parser driven by an external grammar; the grammar is itself JSON-shaped data the engine can read with itself (self-hosting); and the engine is bidirectional — the same grammar that parses also re-emits text from a value tree. Binary formats slot in by registering terminal parsers; GDSII — the standard binary format for IC layout — is the worked example.
The planned .jast container builds on this: grammar + parsed tree, serialised together in a binary file. "Parse once" — every later consumer reads the value tree directly, never re-parses text, and can still emit the text form because the grammar travels with the data. See docs/ROADMAP.md.
EDA is the first proving ground because the files are large, the formats are many, and every tool currently reimplements its own reader and writer. The PoC parses 100% of a 3,132-file production corpus across four formats (GDSII / LEF / DEF / Tcl); funding is being sought to turn the PoC into shippable infrastructure.
python -m venv .venv && source .venv/bin/activate
pip install rawastCompiles the C++ engine from source (no pre-built wheels yet) — needs C++17 (GCC 7+, Clang 5+, Apple Clang 9+, MSVC 2017+) and CMake 3.20+ on your PATH. Compile takes ~15–20 seconds on a modern laptop. Zero runtime Python dependencies.
For development against the repo, see docs/BUILD.md.
import rawast
g = rawast.Grammar("json") # bundled grammar by short name
ast = g.parse_string('{"name": "alice", "items": [1, 2, 3]}')
# ast == {"name": "alice", "items": [1, 2, 3]}
text = g.save(ast) # bytes — works for binary grammars too
issues = g.lint() # warnings about ambiguous Choices, if anyBundled grammars: Grammar("json"), Grammar("rawast"), Grammar("gdsii"), Grammar("lefdef"), Grammar("tcl"). Load your own with Grammar.load("path/to/my_format.rawast").
Cross-format conversion in three lines:
gdsii = rawast.Grammar("gdsii")
json_g = rawast.Grammar("json")
print(json_g.save(gdsii.parse_file("layout.gds")).decode("utf-8"))CLI:
rawast --help
rawast parse grammars/json.json file.json
rawast pydantic grammars/lefdef.rawast > models.py # typed Pydantic v2 models
rawast pycode grammars/lefdef.rawast file.lef \
--start LEF --models-module models # Python source that reconstructs the modelFull reference: docs/CLI.md.
| What | |
|---|---|
docs/FEATURES.md |
All engine capabilities — parsing, save, profiling, Pydantic + pycode, perf wins |
docs/CLI.md |
Every CLI command, every flag, with examples |
docs/EXAMPLES.md |
Worked examples per capability — parse / save, cross-format, Pydantic + pycode, Tcl recursion, GDSII binary, linting, profiling |
docs/AGENTS.md |
Using rawast with LLM tools and agents — why structured AST beats text-pattern matching, what an agent should read to author a grammar, prompt structure |
docs/GRAMMARS.md |
Shipped grammars (GDSII / LEF / DEF / Tcl / JSON / rawast meta) with corpus numbers |
docs/BUILD.md |
Building from source — Python, C++ library, sdist |
docs/ARCHITECTURE.md |
Engine internals — parser groups, use:, ignore policy, subparse, the bidirectional walk |
docs/ROADMAP.md |
Path to 1.0 — M1–M4, funding context |
docs/rawast-format.md |
The .rawast grammar language specification |
examples/ |
Runnable scripts |
SECURITY.md |
Vulnerability-reporting policy |
CONTRIBUTING.md |
How to build, test, submit changes |
rawast is the C++ rewrite of an earlier Python prototype (2023–2025) that validated the data-driven grammar approach, the catcher-based value-tree mechanism, and the bidirectional walk. The current implementation is the productionisation of those ideas as a maintained C++17 codebase; most of the commit history here reflects the rewrite phase. Design decisions and the architecture they came from are documented in docs/ and in the prototype's history.
include/rawast/ public C++ API headers
src/ engine implementation
grammars/ community-maintained grammars (.rawast and .json)
docs/ language, feature, CLI, grammar, build, architecture, roadmap docs
tests/ doctest-based C++ test suite
python/ Python binding + CLI (nanobind extension module)
src/native.cc binding implementation
rawast/ Python package (CLI in cli.py; docs/schema generators in docs.py / schema.py)
tests/ pytest suite
examples/ small worked examples (parse → modify → save, etc.)
The work outlined in docs/ROADMAP.md is the basis of the NLnet NGI0 Commons funding application. Sponsorship via GitHub Sponsors at https://github.com/sponsors/lanserge is the most direct way to help.
MIT — see LICENSE.
Serge Rabyking · LinkedIn