From 80b6d42dc6dcb2f789464c75176f37e84484594b Mon Sep 17 00:00:00 2001 From: Tomasz Okon Date: Fri, 12 Sep 2025 23:18:45 +0000 Subject: [PATCH] docs --- README.md | 192 ++++++++++++++++++++++++++++++++++++++- docs/AST.md | 5 + docs/ASTVisitor.md | 5 + docs/CodeGenerator.md | 5 + docs/IR.md | 5 + docs/IRGenerator.md | 5 + docs/Lexer.md | 5 + docs/Parser.md | 5 + docs/SemanticAnalyzer.md | 5 + docs/Token.md | 5 + 10 files changed, 235 insertions(+), 2 deletions(-) create mode 100644 docs/AST.md create mode 100644 docs/ASTVisitor.md create mode 100644 docs/CodeGenerator.md create mode 100644 docs/IR.md create mode 100644 docs/IRGenerator.md create mode 100644 docs/Lexer.md create mode 100644 docs/Parser.md create mode 100644 docs/SemanticAnalyzer.md create mode 100644 docs/Token.md diff --git a/README.md b/README.md index b94e42c..6795601 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,191 @@ -# Comp -Compiler +````markdown +# miniC +miniC is a lightweight compiler for a simplified C-like language. It processes source code through these stages: + +1. **Lexing**: Tokenization of input source. +2. **Parsing**: Building an Abstract Syntax Tree (AST). +3. **Semantic Analysis**: Type checking and error detection. +4. **IR Generation**: Intermediate Representation (IR) for optimization. +5. **Code Generation**: Outputting executable code (NASM assembly). + +Key features: +- Modular design with separate components for each stage. +- Built with CMake for cross-platform builds. +- Includes unit tests using Google Test. +- Supports basic C syntax (variables, functions, control flow). + +--- + +## Project Structure +- [CMakeLists.txt](./CMakeLists.txt) — Top-level CMake configuration +- docs/ + - [dev.md](./docs/dev.md) + - [ASTVisitor.md](./docs/ASTVisitor.md) + - [CodeGenerator.md](./docs/CodeGenerator.md) + - [IRGenerator.md](./docs/IRGenerator.md) + - [IR.md](./docs/IR.md) + - [Lexer.md](./docs/Lexer.md) + - [Parser.md](./docs/Parser.md) + - [SemanticAnalyzer.md](./docs/SemanticAnalyzer.md) + - [Token.md](./docs/Token.md) +- include/ + - minic/ + - [AST.hpp](./include/minic/AST.hpp) + - [ASTVisitor.hpp](./include/minic/ASTVisitor.hpp) + - [CodeGenerator.hpp](./include/minic/CodeGenerator.hpp) + - [IRGenerator.hpp](./include/minic/IRGenerator.hpp) + - [IR.hpp](./include/minic/IR.hpp) + - [Lexer.hpp](./include/minic/Lexer.hpp) + - [Parser.hpp](./include/minic/Parser.hpp) + - [SemanticAnalyzer.hpp](./include/minic/SemanticAnalyzer.hpp) + - [Token.hpp](./include/minic/Token.hpp) +- [README.md](./README.md) — Root README +- src/ + - [CMakeLists.txt](./src/CMakeLists.txt) + - [CodeGenerator.cpp](./src/CodeGenerator.cpp) + - [IRGenerator.cpp](./src/IRGenerator.cpp) + - [Lexer.cpp](./src/Lexer.cpp) + - [main.cpp](./src/main.cpp) + - [Parser.cpp](./src/Parser.cpp) + - [SemanticAnalyzer.cpp](./src/SemanticAnalyzer.cpp) +- tests/ + - [CMakeLists.txt](./tests/CMakeLists.txt) + - [main.cpp](./tests/main.cpp) + - [TestAST.cpp](./tests/TestAST.cpp) + - [TestExample.cpp](./tests/TestExample.cpp) + - [TestIRGenerator.cpp](./tests/TestIRGenerator.cpp) + - [TestLexer.cpp](./tests/TestLexer.cpp) + - [TestParser.cpp](./tests/TestParser.cpp) + - [TestSemanticAnalyzer.cpp](./tests/TestSemanticAnalyzer.cpp) + +--- + +## Example Compilation + +### Input (miniC source) + +```c +int main() { + int x = 5; + if (x > 0) { + while (x < 10) { + x = x + 1; + } + } + return x; +} +```` + +### Output (NASM x86-64 assembly) + +```asm +section .data +section .text +global _start +_start: + call main + mov rdi, rax + mov rax, 60 + syscall + +main: + push rbp + mov rbp, rsp + sub rsp, 64 +entry_0: + mov qword [rbp - 8], 5 + mov rax, [rbp - 8] + mov [rbp - 64], rax + mov qword [rbp - 16], 0 + mov rax, [rbp - 64] + cmp rax, [rbp - 16] + setg al + movzx rax, al + mov [rbp - 24], rax + mov rax, [rbp - 24] + cmp rax, 0 + je if_else_2 +if_then_1: + jmp while_cond_4 +while_cond_4: + mov qword [rbp - 32], 10 + mov rax, [rbp - 64] + cmp rax, [rbp - 32] + setl al + movzx rax, al + mov [rbp - 40], rax + mov rax, [rbp - 40] + cmp rax, 0 + je while_end_6 +while_body_5: + mov qword [rbp - 48], 1 + mov rax, [rbp - 64] + add rax, [rbp - 48] + mov [rbp - 56], rax + mov rax, [rbp - 56] + mov [rbp - 64], rax + jmp while_cond_4 +while_end_6: + jmp if_end_3 +if_else_2: + jmp if_end_3 +if_end_3: + mov rax, [rbp - 64] + jmp main_epilogue +main_epilogue: + leave + ret +``` + +--- + +## Explanation of the NASM Code + +1. **Program entry** + + * `_start` is the Linux entry point (instead of relying on libc). + * It calls `main`, retrieves the return value in `rax`, and invokes the `exit` system call with that value. + +2. **Stack frame setup** + + * `push rbp` / `mov rbp, rsp` establish a base pointer. + * `sub rsp, 64` reserves 64 bytes of stack space for local variables and temporaries. + +3. **Variable initialization** + + * `int x = 5;` becomes `mov qword [rbp - 8], 5` and then copied into `[rbp - 64]`. + * miniC uses stack slots for both declared variables and intermediate results. + +4. **If condition (`x > 0`)** + + * `cmp` compares values. + * `setg al` sets a flag if greater. + * The result is stored in another stack slot (`[rbp - 24]`) and checked. + +5. **While loop (`while (x < 10)`)** + + * Loop condition: compare `x` with `10`. + * If true, execution continues into the loop body; otherwise, it jumps to `while_end_6`. + * The loop body increments `x` by 1. + +6. **Return value** + + * At the end, the function moves the final value of `x` into `rax`, the return register. + * `_start` uses this to exit the program with the correct return code. + +--- + +## Why is Memory Allocated for Every Statement? + +You may notice that **every intermediate computation is written back to the stack** (`rbp - 16`, `rbp - 24`, `rbp - 32`, etc.) instead of keeping values purely in registers. + +This happens because: + +* **Naive code generation**: miniC currently generates straightforward stack-based code. Each sub-expression or condition is lowered into a temporary slot on the stack. This avoids the complexity of **register allocation**. +* **Simplicity over optimization**: By storing results explicitly in memory, the compiler ensures correctness without worrying about limited register availability or lifetime analysis. +* **Debugging clarity**: Using stack slots makes it easy to map AST nodes and IR instructions directly to memory locations, which is useful during development. + +In more advanced compilers, an **optimization pass** or a **register allocator** would keep many of these values in CPU registers, drastically reducing memory usage and improving performance. For now, miniC favors **correctness and simplicity** over efficiency. + +--- \ No newline at end of file diff --git a/docs/AST.md b/docs/AST.md new file mode 100644 index 0000000..72c16f7 --- /dev/null +++ b/docs/AST.md @@ -0,0 +1,5 @@ +### How It Works +The AST (Abstract Syntax Tree) module represents the parsed structure of miniC source code as a hierarchy of nodes. It uses a base ASTNode class for polymorphism, with Expr as the base for expressions (like literals, identifiers, unary/binary operations) and Stmt as the base for statements (like returns, ifs, whiles, assignments, variable declarations). Specific subclasses hold details: for instance, IntLiteral stores an integer value, BinaryExpr links left/right subexpressions with an operator token type, and VarDeclStmt includes type, name, and optional initializer. The Function class groups parameters (via a simple Parameter struct) and body statements, while the top-level Program holds all functions. Nodes are connected via unique pointers for ownership, allowing recursive tree building without cycles. Dynamic casting is used downstream for type-specific handling, but the structure itself is lightweight and focused on syntax representation. + +### Example of Use +After parsing source code, the AST is built by creating nodes like an IntLiteral for a number, wrapping it in a BinaryExpr for addition with an Identifier, then placing that in an AssignStmt for a variable, and finally enclosing it in a Function's body under a Program. This tree can then be traversed by a visitor to perform analysis or generation, such as checking types or emitting IR for a simple expression like "x = 1 + 2;". diff --git a/docs/ASTVisitor.md b/docs/ASTVisitor.md new file mode 100644 index 0000000..338509b --- /dev/null +++ b/docs/ASTVisitor.md @@ -0,0 +1,5 @@ +### How It Works +The ASTVisitor class provides an interface for the Visitor design pattern to traverse the AST without altering the nodes. It defines pure virtual methods to visit the Program (overall structure), Function (with params and body), Stmt (generic statements), and Expr (generic expressions). Derived classes override these to implement specific logic, using const references to nodes for read-only access. Inside overrides, downcasting to specific subclasses (like ReturnStmt or BinaryExpr) allows targeted processing. This separates traversal from node structure, enabling multiple passes (e.g., one for semantics, another for IR) on the same tree. The virtual destructor ensures proper cleanup for derived visitors. + +### Example of Use +Instantiate a derived class like a semantic checker that overrides visit methods to inspect types in expressions and report errors in statements. Then, call visit on the Program root, which recursively calls visits on functions, their statements, and subexpressions, allowing the visitor to build a symbol table while traversing a full program with multiple functions and control flows. diff --git a/docs/CodeGenerator.md b/docs/CodeGenerator.md new file mode 100644 index 0000000..8e780ac --- /dev/null +++ b/docs/CodeGenerator.md @@ -0,0 +1,5 @@ +### How It Works +The CodeGenerator class takes an IRProgram and translates it into textual NASM assembly code for x86-64. It processes each function by allocating stack space for variables and parameters, emitting a function prologue (setting up the stack frame), handling parameter passing via registers (like rdi for the first param), and emitting instructions for each basic block. For each IR instruction, it generates corresponding assembly lines (e.g., converting an ADD operation to an addition in rax). It manages labels for control flow, infers branch targets for loops and conditionals, and adds an epilogue to clean up the stack. If an output file is specified, it writes there; otherwise, it uses the provided stream. Debug messages trace the process, and it throws errors for unsupported operations or file issues. Stack alignment is ensured to 16 bytes for ABI compliance. + +### Example of Use +To use it, create an instance with an output stream, then call generate on a populated IRProgram, optionally providing a filename like "output.asm". The result is assembly code that can be assembled and linked into an executable, such as emitting a simple main function that adds two numbers and returns the result via syscall exit. diff --git a/docs/IR.md b/docs/IR.md new file mode 100644 index 0000000..f03b2b1 --- /dev/null +++ b/docs/IR.md @@ -0,0 +1,5 @@ +### How It Works +The IR (Intermediate Representation) module structures compiled code as a platform-independent format using three-address instructions. The IROpcode enum lists operations like arithmetic (ADD, SUB), comparisons (EQ, LT), assignments (ASSIGN), memory access (LOAD, STORE), control flow (JUMP, JUMPIF), returns, and labels. An IRInstruction holds an opcode plus up to two operands and a result (for temps or labels). BasicBlock groups instructions under a unique label for control flow units. IRFunction encapsulates a function's name, return type, parameters, and owned basic blocks. The top-level IRProgram owns all functions. This setup allows linear scanning for optimizations and easy translation to assembly, with strings for variable/temporary names and vectors for collections. + +### Example of Use +From an AST, generate an IRProgram by creating IRInstructions for operations (e.g., ASSIGN for variable init, ADD for binary plus), grouping them into labeled BasicBlocks for conditionals (like then/else for if), assembling blocks into an IRFunction for main, and adding it to the IRProgram. This IR can then be passed to a code generator to produce assembly for a loop that increments a counter until a condition. \ No newline at end of file diff --git a/docs/IRGenerator.md b/docs/IRGenerator.md new file mode 100644 index 0000000..8ea2a9f --- /dev/null +++ b/docs/IRGenerator.md @@ -0,0 +1,5 @@ +### How It Works +The IRGenerator class, inheriting from ASTVisitor, walks the AST to build an IRProgram by emitting instructions during traversal. It starts with generate on the Program, creating an IRProgram and visiting each Function to make an IRFunction with an entry BasicBlock, mapping parameters to variables, and clearing counters for temps/labels. For statements, it dispatches: variable declarations assign initializers if present, assignments compute values and store, returns emit RETURN ops, ifs create then/else/end blocks with conditional jumps, and whiles set up cond/body/end with loops. Expressions are handled recursively in generate_expr, producing temps for literals (direct assign), identifiers (lookup map), unaries (NEG/NOT), and binaries (map token ops to IROpcode like PLUS to ADD). It uses counters for unique temps ("tN") and labels (prefixed_N), a map for variable tracking, and emit to append instructions to the current block. Throws on unsupported nodes. + +### Example of Use +Call generate on a Program AST to produce an IRProgram; for a function with an if statement checking a condition and assigning in branches, it creates separate blocks, emits JUMPIFNOT to skip else, generates expr temps for the condition, and jumps to end labels, resulting in structured IR ready for code generation like translating a conditional assignment into branched assembly. \ No newline at end of file diff --git a/docs/Lexer.md b/docs/Lexer.md new file mode 100644 index 0000000..879bf84 --- /dev/null +++ b/docs/Lexer.md @@ -0,0 +1,5 @@ +### How It Works +The Lexer class tokenizes miniC source code by scanning a string input character by character. It maintains position, line, and column trackers, skipping whitespace and comments (single-line // or multi-line /* */). It identifies tokens like keywords (e.g., int, if), identifiers (alphanumeric with underscore), integer literals (digits), string literals (quoted, with escapes like \n, \t), operators (e.g., +, ==, <=), punctuation (e.g., {, ;), and special tokens like newline or EOF. For strings, it handles escapes and throws errors for unclosed quotes or invalid escapes. Numbers are parsed as integers, throwing on invalid formats. The main Lex method collects all tokens into a vector, adding an EOF at the end. Recursive calls handle skipped elements like comments. + +### Example of Use +Initialize with source code like "int main() { return 42; }", then call Lex to get a vector of tokens: starting with KEYWORD_INT, IDENTIFIER "main", LPAREN, RPAREN, LBRACE, KEYWORD_RETURN, LITERAL_INT 42, SEMICOLON, RBRACE, and EOF. This output can feed into a parser for a simple main function returning a constant. diff --git a/docs/Parser.md b/docs/Parser.md new file mode 100644 index 0000000..61fe2a7 --- /dev/null +++ b/docs/Parser.md @@ -0,0 +1,5 @@ +### How It Works +The Parser class builds an AST from tokens using recursive descent. It tracks current position, peeking/advancing/consuming tokens, and throws on mismatches. The parse method loops over functions to create a Program. Functions parse return type (int/void/str), name, parameters (type-name pairs), and block body. Blocks collect statements until }. Statements include var decls (type name [= expr];), assignments (id = expr;), returns (return [expr];), ifs (if (expr) block [else block]), whiles (while (expr) block). Expressions handle precedence: comparisons (==, !=, <, etc.), terms (+, -), factors (*, /), primaries (literals, ids, parens, unaries like ! or -). Synchronization skips to semicolons on errors. Parameters are comma-separated type-name. + +### Example of Use +Feed tokens from "int add(int a, int b) { return a + b; }" into parse to get a Program with one Function "add" (int return, params a/b as int), body as ReturnStmt with BinaryExpr (IDENTIFIER "a" OP_PLUS IDENTIFIER "b"), ready for semantic analysis. \ No newline at end of file diff --git a/docs/SemanticAnalyzer.md b/docs/SemanticAnalyzer.md new file mode 100644 index 0000000..297c00e --- /dev/null +++ b/docs/SemanticAnalyzer.md @@ -0,0 +1,5 @@ +### How It Works +The SemanticAnalyzer class, deriving from ASTVisitor, checks the AST for correctness by traversing nodes and enforcing rules. It uses a stack of symbol tables for scopes (pushed/popped for functions/blocks) and a global function map. For programs, it detects function redefinitions and visits each function, setting its return type. In functions, it declares parameters and visits body statements. For statements, it checks variable declarations (no redeclares, no void types, initializer type match), assignments (declared var, type match), returns (type matches function), ifs/whiles (int condition, visits branches/body). Expressions are validated: identifiers must be declared, binaries/unaries check operand types (e.g., arithmetic needs ints). It infers types for literals/identifiers/binaries and throws SemanticError on issues like undeclared vars or mismatches. + +### Example of Use +After parsing, create an instance and call visit on the Program AST for a function with an int declaration, assignment, and return; it verifies the initializer matches int, the assigned value matches the var type, and the return matches the function type, throwing if a string is assigned to an int var. diff --git a/docs/Token.md b/docs/Token.md new file mode 100644 index 0000000..c88b8ea --- /dev/null +++ b/docs/Token.md @@ -0,0 +1,5 @@ +### How It Works +The Token struct represents individual lexer outputs with a TokenType enum for categories like keywords (int, void, str, if, else, while, return), identifiers, literals (int, string), operators (plus, minus, multiply, divide, assign, equal, not, not equal, less, greater, less eq, greater eq), punctuation (lparen, rparen, lbrace, rbrace, colon, comma, semicolon), newline, and EOF. It stores a variant value (int for numbers, string for identifiers/keywords/strings), plus line and column for error reporting. No methods; it's a simple data holder for passing to parsers. + +### Example of Use +In lexing "if (x == 1)", tokens include KEYWORD_IF, LPAREN, IDENTIFIER "x", OP_EQUAL, LITERAL_INT 1, RPAREN, allowing the parser to build an if condition expression from these structured elements.