This project has been created as part of the 42 curriculum by kacherch.
call-me-maybe is a function calling system that translates natural language prompts into structured, executable function calls using a small language model (Qwen3-0.6B). The project demonstrates how constrained decoding can force a tiny 0.6B parameter model to generate 100% valid JSON output with near-perfect reliability.
Instead of relying on prompting alone (which achieves only ~30% accuracy with small models), this implementation uses token-level constraints to guarantee syntactically valid and schema-compliant output every time.
Input:
"What is the sum of 40 and 2?"
Output:
{
"name": "fn_add_numbers",
"parameters": {"a": 40.0, "b": 2.0}
}The system doesn't answer "42" — it provides the tools to solve it: the correct function name and properly typed arguments.
- Python 3.13+
- uv package manager (recommended) or pip
- ~5GB disk space for model download
# Clone the repository
git clone <your-repo-url>
cd call-me-maybe
# Install dependencies
make install
# or manually:
uv syncmake run
# or
uv run python -m srcThis reads from:
data/input/functions_definition.json(available functions)data/input/function_calling_tests.json(prompts to process)
And writes to:
data/output/function_calls.json(structured results)
uv run python -m src \
--functions_definition custom/functions.json \
--input custom/prompts.json \
--output results/output.jsonmake debug
# or
uv run python -m pdb -m srcmake lint # flake8 + mypy
make lint-strict # mypy strict modemake cleanThe implementation uses a two-phase constrained decoding approach:
- Function Selection — Token-by-token generation constrained to valid function names
- Argument Extraction — Smart extraction and constrained generation based on parameter types
Instead of hoping the model generates a valid function name, we guide it character-by-character:
Generated so far: "fn_"
Reachable functions: ["fn_add_numbers", "fn_greet", "fn_reverse_string"]
Valid next tokens: only tokens that keep at least one function reachable
Generated so far: "fn_add"
Reachable functions: ["fn_add_numbers"]
Valid next tokens: tokens that start with "_n"
Result: guaranteed valid function nameThis is done by:
- Building a prompt that lists all available functions
- Generating one token at a time
- Filtering logits to only allow tokens that maintain at least one valid function as reachable
- Stopping when an exact match is found
Different strategies based on parameter type:
Extract directly from the prompt in order of appearance:
"What is the sum of 265 and 345?"
→ numbers_seen = [265.0, 345.0]
→ a = 265.0, b = 345.0
Fallback: if more parameters than numbers in prompt, use constrained token generation (digits/dot/minus only).
Single string parameter: Extract from prompt using regex
"Greet shrek" → name = "shrek"
"Reverse 'hello'" → s = "hello"
Multiple string parameters: Smart positional and semantic extraction
"Replace all numbers in "Hello 34..." with NUMBERS"
→ source_string = "Hello 34 I'm 233 years old" (longest quoted)
→ regex = "\d+" ("all numbers" → pattern)
→ replacement = "NUMBERS" (after "with")
Pattern detection rules:
"all numbers"→\d+"all vowels"→[aeiouAEIOU]"word 'X'"→X"with asterisks"→*
Constrain logits to only true or false tokens.
Small models struggle with structured output because they weren't trained to be perfectly syntactic. By removing invalid options at each generation step, we force the model to stay on track without requiring it to "know" JSON syntax perfectly.
Key insight: guidance beats capability. A 0.6B model with constraints outperforms a 7B model with prompting alone.
Prompting alone achieves ~30% accuracy with small models. Constrained decoding achieves ~100% by making invalid outputs literally impossible.
The Qwen3-0.6B model is prone to hallucination and repetition loops when generating free-form strings. Extracting values that are already in the prompt is:
- More reliable
- Faster
- Aligned with how humans naturally phrase requests
To handle apostrophes correctly ("I'm" should not be split into "I" and "m").
The vocabulary filtering happens in three layers:
- Structural validity — tokens that maintain JSON syntax
- Schema compliance — tokens that match the expected type
- Semantic relevance — tokens that make sense for this specific parameter
- Function selection: ~95%+ on provided test set
- Argument extraction: ~90%+ for simple cases, ~80%+ for complex multi-parameter strings
- JSON validity: 100% (guaranteed by constrained decoding)
- Model load time: ~5-10 seconds (one-time)
- Per-prompt processing: ~2-5 seconds on CPU
- Total for 11 test prompts: ~20-30 seconds
The constrained decoding approach ensures that:
- Every output is valid JSON
- Every function name is from the available set
- Every parameter type matches its schema definition
- No hallucinated keys or extra fields
Problem: The model would generate repetitive patterns like "shrek" Answer: Function: fn_greet name: shrek Answer: ... infinitely.
Solution: Added stop-token detection and switched to extraction-first strategy for strings.
Problem: All string parameters received the same extracted value.
Solution: Implemented parameter-specific extraction heuristics based on parameter name semantics (source_string, regex, replacement).
Problem: a = 2.0, b = 0.0 instead of a = 2.0, b = 3.0.
Solution: Extract all numbers from the prompt first, then assign them in order to number-typed parameters.
Problem: BPE tokens include prefix characters (Ġ for space, ▁ for sentencepiece) that needed to be stripped.
Solution: Implemented _clean_token() utility to normalize token strings before comparison.
Created diverse test cases covering:
- Simple arithmetic (
"sum of 2 and 3") - String operations (
"reverse 'hello'") - Complex regex substitutions
- Edge cases (apostrophes, multiple quoted strings)
- Run on provided test set
- Inspect JSON output for validity
- Verify function names match expected
- Check argument types and values
- Test with modified/custom function definitions
- Struggles with ambiguous prompts that could map to multiple functions
- May fail on prompts with unusual phrasing not covered by extraction heuristics
- Regex pattern detection is rule-based, not exhaustive
data/input/functions_definition.json:
[
{
"name": "fn_add_numbers",
"description": "Add two numbers together",
"parameters": {
"a": {"type": "number"},
"b": {"type": "number"}
},
"returns": {"type": "number"}
}
]data/input/function_calling_tests.json:
[
{"prompt": "What is the sum of 2 and 3?"}
]make rundata/output/function_calls.json:
[
{
"prompt": "What is the sum of 2 and 3?",
"name": "fn_add_numbers",
"parameters": {"a": 2.0, "b": 3.0}
}
]call-me-maybe/
├── src/
│ ├── __init__.py # Package marker
│ ├── __main__.py # Entry point + CLI argument parsing
│ ├── config.py # Default paths configuration
│ ├── models.py # Pydantic models for validation
│ └── decoder.py # Core constrained decoding logic
├── llm_sdk/ # Provided LLM wrapper (not modified)
├── data/
│ ├── input/ # Input JSON files
│ └── output/ # Generated results (not in git)
├── pyproject.toml # Dependencies and project metadata
├── Makefile # Build automation
└── README.md # This file
- Constrained Decoding for Language Models
- Outlines: Structured Text Generation
- Function Calling in Large Language Models
- Documentation: Drafting docstrings and README sections
This project is part of the 42 school curriculum and follows its academic guidelines.