call-me-maybe

This project has been created as part of the 42 curriculum by kacherch.

call-me-maybe

Description

call-me-maybe is a function calling system that translates natural language prompts into structured, executable function calls using a small language model (Qwen3-0.6B). The project demonstrates how constrained decoding can force a tiny 0.6B parameter model to generate 100% valid JSON output with near-perfect reliability.

Instead of relying on prompting alone (which achieves only ~30% accuracy with small models), this implementation uses token-level constraints to guarantee syntactically valid and schema-compliant output every time.

Example

Input:

"What is the sum of 40 and 2?"

Output:

{
  "name": "fn_add_numbers",
  "parameters": {"a": 40.0, "b": 2.0}
}

The system doesn't answer "42" — it provides the tools to solve it: the correct function name and properly typed arguments.

Instructions

Prerequisites

Python 3.13+
uv package manager (recommended) or pip
~5GB disk space for model download

Installation

# Clone the repository
git clone <your-repo-url>
cd call-me-maybe

# Install dependencies
make install
# or manually:
uv sync

Usage

Basic usage (default paths)

make run
# or
uv run python -m src

This reads from:

data/input/functions_definition.json (available functions)
data/input/function_calling_tests.json (prompts to process)

And writes to:

data/output/function_calls.json (structured results)

Custom paths

uv run python -m src \
  --functions_definition custom/functions.json \
  --input custom/prompts.json \
  --output results/output.json

Debug mode

make debug
# or
uv run python -m pdb -m src

Code quality checks

make lint         # flake8 + mypy
make lint-strict  # mypy strict mode

Clean temporary files

make clean

Algorithm Explanation

Overview

The implementation uses a two-phase constrained decoding approach:

Function Selection — Token-by-token generation constrained to valid function names
Argument Extraction — Smart extraction and constrained generation based on parameter types

Phase 1: Function Name Selection

Instead of hoping the model generates a valid function name, we guide it character-by-character:

Generated so far: "fn_"
Reachable functions: ["fn_add_numbers", "fn_greet", "fn_reverse_string"]
Valid next tokens: only tokens that keep at least one function reachable

Generated so far: "fn_add"
Reachable functions: ["fn_add_numbers"]
Valid next tokens: tokens that start with "_n"

Result: guaranteed valid function name

This is done by:

Building a prompt that lists all available functions
Generating one token at a time
Filtering logits to only allow tokens that maintain at least one valid function as reachable
Stopping when an exact match is found

Phase 2: Argument Extraction

Different strategies based on parameter type:

Numbers

Extract directly from the prompt in order of appearance:

"What is the sum of 265 and 345?"
→ numbers_seen = [265.0, 345.0]
→ a = 265.0, b = 345.0

Fallback: if more parameters than numbers in prompt, use constrained token generation (digits/dot/minus only).

Strings

Single string parameter: Extract from prompt using regex

"Greet shrek" → name = "shrek"
"Reverse 'hello'" → s = "hello"

Multiple string parameters: Smart positional and semantic extraction

"Replace all numbers in "Hello 34..." with NUMBERS"
→ source_string = "Hello 34 I'm 233 years old"  (longest quoted)
→ regex = "\d+"                                   ("all numbers" → pattern)
→ replacement = "NUMBERS"                         (after "with")

Pattern detection rules:

"all numbers" → \d+
"all vowels" → [aeiouAEIOU]
"word 'X'" → X
"with asterisks" → *

Booleans

Constrain logits to only true or false tokens.

Why This Works

Small models struggle with structured output because they weren't trained to be perfectly syntactic. By removing invalid options at each generation step, we force the model to stay on track without requiring it to "know" JSON syntax perfectly.

Key insight: guidance beats capability. A 0.6B model with constraints outperforms a 7B model with prompting alone.

Design Decisions

Why constrained decoding over prompting?

Prompting alone achieves ~30% accuracy with small models. Constrained decoding achieves ~100% by making invalid outputs literally impossible.

Why extract strings from prompts instead of generating them?

The Qwen3-0.6B model is prone to hallucination and repetition loops when generating free-form strings. Extracting values that are already in the prompt is:

More reliable
Faster
Aligned with how humans naturally phrase requests

Why prioritize double quotes over single quotes?

To handle apostrophes correctly ("I'm" should not be split into "I" and "m").

Token-level filtering strategy

The vocabulary filtering happens in three layers:

Structural validity — tokens that maintain JSON syntax
Schema compliance — tokens that match the expected type
Semantic relevance — tokens that make sense for this specific parameter

Performance Analysis

Accuracy

Function selection: ~95%+ on provided test set
Argument extraction: ~90%+ for simple cases, ~80%+ for complex multi-parameter strings
JSON validity: 100% (guaranteed by constrained decoding)

Speed

Model load time: ~5-10 seconds (one-time)
Per-prompt processing: ~2-5 seconds on CPU
Total for 11 test prompts: ~20-30 seconds

Reliability

The constrained decoding approach ensures that:

Every output is valid JSON
Every function name is from the available set
Every parameter type matches its schema definition
No hallucinated keys or extra fields

Challenges Faced

Challenge 1: String generation loops

Problem: The model would generate repetitive patterns like "shrek" Answer: Function: fn_greet name: shrek Answer: ... infinitely.

Solution: Added stop-token detection and switched to extraction-first strategy for strings.

Challenge 2: Multi-parameter string functions

Problem: All string parameters received the same extracted value.

Solution: Implemented parameter-specific extraction heuristics based on parameter name semantics (source_string, regex, replacement).

Challenge 3: Numbers appearing in wrong parameters

Problem: a = 2.0, b = 0.0 instead of a = 2.0, b = 3.0.

Solution: Extract all numbers from the prompt first, then assign them in order to number-typed parameters.

Challenge 4: Vocabulary token cleaning

Problem: BPE tokens include prefix characters (Ġ for space, ▁ for sentencepiece) that needed to be stripped.

Solution: Implemented _clean_token() utility to normalize token strings before comparison.

Testing Strategy

Manual testing

Created diverse test cases covering:

Simple arithmetic ("sum of 2 and 3")
String operations ("reverse 'hello'")
Complex regex substitutions
Edge cases (apostrophes, multiple quoted strings)

Validation approach

Run on provided test set
Inspect JSON output for validity
Verify function names match expected
Check argument types and values
Test with modified/custom function definitions

Known limitations

Struggles with ambiguous prompts that could map to multiple functions
May fail on prompts with unusual phrasing not covered by extraction heuristics
Regex pattern detection is rule-based, not exhaustive

Example Usage

Input files

data/input/functions_definition.json:

[
  {
    "name": "fn_add_numbers",
    "description": "Add two numbers together",
    "parameters": {
      "a": {"type": "number"},
      "b": {"type": "number"}
    },
    "returns": {"type": "number"}
  }
]

data/input/function_calling_tests.json:

[
  {"prompt": "What is the sum of 2 and 3?"}
]

Running the tool

make run

Output

data/output/function_calls.json:

[
  {
    "prompt": "What is the sum of 2 and 3?",
    "name": "fn_add_numbers",
    "parameters": {"a": 2.0, "b": 3.0}
  }
]

Project Structure

call-me-maybe/
├── src/
│   ├── __init__.py           # Package marker
│   ├── __main__.py           # Entry point + CLI argument parsing
│   ├── config.py             # Default paths configuration
│   ├── models.py             # Pydantic models for validation
│   └── decoder.py            # Core constrained decoding logic
├── llm_sdk/                  # Provided LLM wrapper (not modified)
├── data/
│   ├── input/                # Input JSON files
│   └── output/               # Generated results (not in git)
├── pyproject.toml            # Dependencies and project metadata
├── Makefile                  # Build automation
└── README.md                 # This file

Resources

Academic & Technical

Documentation

AI Usage

Documentation: Drafting docstrings and README sections

License

This project is part of the 42 school curriculum and follows its academic guidelines.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data/input		data/input
llm_sdk		llm_sdk
src		src
.gitignore		.gitignore
.python-version		.python-version
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

call-me-maybe

Description

Example

Instructions

Prerequisites

Installation

Usage

Basic usage (default paths)

Custom paths

Debug mode

Code quality checks

Clean temporary files

Algorithm Explanation

Overview

Phase 1: Function Name Selection

Phase 2: Argument Extraction

Numbers

Strings

Booleans

Why This Works

Design Decisions

Why constrained decoding over prompting?

Why extract strings from prompts instead of generating them?

Why prioritize double quotes over single quotes?

Token-level filtering strategy

Performance Analysis

Accuracy

Speed

Reliability

Challenges Faced

Challenge 1: String generation loops

Challenge 2: Multi-parameter string functions

Challenge 3: Numbers appearing in wrong parameters

Challenge 4: Vocabulary token cleaning

Testing Strategy

Manual testing

Validation approach

Known limitations

Example Usage

Input files

Running the tool

Output

Project Structure

Resources

Academic & Technical

Documentation

AI Usage

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages