Skip to content

Kheldin/call-me-maybe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This project has been created as part of the 42 curriculum by kacherch.

call-me-maybe

Description

call-me-maybe is a function calling system that translates natural language prompts into structured, executable function calls using a small language model (Qwen3-0.6B). The project demonstrates how constrained decoding can force a tiny 0.6B parameter model to generate 100% valid JSON output with near-perfect reliability.

Instead of relying on prompting alone (which achieves only ~30% accuracy with small models), this implementation uses token-level constraints to guarantee syntactically valid and schema-compliant output every time.

Example

Input:

"What is the sum of 40 and 2?"

Output:

{
  "name": "fn_add_numbers",
  "parameters": {"a": 40.0, "b": 2.0}
}

The system doesn't answer "42" — it provides the tools to solve it: the correct function name and properly typed arguments.


Instructions

Prerequisites

  • Python 3.13+
  • uv package manager (recommended) or pip
  • ~5GB disk space for model download

Installation

# Clone the repository
git clone <your-repo-url>
cd call-me-maybe

# Install dependencies
make install
# or manually:
uv sync

Usage

Basic usage (default paths)

make run
# or
uv run python -m src

This reads from:

  • data/input/functions_definition.json (available functions)
  • data/input/function_calling_tests.json (prompts to process)

And writes to:

  • data/output/function_calls.json (structured results)

Custom paths

uv run python -m src \
  --functions_definition custom/functions.json \
  --input custom/prompts.json \
  --output results/output.json

Debug mode

make debug
# or
uv run python -m pdb -m src

Code quality checks

make lint         # flake8 + mypy
make lint-strict  # mypy strict mode

Clean temporary files

make clean

Algorithm Explanation

Overview

The implementation uses a two-phase constrained decoding approach:

  1. Function Selection — Token-by-token generation constrained to valid function names
  2. Argument Extraction — Smart extraction and constrained generation based on parameter types

Phase 1: Function Name Selection

Instead of hoping the model generates a valid function name, we guide it character-by-character:

Generated so far: "fn_"
Reachable functions: ["fn_add_numbers", "fn_greet", "fn_reverse_string"]
Valid next tokens: only tokens that keep at least one function reachable

Generated so far: "fn_add"
Reachable functions: ["fn_add_numbers"]
Valid next tokens: tokens that start with "_n"

Result: guaranteed valid function name

This is done by:

  1. Building a prompt that lists all available functions
  2. Generating one token at a time
  3. Filtering logits to only allow tokens that maintain at least one valid function as reachable
  4. Stopping when an exact match is found

Phase 2: Argument Extraction

Different strategies based on parameter type:

Numbers

Extract directly from the prompt in order of appearance:

"What is the sum of 265 and 345?"
→ numbers_seen = [265.0, 345.0]
→ a = 265.0, b = 345.0

Fallback: if more parameters than numbers in prompt, use constrained token generation (digits/dot/minus only).

Strings

Single string parameter: Extract from prompt using regex

"Greet shrek" → name = "shrek"
"Reverse 'hello'" → s = "hello"

Multiple string parameters: Smart positional and semantic extraction

"Replace all numbers in "Hello 34..." with NUMBERS"
→ source_string = "Hello 34 I'm 233 years old"  (longest quoted)
→ regex = "\d+"                                   ("all numbers" → pattern)
→ replacement = "NUMBERS"                         (after "with")

Pattern detection rules:

  • "all numbers"\d+
  • "all vowels"[aeiouAEIOU]
  • "word 'X'"X
  • "with asterisks"*

Booleans

Constrain logits to only true or false tokens.

Why This Works

Small models struggle with structured output because they weren't trained to be perfectly syntactic. By removing invalid options at each generation step, we force the model to stay on track without requiring it to "know" JSON syntax perfectly.

Key insight: guidance beats capability. A 0.6B model with constraints outperforms a 7B model with prompting alone.


Design Decisions

Why constrained decoding over prompting?

Prompting alone achieves ~30% accuracy with small models. Constrained decoding achieves ~100% by making invalid outputs literally impossible.

Why extract strings from prompts instead of generating them?

The Qwen3-0.6B model is prone to hallucination and repetition loops when generating free-form strings. Extracting values that are already in the prompt is:

  • More reliable
  • Faster
  • Aligned with how humans naturally phrase requests

Why prioritize double quotes over single quotes?

To handle apostrophes correctly ("I'm" should not be split into "I" and "m").

Token-level filtering strategy

The vocabulary filtering happens in three layers:

  1. Structural validity — tokens that maintain JSON syntax
  2. Schema compliance — tokens that match the expected type
  3. Semantic relevance — tokens that make sense for this specific parameter

Performance Analysis

Accuracy

  • Function selection: ~95%+ on provided test set
  • Argument extraction: ~90%+ for simple cases, ~80%+ for complex multi-parameter strings
  • JSON validity: 100% (guaranteed by constrained decoding)

Speed

  • Model load time: ~5-10 seconds (one-time)
  • Per-prompt processing: ~2-5 seconds on CPU
  • Total for 11 test prompts: ~20-30 seconds

Reliability

The constrained decoding approach ensures that:

  • Every output is valid JSON
  • Every function name is from the available set
  • Every parameter type matches its schema definition
  • No hallucinated keys or extra fields

Challenges Faced

Challenge 1: String generation loops

Problem: The model would generate repetitive patterns like "shrek" Answer: Function: fn_greet name: shrek Answer: ... infinitely.

Solution: Added stop-token detection and switched to extraction-first strategy for strings.

Challenge 2: Multi-parameter string functions

Problem: All string parameters received the same extracted value.

Solution: Implemented parameter-specific extraction heuristics based on parameter name semantics (source_string, regex, replacement).

Challenge 3: Numbers appearing in wrong parameters

Problem: a = 2.0, b = 0.0 instead of a = 2.0, b = 3.0.

Solution: Extract all numbers from the prompt first, then assign them in order to number-typed parameters.

Challenge 4: Vocabulary token cleaning

Problem: BPE tokens include prefix characters (Ġ for space, for sentencepiece) that needed to be stripped.

Solution: Implemented _clean_token() utility to normalize token strings before comparison.


Testing Strategy

Manual testing

Created diverse test cases covering:

  • Simple arithmetic ("sum of 2 and 3")
  • String operations ("reverse 'hello'")
  • Complex regex substitutions
  • Edge cases (apostrophes, multiple quoted strings)

Validation approach

  1. Run on provided test set
  2. Inspect JSON output for validity
  3. Verify function names match expected
  4. Check argument types and values
  5. Test with modified/custom function definitions

Known limitations

  • Struggles with ambiguous prompts that could map to multiple functions
  • May fail on prompts with unusual phrasing not covered by extraction heuristics
  • Regex pattern detection is rule-based, not exhaustive

Example Usage

Input files

data/input/functions_definition.json:

[
  {
    "name": "fn_add_numbers",
    "description": "Add two numbers together",
    "parameters": {
      "a": {"type": "number"},
      "b": {"type": "number"}
    },
    "returns": {"type": "number"}
  }
]

data/input/function_calling_tests.json:

[
  {"prompt": "What is the sum of 2 and 3?"}
]

Running the tool

make run

Output

data/output/function_calls.json:

[
  {
    "prompt": "What is the sum of 2 and 3?",
    "name": "fn_add_numbers",
    "parameters": {"a": 2.0, "b": 3.0}
  }
]

Project Structure

call-me-maybe/
├── src/
│   ├── __init__.py           # Package marker
│   ├── __main__.py           # Entry point + CLI argument parsing
│   ├── config.py             # Default paths configuration
│   ├── models.py             # Pydantic models for validation
│   └── decoder.py            # Core constrained decoding logic
├── llm_sdk/                  # Provided LLM wrapper (not modified)
├── data/
│   ├── input/                # Input JSON files
│   └── output/               # Generated results (not in git)
├── pyproject.toml            # Dependencies and project metadata
├── Makefile                  # Build automation
└── README.md                 # This file

Resources

Academic & Technical

Documentation

AI Usage

  • Documentation: Drafting docstrings and README sections

License

This project is part of the 42 school curriculum and follows its academic guidelines.

About

Function calling system using constrained decoding on a tiny LLM (Qwen3-0.6B). Translates natural language prompts into schema-compliant JSON function calls with guaranteed validity. 42 project.

Resources

Stars

Watchers

Forks

Contributors