Ctx2Skill: From Context to Skills

Can Language Models Learn from Context Skillfully?

The code of our paper "From Context to Skills: Can Language Models Learn from Context Skillfully? ".

Ctx2Skill is a self-evolving framework that autonomously discovers, refines, and selects context-specific skills from complex contexts, requiring no human annotation and no external feedback. The resulting natural-language skills can be plugged into any language model at inference time to enhance context learning capability.

Overview

Many real-world tasks require language models to reason over complex contexts (e.g., technical documents, research papers, code repositories) that lie outside their parametric knowledge. An intuitive solution is inference-time skill augmentation — extracting rules and procedures from the context into explicit, natural-language skills. However, constructing such skills faces two fundamental challenges:

Prohibitive cost of manual skill annotation for long, technically dense contexts
Lack of external feedback for automated skill construction in context learning scenarios

Ctx2Skill addresses both challenges through a multi-agent self-play loop:

Method

Multi-Agent Self-Play Loop

The core of Ctx2Skill is a self-play loop comprising five frozen-LM agent roles:

Agent	Role
Challenger	Generates probing tasks and rubrics based on the context and its own evolving skill set
Reasoner	Attempts to solve tasks guided by the context and its current skill set
Judge	Provides binary per-rubric verdicts and partitions tasks into solved/failed sets
Proposer (one per side)	Diagnoses failure/success patterns and synthesizes high-level skill update proposals
Generator (one per side)	Materializes proposals into concrete skill set updates

Both the Challenger and Reasoner co-evolve through accumulated natural-language skills: failed cases drive Reasoner skill updates, while easily solved cases drive Challenger skill updates, maintaining sustained adversarial pressure.

Cross-Time Replay Mechanism

A key risk in self-play is adversarial collapse — the Challenger generates increasingly extreme tasks while the Reasoner's skills over-specialize. To address this, the Cross-Time Replay mechanism:

Collects representative hard/easy probe tasks during self-play
Re-evaluates all historical skill set candidates on these probes
Selects the skill set that maximizes the product of hard-set and easy-set solving rates, ensuring robust generalization

Results

Evaluated on four context learning tasks from CL-bench, Ctx2Skill consistently improves solve rates across backbone models:

Model	Without Skills	With Ctx2Skill	Improvement
GPT-4.1	11.1%	16.5%	+5.4%
GPT-5.1	21.2%	25.8%	+4.6%
GPT-5.2	18.2%	21.4%	+3.2%

We conduct our experiments using the API from newapi, openai, and azure api.

Quick Start

Prerequisites

Python 3.8+
OpenAI-compatible API access

Installation

git clone https://github.com/S1s-Z/Ctx2Skill.git
cd Ctx2Skill

Data Preparation

Download the CL-Bench dataset from this link files and place them in the project root:

CL-bench-context-dedup.jsonl — deduplicated contexts (used for skill generation)
CL-bench-with-task-delimiter.jsonl — tasks with delimiters (used for evaluation)

Running the Self-Play Loop

# Configure API
export OPENAI_BASE_URL="your-api-base-url"
export OPENAI_API_KEY="your-api-key"

# Run the self-play skill discovery loop
python selfplay_loop.py \
    --challenger-model gpt-4.1 \
    --reasoner-model gpt-4.1 \
    --judge-model gpt-5.1 \
    --proposer-model gpt-4.1 \
    --generator-model gpt-4.1 \
    --input ./CL-bench-context-dedup.jsonl \
    --output outputs/loop_data/loop_output.jsonl \
    --num-iterations 5 \
    --num-tasks 5 \
    --skills-dir skills-output \
    --workers 32

Inference with Discovered Skills

python infer.py \
    --model gpt-4.1 \
    --input ./CL-bench-with-task-delimiter.jsonl \
    --workers 32 \
    --skills-dir skills-output/reasoner \
    --output outputs/inference_output.jsonl

Evaluation

python eval_ignore_none.py \
    --input outputs/inference_output.jsonl \
    --judge-model gpt-5.1 \
    --workers 32

Project Structure

Ctx2Skill/
├── selfplay_loop.py       # Main self-play loop with all five agents
├── challenger.py           # Challenger agent implementation
├── infer.py                # Inference script with skill augmentation
├── eval.py                 # Evaluation script
├── eval_ignore_none.py     # Evaluation script (ignoring None responses)
├── prompts/                # Prompt templates for each agent role
│   ├── challenger.txt
│   ├── challenger_generator.txt
│   ├── challenger_proposer.txt
│   ├── reasoner_generator.txt
│   └── reasoner_proposer.txt
└── run.sh                  # Example run script

Citation

@misc{si2026contextskillslanguagemodels,
      title={From Context to Skills: Can Language Models Learn from Context Skillfully?}, 
      author={Shuzheng Si and Haozhe Zhao and Yu Lei and Qingyi Wang and Dingwei Chen and Zhitong Wang and Zhenhailong Wang and Kangyang Luo and Zheng Wang and Gang Chen and Fanchao Qi and Minjia Zhang and Maosong Sun},
      year={2026},
      eprint={2604.27660},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2604.27660}, 
}

License

This project is released under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ctx2Skill: From Context to Skills

Overview

Method

Multi-Agent Self-Play Loop

Cross-Time Replay Mechanism

Results

Quick Start

Prerequisites

Installation

Data Preparation

Running the Self-Play Loop

Inference with Discovered Skills

Evaluation

Project Structure

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
assets		assets
prompts		prompts
.gitignore		.gitignore
README.md		README.md
challenger.py		challenger.py
eval.py		eval.py
eval_ignore_none.py		eval_ignore_none.py
infer.py		infer.py
run.sh		run.sh
selfplay_loop.py		selfplay_loop.py

Folders and files

Latest commit

History

Repository files navigation

Ctx2Skill: From Context to Skills

Overview

Method

Multi-Agent Self-Play Loop

Cross-Time Replay Mechanism

Results

Quick Start

Prerequisites

Installation

Data Preparation

Running the Self-Play Loop

Inference with Discovered Skills

Evaluation

Project Structure

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages