Agent SFT data infrastructure for generation, normalization, chat-template rendering, response masking, and training audits.
Teich is not only a dataset generator. It is a bridge between messy agent/chat data and model-specific supervised fine-tuning.
Start from any of these:
- Fresh Codex, Pi, Claude Code, or Hermes traces
- Text-only chat generations
- Local JSONL files or folders
- Hugging Face datasets
- Already-loaded
datasets.Datasetobjects
Then Teich handles the training-sensitive parts:
- Normalize sources into OpenAI-style
messages/tools - Render through the target tokenizer chat template
- Preserve typed supervision spans before tokenization
- Apply response-only labels after trainer tokenization
That means the same package can:
- Generate new agent traces or chat-only distillation data.
- Load and normalize existing local or Hub datasets.
- Mix chat-only and tool-call datasets with explicit ratios.
- Preserve raw traces as source-of-truth artifacts.
- Render with arbitrary tokenizer chat templates.
- Mask assistant reasoning, final answers, and tool calls while keeping prompts/tool responses ignored.
- Audit labels before training so fully masked or misaligned rows fail early.
prompts / traces / JSONL / HF datasets / Dataset objects
β
load_traces() or prepare_data()
β
normalized messages + tools
β
tokenizer chat template rendering
β
trainer-friendly text + Teich supervision spans
β
SFTTrainer tokenization
β
mask_data()
β
audited input_ids + labels
Use only the pieces you need:
- Already have a dataset? Skip generation and go straight to
prepare_data(). - Want raw trace preservation? Use the CLI.
- Want standard next-token training? Use
prepare_data(..., teich_masking=False)and skipmask_data().
| Goal | Use |
|---|---|
| Generate coding-agent traces | teich generate with agent.provider: codex, pi, claude-code, or hermes |
| Generate text-only chat rows | teich generate with agent.provider: chat |
| Load raw traces manually | load_traces() |
| Prepare local/HF/mixed datasets for training | prepare_data() |
| Apply response-only labels after TRL/Unsloth tokenization | mask_data() |
| Inspect supervised vs masked tokens | preview_sft_example() / trainer.train_dataset.preview() |
pip install teichTo create a new generation project:
teich init my-project && cd my-project
teich generate -c config.yamlOr use astral-uv
uvx teich init my-project && cd my-project
uvx teich generate -c config.yamlEdit
config.yamlandprompts.jsonlbefore running a real generation batch.
- Trace-first data collection: Run real coding agents and keep raw session traces as the source of truth.
- Dataset-first training: Load existing JSONL files, folders, Hugging Face repos, or
datasets.Datasetobjects without using the generator. - Multi-provider generation: Works with Docker-backed Codex, Pi, Claude Code, Hermes, and a direct OpenAI-compatible
chatmode. - Structured conversion: Converts traces into chat messages with tool calls, reasoning, tool results, metadata, and configured tool snapshots.
- Universal masking surface: Supports assistant reasoning, final answers, tool calls, user/system/developer text, and tool responses as independently configurable masking targets.
- Multi-turn and tool-aware labels: Avoids Unsloth-style single-span masking pitfalls by storing typed spans before tokenization and aligning them after trainer tokenization.
- Source mixing: Mix local paths, Hub datasets, and in-memory datasets; explicit percentages stay true by scaling to the limiting source instead of silently changing ratios.
- Hugging Face integration: Publishes dataset cards with embedded tool-schema snapshots, and loads local or Hub datasets through one API.
Requirements for agent trace generation:
- Docker
- API key for the configured provider, such as OpenAI, OpenRouter, or Anthropic. Local OpenAI-compatible endpoints are also supported where the selected runner can use them.
agent.provider: chat does not require Docker.
The Python utilities also work without Docker if you already have traces or structured JSONL datasets.
Training examples use your existing finetuning stack. For the TRL example below, install compatible versions of transformers, trl, and your model-loading stack separately.
You do not need to generate data with Teich first.
If a local file, folder, Hugging Face dataset, or datasets.Dataset has a messages column, Teich can usually prepare it directly.
from teich import prepare_data
train_dataset = prepare_data(
"TeichAI/Claude-Opus-4.6-Reasoning-887x",
tokenizer,
max_length=32768,
drop_oversized_examples=True,
trim_oversized_followups=True,
tokenize=True,
chat_template_kwargs={"enable_thinking": True, "preserve_thinking": True},
)prepare_data() returns rendered text, Teich span metadata, and optionally input_ids / attention_mask. With trim_oversized_followups=True, multi-turn rows that exceed max_length can drop the final user follow-up and everything after it before the whole row is discarded. Call mask_data() after constructing your trainer to convert those spans into labels.
train_dataset = prepare_data(
{
"max_examples": 1000,
"agent": {"source": "badlogicgames/pi-mono", "percentage": 80},
"chat": {"source": "TeichAI/Claude-Opus-4.6-Reasoning-887x", "percentage": 20},
},
tokenizer,
max_length=32768,
drop_oversized_examples=True,
trim_oversized_followups=True,
tokenize=True,
chat_template_kwargs={"enable_thinking": True, "preserve_thinking": True},
)Explicit percentage, proportion, and weight values are treated as true ratios.
If one source cannot fill its share after filtering or context-window drops, Teich scales the total row count down instead of silently changing the realized mix.
# Initialize project
teich init my-project
cd my-project
# Add prompts to prompts.jsonl, then:
export OPENAI_API_KEY=sk-...
teich generate -c config.yamlOutputs:
codex/pi: normalized copies of native agent session JSONL files inoutput/, sandboxes insandbox/, and aREADME.mdclaude-code: native Claude Code transcript JSONL copied from.claude/projects/..., sandboxes insandbox/, and aREADME.mdhermes: one Hermes-native session JSONL per session exported fromstate.dbin theexport_all(source="cli")shape, including delegated subagent sessions as separate files linked byparent_session_id, sandboxes insandbox/, and aREADME.mdchat: text-only JSONL training rows inoutput/and a datasetREADME.md
Only completed runs are kept at the top level of output/. Failed or interrupted agent traces are preserved under output/partials/ for debugging, and Teich excludes that directory from resume detection, conversion, README generation, and Hugging Face uploads.
Generation progress reports provider/model usage when Teich can retrieve it. For OpenRouter, Teich first queries the provider's generation stats API for native token and cost accounting, then falls back to harness-reported usage. If neither source is available, Teich prints N/A instead of treating the missing value as zero.
If publish.repo_id is configured, Teich also creates or updates the matching Hugging Face dataset repo.
Uploaded artifacts include:
- Generated JSONL
- Dataset
README.md - Embedded tool-schema snapshot in the dataset card when tools are present
If a long run is interrupted, use:
teich generate -c config.yaml --resumeTeich will scan existing outputs and skip prompts that already converted into completed training examples.
Prompt files can be JSONL/NDJSON, JSON, CSV, or plain text.
JSONL is recommended because it handles long multiline prompts, repository metadata, and follow-up turns without CSV escaping problems.
Recommended prompts.jsonl:
{"prompt":"Build a simple todo list app in React"}
{"github_repo":"armand0e/perplexica-mcp","prompt":"Add a small usability improvement and update the tests"}
{"system":"Answer as a concise project manager.","prompt":"Draft a compact project plan"}
{"prompt":"Draft a compact project plan","follow_up_prompts":["Revise it for a solo developer","Add a risk checklist"]}system is optional and prompt-specific. If a row does not include system, Teich does not inject a default system prompt.
follow_up_prompts is supported across providers. chat sends each follow-up as a real additional user turn in one generated training row. Agent runners keep one Docker container alive for the full prompt sequence, run the initial prompt, then resume or continue the same saved agent session for each follow-up so workspace edits, tool caches, and in-container installs remain available across turns.
codexcopies the native Codex session JSONL out of the mountedCODEX_HOME/sessionsdirectory, then normalizes known Codex event-shape edge cases.picopies the native Pi session JSONL out of the mounted/home/codex/pi-sessionsdirectory, then normalizes and validates tool-call structure before writing output. For OpenRouter, Teich forces Pi onto the chat/completions wire path because Pi's OpenRouter Responses adapter can stall before the first session event.claude-codecopies Claude Code's native transcript JSONL from.claude/projects/...so the output keeps Claude's ownuser,assistant,system, andresultevent format. With OpenRouter non-Claude models, Teich runs a local in-container proxy: Claude Code sees a Claude surrogate model name, while the proxy rewrites outbound requests back to the configured model. The native assistant/result events keep the provider-returned model and usage fields when Claude Code records them.hermesruns with built-in toolsetssafe,terminal,file,skills,memory,session_search,delegation, then exports Hermes Agentstate.dbsessions using the nativeexport_all(source="cli")session shape: one session object per JSONL file with a nestedmessageslist. Delegated subagents remain separate trace files rather than being merged into the orchestrator session; child traces includeparent_session_id.chatcalls an OpenAI-compatible API directly and writes structured training rows instead of raw agent traces.
agent:
provider: chat
model:
model: gpt-4.1-mini
api:
provider: openai
wire_api: responsesEach generated JSONL line will look like:
{"messages":[{"role":"user","content":"Hello","thinking":null},{"role":"assistant","content":"Hi!","thinking":"I should greet the user."}],"prompt":"Hello","thinking":"I should greet the user.","response":"Hi!","model":"gpt-4.1-mini"}With follow-ups, the same row contains:
- Alternating
userandassistantmessages follow_up_prompts- Per-turn
responses - Final
response
Use the trainer-first path:
prepare_datarenders trainer-friendlytextrows with Teich supervision metadata.SFTTrainertokenizes them.mask_dataapplies multi-turn/tool-aware response-only labels to the trainer dataset.
import os
from unsloth import FastLanguageModel
from trl import SFTConfig, SFTTrainer
from teich import mask_data, prepare_data
MAX_SEQ_LEN = 32768
MODEL_NAME = "unsloth/Qwen3.5-0.8B"
CHAT_TEMPLATE_KWARGS = {"enable_thinking": True}
PUSH_TO_HUB_REPO_ID = "username/teich-sft-model"
HF_TOKEN = os.environ.get("HF_TOKEN") or ""
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=MODEL_NAME,
max_seq_length=MAX_SEQ_LEN,
load_in_4bit=False,
load_in_8bit=False,
full_finetuning=False,
)
model = FastLanguageModel.get_peft_model(
model,
r=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "out_proj"],
lora_alpha=64,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
random_state=3407,
use_rslora=False,
loftq_config=None,
)
train_dataset = prepare_data(
"TeichAI/lordx64-claude-opus-4.7-max-cleaned",
tokenizer,
split="train",
max_examples=500,
chat_template_kwargs=CHAT_TEMPLATE_KWARGS,
max_length=MAX_SEQ_LEN,
drop_oversized_examples=True,
trim_oversized_followups=True,
tokenize=True,
strict=True,
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=train_dataset,
eval_dataset=None,
args=SFTConfig(
dataset_text_field="text",
dataset_num_proc=1,
max_length=MAX_SEQ_LEN,
packing=False,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
warmup_steps=5,
num_train_epochs=1,
learning_rate=2e-4,
logging_steps=1,
optim="muon",
optim_target_modules="all-linear",
weight_decay=0.001,
lr_scheduler_type="linear",
output_dir="outputs",
seed=3407,
report_to="none",
),
)
trainer = mask_data(
trainer,
tokenizer=tokenizer,
train_on_reasoning=True,
train_on_final_answers=True,
train_on_tools=True,
)
trainer_stats = trainer.train(resume_from_checkpoint=False)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
model.push_to_hub_merged(PUSH_TO_HUB_REPO_ID, tokenizer, save_method="merged_16bit", token=HF_TOKEN)prepare_data:
- Loads local folders, local files, Hugging Face datasets, source mixes, or
datasets.Datasetobjects. - Applies the tokenizer chat template.
- Optionally tokenizes only to drop rows above
max_length. - Optionally trims final follow-up turns from multi-turn rows before dropping them for exceeding
max_length. - Returns trainer-friendly
textrows with typed Teich span metadata. - Supports
teich_masking=Falsefor plain next-token training without Teich response-only labels.
For Unsloth / TRL, pass tokenize=True so trainer setup treats the dataset as already tokenized and preserves Teich span metadata until mask_data() runs.
mask_data:
- Follows the same trainer-first shape as Unsloth's response-only helper.
- Uses Teich span metadata so multi-turn tool calls and tool responses are masked correctly.
- Trains on assistant reasoning, final answers, and tool calls by default.
- Keeps user/system/developer/tool-response text masked by default.
- Returns a compact trainer dataset with only
input_idsandlabels.
You can override the default policy with train_on_reasoning, train_on_final_answers, train_on_tools, train_on_user, train_on_system, train_on_developer, and train_on_tool_responses.
Keep packing=False for this flow because packed datasets merge row boundaries before masking. For long-context runs, max_supervised_tokens defaults to the trainer's max_length to cap the number of trainable answer tokens per row.
To combine datasets, pass a list of dataset IDs, local paths, or loaded datasets.Dataset objects:
train_dataset = prepare_data(
["username/chat-traces", "username/tool-traces"],
tokenizer,
max_length=MAX_SEQ_LEN,
drop_oversized_examples=True,
trim_oversized_followups=True,
tokenize=True,
chat_template_kwargs=CHAT_TEMPLATE_KWARGS,
)For weighted mixes, explicit percentage, proportion, and weight values are treated as true ratios.
If one source cannot fill its share after filtering or context-window drops, Teich scales the total row count down instead of silently changing the realized mix.
Use load_traces directly when you want to own the rest of the training pipeline yourself:
- Chat-template rendering
- Filtering
- Tokenization
- Label masking
- Packing policy
- Auditing
from teich import load_traces
dataset = load_traces("./output")
example = dataset[0]
rendered = tokenizer.apply_chat_template(
example["messages"],
tools=example.get("tools") or [],
tokenize=False,
add_generation_prompt=False,
enable_thinking=True,
)
tokenized = tokenizer(rendered, truncation=True, max_length=32768)config.yaml:
agent:
provider: codex # or pi, claude-code, hermes, or chat
model:
model: codex-mini-latest
approval_policy: never
sandbox: danger-full-access
prompts_file: prompts.jsonl
prompts: []
# For chat provider follow-up turns:
# prompts:
# - prompt: "Draft a compact project plan"
# follow_up_prompts:
# - "Revise it for a solo developer"
# - "Add a risk checklist"
output:
traces_dir: ./output
sandbox_dir: ./sandbox
pretty_name: "My Agent Traces"
publish:
repo_id: armand0e/my-dataset
hf_token: hf_xxx
private: falseDataset tags are auto-generated from the provider and model:
codex/pi/claude-code/hermes:agent-traces,format:agent-traces,<provider>,distillation,<model>,teichchat:conversational,distillation,teich,<model>
If publish.hf_token is omitted, Teich also accepts HF_TOKEN, HUGGINGFACE_HUB_TOKEN, or TEICH_HF_TOKEN from the environment.
export TEICH_PROVIDER=LMstudio
export TEICH_MODEL=gemma-4
export TEICH_BASE_URL=http://localhost:1234/v1
export TEICH_API_KEY=llm
teich generate -c config.yamlTraining examples include:
prompt: initial task descriptionfollow_up_prompts: optional additional chat turns generated after the initial promptmessages: chat history (system, user, assistant, tool)tools: tool schemas used in the sessionmetadata: session info, model, timestamps, and usage when available
Structured chat datasets can also include convenience top-level fields like:
systemwhen provided by the prompt rowfollow_up_promptsthinkingresponseresponsesmodel
Assistant messages capture:
content: text responsereasoning_content: chain-of-thought tracestool_calls: function calls with arguments
from teich import (
prepare_data, # Recommended: render trainer-friendly text rows
mask_data, # Recommended: apply Teich labels after SFTTrainer tokenization
load_traces, # Fallback: load rows for fully manual processing
preview_sft_example, # Preview supervised vs masked tokens
Config, # Load config.yaml
TrainingExample, # Typed training example
)README.md is the package readme used for PyPI, so these examples are the canonical public package docs.
Teich preserves the raw or native captured agent session as the source of truth:
- Collect: Run agents on real tasks β raw
.jsonltraces - Inspect/Share: Traces are human-readable and uploadable
- Convert: Transform to structured examples when ready
- Prepare: Use
prepare_data()+mask_data()to apply model-specific templates and labels through the trainer-first flow
If you choose agent.provider: chat, Teich skips the trace-preservation step and writes structured text-only JSONL rows directly.
This means you can:
- Re-convert with different logic later
- Share raw traces before releasing training data
- Train on the same sessions with different model templates
uv pip install -e ".[dev]"
uv run pytest --ignore=tests/test_integration.py -qTeich is alpha. The core workflow is stable and usable. APIs may evolve as more agent types and training workflows are added.
Apache-2.0