allenai · finbarrtimbers · May 13, 2026 · May 13, 2026 · May 13, 2026
diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
@@ -0,0 +1,40 @@
+name: Publish to PyPI
+
+on:
+  push:
+    tags:
+      - "v*"
+
+jobs:
+  build:
+    name: Build distribution
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Install uv
+        uses: astral-sh/setup-uv@v3
+      - name: Build sdist and wheel
+        run: uv build
+      - name: Upload artifacts
+        uses: actions/upload-artifact@v4
+        with:
+          name: dist
+          path: dist/
+
+  publish:
+    name: Publish to PyPI
+    needs: build
+    runs-on: ubuntu-latest
+    environment:
+      name: pypi
+      url: https://pypi.org/p/ifbench
+    permissions:
+      id-token: write
+    steps:
+      - name: Download artifacts
+        uses: actions/download-artifact@v4
+        with:
+          name: dist
+          path: dist/
+      - name: Publish to PyPI
+        uses: pypa/gh-action-pypi-publish@release/v1
diff --git a/.gitignore b/.gitignore
@@ -7,6 +7,8 @@
 __pycache__/
 *.py[cod]
 *.egg-info/
+dist/
+build/
 
 # Eval outputs (optional - uncomment to track results)
 # eval/

diff --git a/PUBLISHING.md b/PUBLISHING.md
@@ -0,0 +1,81 @@
+# Publishing a new version of `ifbench` to PyPI
+
+Releases are published to PyPI automatically via GitHub Actions
+([.github/workflows/release.yml](.github/workflows/release.yml)) using
+[PyPI trusted publishing](https://docs.pypi.org/trusted-publishers/) (OIDC).
+No API tokens are stored anywhere.
+
+## One-time setup (already done — skip unless re-bootstrapping)
+
+1. Create the project on PyPI by uploading the first release manually
+   (`uv build && uv publish`), or have a PyPI admin reserve the name.
+2. On PyPI → project → **Publishing** → **Add a pending publisher**:
+   - Owner: `allenai`
+   - Repository name: `IFBench`
+   - Workflow filename: `release.yml`
+   - Environment name: `pypi`
+3. In GitHub → repo **Settings** → **Environments** → create an environment
+   named `pypi`. Optionally require reviewers / restrict to tags `v*`.
+
+## Cutting a release
+
+1. Make sure `main` is green and contains everything you want shipped.
+2. Bump the version in `pyproject.toml`:
+   ```toml
+   version = "0.3.0"
+   ```
+   Follow [semver](https://semver.org/): patch for bugfixes, minor for new
+   verifiers / additive registry entries, major for breaking API changes
+   (e.g., removing or renaming a registry key).
+3. Run the full test suite locally:
+   ```bash
+   uv sync
+   uv run python instructions_test.py
+   ```
+4. Build locally and sanity-check the wheel contents:
+   ```bash
+   uv build
+   uv run python -c "import zipfile; z=zipfile.ZipFile('dist/ifbench-0.3.0-py3-none-any.whl'); print('\n'.join(z.namelist()))"
+   ```
+   Confirm `ifbench/data/IFBench_test.jsonl` is present.
+5. Commit and push:
+   ```bash
+   git add pyproject.toml
+   git commit -m "Bump version to 0.3.0"
+   git push
+   ```
+6. Tag and push the tag — this is what triggers the workflow:
+   ```bash
+   git tag v0.3.0
+   git push origin v0.3.0
+   ```
+   The tag **must** start with `v` to match the workflow's `on.push.tags`
+   filter, and **must** match the version in `pyproject.toml`.
+7. Watch the run at
+   `https://github.com/allenai/IFBench/actions/workflows/release.yml`.
+   If the `pypi` environment requires approval, approve it.
+8. Verify the release went live:
+   ```bash
+   uv run --with "ifbench==0.3.0" python -c "from ifbench import instructions_registry; print(len(instructions_registry.INSTRUCTION_DICT))"
+   ```
+   Expect `83` (or whatever the new count is).
+9. Create a GitHub Release from the tag with release notes.
+
+## If the publish fails
+
+- **`File already exists`** on PyPI: you can't overwrite an existing version.
+  Bump to the next patch, retag, push.
+- **OIDC / trusted publisher error**: verify the PyPI publisher config
+  matches the repo/workflow/environment names exactly, and that the workflow
+  ran from a tag on the default branch.
+- **Tests fail in CI**: fix on `main`, delete the bad tag locally and
+  remotely (`git tag -d v0.3.0 && git push --delete origin v0.3.0`), then
+  retag.
+
+## Yanking a bad release
+
+If a published version is broken, do **not** try to overwrite it. Either:
+
+- Yank it on PyPI (project → Manage → Releases → Yank) — the version stays
+  installable for existing pins but new resolutions skip it.
+- Cut a fixed `0.3.1` and tell downstreams to upgrade.
diff --git a/README.md b/README.md
@@ -12,11 +12,51 @@ IFBench consists of two parts:
 
 - New IF-RLVR training constraints: 29 new and challenging constraints, with corresponding verification functions. 
 
+## Installation
+
+IFBench is pip-installable. The package ships all OOD verifiers as well as the
+classic Google IFEval verifiers, so it can serve as a single source of truth
+for both registries.
+
+```bash
+pip install ifbench
+# or
+uv add ifbench
+```
+
+To work on IFBench itself, clone and sync with [uv](https://docs.astral.sh/uv/):
+
+```bash
+git clone https://github.com/allenai/IFBench.git
+cd IFBench
+uv sync
+```
+
+The package is namespaced under `ifbench` — submodules are `ifbench.instructions`,
+`ifbench.classic_instructions`, `ifbench.instructions_registry`, and
+`ifbench.instructions_util`. The test jsonl is bundled inside the wheel; access
+it via `ifbench.data_path()`.
+
+### Programmatic use
+
+```python
+from ifbench import instructions_registry
+
+checker_cls = instructions_registry.INSTRUCTION_DICT["keywords:existence"]
+checker = checker_cls("keywords:existence")
+checker.build_description(keywords=["cat", "dog"])
+checker.check_following("I saw a cat and a dog today.")  # True
+```
+
+`INSTRUCTION_DICT` contains 83 verifiers in total: 25 classic IFEval keys
+(prefixed `keywords:`, `language:`, `length_constraints:`,
+`detectable_content:`, `detectable_format:`, `combination:`, `startend:`,
+`change_case:`, `punctuation:`) plus 58 IFBench OOD keys.
+
 ## How to run the evaluation
-Install the requirements via the requirements.txt file.
 You need two jsonl files, one the IFBench_test.jsonl file (in the data folder) and one your file with eval prompts and completions (see sample_output.jsonl as an example). Then run:
-```
-python3 -m run_eval --input_data=IFBench_test.jsonl --input_response_data=sample_output.jsonl --output_dir=eval
+```bash
+uv run python -m run_eval --input_data=IFBench_test.jsonl --input_response_data=sample_output.jsonl --output_dir=eval
 ```
 
 Note: In the paper we generally report the prompt-level loose accuracy of IFBench. When we generate for evaluation, we use a temperature of 0 and adjust the maximum generated tokens depending on the model type, i.e. for thinking models we allow to generate more tokens and we then process the output to extract the answer without the reasoning chains.

diff --git a/evaluation_lib.py b/evaluation_lib.py
@@ -20,7 +20,7 @@
 import json
 from typing import Dict, Optional, Union
 
-import instructions_registry
+from ifbench import instructions_registry
 
 
 @dataclasses.dataclass

diff --git a/ifbench/__init__.py b/ifbench/__init__.py
@@ -0,0 +1,16 @@
+"""IFBench: verifiable instruction-following benchmark.
+
+Exposes the verifier classes (`instructions`, `classic_instructions`),
+the shared registry (`instructions_registry.INSTRUCTION_DICT`), and a
+helper for locating bundled data files.
+"""
+
+from importlib import resources
+from pathlib import Path
+
+from ifbench import classic_instructions, instructions, instructions_registry, instructions_util  # noqa: F401
+
+
+def data_path(name: str = "IFBench_test.jsonl") -> Path:
+    """Return the filesystem path to a bundled data file."""
+    return Path(resources.files(__name__) / "data" / name)