Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
name: Publish to PyPI

on:
push:
tags:
- "v*"

jobs:
build:
name: Build distribution
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install uv
uses: astral-sh/setup-uv@v3
- name: Build sdist and wheel
run: uv build
- name: Upload artifacts
uses: actions/upload-artifact@v4
with:
name: dist
path: dist/

publish:
name: Publish to PyPI
needs: build
runs-on: ubuntu-latest
environment:
name: pypi
url: https://pypi.org/p/ifbench
permissions:
id-token: write
steps:
- name: Download artifacts
uses: actions/download-artifact@v4
with:
name: dist
path: dist/
- name: Publish to PyPI
uses: pypa/gh-action-pypi-publish@release/v1
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@
__pycache__/
*.py[cod]
*.egg-info/
dist/
build/

# Eval outputs (optional - uncomment to track results)
# eval/
Expand Down
81 changes: 81 additions & 0 deletions PUBLISHING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Publishing a new version of `ifbench` to PyPI

Releases are published to PyPI automatically via GitHub Actions
([.github/workflows/release.yml](.github/workflows/release.yml)) using
[PyPI trusted publishing](https://docs.pypi.org/trusted-publishers/) (OIDC).
No API tokens are stored anywhere.

## One-time setup (already done — skip unless re-bootstrapping)

1. Create the project on PyPI by uploading the first release manually
(`uv build && uv publish`), or have a PyPI admin reserve the name.
2. On PyPI → project → **Publishing** → **Add a pending publisher**:
- Owner: `allenai`
- Repository name: `IFBench`
- Workflow filename: `release.yml`
- Environment name: `pypi`
3. In GitHub → repo **Settings** → **Environments** → create an environment
named `pypi`. Optionally require reviewers / restrict to tags `v*`.

## Cutting a release

1. Make sure `main` is green and contains everything you want shipped.
2. Bump the version in `pyproject.toml`:
```toml
version = "0.3.0"
```
Follow [semver](https://semver.org/): patch for bugfixes, minor for new
verifiers / additive registry entries, major for breaking API changes
(e.g., removing or renaming a registry key).
3. Run the full test suite locally:
```bash
uv sync
uv run python instructions_test.py
```
4. Build locally and sanity-check the wheel contents:
```bash
uv build
uv run python -c "import zipfile; z=zipfile.ZipFile('dist/ifbench-0.3.0-py3-none-any.whl'); print('\n'.join(z.namelist()))"
```
Confirm `ifbench/data/IFBench_test.jsonl` is present.
5. Commit and push:
```bash
git add pyproject.toml
git commit -m "Bump version to 0.3.0"
git push
```
6. Tag and push the tag — this is what triggers the workflow:
```bash
git tag v0.3.0
git push origin v0.3.0
```
The tag **must** start with `v` to match the workflow's `on.push.tags`
filter, and **must** match the version in `pyproject.toml`.
7. Watch the run at
`https://github.com/allenai/IFBench/actions/workflows/release.yml`.
If the `pypi` environment requires approval, approve it.
8. Verify the release went live:
```bash
uv run --with "ifbench==0.3.0" python -c "from ifbench import instructions_registry; print(len(instructions_registry.INSTRUCTION_DICT))"
```
Expect `83` (or whatever the new count is).
9. Create a GitHub Release from the tag with release notes.

## If the publish fails

- **`File already exists`** on PyPI: you can't overwrite an existing version.
Bump to the next patch, retag, push.
- **OIDC / trusted publisher error**: verify the PyPI publisher config
matches the repo/workflow/environment names exactly, and that the workflow
ran from a tag on the default branch.
- **Tests fail in CI**: fix on `main`, delete the bad tag locally and
remotely (`git tag -d v0.3.0 && git push --delete origin v0.3.0`), then
retag.

## Yanking a bad release

If a published version is broken, do **not** try to overwrite it. Either:

- Yank it on PyPI (project → Manage → Releases → Yank) — the version stays
installable for existing pins but new resolutions skip it.
- Cut a fixed `0.3.1` and tell downstreams to upgrade.
46 changes: 43 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,51 @@ IFBench consists of two parts:

- New IF-RLVR training constraints: 29 new and challenging constraints, with corresponding verification functions.

## Installation

IFBench is pip-installable. The package ships all OOD verifiers as well as the
classic Google IFEval verifiers, so it can serve as a single source of truth
for both registries.

```bash
pip install ifbench
# or
uv add ifbench
```

To work on IFBench itself, clone and sync with [uv](https://docs.astral.sh/uv/):

```bash
git clone https://github.com/allenai/IFBench.git
cd IFBench
uv sync
```

The package is namespaced under `ifbench` — submodules are `ifbench.instructions`,
`ifbench.classic_instructions`, `ifbench.instructions_registry`, and
`ifbench.instructions_util`. The test jsonl is bundled inside the wheel; access
it via `ifbench.data_path()`.

### Programmatic use

```python
from ifbench import instructions_registry

checker_cls = instructions_registry.INSTRUCTION_DICT["keywords:existence"]
checker = checker_cls("keywords:existence")
checker.build_description(keywords=["cat", "dog"])
checker.check_following("I saw a cat and a dog today.") # True
```

`INSTRUCTION_DICT` contains 83 verifiers in total: 25 classic IFEval keys
(prefixed `keywords:`, `language:`, `length_constraints:`,
`detectable_content:`, `detectable_format:`, `combination:`, `startend:`,
`change_case:`, `punctuation:`) plus 58 IFBench OOD keys.

## How to run the evaluation
Install the requirements via the requirements.txt file.
You need two jsonl files, one the IFBench_test.jsonl file (in the data folder) and one your file with eval prompts and completions (see sample_output.jsonl as an example). Then run:
```
python3 -m run_eval --input_data=IFBench_test.jsonl --input_response_data=sample_output.jsonl --output_dir=eval
```bash
uv run python -m run_eval --input_data=IFBench_test.jsonl --input_response_data=sample_output.jsonl --output_dir=eval
```

Note: In the paper we generally report the prompt-level loose accuracy of IFBench. When we generate for evaluation, we use a temperature of 0 and adjust the maximum generated tokens depending on the model type, i.e. for thinking models we allow to generate more tokens and we then process the output to extract the answer without the reasoning chains.
Expand Down
2 changes: 1 addition & 1 deletion evaluation_lib.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
import json
from typing import Dict, Optional, Union

import instructions_registry
from ifbench import instructions_registry


@dataclasses.dataclass
Expand Down
16 changes: 16 additions & 0 deletions ifbench/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
"""IFBench: verifiable instruction-following benchmark.

Exposes the verifier classes (`instructions`, `classic_instructions`),
the shared registry (`instructions_registry.INSTRUCTION_DICT`), and a
helper for locating bundled data files.
"""

from importlib import resources
from pathlib import Path

from ifbench import classic_instructions, instructions, instructions_registry, instructions_util # noqa: F401


def data_path(name: str = "IFBench_test.jsonl") -> Path:
"""Return the filesystem path to a bundled data file."""
return Path(resources.files(__name__) / "data" / name)
Loading