Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 17 additions & 5 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,20 +7,32 @@ Version numbers follow [SemVer](https://semver.org/spec/v2.0.0.html).

## [Unreleased]

### Added (not yet released)

- **Hybrid retrieval** — opt-in BM25 + dense-vector retrieval fused with
Reciprocal Rank Fusion. Local `sentence-transformers` embedder by default
(no API key), pluggable `Embedder` protocol. `orc workspace create
--embeddings`, `orc workspace embed` backfill. BM25 stays the default.
- **`orc propose`** — stage an allow-listed effect for human approval from the
CLI (the approval queue's producer surface); `orc approve list --json`.
- **`orc report <run_id>...`** — render trace(s) into a self-contained HTML
artifact reusing the trace design language.

### Planned

- `gads` directive (Google Ads agentic analysis: lens-based decomposition,
read-only MCP integration, evidence-bound recommendation verification).
- `orc eval consistency|perturb|retrieval|regression` reliability commands.
- Voyage-AI or local-`sentence-transformers` embeddings + hybrid retrieval (RRF over BM25 + vector).
- Voyage-AI / OpenAI embedding backends behind the existing `Embedder` protocol.
- Hosted runtime (scheduled triggers, web dashboard, team workspaces).
- Decomposition + arithmetic combined for DROP-shaped multi-step claims.

## [0.2.0] — 2026-06-11
## [0.2.0] — unreleased

First PyPI release. The distribution is named **`orc-ai`** — `orc` is taken on
PyPI by an unrelated project — but the import package (`import orc`) and the
CLI command (`orc`) are unchanged.
Packaged for PyPI as **`orc-ai`** (`orc` is taken by an unrelated project);
the import package (`import orc`) and CLI command (`orc`) are unchanged. The
release workflow publishes on a `v0.2.0` tag once the trusted publisher is
configured — not yet tagged or published.

### Added

Expand Down
15 changes: 13 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,16 @@ Bind every claim to evidence you own. Cite real sources only. Replay every decis
| **Replay** | Every call writes a trace: retrieval set, every LLM call's tokens and cache hits, the structured output. LLM sampling is pinned to `temperature=0` and the corpus is pinned by version, so `orc replay <run_id>` re-issues the original decision against the same snapshot rather than a fresh sample (best-effort against residual model nondeterminism). |
| **Approval** | Anything that would mutate the outside world is routed to an approval queue first. Skills can only *propose* a typed, schema-validated, allow-listed action; a **separate process** holding the write credentials — which the analysis plane never sees — carries out human-approved actions and records the result, either one-shot (`orc execute`) or via the auto-drain daemon (`orc worker`, with leasing + idempotency + retry/backoff). *(Hosted row-level authz per plane is Phase 3; see [docs/design/0001-isolated-write-paths.md](docs/design/0001-isolated-write-paths.md).)* |

### What the gate does and does not catch

Orc's guarantee is **"every claim is traceable to a cited source"** — not "every claim is true." Three failure modes, three different answers:

| Failure mode | Coverage |
|---|---|
| **Hallucinated citations** — the model cites a chunk that doesn't exist | **Caught reliably.** Fabricated chunk IDs are filtered structurally before the verdict ships; a verdict left with no valid grounding is downgraded to `not_found`. |
| **Unsupported claims** — the model says `supported` when the cited evidence doesn't actually back the claim | **Caught partially.** This is an LLM-judge decision, with LLM-judge limits — the faithfulness benchmark (F1 0.864) is the measured error rate, not a guarantee. |
| **Faithful-but-wrong** — the corpus itself is wrong, stale, or poisoned, and the claim cites it faithfully | **Not caught.** Orc verifies against your corpus, not against the world. Mitigate with corpus provenance and freshness controls: ingest only sources you trust (sha256 + source path are recorded automatically) and re-verify with `orc replay --live` after corpus updates. |

Built for **research analysts, editorial teams, legal & compliance, agentic-workflow engineers** — anyone whose AI work product has to survive a second reviewer six months later.

## Quickstart
Expand Down Expand Up @@ -69,6 +79,7 @@ orc verify "<claim>" [-w <name>] verify a single claim
orc verify --file <path> extract + verify every claim in a draft
orc verify --url <url> same, from a URL
orc research "<topic>" [-w <name>] corpus-grounded synthesis with citations
orc report <run_id>... [-o out.html] render trace(s) as a shareable HTML report
orc trace show <run_id> full trace JSON
orc trace list [-w <name>] recent runs
orc replay <run_id> [--live] re-execute a recorded run
Expand Down Expand Up @@ -165,7 +176,7 @@ git clone https://github.com/Thormatt/orc.git
cd orc
uv sync --extra dev

uv run pytest # 260+ tests, <5s
uv run pytest # 360+ tests, <5s
uv run ruff check src tests
uv run orc --version
```
Expand All @@ -174,8 +185,8 @@ Live LLM tests are gated behind `ORC_TEST_ALLOW_LIVE_LLM=1` and require a real A

## Roadmap

- Embedding-based retrieval (hybrid BM25 + vector via `sqlite-vec`)
- OCR for scanned/image-only PDFs
- Voyage/OpenAI embedding backends (the `Embedder` protocol is pluggable; local `sentence-transformers` hybrid retrieval shipped as opt-in)
- Long-running directives (scheduled triggers, cloud execution)
- `marketing` directive (assisted-only at first, autonomous behind approval gates later)
- `legal` / `gads` / `code-review` directives — same runtime, new skill packages
Expand Down
2 changes: 1 addition & 1 deletion docs/business/roadmap.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ validated against real customer demand. Stage 0 is "land 3 pilots and learn
what to charge for"; everything past Stage 1 will be revised based on what
those pilots teach us.

Last updated: 2026-05-19. Code state: v0.1.4 (F1 = 0.864 on a stratified
Last updated: 2026-05-19. Code state: v0.2.0 — hybrid retrieval, PDF ingest, propose/report CLIs shipped (unreleased). Benchmark F1 = 0.864 on a stratified
504-item HaluBench subsample — competitive with Lynx-70B's published
home-court 0.85, not a same-set head-to-head; see
[competitive.md](../positioning/competitive.md) for caveats).
Expand Down
13 changes: 12 additions & 1 deletion docs/compliance/eu-ai-act.md
Original file line number Diff line number Diff line change
Expand Up @@ -261,6 +261,17 @@ Honest framing matters here.
obligations fall on Anthropic; Orc passes through whatever transparency
information the upstream provider supplies.

5. **Orc verifies traceability, not truth.** The guarantee is "every claim is
traceable to a cited source," not "every claim is true." Three failure
modes, three different coverages: hallucinated citations are caught
reliably (fabricated chunk IDs are structurally filtered, ungrounded
verdicts downgraded); unsupported claims are caught partially (an
LLM-judge decision, with LLM-judge error rates — see the faithfulness
benchmarks); faithful-but-wrong corpus content is not caught at all — if
the corpus is wrong, stale, or poisoned, a claim that cites it faithfully
will pass. The mitigation is the Article 10 data-governance work above:
corpus provenance, freshness, and review remain the deployer's obligation.

---

## Runbook for deployers
Expand Down Expand Up @@ -336,5 +347,5 @@ For procurement, conformity-assessment, or compliance-pilot inquiries:
[thormatt@gmail.com](mailto:thormatt@gmail.com)

Source: [github.com/Thormatt/orc](https://github.com/Thormatt/orc) · Last updated:
2026-05-17. This document is part of the repository and is versioned with the
2026-06-12. This document is part of the repository and is versioned with the
runtime it describes.
10 changes: 9 additions & 1 deletion docs/positioning/competitive.md
Original file line number Diff line number Diff line change
Expand Up @@ -260,6 +260,14 @@ Honest gaps, kept current so prospects know what they're buying:
will publish ours once the HHEM tokenizer-load issue is resolved.
- **No multi-tenancy or team workspace primitives in 0.1.x.** Each
workspace is owned by one filesystem.
- **Truth of the corpus.** The runtime guarantee is "every claim is
traceable to a cited source," not "every claim is true." Hallucinated
citations are caught structurally; unsupported claims are caught at
LLM-judge accuracy (the F1 numbers above); faithful-but-wrong corpus
content — wrong, stale, or poisoned sources cited faithfully — is not
caught at all. Corpus provenance and freshness controls are the
mitigation. Post-hoc judges share the same ceiling: they score
consistency with the provided context, not the truth of the context.

---

Expand Down Expand Up @@ -291,4 +299,4 @@ Updates land via PR with the rationale captured in the commit message.
The latest reproducible benchmark numbers always live in
[`docs/benchmarks/`](../benchmarks/).

Last updated: 2026-05-19 (Orc 0.1.4).
Last updated: 2026-06-12 (Orc 0.2.0).
2 changes: 2 additions & 0 deletions src/orc/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from orc.cli_commands import mcp as mcp_cmd
from orc.cli_commands import propose as propose_cmd
from orc.cli_commands import replay as replay_cmd
from orc.cli_commands import report as report_cmd
from orc.cli_commands import research as research_cmd
from orc.cli_commands import search as search_cmd
from orc.cli_commands import trace as trace_cmd
Expand All @@ -29,6 +30,7 @@ def main() -> None:
main.add_command(research_cmd.research_command)
main.add_command(trace_cmd.trace_group)
main.add_command(replay_cmd.replay_command)
main.add_command(report_cmd.report_command)
main.add_command(approve_cmd.approve_group)
main.add_command(propose_cmd.propose_command)
main.add_command(execute_cmd.execute_command)
Expand Down
51 changes: 51 additions & 0 deletions src/orc/cli_commands/report.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
"""`orc report RUN_ID...` — render traces as a self-contained HTML artifact."""

from __future__ import annotations

from pathlib import Path

import click

from orc.errors import TraceNotFoundError
from orc.rendering.trace_html import build_report_html
from orc.storage.trace_store import load_trace


@click.command("report")
@click.argument("run_ids", nargs=-1, required=True)
@click.option(
"-o",
"--output",
"output_path",
type=click.Path(dir_okay=False, writable=True, path_type=Path),
default=None,
help="Write the report to PATH instead of stdout.",
)
@click.option(
"--open",
"open_after",
is_flag=True,
help="Open the written report in the default browser (requires -o).",
)
def report_command(
run_ids: tuple[str, ...],
output_path: Path | None,
open_after: bool,
) -> None:
"""Render one or more run traces as a self-contained HTML report."""
# Fail before rendering: there is no file to open when writing to stdout,
# and silently ignoring the flag would hide a typo in the invocation.
if open_after and output_path is None:
raise click.ClickException("--open requires -o/--output (stdout cannot be opened)")
try:
traces = [load_trace(run_id) for run_id in run_ids]
except TraceNotFoundError as exc:
raise click.ClickException(str(exc)) from exc
html_doc = build_report_html(traces)
if output_path is None:
click.echo(html_doc)
return
output_path.write_text(html_doc, encoding="utf-8")
click.echo(str(output_path))
if open_after:
click.launch(str(output_path))
1 change: 1 addition & 0 deletions src/orc/rendering/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Rendering: turn persisted trace JSON into human-facing artifacts."""
7 changes: 7 additions & 0 deletions src/orc/rendering/assets/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"""Static assets (trace.css, trace.js) inlined into generated reports.

A real package (not bare data files) so importlib.resources can locate the
assets from a wheel, a zipapp, or an editable install alike. trace.css and
trace.js are verbatim copies of site/trace.css and site/trace.js — the report
artifact and the public site must render traces identically.
"""
Loading
Loading