A 24/7 mispricing detector for Kalshi speaker mention markets. Built over three months as a research project on whether historical phrase frequency data can be used to systematically trade against market sentiment.
Live P&L: -$16.40 across 1,115 resolved bets (47% win rate, BSS -0.47 vs market mid). The system is being open-sourced because the data pipeline and architecture are more interesting than the returns, and the underperformance itself has useful lessons.
Not financial advice. See DISCLAIMER.md.
| Source | Volume |
|---|---|
| Kalshi resolved outcomes (training set) | 12,490 across 17 speaker categories |
| Speaker corpus transcripts | 337 files across 7 speakers |
| Live price snapshots recorded | 547,265 |
| Scored action cards produced | 161,312 |
| Phrase hits detected in transcripts | 39,487 |
| Bets matched to resolved outcomes | 1,115 |
| Beta-Binomial posteriors maintained | 1,761 |
Outcome breakdown:
| Category | Outcomes |
|---|---|
| NBA broadcasts | 2,747 |
| Auto-detected | 3,119 |
| NCAAB broadcasts | 1,886 |
| Trump | 2,025 |
| MLB broadcasts | 712 |
| MMA/UFC broadcasts | 504 |
| Mamdani | 425 |
| Leavitt | 338 |
| Hochul | 243 |
| Newsom | 126 |
| Carney | 96 |
| Starmer | 89 |
| Melania | 84 |
| Powell/Fed | 54 |
| Homan | 25 |
| AOC | 14 |
Corpus:
| Speaker | Transcripts |
|---|---|
| Trump | 179 (rallies, briefings, addresses, signings, interviews, 2022–2026) |
| Leavitt | 85 press briefings |
| Powell | 33 Fed press conferences |
| Mamdani | 27 events |
| Starmer | 8 |
| Carney | 3 |
| Homan | 2 |
Corpus files are not included in the repo (copyright). data/corpus/README.md documents public-domain sources.
GET /markets on the Kalshi public API polled every 30 seconds. Free tier allows 20 req/s unauthenticated. Advanced API (30 req/s + WebSocket streaming) is available free by application. Every price change writes to market_snapshots; unchanged snapshots are deduplicated on the write path.
Config: KALSHI_API_KEY_ID and KALSHI_API_KEY_PATH for the authenticated tier.
Live transcripts typically require JavaScript rendering. Direct HTTP fails on most sources because the transcript updates via JS as the speaker talks. The ingestor has three modes:
- Direct HTTP —
httpxwithverify=True. Used for static sources. - OpenClaw browser relay — 3-step pipeline:
openclaw browser start→open <url>→evaluate --fn "() => document.body.innerText". Returns rendered DOM text. - Fallback chain — tries Direct HTTP first, falls back to OpenClaw on failure. Configurable order via
TRANSCRIPT_SOURCE.
URL refs are validated against a ^https?:// allowlist before any subprocess call. Format-string metacharacters ({, }) are rejected to prevent command injection via OPENCLAW_SCRAPE_CMD. OPENCLAW_BROWSER_PROFILE is validated against ^[a-zA-Z0-9_-]{1,64}$.
| Source | Coverage | Notes |
|---|---|---|
| whitehouse.gov | Trump, Leavitt | Public domain (US Gov work), primary source |
| federalreserve.gov | Powell | Public domain, FOMC press conferences |
| Rev.com | Trump rallies | Copyrighted, parsed via scripts/rev_transcript_cleaner.py |
| factba.se | Trump historical | Supplementary |
| C-SPAN | Mixed | Older transcripts |
File naming: data/corpus/<speaker>/<event_type>_YYYY-MM-DD_NN.txt. Event type is parsed from filename for event-specific base rate stratification (rally, briefing, signing, presser, remarks, interview, address).
make fetch-outcomes hits Kalshi /settlements. 12,490 resolved markets in the current training set. This trains the Bayesian posteriors and calibrates the legacy scorer's Platt scaling.
Polymarket (make fetch-poly) — Fuzzy-matched against Kalshi markets using phrase overlap and event title similarity with a confidence score per match. Price divergence is a signal for the legacy scorer.
Wallet flow (make fetch-wallet) — Polymarket trade flow API. Tracks conviction_weighted_flow (price-distance weighted) and extreme_bet_count. The adaptive signal learner drove this weight close to zero in production.
X/news (make fetch-x) — Configured watchlist of political journalists and official accounts via the X API. Boost factor for phrases in the news. Same neutralization pattern as wallet flow.
White House schedule (make fetch-wh) — Official daily schedule. Event type (signing vs briefing vs rally) materially changes phrase frequency profiles.
One Python process with five concurrent service loops:
| Service | Function |
|---|---|
KalshiWatcher |
Polls Kalshi API every 30s, writes market_snapshots |
TranscriptIngestor |
Polls configured transcript URLs, runs phrase matcher, writes phrase_hits |
Scorer |
Reads snapshots + hits, runs scoring model, emits action cards |
MaintenanceRunner |
Scheduled background refresh (markets, outcomes, Polymarket, calibration) |
Watchdog |
Monitors snapshot staleness; exits for launchd auto-restart if data goes cold |
Each loop has a dedicated SQLite connection (shared connections across threads cause WAL corruption — check_same_thread=False suppresses the exception but provides no locking). Auto-healing logic on startup checkpoints and removes orphaned WAL/SHM sidecar files.
Boundary-aware regex with Unicode normalization. Maintains a 10-token window before each match to detect:
- Negation — "will not say", "refused to mention", etc. Hits are recorded but flagged.
- Attribution — "he said X", "according to", etc. Hits flagged as non-primary-speaker.
Phrase dictionary is assembled from config/base_rates.yaml + config/base_rates_auto.yaml. ~1,800 tracked phrases.
SQLite, WAL mode, synchronous=FULL. Primary tables:
markets — market definitions
market_snapshots — price history (deduplicated on unchanged)
phrase_hits — transcript detection events
action_cards — every scored card with full model state
outcome_reviews — bets matched to resolutions with realized P&L
events — scheduled/live/ended state machine
BayesianScorer (USE_BAYESIAN_SCORER=1, default)
Hierarchical Beta-Binomial posteriors per (speaker, phrase) pair, trained from the 12,490 outcomes. Decision logic:
if yes_ask < ci_low: → BUY_YES
if (1 - no_ask) > ci_high: → BUY_NO
else: → WATCH
Kelly sizing on posterior mean and edge. For n_obs < 3, falls back to the speaker-level pooled prior (e.g., Trump overall ~0.45). Structural gates handle settled markets, thin books, wide spreads.
ScoringEngine (USE_BAYESIAN_SCORER=0, legacy)
p_literal = base_rate × time_decay × news_pressure × x_buzz × llm_boost × event_llm
p_calibrated = platt_scale(p_literal)
ev = p_calibrated - market_price
time_decay— empirical hazard rates per phrase, not static exponentialllm_boost— GPT-5-mini analyzes phrase in event context, 45-minute scheduleevent_llm— per-event analysis producingp_floorandp_overridevalues- Platt scaling + 17-gate decision stack, each gate backed by live outcome data in
brain/08_DECISIONS_LOG.md
Legacy model cost ~$50/month in OpenAI calls. Production Brier score: 0.352 (vs market mid: 0.238).
Watches outcome_reviews and maintains win rates per (speaker, signal_source). Outputs weights in data/signal_weights.json that scale each signal multiplicatively. Signals that consistently lose money get weighted toward 0; signals that win get weighted up to 1.5. Neutralized wallet flow and X/news signals for most political speakers in production.
scheduled → live → ended
Transitions are auto-detected from Kalshi market metadata. Live state applies hazard-based time decay. Ended state with no phrase hit collapses probability to 0.02. Phrase hits during live state override to 0.98.
config/events.yaml supports manual p_overrides (hard bypass — e.g., "Kristi Noem sworn in today → 'kristi' = 0.95") and p_floors (clamp minimum).
Prompts in config/llm/ as Markdown — editable without code changes:
global_signals_guide.md— global boost/suppress rulesper_event_guide.md— per-event analysis instructionstrump_patterns.md— data-backed behavioral profiles (e.g., Trump says "sleepy joe" at 92% of events including signings, regardless of format)calibration_guide.md,event_formats.md,mission.md
Single-file web UI at http://localhost:8777 — see the Dashboard section below for screenshots and tab-by-tab breakdown.
WhatsAppNotifier sends high-conviction BUY cards via openclaw message send <number> <text>. Target phone number in WHATSAPP_TARGET. Formatted for 5-second phone reads.
| Segment | Bets | Win Rate | P&L | Per bet |
|---|---|---|---|---|
| NBA BUY_NO | 283 | 51.9% | +$12.62 | +$0.045 |
| NCAAB BUY_NO | 235 | 57.0% | +$6.20 | +$0.026 |
| NCAAB BUY_YES | 24 | 41.7% | +$3.63 | +$0.151 |
| MMA BUY_NO | 60 | 61.7% | +$2.03 | +$0.034 |
| MMA BUY_YES | 12 | 41.7% | +$1.03 | +$0.086 |
| MLB BUY_NO | 128 | 42.2% | −$17.61 | −$0.138 |
| NBA BUY_YES | 93 | 33.3% | −$8.68 | −$0.093 |
| Trump BUY_NO | 42 | 33.3% | −$8.50 | −$0.202 |
| Trump BUY_YES | 125 | 33.6% | −$3.28 | −$0.026 |
| Legacy ScoringEngine | BayesianScorer | |
|---|---|---|
| Bets | 975 | 132 |
| Win rate | 49.2% | 32.6% |
| P&L | −$1.97 | −$13.37 |
| BUY_YES share | 20% | 76% |
| Brier score | 0.352 | — |
| Market mid Brier | 0.238 | — |
The legacy model is worse than the market at predicting outcomes. The Bayesian model has been live since April 14 with sparse data — its current failure mode is over-firing NBA BUY_YES bets, driven by a high pooled NBA speaker prior (0.591) being applied to thin-data phrases.
Restricting to sports BUY_NO (NBA + NCAAB + MMA) only, retrospectively: 578 bets, ~55% WR, approximately +$20 gross. Whether this segment-level edge holds forward or is historical overfitting is the open research question.
Local web UI at http://localhost:8777. Single-file Python (app/dashboard.py, 5,249 lines) — no external framework, no build step, no CDN dependencies. Runs as a separate process from the runner and reads the same SQLite database. A persistent top bar shows live system state: active BUY signals, total markets tracked, AI boosts applied, gate blocks, net P&L, win rate, and snapshot freshness.
The honest view. Journal bets, rolling win rate, realized P&L, and BSS vs the market mid. Speaker Rank sorts by BSS (or P&L, win%, or bet count) so you can see which segments are pulling weight. Each speaker expands to a card with rolling performance windows (7d / 30d / 60d / 90d / all-time), a live recent-activity feed, and a per-phrase breakdown sorted by contribution.
This is the tab that tells you whether the system is working. It's where you watch segment-level edge develop or erode in real time.
Signal-layer diagnostics. The top summary row shows how many phrases are currently LLM-boosted or LLM-suppressed. Recent Events & Schedule surfaces the news items and White House calendar entries that are shaping today's signals.
The Model Health calibration table compares predicted probability to actual outcome rate across buckets — the red "Bad" quality flags in this screenshot are exactly the systematic bias the project is trying to diagnose. Below that, Top Opportunities lists current BUY candidates, and Today's Topics clusters the LLM-identified themes driving per-event p_floor adjustments.
One-click operational control. Every maintenance script in the repo is registered with a label, group, and description, then exposed as a Run button. Output streams live to an embedded terminal below each button.
Grouped by workflow: Engine Control (start/stop the runner), Pre-Event (certainties, Truth Social floors), Sports (NBA/MLB/NCAAB certainties and schedules), Data Fetching (markets, outcomes, Polymarket, wallet flow, news, X signals, Fed transcripts), and AI Intelligence (LLM signal analysis). Each script is allowlisted via _ALLOWED_SCRIPTS — arbitrary script execution is not possible.
- Markets — live scored markets grouped by speaker, per-phrase cards showing side, probability, market price, EV, and the gate codes that drove the decision. Each card expands to reveal full model state (base rate, signal multipliers, Platt-calibrated probability, Kelly fraction, reason-code trail).
- Sports — NBA, NCAAB, MLB, and MMA/UFC markets grouped by game. Arena/venue overrides, universal phrase floors, active phrase probabilities per scheduled event.
- Analysis — co-occurrence matrices, phrase correlation graphs, hazard-rate curves for live events, cross-market arbitrage candidates (Kalshi vs Polymarket divergences).
- System — runtime health: snapshot freshness, scorer idle time, DB integrity, WAL checkpoint state, maintenance task history.
- Python 3.9+
- macOS (for launchd-based 24/7 mode) or Linux (manual process supervision)
- Kalshi account (public API requires no authentication for basic polling)
git clone https://github.com/<user>/kalshi-edge.git
cd kalshi-edge
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python3 -m pytest -q # 246 testsKALSHI_MOCK=1 python3 -m app.runner
python3 -m app.dashboard # second terminal → http://localhost:8777Mock mode generates ~660 deterministic fake markets. Useful for exploring the system without touching live data.
cp config/runtime.env.example config/runtime.env
# Set KALSHI_MOCK=0; other settings optional
make run-liveUSE_BAYESIAN_SCORER=1 python3 -m app.runner # default, no LLM cost
USE_BAYESIAN_SCORER=0 python3 -m app.runner # legacy, requires OPENAI_API_KEYBefore installing launchd services, complete these three steps:
- Remove macOS quarantine on cloned files:
xattr -dr com.apple.quarantine /path/to/kalshi-edge
- Grant Terminal Full Disk Access (System Settings → Privacy & Security). Without this, launchd services silently fail to read project files.
- Place the repo outside
~/Documents,~/Desktop,~/Downloads. Those locations have sandbox restrictions that block launchd.
Then:
make install-24x7-all # installs runner + dashboard + watchdog + caffeinate
make local-statusFull macOS setup walkthrough (Gatekeeper approval, sleep prevention, troubleshooting): docs/QUICKSTART.md.
# Data refresh
make fetch-markets # Kalshi market definitions
make fetch-outcomes # resolved outcomes (training data)
make fetch-poly # Polymarket cross-prices
# Calibration pipeline
make calibrate
# Outcome tracking and P&L
make record-outcomes # match BUY cards to outcomes
make report-outcomes
make backtest # win rates by segment, confidence, regime
# Runtime health
make health-check
make doctorapp/
runner.py async service orchestrator
bayesian_scorer.py BayesianScorer — CI-vs-market decisions, Kelly sizing
scoring.py ScoringEngine — 17-gate legacy model, 2,579 lines
dashboard.py web UI, 5,249 lines
transcript_sources.py Direct HTTP + OpenClaw + file sources with fallback
transcript_ingestor.py phrase detection pipeline
phrase_matcher.py boundary-safe regex with negation/attribution
bayesian_rates.py Beta posteriors, hierarchical pooling
base_rates.py YAML base rate lookup
bias_map.py series-specific empirical rate overrides
rolling_rates.py 3/5/10-speech rolling window signals
phrase_hazard.py empirical hazard rates for live time decay
signal_learner.py adaptive weight learner from outcomes
phrase_cooccurrence.py conditional phrase lift table
phrase_correlation.py phi-coefficient correlation matrix
event_detector.py scheduled/live/ended state machine
event_signals.py per-event overrides and floors
polymarket.py cross-market price signal
wallet_flow.py Polymarket trade flow signal
price_velocity.py YES price rate-of-change signal
calibration.py Platt scaling
db.py SQLite setup, WAL healing
maintenance.py background task scheduler
notifier.py WhatsApp via OpenClaw
watchdog.py staleness monitor
scripts/ 67 fetch/compute/backtest/admin scripts
config/
base_rates.yaml manually-curated phrase rates
base_rates_auto.yaml auto-calibrated from outcomes
runtime.env.example all env options documented
events.yaml upcoming events with overrides
llm/ LLM prompt files
brain/ architecture specs, decisions log
tests/ 246 pytest tests
data/ runtime state (gitignored)
docs/ quickstart, architecture, development archive
NBA BUY_YES over-firing (Bayesian). Pooled NBA prior of 0.591 is applied to sparse-data phrases, producing artificially high posterior means. Candidate fix: minimum CI exclusion margin — require yes_ask < ci_low − 0.05 rather than just < ci_low.
MLB BUY_NO systematically losing. "Bunt" (19% WR, n=17), "triple" (8% WR, n=12), "wild pitch" (46% WR, n=11). The outcome data used for calibration may have different resolution criteria than current Kalshi settlements. Needs investigation — highest priority open issue.
Legacy ScoringEngine Brier score exceeds market mid Brier (0.352 vs 0.238). The layered signal model is net-negative. Fix attempt is the BayesianScorer (currently underperforming for separate reasons).
Corpus not included. Without user-supplied transcripts, calibration falls back to priors in config/base_rates_priors.yaml. Accuracy degrades significantly.
launchd-specific 24/7 mode. Linux systemd port is straightforward but unimplemented.
See CONTRIBUTING.md.
Priority contributions:
- MLB BUY_NO analysis. Anyone with a theory for why common MLB phrases are hitting far above their historical rates should open an issue.
- Corpus expansion. Redistributable sources for Carney, Starmer, Homan transcripts.
- Linux/Docker port. Scope is limited to
scripts/manage_launchd.pyand the runner service scripts.
Scoring logic changes must cite outcome-level evidence (bet counts, win rates, P&L) from outcome_reviews, per the convention in brain/08_DECISIONS_LOG.md.
MIT. See LICENSE.
This software lost real money in live trading. DISCLAIMER.md.


