Mike Ross AI Engine

IBM TechXchange Hackathon 2025

“Put Mike Ross from Suits in your pocket.”
Instant recall of sprawling case files, precedent pattern‑matching, clause risk triage, witness leverage maps & argument scaffolds – all powered by IBM watsonx.ai Granite models and a transparent retrieval + reasoning stack.

1. Problem Statement

Lawyers still burn 40–70% of prep time re‑reading legacy case folders, hunting precedents, diffing contracts, stitching witness contradictions. Project Pearson turns that grind into a reproducible 5–10 minute pipeline. The result: a personal Mike Ross – eidetic recall + structured legal reasoning – that cuts document triage time by ~50% (internal benchmarks on sample tenancy & contract sets) so teams reinvest hours into higher‑order strategy, negotiation positioning, and courtroom narrative.

Mike Ross (fictional inspiration only) is the metaphor: photographic memory, pattern synthesis, tactical suggestions. Our engine approximates that utility with:

Deterministic, hash‑tracked ingestion & enrichment
Hybrid semantic + metadata retrieval over precedent & your matter corpus
Granite‑family LLM reasoning templates specialized per legal task
Four focused autonomous “agents” instead of a vague generalist

Disclaimer: “Mike Ross” is referenced solely as cultural shorthand for an associate with exceptional recall and analytical speed. No show scripts, plot lines, or proprietary content are reproduced.

2. Core Value Outcomes

Pain Today	Mike Ross AI Outcome
Manual reread of 100s of pages before each motion	Indexed once; contextual Q&A and structured briefs in minutes
Ad hoc precedent hunting with inconsistent coverage	Unified precedent & regulatory vector layer with explainable citation traces
Contract risk review diffed in spreadsheets	Clause & risk inventory + redraft suggestions surfaced instantly
Witness statement contradiction spotting done late	Early leverage map + prioritized questioning funnels
Hours lost summarizing for partners	Auto strategic briefs & argument skeletons with provenance

Time Freed (Indicative): 50%+ of low‑leverage reading hours → redeployed to strategy formation, settlement calculus, judge preference tailoring, and narrative coherence.

3. Architecture (Ingest → Retrieve → Reason → Act)

Ingest: PDFs / TXT / (DOCX) + crawled precedent normalized, chunked, enriched (citations, parties, acts, judges, dates).
Vectorize: Watsonx embeddings unify jurisdictions (IndianKanoon, US, Regulatory).
Retrieve: Hybrid semantic + metadata filtering across legal_cases and case_files (fusion).
Reason (Granite Models): Task‑specific prompt scaffolds call Granite chat models for analysis, synthesis, ranking & structured outputs.
Act (Agents): Specialized modules return actionable artifacts (risk matrices, contradiction tables, argument trees) with citation provenance and scoring.

Technologies: FastAPI, Chroma persistent store, SQLite metadata registry, IBM watsonx.ai Granite & embedding models, regex + (extensible) NLP enrichment.

---

4. The Four Mike Ross Agents

Agent	Purpose	Key Artifacts	Granite Prompt Angle
Case Analyzer	Dissect a case’s internal structure	Strengths, weaknesses, contradictions, precedent gaps, remediation steps	Structured issue tree + risk weighting
Contract Scanner	Clause & risk intelligence	Clause taxonomy, risk tiers, missing protections, redraft proposals	Clause span scoring + mitigation synthesis
Deposition Strategist	Witness & statement leverage	Inconsistency map, credibility pressure points, question funnels	Comparative factual matrix reasoning
Precedent Locator	Argument architecture	Binding vs persuasive set, distinguishing factors, counterarguments	Multi‑precedent abstraction + analogical mapping

Each agent reuses the same retrieval substrate but applies a distinct reasoning template and output schema; this keeps responses explainable and reduces hallucination surface by constraining role and objective.

5. Data & Storage Layout

storage/
  raw/        # immutable ingested source files
  curated/    # future normalized / structured facts
  feature/    # derived artifacts (timelines, graphs, clause JSON)
vector/       # persistent Chroma directories (VECTOR_DB_PATH)
pearson.db    # SQLite: documents, chunks, enrichment metadata

Evolution: swap local dirs for IBM Cloud Object Storage + Iceberg/Hudi via watsonx.data, add lineage + masking hooks.

6. Retrieval & Embeddings Strategy

Embeddings: IBM watsonx ibm/slate-30m-english-rtrvr (extensible to multilingual Granite retrievers)
Collections: legal_cases (precedent & regulatory) + case_files (matter uploads)
Fusion: parallel top‑k per corpus → reciprocal rank / score normalization
Metadata: filename, hash, chunk_index, jurisdiction, citations_local, sections_local, source_type
Provenance: Every answer returns snippet sources to allow manual validation

Scoring Example: cosine distance -> intuitive score transformation score = 1 / (1 + distance).

7. RAG Flow (Upload → Strategic Answer)

File upload → parse & chunk (size tuned for legal clause + fact density)
Enrichment (regex now; pluggable NLP upgrade path) adds structured facets
Embedding + vector insert with deterministic IDs & content hashes
User question / agent call triggers hybrid retrieval
Context windows constructed with citation ordering & duplicate suppression
Granite model prompt (task template) → structured JSON / markdown answer
Response returned with provenance & optional confidence / risk notes

8. Current Implemented Backend Components

FastAPI service (backend/main.py) with upload + chat (RAG) + agent endpoints
IndianKanoon crawler (backend/app.py) for precedent seeding (ID range enumerator)
Chroma vector abstraction (backend/vectorstores/chroma_store.py)
Ingestion & enrichment services (documents → enriched chunks → vectors + SQLite)
Hybrid retrieval service with metadata filtering stubs
Session digests for large PDFs to accelerate non‑RAG follow‑ups

9. Environment & Configuration

Create a .env in backend folder:

WATSONX_API_KEY=
WATSONX_PROJECT_ID=
WATSONX_URL=
VECTOR_DB_PATH=legal_cases_store
CASE_LAW_COLLECTION=legal_cases
CASE_FILES_COLLECTION=case_files
SQLITE_PATH=pearson.db
RAW_STORAGE=storage/raw
CURATED_STORAGE=storage/curated

PowerShell Quick Start (Windows):

python -m venv .venv
.venv\Scripts\activate
pip install -r backend/requirements.txt
cd backend
uvicorn main:app --reload

Seed precedent (configure DOC_START_ID / DOC_END_ID in .env if present):

python app.py

Upload & Chat:

curl -F "file=@case.pdf" -F "session_id=s1" http://localhost:8000/upload/
curl -F "user_input=Key tenancy arguments?" -F "session_id=s1" -F "use_rag=true" http://localhost:8000/chat/

10. Key API Endpoints (Summary)

Endpoint	Purpose
`POST /upload/`	Ingest & index a document
`POST /chat/`	Conversational Q&A (optional RAG)
`POST /search/`	Direct hybrid search (no LLM reasoning)
`GET /mike-ross/models`	List agents
`POST /mike-ross/case-breaker/analyze`	Case structural analysis
`POST /mike-ross/contract-xray/analyze`	Contract risk & clauses
`POST /mike-ross/deposition-strategist/analyze-witnesses`	Witness inconsistency map
`POST /mike-ross/precedent-strategist/analyze`	Precedent stack & strategy

Full expanded reference lives in backend/README.md (Appendix A & B).

11. Enrichment Signals (v1)

Citations (regex extracted – local & cross‑jurisdiction markers)
Acts / Sections
Parties & Judges
Dates (normalized forms)
Early risk flags (contract heuristics)

Planned: Named entity graph, causal chains, outcome classification, rhetorical role tagging.

12. Roadmap (Hackathon Slice → Near Term)

Phase	Focus	Headline Additions
P1 (done)	Core ingestion + hybrid RAG	Persistent vectors, crawler, enrichment v1
P2	Graph & Timeline	Entity co‑occurrence graph, timeline synthesis, filterable retrieval params
P3	Precedent Deep Reasoning	Argument element extraction, judge profiling, adverse authority scan
P4	Contract X‑Ray Prototype	Clause diff vs playbook, risk scoring JSON schema
P5	Deposition Enhancements	Contradiction clustering, dynamic questioning funnel generator
P6	Retrieval Quality	BM25 + dense hybrid, reranker, response caching

13. Observability & Trust

Content hashes & deterministic IDs prevent duplicate vector spam
Logging: ingestion counts, retrieval timings, crawler adaptive rates
Provenance bundle: snippet text + source metadata + jurisdiction
Manual verification first design: surfaced before reasoning summary

14. Extensibility Patterns

Add corpus: implement lightweight fetcher → normalize → feed enrichment → vector insert with jurisdiction, source_type.
Add agent: create endpoint → retrieval call → task prompt template referencing Granite model → structured output schema → register in /mike-ross/models.

15. Hackathon Narrative (IBM TechXchange)

Specialized vertical legal intelligence showcasing:

watsonx.ai Granite models orchestrated through transparent RAG
Modular agents (clear scope → lower hallucination risk)
Cross‑jurisdiction & regulatory blend
Auditability & governance‑ready data contracts
Tangible productivity claim: 50%+ reduction in low‑leverage reading → more strategy cycles

Pitch Line: “While others promise generic legal chat, we deliver a verifiable Mike Ross memory layer plus four surgical strategy agents – all on Granite.”

16. License & Source Attribution

Uses public domain / open legal sources (IndianKanoon references, US public opinions, regulatory bulletins). Ensure compliance with each source’s reuse terms. This repository’s license: see LICENSE.

17. Quick Checklist for Demo Prep

Populate .env (API key, project id, URL)
Start backend (FastAPI + Uvicorn)
(Optional) Run crawler to seed precedent range
Upload a representative contract + case bundle
Run: case breaker → precedent strategist → contract x‑ray → deposition strategist
Show provenance & time saved contrast slide

18. Contribution Notes

Please keep additions modular: new enrichers, stores, or agents should not break existing retrieval contracts. Add tests for any new parsing logic or prompt template transformations.

Hackathon Goal: rock‑solid ingestion + high‑signal retrieval + explainable, provenance‑rich strategic outputs. Your pocket Mike Ross – ethically implemented.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
backend		backend
frontend		frontend
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
flow.jpg		flow.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mike Ross AI Engine

1. Problem Statement

2. Core Value Outcomes

3. Architecture (Ingest → Retrieve → Reason → Act)

4. The Four Mike Ross Agents

5. Data & Storage Layout

6. Retrieval & Embeddings Strategy

7. RAG Flow (Upload → Strategic Answer)

8. Current Implemented Backend Components

9. Environment & Configuration

10. Key API Endpoints (Summary)

11. Enrichment Signals (v1)

12. Roadmap (Hackathon Slice → Near Term)

13. Observability & Trust

14. Extensibility Patterns

15. Hackathon Narrative (IBM TechXchange)

16. License & Source Attribution

17. Quick Checklist for Demo Prep

18. Contribution Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Mike Ross AI Engine

1. Problem Statement

2. Core Value Outcomes

3. Architecture (Ingest → Retrieve → Reason → Act)

4. The Four Mike Ross Agents

5. Data & Storage Layout

6. Retrieval & Embeddings Strategy

7. RAG Flow (Upload → Strategic Answer)

8. Current Implemented Backend Components

9. Environment & Configuration

10. Key API Endpoints (Summary)

11. Enrichment Signals (v1)

12. Roadmap (Hackathon Slice → Near Term)

13. Observability & Trust

14. Extensibility Patterns

15. Hackathon Narrative (IBM TechXchange)

16. License & Source Attribution

17. Quick Checklist for Demo Prep

18. Contribution Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages