jw-myrag (only for Technical books or documents which contains code)

A document processing and semantic search system that parses documents (PDFs, Markdown, plain text), creates semantic embeddings, and stores them in PostgreSQL with pgvector for similarity search.

(코드가 있는 기술 문서를 가정하고 구성했습니다. 그렇지 않은 경우 임베딩 정확도가 떨어집니다.)

Features

Multi-format parsing: PDF (with OCR fallback), Markdown, plain text
Semantic segmentation: Intelligent grouping of text, code, and images
Multi-view embeddings: Separate embeddings for text, code, images, tables
Parent-child hierarchy: Context-aware retrieval with parent documents
RAG support: LLM-powered question answering over your documents
PostgreSQL + pgvector: Scalable vector storage with HNSW indexing

Installation

From PyPI

pip install jw-my-rag

Using uv (recommended)

uv add jw-my-rag

Prerequisites

Python 3.12+
PostgreSQL with pgvector extension
Google API key (for Gemini embeddings/LLM) or Voyage API key

Database Setup

Start PostgreSQL with pgvector using Docker:

docker run -d \
  --name pgvector \
  -e POSTGRES_USER=langchain \
  -e POSTGRES_PASSWORD=langchain \
  -e POSTGRES_DB=vectordb \
  -p 5432:5432 \
  pgvector/pgvector:pg16

Or use docker-compose (if provided in the repository).

Quick Start

1. Configure Environment

Create a .env file:

# Required
PG_CONN=postgresql+psycopg://langchain:langchain@localhost:5432/vectordb
COLLECTION_NAME=my_documents
GOOGLE_API_KEY=your-api-key-here
EMBEDDING_PROVIDER=gemini

2. Ingest Documents

# Ingest PDF files
myrag ingest documents/*.pdf

# Ingest with dry-run (parse only)
myrag ingest report.pdf --dry-run

3. Search

# Direct search
myrag search "vector database optimization"

# Search with filters
myrag search "async function" --view code --language javascript

# JSON output
myrag search "machine learning" --json

4. RAG (Question Answering)

# Ask a question
myrag rag "What is the main topic of this document?"

# With sources
myrag rag "How does the authentication work?" --sources

5. Interactive REPL

# Start search REPL
myrag

# Start RAG REPL
myrag --rag

CLI Commands

`myrag` (default)

Start the interactive REPL for search or RAG queries.

myrag              # Search mode
myrag --rag        # RAG mode (LLM-powered)
myrag --view code  # Default filter

`myrag search`

Run a single search query.

myrag search "query" [options]

Options:
  --view {text,code,image,caption,table,figure}  Filter by content type
  --language LANG       Filter by programming language
  --top-k N, -k N       Number of results (default: 5)
  --no-context          Disable parent context expansion
  --json                Output JSON format
  --verbose, -v         Enable verbose logging

`myrag ingest`

Ingest documents into the vector database.

myrag ingest FILE [FILE ...] [options]

Options:
  --dry-run      Parse only, no database writes
  --no-cache     Disable OCR cache (re-process all pages)

`myrag rag`

Ask a question using RAG (Retrieval-Augmented Generation).

myrag rag "question" [options]

Options:
  --view {text,code,image,caption,table,figure}  Filter by content type
  --language LANG       Filter by programming language
  --top-k N, -k N       Number of context documents (default: 5)
  --sources             Show source documents in response
  --verbose, -v         Enable verbose logging

`myrag quality`

Inspect data quality and statistics.

myrag quality

Output includes:

Document/concept/fragment/embedding counts
View distribution
Orphan entity check

Configuration

All settings are configured via environment variables. See .env.example for the complete list.

Key Settings

Variable	Description	Default
`PG_CONN`	PostgreSQL connection string	(required)
`COLLECTION_NAME`	Vector store collection name	(required)
`EMBEDDING_PROVIDER`	`gemini` or `voyage`	`voyage`
`GOOGLE_API_KEY`	Google API key for Gemini	-
`VOYAGE_API_KEY`	Voyage AI API key	-
`EMBEDDING_DIM`	Embedding dimension	768
`PARENT_MODE`	Grouping mode: `unit`, `page`, `section`, `page_section`	`unit`
`ENABLE_IMAGE_OCR`	Enable Gemini Vision OCR for PDFs	`true`

Architecture

Document -> Concept -> Fragment -> Embedding

Document: Source file (PDF, Markdown, text)
Concept: Semantic unit (related paragraphs, code blocks)
Fragment: Individual piece of content (text paragraph, code block, image)
Embedding: Vector representation for similarity search

Multi-View Strategy

Documents are segmented into distinct views:

text: Natural language paragraphs
code: Code blocks with language detection
image: Image references with alt text
caption: Figure/table captions
table, figure: Structured content

Each view can be filtered during search for targeted retrieval.

Development

Install from Source (uv)

git clone https://github.com/ocr-vector-db/ocr-vector-db.git
cd ocr-vector-db

# Install with uv (recommended)
uv sync --all-extras

# Or with pip
pip install -e ".[dev,test]"

Build Package

# Using uv
uv build

# Or using pip
python -m build

Run CLI (development)

# Using uv
uv run myrag --help
uv run myrag search "query"
uv run myrag ingest docs/*.pdf

# Or after pip install -e .
myrag --help

Run Tests

uv run pytest
# or
pytest

Sync Lock File

uv lock

License

MIT License - see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.claude		.claude
api		api
app		app
docs		docs
domain		domain
embedding		embedding
generation		generation
ingestion		ingestion
retrieval		retrieval
review		review
shared		shared
storage		storage
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
README.md		README.md
check_db.py		check_db.py
cur_plan.md		cur_plan.md
docker-compose.yml		docker-compose.yml
eval_queries.jsonl		eval_queries.jsonl
eval_queries_code.jsonl		eval_queries_code.jsonl
eval_queries_extended.jsonl		eval_queries_extended.jsonl
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
test_integration.py		test_integration.py
test_korean_validator.py		test_korean_validator.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

jw-myrag (only for Technical books or documents which contains code)

Features

Installation

From PyPI

Using uv (recommended)

Prerequisites

Database Setup

Quick Start

1. Configure Environment

2. Ingest Documents

3. Search

4. RAG (Question Answering)

5. Interactive REPL

CLI Commands

myrag (default)

myrag search

myrag ingest

myrag rag

myrag quality

Configuration

Key Settings

Architecture

Multi-View Strategy

Development

Install from Source (uv)

Build Package

Run CLI (development)

Run Tests

Sync Lock File

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`myrag` (default)

`myrag search`

`myrag ingest`

`myrag rag`

`myrag quality`

Packages