A document processing and semantic search system that parses documents (PDFs, Markdown, plain text), creates semantic embeddings, and stores them in PostgreSQL with pgvector for similarity search.
(코드가 있는 기술 문서를 가정하고 구성했습니다. 그렇지 않은 경우 임베딩 정확도가 떨어집니다.)
- Multi-format parsing: PDF (with OCR fallback), Markdown, plain text
- Semantic segmentation: Intelligent grouping of text, code, and images
- Multi-view embeddings: Separate embeddings for text, code, images, tables
- Parent-child hierarchy: Context-aware retrieval with parent documents
- RAG support: LLM-powered question answering over your documents
- PostgreSQL + pgvector: Scalable vector storage with HNSW indexing
pip install jw-my-raguv add jw-my-rag- Python 3.12+
- PostgreSQL with pgvector extension
- Google API key (for Gemini embeddings/LLM) or Voyage API key
Start PostgreSQL with pgvector using Docker:
docker run -d \
--name pgvector \
-e POSTGRES_USER=langchain \
-e POSTGRES_PASSWORD=langchain \
-e POSTGRES_DB=vectordb \
-p 5432:5432 \
pgvector/pgvector:pg16Or use docker-compose (if provided in the repository).
Create a .env file:
# Required
PG_CONN=postgresql+psycopg://langchain:langchain@localhost:5432/vectordb
COLLECTION_NAME=my_documents
GOOGLE_API_KEY=your-api-key-here
EMBEDDING_PROVIDER=gemini# Ingest PDF files
myrag ingest documents/*.pdf
# Ingest with dry-run (parse only)
myrag ingest report.pdf --dry-run# Direct search
myrag search "vector database optimization"
# Search with filters
myrag search "async function" --view code --language javascript
# JSON output
myrag search "machine learning" --json# Ask a question
myrag rag "What is the main topic of this document?"
# With sources
myrag rag "How does the authentication work?" --sources# Start search REPL
myrag
# Start RAG REPL
myrag --ragStart the interactive REPL for search or RAG queries.
myrag # Search mode
myrag --rag # RAG mode (LLM-powered)
myrag --view code # Default filterRun a single search query.
myrag search "query" [options]
Options:
--view {text,code,image,caption,table,figure} Filter by content type
--language LANG Filter by programming language
--top-k N, -k N Number of results (default: 5)
--no-context Disable parent context expansion
--json Output JSON format
--verbose, -v Enable verbose loggingIngest documents into the vector database.
myrag ingest FILE [FILE ...] [options]
Options:
--dry-run Parse only, no database writes
--no-cache Disable OCR cache (re-process all pages)Ask a question using RAG (Retrieval-Augmented Generation).
myrag rag "question" [options]
Options:
--view {text,code,image,caption,table,figure} Filter by content type
--language LANG Filter by programming language
--top-k N, -k N Number of context documents (default: 5)
--sources Show source documents in response
--verbose, -v Enable verbose loggingInspect data quality and statistics.
myrag qualityOutput includes:
- Document/concept/fragment/embedding counts
- View distribution
- Orphan entity check
All settings are configured via environment variables. See .env.example for the complete list.
| Variable | Description | Default |
|---|---|---|
PG_CONN |
PostgreSQL connection string | (required) |
COLLECTION_NAME |
Vector store collection name | (required) |
EMBEDDING_PROVIDER |
gemini or voyage |
voyage |
GOOGLE_API_KEY |
Google API key for Gemini | - |
VOYAGE_API_KEY |
Voyage AI API key | - |
EMBEDDING_DIM |
Embedding dimension | 768 |
PARENT_MODE |
Grouping mode: unit, page, section, page_section |
unit |
ENABLE_IMAGE_OCR |
Enable Gemini Vision OCR for PDFs | true |
Document -> Concept -> Fragment -> Embedding
- Document: Source file (PDF, Markdown, text)
- Concept: Semantic unit (related paragraphs, code blocks)
- Fragment: Individual piece of content (text paragraph, code block, image)
- Embedding: Vector representation for similarity search
Documents are segmented into distinct views:
text: Natural language paragraphscode: Code blocks with language detectionimage: Image references with alt textcaption: Figure/table captionstable,figure: Structured content
Each view can be filtered during search for targeted retrieval.
git clone https://github.com/ocr-vector-db/ocr-vector-db.git
cd ocr-vector-db
# Install with uv (recommended)
uv sync --all-extras
# Or with pip
pip install -e ".[dev,test]"# Using uv
uv build
# Or using pip
python -m build# Using uv
uv run myrag --help
uv run myrag search "query"
uv run myrag ingest docs/*.pdf
# Or after pip install -e .
myrag --helpuv run pytest
# or
pytestuv lockMIT License - see LICENSE for details.