textpress is an R toolkit for building text corpora and searching them -- no custom object classes, just plain data frames from start to finish. It covers the full arc from URL to retrieved passage through a consistent four-step API: Fetch, Read, Process, Search. Traditional tools (KWIC, BM25, dictionary matching) sit alongside modern ones (semantic search, LLM-ready chunking), all compatible with the native R pipe (|>).
From CRAN:
install.packages("textpress")Development version:
remotes::install_github("jaytimm/textpress")Conventions: corpus is a data frame with a text column plus identifier column(s) passed to by (default doc_id). All outputs are plain data frames or data.tables; pipe-friendly.
Find URLs and metadata -- not full text. Pass results to read_urls() to get content.
fetch_urls(query, n_pages, date_filter)-- Search engine query; returns candidate URLs with metadata.fetch_wiki_urls(query, limit)-- Wikipedia article URLs matching a search phrase.fetch_wiki_refs(url, n)-- External citation URLs from a Wikipedia article's References section.
Scrape and parse URLs into a structured corpus.
read_urls(urls, ...)-- Character vector of URLs →list(text, meta).textis one row per node (headings, paragraphs, lists);metais one row per URL. For Wikipedia,exclude_wiki_refs = TRUEdrops References / See also / Bibliography sections.
Prepare text for search or indexing.
nlp_split_paragraphs()-- Break documents into structural blocks.nlp_split_sentences()-- Segment blocks into individual sentences.nlp_tokenize_text()-- Normalize text into a clean token stream.nlp_index_tokens()-- Build a weighted BM25 index for ranked retrieval.nlp_roll_chunks()-- Roll sentences into fixed-size chunks with surrounding context (RAG-style).
Four retrieval modes over the same corpus. Data-first, pipe-friendly.
| Function | Query type | Use case |
|---|---|---|
search_regex(corpus, query) |
Regex pattern | Specific strings, KWIC with inline highlighting. |
search_dict(corpus, terms) |
Term vector | Exact phrases and MWEs; built-in dict_generations, dict_political. |
search_index(index, query) |
Keywords | BM25 ranked retrieval over a token index. |
search_vector(embeddings, query) |
Numeric vector | Semantic nearest-neighbor search; use util_fetch_embeddings() to embed. |
textpress is designed to compose cleanly into retrieval-augmented generation pipelines.
Hybrid retrieval -- run search_index() and search_vector() over the same chunks, then merge with reciprocal rank fusion (RRF). Chunks that rank well under both term frequency and meaning rise to the top.
Context assembly -- nlp_roll_chunks() with context_size > 0 gives each chunk a focal sentence plus surrounding context, so retrieved passages are self-contained when passed to an LLM.
Agent tool-calling -- the consistent API and plain data-frame outputs map naturally to tool use:
| Agent task | Function |
|---|---|
| "Find recent articles on X" | fetch_urls() |
| "Scrape these pages" | read_urls() |
| "Find all mentions of these entities" | search_dict() |
| "Follow citations from this Wikipedia article" | fetch_wiki_refs() |
- Web data --
fetch_urls()+read_urls() - Basic NLP -- sentence splitting, tokenization, span-aware casting
- Wikipedia data --
fetch_wiki_urls()+fetch_wiki_refs() - Regex search --
search_regex(), KWIC - Dictionary search --
search_dict(), PMI co-occurrence - Semantic search -- RAG pipeline: embeddings, BM25, hybrid RRF retrieval, LLM extraction
MIT © Jason Timm
citation("textpress")Report bugs or request features at https://github.com/jaytimm/textpress/issues