textpress

textpress is an R toolkit for building text corpora and searching them -- no custom object classes, just plain data frames from start to finish. It covers the full arc from URL to retrieved passage through a consistent four-step API: Fetch, Read, Process, Search. Traditional tools (KWIC, BM25, dictionary matching) sit alongside modern ones (semantic search, LLM-ready chunking), all compatible with the native R pipe (|>).

Installation

From CRAN:

install.packages("textpress")

Development version:

remotes::install_github("jaytimm/textpress")

The `textpress` API

Conventions: corpus is a data frame with a text column plus identifier column(s) passed to by (default doc_id). All outputs are plain data frames or data.tables; pipe-friendly.

1. Fetch (`fetch_*`)

Find URLs and metadata -- not full text. Pass results to read_urls() to get content.

fetch_urls(query, n_pages, date_filter) -- Search engine query; returns candidate URLs with metadata.
fetch_wiki_urls(query, limit) -- Wikipedia article URLs matching a search phrase.
fetch_wiki_refs(url, n) -- External citation URLs from a Wikipedia article's References section.

2. Read (`read_*`)

Scrape and parse URLs into a structured corpus.

read_urls(urls, ...) -- Character vector of URLs → list(text, meta). text is one row per node (headings, paragraphs, lists); meta is one row per URL. For Wikipedia, exclude_wiki_refs = TRUE drops References / See also / Bibliography sections.

3. Process (`nlp_*`)

Prepare text for search or indexing.

nlp_split_paragraphs() -- Break documents into structural blocks.
nlp_split_sentences() -- Segment blocks into individual sentences.
nlp_tokenize_text() -- Normalize text into a clean token stream.
nlp_index_tokens() -- Build a weighted BM25 index for ranked retrieval.
nlp_roll_chunks() -- Roll sentences into fixed-size chunks with surrounding context (RAG-style).

4. Search (`search_*`)

Four retrieval modes over the same corpus. Data-first, pipe-friendly.

Function	Query type	Use case
`search_regex(corpus, query)`	Regex pattern	Specific strings, KWIC with inline highlighting.
`search_dict(corpus, terms)`	Term vector	Exact phrases and MWEs; built-in `dict_generations`, `dict_political`.
`search_index(index, query)`	Keywords	BM25 ranked retrieval over a token index.
`search_vector(embeddings, query)`	Numeric vector	Semantic nearest-neighbor search; use `util_fetch_embeddings()` to embed.

RAG & LLM pipelines

textpress is designed to compose cleanly into retrieval-augmented generation pipelines.

Hybrid retrieval -- run search_index() and search_vector() over the same chunks, then merge with reciprocal rank fusion (RRF). Chunks that rank well under both term frequency and meaning rise to the top.

Context assembly -- nlp_roll_chunks() with context_size > 0 gives each chunk a focal sentence plus surrounding context, so retrieved passages are self-contained when passed to an LLM.

Agent tool-calling -- the consistent API and plain data-frame outputs map naturally to tool use:

Agent task	Function
"Find recent articles on X"	`fetch_urls()`
"Scrape these pages"	`read_urls()`
"Find all mentions of these entities"	`search_dict()`
"Follow citations from this Wikipedia article"	`fetch_wiki_refs()`

Vignettes

Web data -- fetch_urls() + read_urls()
Basic NLP -- sentence splitting, tokenization, span-aware casting
Wikipedia data -- fetch_wiki_urls() + fetch_wiki_refs()
Regex search -- search_regex(), KWIC
Dictionary search -- search_dict(), PMI co-occurrence
Semantic search -- RAG pipeline: embeddings, BM25, hybrid RRF retrieval, LLM extraction

License

Citation

citation("textpress")

Issues

Report bugs or request features at https://github.com/jaytimm/textpress/issues

Name		Name	Last commit message	Last commit date
Latest commit History 145 Commits
..Rcheck		..Rcheck
.github		.github
R		R
man		man
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md
_pkgdown.yml		_pkgdown.yml
textpress.Rproj		textpress.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

textpress

Installation

The `textpress` API

1. Fetch (`fetch_*`)

2. Read (`read_*`)

3. Process (`nlp_*`)

4. Search (`search_*`)

RAG & LLM pipelines

Vignettes

License

Citation

Issues

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

textpress

Installation

The textpress API

1. Fetch (fetch_*)

2. Read (read_*)

3. Process (nlp_*)

4. Search (search_*)

RAG & LLM pipelines

Vignettes

License

Citation

Issues

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 1

Languages

The `textpress` API

1. Fetch (`fetch_*`)

2. Read (`read_*`)

3. Process (`nlp_*`)

4. Search (`search_*`)

Packages