Embedding of Corpora, implemented in R.
Current development version: 0.3.0.
- Embeddings served by TEI (Text Embeddings Inference; Hugging Face).
- Embeddings via a backend-neutral interface (
hf,openai,tei). - Scoring: prototype cosine-distance + reference-area ridge score
(
distance_ridge()+score_ridge()) + threshold calibration. - Works great with DuckDB/Arrow pipelines.
- Demo defaults now use a shared structure:
demos/openalexdemos/openai
- OpenAI demo uses an explicit two-phase async workflow:
- render now (submit/poll briefly),
- finalize later (status/collect/compare) if batch is still pending.
- Added tutorial-style demo narratives with clearer explanation of workflow choices, costs/latency trade-offs, and interpretation of direct-vs-batch embedding differences.
See DEVELOPMENT_CONTINUITY.md for design principles, architectural
decisions, and the required pre-commit update checklist that keeps development
context continuous for both humans and AI agents.
# install.packages("devtools")
devtools::install_local("openalexVectorComp")Or build & install from the zip you downloaded.
- For
provider = "tei":text-embeddings-router --model BAAI/bge-small-en-v1.5 --port 8080
- For hosted embedding backends (
provider = "hf"or"openai"), setOVC_API_TOKENin your environment.
Start with vignettes/simplestart.qmd, then see:
vignettes/package-overview.qmdvignettes/openai-batch-async.qmdvignettes/abstract-cleaning.qmd
Create a full demo in getwd()/demos/openalex (fixtures + Quarto analysis):
run_demo_openalex(
demo_dir = file.path(getwd(), "demos", "openalex"),
render = FALSE
)Set render = TRUE to run quarto render directly. For hosted backends
(provider = "hf" or "openai"), set OVC_API_TOKEN first.
The Quarto file and backend YAML are placed in demo_dir, while all pipeline
artifacts are written under demo_dir/project/.
OpenAI-specific demo (same structure, explicit API key argument):
run_demo_openai(
api_key = Sys.getenv("OVC_API_TOKEN"),
demo_dir = file.path(getwd(), "demos", "openai"),
render = FALSE
)The OpenAI demo now follows a two-phase async flow:
run_demo_openai(..., render = TRUE)submits batch and continues.- If batch is still pending, finalize later:
demo_finalize_openai_batch(
demo_dir = file.path(getwd(), "demos", "openai"),
api_key = Sys.getenv("OVC_API_TOKEN"),
label = "corpus_batch"
)This writes comparison outputs to:
project/openai_batch_comparison/label=corpus_batch/.
For long-running OpenAI embedding jobs, use the async batch helpers:
backend <- backend_config(
provider = "openai",
model = "text-embedding-3-small"
)
# 1) submit and return immediately
batch_submit_openai(
project_dir = "my_project",
backend = backend,
label = "corpus"
)
# 2) check job status
batch_status_openai(
project_dir = "my_project",
label = "corpus"
)
# 3) collect completed jobs and write canonical embeddings parquet
batch_collect_openai(
project_dir = "my_project",
backend = backend,
label = "corpus"
)Preflight checks run before submission. Jobs are auto-split by size/count when needed. A single oversized request line stops with a clear error.
distance_reference_cosine() writes one parquet file:
distance_reference_cosine/model_id=<...>/corpus_label=<...>/reference_label=<...>/pairwise-cosine.parquet
This is a distance-only matrix with centroid axes:
- first column
id: corpus ids plus one row"centroid"(corpus centroid) - remaining columns: reference ids plus one column
"centroid"(reference centroid) - all values are cosine distances (
1 - cosine)
To convert this full distance matrix to scores:
score_reference_cosine(
distance_parquet = "my_project/distance_reference_cosine/model_id=.../corpus_label=.../reference_label=...",
method = "linear" # or "exponential"
)