openalexVectorComp

Embedding of Corpora, implemented in R.

Version

Current development version: 0.3.0.

Embeddings served by TEI (Text Embeddings Inference; Hugging Face).
Embeddings via a backend-neutral interface (hf, openai, tei).
Scoring: prototype cosine-distance + reference-area ridge score (distance_ridge() + score_ridge()) + threshold calibration.
Works great with DuckDB/Arrow pipelines.

0.3.0 Highlights

Demo defaults now use a shared structure:
- demos/openalex
- demos/openai
OpenAI demo uses an explicit two-phase async workflow:
- render now (submit/poll briefly),
- finalize later (status/collect/compare) if batch is still pending.
Added tutorial-style demo narratives with clearer explanation of workflow choices, costs/latency trade-offs, and interpretation of direct-vs-batch embedding differences.

Development Continuity

See DEVELOPMENT_CONTINUITY.md for design principles, architectural decisions, and the required pre-commit update checklist that keeps development context continuous for both humans and AI agents.

Install (local)

# install.packages("devtools")
devtools::install_local("openalexVectorComp")

Or build & install from the zip you downloaded.

Runtime dependencies

For provider = "tei":

text-embeddings-router --model BAAI/bge-small-en-v1.5 --port 8080

For hosted embedding backends (provider = "hf" or "openai"), set OVC_API_TOKEN in your environment.

Vignettes

Start with vignettes/simplestart.qmd, then see:

vignettes/package-overview.qmd
vignettes/openai-batch-async.qmd
vignettes/abstract-cleaning.qmd

Run a Local Demo Project

Create a full demo in getwd()/demos/openalex (fixtures + Quarto analysis):

run_demo_openalex(
  demo_dir = file.path(getwd(), "demos", "openalex"),
  render = FALSE
)

Set render = TRUE to run quarto render directly. For hosted backends (provider = "hf" or "openai"), set OVC_API_TOKEN first. The Quarto file and backend YAML are placed in demo_dir, while all pipeline artifacts are written under demo_dir/project/.

OpenAI-specific demo (same structure, explicit API key argument):

run_demo_openai(
  api_key = Sys.getenv("OVC_API_TOKEN"),
  demo_dir = file.path(getwd(), "demos", "openai"),
  render = FALSE
)

The OpenAI demo now follows a two-phase async flow:

run_demo_openai(..., render = TRUE) submits batch and continues.
If batch is still pending, finalize later:

demo_finalize_openai_batch(
  demo_dir = file.path(getwd(), "demos", "openai"),
  api_key = Sys.getenv("OVC_API_TOKEN"),
  label = "corpus_batch"
)

This writes comparison outputs to: project/openai_batch_comparison/label=corpus_batch/.

OpenAI Batch Workflow (Async, Pure R)

For long-running OpenAI embedding jobs, use the async batch helpers:

backend <- backend_config(
  provider = "openai",
  model = "text-embedding-3-small"
)

# 1) submit and return immediately
batch_submit_openai(
  project_dir = "my_project",
  backend = backend,
  label = "corpus"
)

# 2) check job status
batch_status_openai(
  project_dir = "my_project",
  label = "corpus"
)

# 3) collect completed jobs and write canonical embeddings parquet
batch_collect_openai(
  project_dir = "my_project",
  backend = backend,
  label = "corpus"
)

Preflight checks run before submission. Jobs are auto-split by size/count when needed. A single oversized request line stops with a clear error.

Prototype distance output

distance_reference_cosine() writes one parquet file:

distance_reference_cosine/model_id=<...>/corpus_label=<...>/reference_label=<...>/pairwise-cosine.parquet

This is a distance-only matrix with centroid axes:

first column id: corpus ids plus one row "centroid" (corpus centroid)
remaining columns: reference ids plus one column "centroid" (reference centroid)
all values are cosine distances (1 - cosine)

To convert this full distance matrix to scores:

score_reference_cosine(
  distance_parquet = "my_project/distance_reference_cosine/model_id=.../corpus_label=.../reference_label=...",
  method = "linear" # or "exponential"
)

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
R		R
inst		inst
man		man
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
DEVELOPMENT_CONTINUITY.md		DEVELOPMENT_CONTINUITY.md
IMPLEMENTATION_NOTES.md		IMPLEMENTATION_NOTES.md
LICENSE		LICENSE
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md
_pkgdown.yml		_pkgdown.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

openalexVectorComp

Version

0.3.0 Highlights

Development Continuity

Install (local)

Runtime dependencies

Vignettes

Run a Local Demo Project

OpenAI Batch Workflow (Async, Pure R)

Prototype distance output

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

openalexVectorComp

Version

0.3.0 Highlights

Development Continuity

Install (local)

Runtime dependencies

Vignettes

Run a Local Demo Project

OpenAI Batch Workflow (Async, Pure R)

Prototype distance output

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages