Skip to content

Add native vector search#683

Draft
artpi wants to merge 7 commits into
WordPress:developfrom
artpi:rag-mariadb
Draft

Add native vector search#683
artpi wants to merge 7 commits into
WordPress:developfrom
artpi:rag-mariadb

Conversation

@artpi

@artpi artpi commented Jun 8, 2026

Copy link
Copy Markdown

Summary

This introduces native RAG search for hosts that can support it:

New Infrastructure:

  • Add an opt-in rag-search experiment backed by MariaDB native VECTOR(1536) storage and a cosine VECTOR INDEX. Gates activation on MariaDB 11.8+ and an authenticated OpenAI connector, using text-embedding-3-small embeddings.
  • Add fallback which stores embeddings in post meta and searches them in-memory. This should be performant enough for smaller sites
  • Add indexing lifecycle hooks, one-hour dirty-post cron scheduling, short batch indexing, transactional replace-by-post behavior where possible, cleanup hooks, and wp ai rag WP-CLI utilities.
  • Move to embeddings supplied by WP AI CLient once that is available
Zrzut ekranu 2026-06-8 o 20 04 37

User facing features:

  • Augment the public search page to integrate the RAG results into the result
  • Introduce the "related posts" functionality just like in Jetpack - for SEO
Zrzut ekranu 2026-06-8 o 20 01 13

Rationale

MariaDB 11.8 LTS ships a community-available vector index implementation, so WordPress can offer a lean RAG search experiment without introducing another runtime service or PHP dependency.

Unfortunately MySQL 9 does not have vector search capabilities, despite having vector store.

The feature is intentionally gated for supporting hosts: it only becomes operational when the site has both a supported MariaDB version and an authenticated OpenAI connector capable of producing the fixed 1536-dimensional embedding shape used by the first schema.

Testing

Manual default wp-env smoke test:

source ~/.nvm/nvm.sh
nvm use 22.21.1
npm run wp-env -- start
npm run wp-env -- run cli wp ai rag status

Environment observed:

WordPress: http://localhost:8888
MariaDB: 12.3.2-MariaDB-ubu2404
RAG available: yes
Index table: present
  • Connect OpenAI key in the default wp-env site connectors screen
  • Create few posts semantically similar to a specific topic but not containing the keywords. For example posts about tables being broken, but NOT having "furniture shaking" words.

Then ran the procesing manually since chunking happens every hour. So you can chunk them initially with wp-cli:

npm run wp-env -- run cli wp ai rag index --all --batch-size=50
npm run wp-env -- run cli wp ai rag status

Result:

Processed 8 post(s): 8 clean, 0 error, 0 removed. 0 dirty post(s) remain.
RAG index rows: 8
Indexed posts: 8

Semantic search smoke test:

http://localhost:8888/?s=how+do+I+stop+furniture+from+shaking
First result: RAG fixture ... cafe table leveling

A raw keyword lookup for furniture / shaking should return no matching posts, while the semantic search API returned the cafe-table post first.

Zrzut ekranu 2026-06-8 o 12 17 00 Open WordPress Playground Preview

@artpi artpi changed the title [codex] Add MariaDB vector RAG search Add native MariaDB vector RAG search Jun 8, 2026
@JasonTheAdams

Copy link
Copy Markdown
Member

This is neat! I really like the idea! Let's get some user-facing UI in here so folks can enjoy the benefits, so it isn't purely technical.

@jeffpaul @dkotter I'm curious what you think about having an experiment that introduces additional tables? Do we have any other experiments doing this?

I'm going to suggest we use this to push along the Embeddings PR to have embeddings support in the AI Client in 7.1. I'm not keen on introducing experiments that have bespoke methods for calling AI and are coupled to specific providers. Not because that's generally "bad", but because a goal of this plugin is to be a good example to developers for using the AI Client, and this would be such a cool example.

Thanks, @artpi! 🙌

@jeffpaul

jeffpaul commented Jun 8, 2026

Copy link
Copy Markdown
Member

Execution calls OpenAI embeddings directly for now because AI Client exposes provider/model discovery but does not yet expose embedding generation execution. The implementation keeps that boundary isolated in OpenAI_Embedding_Client so it can move to AI Client execution later.

This rationale makes sense to me, let's ensure there's a TODO in the code to remove the OAI requirement once the AI Client allows us to utilize any Connector with embeddings support.

@JasonTheAdams I'm fine with adding tables, though we may want to ensure there's cleanup when someone disables the experiment that either immediately removes those tables or offers to do so for the user. If this ever looks to move out of experiment towards feature we can discuss more seriously about the table needs and structure.

@dkotter

dkotter commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

I'm not keen on introducing experiments that have bespoke methods for calling AI and are coupled to specific providers. Not because that's generally "bad", but because a goal of this plugin is to be a good example to developers for using the AI Client, and this would be such a cool example.

If I'm reading this correctly, is the idea that this PR should wait until embedding support is in the AI Client?

At a quick glance, my other concern here is the limitation with MariaDB 11.8. I'm not sure if there's numbers out there we can look at (in terms of what most WordPress sites run on) but I'm assuming this limits the number of people that would even be able to use this. Worth considering a fallback approach that would work for a larger percentage of users? Basically store in a custom table and then do our own calculations server-side? Performances issues on that to consider but in my testing, does scale decently well.

@artpi artpi changed the title Add native MariaDB vector RAG search Add native vector RAG search Jun 8, 2026
@artpi artpi changed the title Add native vector RAG search Add native vector search Jun 8, 2026
*
* @since 1.1.0
*/
public function render_unavailable_notice(): void {

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this now that we have a fallback?

Comment thread includes/RAG/Index_Repository.php Outdated
*
* @since 1.1.0
*/
class Index_Repository implements Index_Repository_Interface {

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schould this be renamed into MariaDB_Index_Repository?

/**
* MariaDB vector index backend.
*/
public const BACKEND_MARIADB = 'mariadb';

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this constant live in the appropriate index repository classes?
Maybe availability shouldn't be concernings itself about details of each index repository?

Comment thread includes/RAG/Availability.php Outdated
/**
* Required embedding model.
*/
private const EMBEDDING_MODEL = 'text-embedding-3-small';

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this live in the embedding client class?
Lets move all details about how embedding works into the embedding client

Comment thread includes/RAG/Availability.php Outdated
}

if ( empty( $this->get_available_index_backends() ) ) {
$this->unavailable_reason = __( 'MariaDB 11.8 or newer is required when the compact memory fallback is disabled.', 'ai' );

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sounds like it should also live in tne index repository class

Comment thread includes/RAG/Index_Repository.php Outdated

$defaults = array(
'limit' => 20,
'model' => 'text-embedding-3-small',

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this come from configured embedding client?

Comment thread includes/RAG/Index_Repository.php Outdated
// phpcs:disable WordPress.DB.PreparedSQL.InterpolatedNotPrepared -- Dynamic table name targets the owned vector index table.
$sql = $wpdb->prepare(
"INSERT INTO {$table_name}
(post_id, post_type, post_status, chunk_id, chunk_index, chunk_offset, anchor, title, permalink, content, content_hash, embedding, embedding_model, embedding_dimensions, indexed_at)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have model, do we need dimensions?
do we need anchor and title? what for we should be able to infer everything from post id

Comment thread includes/RAG/Index_Schema.php Outdated
*
* @since 1.1.0
*/
class Index_Schema {

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This index schema is only used in the "Index_Repository" which should be MariaDB_Index_Repository and is tied to that implementation? should it live there instaed?

$this->validate_embedding( $embedding );

$records[] = array(
'post_id' => (int) $post->ID,

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we storing post data in the meta of the post itself ???!!!!
We have this data from psost already!!
TITLE ?!

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we store the absolute minimum necessary for these chunks?

* @param \WP_Post $post Post object.
* @return list<array{chunk_id:string, chunk_index:int, chunk_offset:int, anchor:string|null, title:string, permalink:string, content:string}> Chunk records.
*/
public function chunk_post( WP_Post $post ): array {

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense for every chunk to have titel and excerpt of the post? I am not sure what to do if excerpt is long... maybe only title?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants