Add native vector search#683
Conversation
|
This is neat! I really like the idea! Let's get some user-facing UI in here so folks can enjoy the benefits, so it isn't purely technical. @jeffpaul @dkotter I'm curious what you think about having an experiment that introduces additional tables? Do we have any other experiments doing this? I'm going to suggest we use this to push along the Embeddings PR to have embeddings support in the AI Client in 7.1. I'm not keen on introducing experiments that have bespoke methods for calling AI and are coupled to specific providers. Not because that's generally "bad", but because a goal of this plugin is to be a good example to developers for using the AI Client, and this would be such a cool example. Thanks, @artpi! 🙌 |
This rationale makes sense to me, let's ensure there's a TODO in the code to remove the OAI requirement once the AI Client allows us to utilize any Connector with embeddings support. @JasonTheAdams I'm fine with adding tables, though we may want to ensure there's cleanup when someone disables the experiment that either immediately removes those tables or offers to do so for the user. If this ever looks to move out of experiment towards feature we can discuss more seriously about the table needs and structure. |
If I'm reading this correctly, is the idea that this PR should wait until embedding support is in the AI Client? At a quick glance, my other concern here is the limitation with MariaDB 11.8. I'm not sure if there's numbers out there we can look at (in terms of what most WordPress sites run on) but I'm assuming this limits the number of people that would even be able to use this. Worth considering a fallback approach that would work for a larger percentage of users? Basically store in a custom table and then do our own calculations server-side? Performances issues on that to consider but in my testing, does scale decently well. |
| * | ||
| * @since 1.1.0 | ||
| */ | ||
| public function render_unavailable_notice(): void { |
There was a problem hiding this comment.
Do we need this now that we have a fallback?
| * | ||
| * @since 1.1.0 | ||
| */ | ||
| class Index_Repository implements Index_Repository_Interface { |
There was a problem hiding this comment.
Schould this be renamed into MariaDB_Index_Repository?
| /** | ||
| * MariaDB vector index backend. | ||
| */ | ||
| public const BACKEND_MARIADB = 'mariadb'; |
There was a problem hiding this comment.
Should this constant live in the appropriate index repository classes?
Maybe availability shouldn't be concernings itself about details of each index repository?
| /** | ||
| * Required embedding model. | ||
| */ | ||
| private const EMBEDDING_MODEL = 'text-embedding-3-small'; |
There was a problem hiding this comment.
Should this live in the embedding client class?
Lets move all details about how embedding works into the embedding client
| } | ||
|
|
||
| if ( empty( $this->get_available_index_backends() ) ) { | ||
| $this->unavailable_reason = __( 'MariaDB 11.8 or newer is required when the compact memory fallback is disabled.', 'ai' ); |
There was a problem hiding this comment.
this sounds like it should also live in tne index repository class
|
|
||
| $defaults = array( | ||
| 'limit' => 20, | ||
| 'model' => 'text-embedding-3-small', |
There was a problem hiding this comment.
Should this come from configured embedding client?
| // phpcs:disable WordPress.DB.PreparedSQL.InterpolatedNotPrepared -- Dynamic table name targets the owned vector index table. | ||
| $sql = $wpdb->prepare( | ||
| "INSERT INTO {$table_name} | ||
| (post_id, post_type, post_status, chunk_id, chunk_index, chunk_offset, anchor, title, permalink, content, content_hash, embedding, embedding_model, embedding_dimensions, indexed_at) |
There was a problem hiding this comment.
If we have model, do we need dimensions?
do we need anchor and title? what for we should be able to infer everything from post id
| * | ||
| * @since 1.1.0 | ||
| */ | ||
| class Index_Schema { |
There was a problem hiding this comment.
This index schema is only used in the "Index_Repository" which should be MariaDB_Index_Repository and is tied to that implementation? should it live there instaed?
| $this->validate_embedding( $embedding ); | ||
|
|
||
| $records[] = array( | ||
| 'post_id' => (int) $post->ID, |
There was a problem hiding this comment.
Why are we storing post data in the meta of the post itself ???!!!!
We have this data from psost already!!
TITLE ?!
There was a problem hiding this comment.
Can we store the absolute minimum necessary for these chunks?
| * @param \WP_Post $post Post object. | ||
| * @return list<array{chunk_id:string, chunk_index:int, chunk_offset:int, anchor:string|null, title:string, permalink:string, content:string}> Chunk records. | ||
| */ | ||
| public function chunk_post( WP_Post $post ): array { |
There was a problem hiding this comment.
Would it make sense for every chunk to have titel and excerpt of the post? I am not sure what to do if excerpt is long... maybe only title?
Summary
This introduces native RAG search for hosts that can support it:
New Infrastructure:
rag-searchexperiment backed by MariaDB nativeVECTOR(1536)storage and a cosineVECTOR INDEX. Gates activation on MariaDB 11.8+ and an authenticated OpenAI connector, usingtext-embedding-3-smallembeddings.wp ai ragWP-CLI utilities.User facing features:
Rationale
MariaDB 11.8 LTS ships a community-available vector index implementation, so WordPress can offer a lean RAG search experiment without introducing another runtime service or PHP dependency.
Unfortunately MySQL 9 does not have vector search capabilities, despite having vector store.
The feature is intentionally gated for supporting hosts: it only becomes operational when the site has both a supported MariaDB version and an authenticated OpenAI connector capable of producing the fixed 1536-dimensional embedding shape used by the first schema.
Testing
Manual default
wp-envsmoke test:Environment observed:
Then ran the procesing manually since chunking happens every hour. So you can chunk them initially with wp-cli:
Result:
Semantic search smoke test:
A raw keyword lookup for
furniture/shakingshould return no matching posts, while the semantic search API returned the cafe-table post first.