Most efficient way to search for values from CPT in Post content

This was a fun project. I think I could write an entire chapter in a book on this (in fact, I practically did below). Anyway, I messed around with a bunch of ways to do this, and here’s the simplest way I came up with:

function add_glossary_links($content) {
    global $wpdb, $wordlist;
    if ( !$wordlist && !$wordlist = get_option('wordlist') ) {
        mysql_query('SET SESSION group_concat_max_len = 100000');
        $wordlist = $wpdb->get_var('SELECT GROUP_CONCAT(DISTINCT word SEPARATOR "|") AS `list` FROM '.$wpdb->prefix.'glossary');
        add_option('wordlist', $wordlist);
    }
    $content = preg_replace_callback(
        '/\b('.$wordlist.')\b/i',
        create_function(
            '$matches',
            'return "<a href=\"/glossary/".strtolower(substr($matches[0],0,1)."https://wordpress.stackexchange.com/".$matches[0])."/\">".$matches[0]."</a>";'
        ),
        $content
    );
    return preg_replace('/(<[^<]+)<a\s.*?>(.*?)<\/a>/si','$1$2', $content);
}
add_filter( 'the_content', 'add_glossary_links' );

Discussion

Assuming a table prefix_glossary of words, we’re selecting a pipe-delimited list of all records. This will later become part of the regular expression when we search our post for the words. Before we do this, we’re setting a session variable for group_concat_max_length, which by default truncates to 1024 characters. 100,000 characters gives you room for 10,000 9-letter words (and 1 10 letter word, to be precise), so you can up that if you need more.

In preg_replace_callback, we’re replacing all instances of the words with themselves wrapped in anchor tags. Adjust the HREF as necessary.

This might have malformed some HTML (e.g. wrapping a word in an image alt with an anchor tag), so the next preg_replace fixes that. If you want, you could do another one to ensure that words already wrapped in anchor tags don’t get wrapped in another anchor tag.

For performance testing, I built a glossary table of 1,001 random words 5-15 characters long. Performance does indeed suffer if the SELECT GROUP_CONCAT... results are not cached in some way, to the tune of a full second. As you can see, what I’ve done here is stored it in an option (of course, you’d want to update that option any time your glossary is updated). You could take any number of steps to boost performance here, and this is just one suggestion. Using apache benchmark to test, my page loads increased an average of 6.8% (with a very large standard deviation, so it’s actually not a statistically significant difference, but I digress), which is on the fence; whether that is crucial or not is based entirely on how much traffic the site gets. Although, if it got a lot of traffic, one would hope it would have caching enabled anyway, in which a case the difference would be completely negligible.

Conclusion

This works, but you’d want to revisit the approach if that 6.8% was critical. Specifically, you could consider doing the replacements with javascript. You could do it in similar fashion, writing the wordlist out as a JS variable. This way your server doesn’t have to handle the search processing and you let your users’ machines do all the work.

Alternatives

Another way I tried doing this, which was significantly faster but required more coding, was to store an index when a post was published or updated. In that case, I searched the dictionary against every word in the post and stored the matches in an index table. Then on page load, replace the shortened wordlist in the post like above, so then I was only processing the list of known matches instead of the full list (so, say 5 words instead of 1000). If you wanted to explore that route, you need two tables:

CREATE TABLE `wp_glossary` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `word` tinytext,
  `definition` text,
  PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=1002 DEFAULT CHARSET=latin1;

CREATE TABLE `wp_posts_glossary` (
  `post_id` int(11) unsigned NOT NULL,
  `glossary_id` int(11) unsigned NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

And you’d create the index on post update using something like:

global $wpdb, $post;
$wpdb->query('DELETE FROM '.$wpdb->prefix.'posts_glossary WHERE post_id='.$post->ID);
$sql_regex = rtrim(preg_replace('/[\W\s]*(\w+)[\W\s]*/','$1|',$post->post_content),'|');
$wpdb->query('INSERT INTO '.$wpdb->prefix.'posts_glossary (post_id,glossary_id)
    SELECT '.$post->ID.',ID FROM '.$wpdb->prefix.'glossary WHERE word REGEXP("^'.$sql_regex.'$")';

I used my 1,001-word glossary and the 2,215-word first chapter of Moby Dick to test the indexing and it averaged ~1.5 seconds to build. Therefore, if you went this route, you should consider leveraging wp_cron so you don’t have to wait for it to index after you hit “Update” or “Publish”.

Fun stuff. Hope this helps!

Cheers~

Leave a Comment