Limit search to latin characters

This solution filters search strings by applying a regular expression which only matches characters from the Common and Latin Unicode scripts.


Matching Latin Characters with Regular Expressions

I just had my mind blown over at Stack Overflow. As it turns out, regular expressions have a mechanism to match entire Unicode categories, including values to specify entire Unicode “scripts”, each corresponding to groups of characters used in different writing systems.

This is done by using the \p meta-character followed by a Unicode category identifier in curly braces – so [\p{Common}\p{Latin}] matches a single character in either the Latin or Common scripts – this includes punctuation, numerals, and miscellaneous symbols.

As @Paul ‘Sparrow Hawk’ Biron points out, the u pattern modifier flag should be set at the end of the regular expression in order for PHP’s PCRE functions to treat the subject string as UTF-8 Unicode encoded.

All together then, the pattern

/^[\p{Latin}\p{Common}]+$/u

will match an entire string composed of one or more characters in the Latin and Common Unicode scripts.


Filtering the Search String

A good place to intercept a search string is the pre_get_posts action as it fires immediately before WordPress executes the query. With more care, this could also be accomplished using a request filter.

function wpse261038_validate_search_characters( $query ) {
  // Leave admin, non-main query, and non-search queries alone
  if( is_admin() || !$query->is_main_query() || !$query->is_seach() )
    return;

  // Check if the search string contains only Latin/Common Unicode characters
  $match_result = preg_match( '/^[\p{Latin}\p{Common}]+$/u', $query->get( 's' ) );

  // If the search string only contains Latin/Common characters, let it continue
  if( 1 === $match_result )
    return;

  // If execution reaches this point, the search string contains non-Latin characters
  //TODO: Handle non-Latin search strings
  //TODO: Set up logic to display error message
}

add_action( 'pre_get_posts', 'wpse261038_validate_search_characters' );

Responding to Disallowed Searches

Once it’s been determined that a search string contains non-Latin characters, you can use WP_Query::set() in order to modify the query by changing it’s named query vars – thus affecting the SQL query WordPress subsequently composes and executes.

The most relevant query variables are probably the following:

  • s is the query variable corresponding to a search string. Setting it to null or an empty string ('') will result in the WordPress no longer treating the query as a search – often times this results in an archive template displaying all posts or the front-page of the site, depending on the values of the other query vars. Setting it to a single space (' '), however, will result in WordPress recognizing it as a search, and thus attempting to display the search.php template.
  • page_id could be used to direct the user to a specific page of your choice.
  • post__in can restrict the query to a specific selection of posts. By setting it to an array with an impossible post ID, it can serve as a measure to ensure that the query returns absolutely nothing.

The above in mind, you might do the following in order to respond to a bad search by loading the search.php template with no results:

function wpse261038_validate_search_characters( $query ) {
  // Leave admin, non-main query, and non-search queries alone
  if( is_admin() || !$query->is_main_query() || !$query->is_seach() )
    return;

  // Check if the search string contains only Latin/Common Unicode characters
  $match_result = preg_match( '/^[\p{Latin}\p{Common}]+$/u', $query->get( 's' ) );

  // If the search string only contains Latin/Common characters, let it continue
  if( 1 === $match_result )
    return;

  $query->set( 's', ' ' ); // Replace the non-latin search with an empty one
  $query->set( 'post__in', array(0) ); // Make sure no post is ever returned

  //TODO: Set up logic to display error message
}

add_action( 'pre_get_posts', 'wpse261038_validate_search_characters' );

Displaying an Error

The way in which you actually display the error message is highly dependent on your application and the abilities of your theme – there are many ways which this can be done. If your theme calls get_search_form() in it’s search template, the easiest solution is probably to use a pre_get_search_form action hook to output your error immediately above the search form:

function wpse261038_validate_search_characters( $query ) {
  // Leave admin, non-main query, and non-search queries alone
  if( is_admin() || !$query->is_main_query() || !$query->is_seach() )
    return;

  // Check if the search string contains only Latin/Common Unicode characters
  $match_result = preg_match( '/^[\p{Latin}\p{Common}]+$/u', $query->get( 's' ) );

  // If the search string only contains Latin/Common characters, let it continue
  if( 1 === $match_result )
    return;

  $query->set( 's', ' ' ); // Replace the non-latin search with an empty one
  $query->set( 'post__in', array(0) ); // Make sure no post is ever returned

  add_action( 'pre_get_search_form', 'wpse261038_display_search_error' );
}

add_action( 'pre_get_posts', 'wpse261038_validate_search_characters' );

function wpse261038_display_search_error() {
  echo '<div class="notice notice-error"><p>Your search could not be completed as it contains characters from non-Latin alphabets.<p></div>';
}

Some other possibilities for displaying an error message include:

  • If your site uses JavaScript which can display “flash” or “modal” messages (or you add such abilities on your own), add to it the logic to display messages on page-load when a specific variable is set, then add a wp_enqueue_script hook with a $priority larger than that which enqueues that JavaScript, and use wp_localize_script() to set that variable to include your error message.
  • Use wp_redirect() to send the user to the URL of your choice (this method requires an additional page load).
  • Set a PHP variable or invoke a method which will inform your theme/plugin about the error such that it may display it where appropriate.
  • Set the s query variable to '' instead of ' ' and use page_id in place of post__in in order to return a page of your choosing.
  • Use a loop_start hook to inject a fake WP_Post object containing your error into the query results – this is most definitely an ugly hack and may not look right with your particular theme, but it has the potentially desirable side effect of suppressing the “No Results” message.
  • Use a template_include filter hook to swap out the search template with a custom one in your theme or plugin which displays your error.

Without examining the theme in question, it’s difficult to determine which route you should take.

Leave a Comment