If you have a strict list of allowed years:
$allowed_years = array('2018', '2019', '2020', '2021', '2022', '2023', '2024');
Then why are you using regex when you can search for those years directly:
function vyperlook_get_years_in_text( string $text, array $allowed_years ) : array {
$found_years = []; // the years we found
// for each year that is allowed
foreach( $allowed_years as $year ) {
// if it's in the excerpt
if ( str_contains( $text, $year ) ) {
// add it to the list
$found_years[] = $year;
}
}
return $found_years;
}
$excerpt="1986 Alice Bob 2020年, 1984 and 2022!";
$years = vyperlook_get_years_in_text( $excerpt, [ '2018', '2019', '2020', '2021', '2022', '2023', '2024' ] );
echo implode(', ', $years);
The above outputs this:
2020, 2022
Unicode/Regex/Japanese/etc can all be ignored. I’m sure there’s a variation of the regex that will fix your code, but it’s massively overcomplicating the problem. As for why it doesn’t work, it’s because \b
matches zero width boundary positions, and not all unicode values are for whole characters, there’s a tonne of nuance that’s being ignored.