This answer is just to put down a kind of roadmap. I think the question touches some machine learning concepts, you used the right “statistics” tag. You need to build a dictionary plugin learning from your new posts. Probably doing something like:
- Manually create a first json filter dataset of most used words in your language (i.g. https://1000mostcommonwords.com/1000-most-common-english-words/). I didn’t find an API for it. It would filter out all the words considered or which you consider irrelevant (like prepositions, pronouns, etc).
- Write a function which process all the existing posts and export the content of your interest (description, titles, etc) in a second json dataset. You already have the
post_meta
s to exploit, as database source. Remember to assign each content to thepost_id
because you’ll need to handle updates or removals. - Create a function which updates that json on new post updated or published.
- Define a gate comparing the previous 2 json sources, filtering out words and generating a new final json file to parse. You can use a declarative approach with
array_map
orarray_filter
built-in functions. - Finally, build a logic to count each word occurrence, store in a new database table and display it in a dashboard page.
I guess the parsing activity will become quite intense after a while, if the blog gets rich of contents and you often update your filter dataset. Let’s have a look also to these library which could help:
- Machine learning library for php https://php-ml.readthedocs.io/en/latest/
- Event-driven, non-blocking I/O with PHP https://github.com/reactphp/react
Have fun and please share the outcome.