Filter/Remove HTML Elements on all posts and pages

Here’s an example function that might help you accomplish something like that. Basically what it does is fetch a couple of posts, loops through them, modifies the post_content field and stores the changes.

function wpse_87695_clean_post_content() {
    $posts = get_posts(array(
        'post_type' => array('post', 'page'),
        'post_status' => 'publish',
        /*
        'meta_query' => array(
            array(
                'key' => '_wpse_87695_processed',
                'value' => true,
                'compare' => '!='
            )
        ),
        */
    ));

    foreach ($posts as $p) {
        $p->post_content = wpse_87695_filter_content($p->post_content);
        wp_update_post($p);

        // update_post_meta($p->ID, '_wpse_87695_processed', true);
    }

    die();
}
add_action('wp', 'wpse_87695_clean_post_content');

function wpse_87695_filter_content($content) {
    return strip_tags($content); // wp_filter_nohtml_kses might be a more WordPress-friendly way to do this
}

First, you will wan’t to refine the get_posts argument so that it returns only the posts you need to clean. You would probably also want to limit the number of posts, as you will probably not be able to process 800 posts at once, though set_time_limit can help increase the number of posts you can process at once, depending on your configuration.

Ideally you would also want to mark the posts already process in some way, for example using update_post_meta, as this will allow you to filter them out using a meta_query keyword in the arguments array. That way you could process e.g. 50 posts at a time, reloading the page until all posts have been processed. I commented it out in my example code as I think it’ll need some more work.

Doing this work on a shared hosting environment might be very slow due to memory consumption and execution time limit, and as it’s also very likely that at some point (due to human error) you’ll corrupt data and having to start over, that you run on a backup database, preferably on a local machine.

An alternative way, which would free you from having to run the conversion in batch, is to set up a small javascript to load a certain URL from somewhere in the admin, that will instead run the above one post at a time until all posts have been processed.

Also, the filter function I supplied (wpse_87695_filter_content), as you can see, is extremely rudimentary. All it does is run strip_html() on the post_content to strip out all HTML in there. Likely you will have to use regular expressions or an HTML parser depending on your specific needs. For example you will probably need to remove the excess newlines and make sure the paragraphs are joined by only two newlines.

Alternative approach

Another solution could be to perform the filtering when the data is being output by WordPress. When you call the_content in you templates WordPress will fetch $post->post_content and run a few filters on it using apply_filters('the_content', '$post->post_content'). This allows you to register the function I outlines above as a filter for for all post content, by calling add_filter('the_content', 'wpse_87695_filter_content');.

While this approach will save you the trouble of having to iterate all the posts and update the database manually it will require the same efforts when it comes to writing a good filter function. Also, it will be run every time a post is displayed and for posts completely unrelated to your current import, unless you define some form of exception. Thus, it can be considered more of a quick fix, off course depending on the nature of your data. Perhaps you could store the most important filtering in the database, and leave some of it out and handle that using WordPress filters if that helps you in some way.

Leave a Comment

Hata!: SQLSTATE[HY000] [1045] Access denied for user 'divattrend_liink'@'localhost' (using password: YES)