Most CMS migration/data import tools for WordPress rely on some degree of access to the original data types and store – e.g. the discrete data related to posts, authors, and comments – which can then be fairly directly translated to WordPress’s counterparts. This particular case is considerably more involved since comments and authors have been rendered in to HTML within each post’s post_content
.
Since this is something of an unusual situation, I’m not familiar with any pre-packaged solutions which would largely facilitate this functionality. To move that data into WordPress, each posts’ HTMLpost_content
must be parsed to extract the relevant data such that it may be inserted into WordPress.
There are many many different ways which you could implement the actual migration logic. Leveraging something like WP-CLI or even just your site’s REST API, you could write that logic in whatever language you feel comfortable. But I will assume you intend to write a migration script which will execute in your WordPress environment hereforth.
Tracking Processed Posts
I would think to use a piece of post meta-data in order to track which posts have already been processed. Once the comments have been parsed out the post’s post_content
and inserted as comments, set a meta-value to flag it as complete, i.e.
update_post_meta( $post->ID, 'wpse392841_import_complete`, true );
The meta value will also enable you to leverage WP_Query
‘s meta query args in order to retrieve selections of posts which have yet to be processed:
$import_query = new WP_Query(
[
'post_type' => 'post',
'posts_per_page' => 30,
'meta_key' => 'wpse392841_import_complete',
'meta_compare' => 'NOT_EXISTS'
]
);
foreach( $import_query->posts as $post ) {
// TODO: parse data out of comments, insert as WordPress comments, update post
// with comments removed from `post_content`.
}
Batch-Processing
If you have a large amount of content, you’ll probably want to chunk the updates into smaller batches by querying for and looping through fewer per script execution – this will help to mitigate the possibility that you’ll run off the end of your script execution time limit which could potentially “corrupt” a part of the import if execution terminates at an inopportune time.
There are various ways to accomplish executing this sort of batched processing logic under the constraints of a webserver/PHP. One popular option is to simply have the server-side code make an asynchronous HTTP request to it’s own server with arguments which will execute the same logic once again. Thus, you could set up your script to continue to invoke itself until there are no more posts remaining to process. WordPress’s HTTP API can be used for this purpose.
Parsing Markup
To parse the actual markup in your migration loop, while you could use regular expressions, given the amount of content you’ll be extracting, I would probably use a bona-fide HTML parser instead – probably PHP’s DOM/DOMDocument. (If you end up performing the migration in a Node.js program instead for some reason, cheerio is a wonderful package for parsing HTML. It’s not terribly fast, but it’s easy to use and lets you interact with the virtual DOM using jQuery conventions).
We can see the sort of data which you’ll have to work with to shape into a comment by examining the markup of one the Blogger comments:
<div id="comment-header">
<div id="c2244642925767684880" class="comment-author">
<div id="comment-profile-image">
<div class="avatar-image-container avatar-stock">
<span dir="ltr">
<a id="av-1-08118927101030203407" class="avatar-hovercard" href="http://www.blogger.com/profile/08118927101030203407" rel="nofollow">
<img title="Joe Shmoe" alt="" width="16" height="16" nitro-lazy-src="https://cdn-daadn.nitrocdn.com/BKAyEzJMnNySFeYkTTGiHLTpuWRtPoXh/assets/static/optimized/img/c2dc8e1cb1b6f022685269de299214d3.b16-rounded.gif" class=" ls-is-cached lazyloaded" nitro-lazy-empty="" id="NDcxOjMyNw==-1" src="https://cdn-daadn.nitrocdn.com/BKAyEzJMnNySFeYkTTGiHLTpuWRtPoXh/assets/static/optimized/img/c2dc8e1cb1b6f022685269de299214d3.b16-rounded.gif">
</a>
</span>
</div>
</div>
<div id="comment-name-url">
<a class="comments-autor-name">Joe Shmoe</a> <a class="says">says:</a>
</div>
<div id="comment-date">
<span class="comment-timestamp">
<a class="comment-permalink">November 4, 2009 at 4:15 PM</a>
</span>
</div>
<div id="comment-body">
<div style="clear:both;"> </div>
<p class="comment-body" style="padding-right:10px;">Lorem ipsum dolor sit amet</p>
</div>
</div>
</div>
Here we can see there’s an author name, potentially an author URL and avatar image URLs in the case of the author having a Blogger profile, a timestamp, and the body of the post. All of this data will be easily “addressable” with an HTML parser due to each piece of data using a corresponding HTML class.
Unfortunately, looking at the markup for your comments section as a whole, every comment is a sibling of one another and does not detail any child/parent relationships. You will not be able to reconstruct comment threads based on the markup you have available alone.
Executing One-Off Logic
Finally, with regards to where to actually write and execute this sort of “one-off” code, I’m not aware of any strong conventions. The only major consideration I would have is to not attach it to a hook or other way in which it might execute per page-load, or indeed at any point unless you explicitly trigger it. In the past I’ve used a custom admin_post_{custom action name}
hook for this purpose, but I haven’t had a need to standardize my approach, and there may well be a better place for it.