post content includes and seo vs. duplicate content

Question

If the resources and the products are accessible on different unique URLs, but those pages shown the same content in certain areas, it could definitely be flagged as duplicate content.

Sometimes you can’t avoid duplicate content. An example, to stay on the WordPress topic, is when you have blog posts, and category pages that list entire posts. The posts or category page would be seen as duplicate content.

You can control somewhat how Google and other search engines treat this content. I’ll list a couple of ways. I will treat the question as if it was purely WordPress related, and you can translate it to your own implementation. (I believe that will be the best way for WP SE to benefit).

Canonical Pages

A subtle solution is using canonical Pages. On the post’s page you would insert the follow meta tag in the <head> section:

<link rel="canonical" href="http://www.example.com/post-title/" />

Remember to self-close /> this tag (or any <meta> tag). Also, here’s an article from Google about using Canonical Pages.

This new option lets site owners suggest the version of a page that
Google should treat as canonical. Google will take this into account,
in conjunction with other signals, when determining which URL sets
contain identical content, and calculating the most relevant of these
pages to display in search results.

Using the Robots `<meta>` tag

You can instruct search engines to not index a page. Also, using the same tag, you can instruct them to ignore all the links found on that page (and not crawl through them).

<meta name="robots" content="noindex, nofollow" />

noindex will, as the name implies, instruct the search engine to not index this page.
nofollow will instruct the search engine to not ‘click through’ on any links found on that page.

Do not confuse <meta name="robots" content="nofollow" /> with <a rel="nofollow>".

Using `robots.txt`

You can instruct search engines to ignore entire sections of your website using a robots.txt file. Place this file in the root directory of your website, and make sure it can be reached through http://www.example.com/robots.txt.

The contents of this text file should be the following:

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/

Sitemap: http://www.example.com/sitemap.xml.gz

A good idea is to include a sitemap (there are plugins for WordPress).

Please know that there is no such thing as ‘wildcards’ (like *) in robots.txt, even if Google says different. The asterisk on the User-agent line is the only allowed wildcard. It will not work on Disallow directives!

There is also no such thing as the Allow: directive. While Google may follow these improvements on the robots.txt concept, they are certainly not obeyed by all search engines. Unless you are specifically catering for Google, only use the directives as described on the official robots.txt website.

Good to know is that even without an explicit wildcard, you can still target multiple things.

Disallow: / will prevent search engines from indexing your entire website (root directory and everything in it).

Disallow: /joe/ will prevent search engines from indexing everything inside the joe folder, which is located inside the root directory.

Disallow: /joe will prevent search engines from indexing everything inside the root directory starting with joe. So joe.html and joey.html will not be indexed, but hank.html will.

Last remarks

Remember that even if you do all three of these things (which I encourage), search engines do not explicitly have to obey these instructions. They are just that: instructions. Especially malware crawlers will ignore anything you instruct simply because they want to find out all they can about your website.

Canonical Pages

Using the Robots <meta> tag

Using robots.txt

Last remarks

Related Posts:

Using the Robots `<meta>` tag

Using `robots.txt`