Jim Connolly posted a link on his Google+ page to “No Such Thing As A Good Scraper” by AJ Kohn, who writes a pretty good analysis and summary of what content scrapers often do with your written content.
You’re not wholly sure what content scraping is? The Wikipedia definition of a scraper site will help you understand it, which starts thus:
A scraper site is a spam website that copies all of its content from other websites using web scraping. In the last few years scraper sites have proliferated at an amazing rate for spamming search engines.
There’s even software you can buy that will automate the theft of content from other people’s websites. Check this description of one product (no, I’m not linking to it):
Web Content Extractor is the most powerful and easy-to-use data extraction software for web scraping and data extraction from the websites.
So it’s not a new thing and, according to AJ Kohn (and many others), is getting worse.
In Jim’s Google+ post, I found it interesting exchanging some opinions with one commenter there, Joel Hughes, who raised some good points that I’ve heard before as well, in essence thinking that content scraping isn’t so bad as long as the scraper links back to your original post.
Unfortunately, this isn’t innocently like recent services such as Pinterest (which faces copyright issues of its own). Scraped content that’s published on scraper websites often has the link to the original post changed to the clear disadvantage of the original content creator, as AJ Kohn illustrates in his post.
In my opinion, content scraping is theft, pure and simple.
If someone has taken your content – a larger proportion than anyone reasonably would say is a fair use – without your permission and reposted it (sometimes, repurposed) often surrounded with ads, that’s theft.
Clearly stating on your site what you allow others to do with your content (a Creative Commons license is the easiest way for this) is great for honest folk. Scrapers tend not to be in this group, so t&c or whatever would have little impact.
So does that mean, then, that this is all part of the online landscape and we just live with it? AJ Kohn has a clear view on that in his post:
[…] We turn a blind eye and whistle past the graveyard happily trusting that Google will sort it all out. They’ll make sure that the original content is returned instead of the scraped content. That’s a lot of faith to put in Google, particularly as they struggle to keep up with the increasing pace of digital content.
Yet, we whine about how SEO is viewed by those outside of the industry. And we’ll whine again when Google gets a search result wrong and shows a scraper above the original content. Indignant blog posts will be written.
[…] It doesn’t have to be that way.
Why not build a Chrome extension that lets me flag and report scraper sites? Or a WordPress Plugin that lets me mark and report a site as a scraper directly within the comment interface. Or how about a section in Google Webmaster Tools where I can review links?
Such tech measures would help, undoubtedly. Yet I fear that content scraping really is part of the online landscape as it’s a lot to do with behaviours and how people use the tech. Maybe it is something that we’ll have to regard as we have done for years with email spam – an irritant, something that is in our environment. As Kohn eloquently says:
This stuff is garbage. It’s content pollution. It is the arterial plaque of the Internet.
At least, you can make sure you have in place clarity about your content on the web even if the scrapers will steal it anyway.