Content-scraping is theft, pure and simple

Jim Connolly posted a link on his Google+ page to “No Such Thing As A Good Scraper” by AJ Kohn, who writes a pretty good analysis and summary of what content scrapers often do with your written content.

You’re not wholly sure what content scraping is? The Wikipedia definition of a scraper site will help you understand it, which starts thus:

A scraper site is a spam website that copies all of its content from other websites using web scraping.  In the last few years scraper sites have proliferated at an amazing rate for spamming search engines.

There’s even software you can buy that will automate the theft of content from other people’s websites. Check this description of one product (no, I’m not linking to it):

Web Content Extractor is the most powerful and easy-to-use data extraction software for web scraping and data extraction from the websites.

So it’s not a new thing and, according to AJ Kohn (and many others), is getting worse.

In Jim’s Google+ post, I found it interesting exchanging some opinions with one commenter there, Joel Hughes, who raised some good points that I’ve heard before as well, in essence thinking that content scraping isn’t so bad as long as the scraper links back to your original post.

Unfortunately, this isn’t innocently like recent services such as Pinterest (which faces copyright issues of its own). Scraped content that’s published on scraper websites often has the link to the original post changed to the clear disadvantage of the original content creator, as AJ Kohn illustrates in his post.

In my opinion, content scraping is theft, pure and simple.

If someone has taken your content – a larger proportion than anyone reasonably would say is a fair use – without your permission and reposted it (sometimes, repurposed) often surrounded with ads, that’s theft.

Clearly stating on your site what you allow others to do with your content (a Creative Commons license is the easiest way for this) is great for honest folk. Scrapers tend not to be in this group, so t&c or whatever would have little impact.

Providing a link back to the original post doesn’t mean that the act of the scraper in using your content without your permission is therefore legitimate. The scraper needs your permission, or to abide by any terms of use you make available on your site, in order to not be a thief.

So does that mean, then, that this is all part of the online landscape and we just live with it? AJ Kohn has a clear view on that in his post:

[…] We turn a blind eye and whistle past the graveyard happily trusting that Google will sort it all out. They’ll make sure that the original content is returned instead of the scraped content. That’s a lot of faith to put in Google, particularly as they struggle to keep up with the increasing pace of digital content.

Yet, we whine about how SEO is viewed by those outside of the industry. And we’ll whine again when Google gets a search result wrong and shows a scraper above the original content. Indignant blog posts will be written.

[…] It doesn’t have to be that way.

Why not build a Chrome extension that lets me flag and report scraper sites? Or a WordPress Plugin that lets me mark and report a site as a scraper directly within the comment interface. Or how about a section in Google Webmaster Tools where I can review links?

Such tech measures would help, undoubtedly. Yet I fear that content scraping really is part of the online landscape as it’s a lot to do with behaviours and how people use the tech. Maybe it is something that we’ll have to regard as we have done for years with email spam – an irritant, something that is in our environment. As Kohn eloquently says:

This stuff is garbage. It’s content pollution. It is the arterial plaque of the Internet.

At least, you can make sure you have in place clarity about your content on the web even if the scrapers will steal it anyway.

Related posts:

Neville Hobson

Social Strategist, Communicator, Writer, and Podcaster with a curiosity for tech and how people use it. Believer in an Internet for everyone. Early adopter (and leaver) and experimenter with social media. Occasional test pilot of shiny new objects. Avid tea drinker.

  1. Bruno Amaral

    I will have to make a case towards ethical content scrapping.

    The software used to scrape a website’s content is a tool, and as a tool it is agnostic in the sense that it will be good or evil depending on your purpose.

    Case in point: http://www.iefp.pt/Paginas/Home.aspx
    This is the website for the portuguese employment and training institute. The content is relevant, but the website has not been updated for quite some time. It is hard to navigate and find the information you need.

    then you have http://alt.iefp.eu a small project of mine where I have scrapped to list every single page and offer visitors a clear search engine (that needs a lot of work in terms of design and content).

    I don’t see this as a form of theft. Every result links back to the original IEFP website and the goal is to make people’s life easier. Yes, there is an ad unit on the page. However, it serves to pay for the hosting costs and over the course of its placement it hasn’t even made 1€ of revenue.

    Think of scrapping as a tool, and use it to help your visitors achieve something useful.

    • Neville Hobson

      Some interesting points, Bruno, thanks. However…

      “The software used to scrape a website’s content is a tool, and as a tool it is agnostic in the sense that it will be good or evil depending on your purpose.”

      I largely agree, hence my point in my post about people’s behaviours and how they use tech.

      “I don’t see this as a form of theft. Every result links back to the original IEFP website and the goal is to make people’s life easier.”

      Sorry, Bruno, but I think you’re dead wrong – if you don’t have permission to do what you’re doing, it’s theft no matter what your motive might be. If you have good intend, why not tell that to the content owner and seek permission first?

      • Bruno Amaral

        The content that gets copied is nothing more than the URL of the page and it’s title. And that is the only thing that is displayed by the website. The heart of it remains untouched and can only be viewed at the original location. This approach doesn’t even take away visits from their website, it aims to provide them with more visitors and even increase the success rate of each visit to search for information.

        As far as asking for permission, the information that gets scrapped is public and google scrapes it all the time without asking for it.

        And to be honest, I disagree with the way our public institutions work. Entering a process of pre-approval would be long and tedious, culminating on me losing any motivation. And lets face it, I doubt a portuguese public institution would acknowledge that their website is not effective.

        • Neville Hobson

          In which case, Bruno, that’s not content scraping if you publish only the URL and title. That’s even less than a typical Google search result would include.

          If on the other hand you intended or wanted to publish much of the content beyond what would reasonably be regarded as fair use, then you’d need to get permission. Otherwise, you’d probably be seen as content scraping.

          • Bruno Amaral

            I still consider this to be content scrapping. The tool used is meant to map websites and their exchange of links and I have created the website’s directory.

            It is also possible to copy the entire content to the database, making it searchable but not viewable by the visitor and still direct him/her to the original content.

            So the bigger question is not wether content scrapping is or is not theft, it is a mere tool, but to what extent can we go to without taking advantage of someone else’s hard work.

            • Neville Hobson

              Hmm, not convinced, Bruno. Content scraping – meaning, grabbing and using someone else’s content (not just a URL and a title) for your own purpose without their permission – is theft, you do see that, right?

              As for the bigger question you refer too, well, I doubt there will be a single clear answer. More of a debating point, I think.

  2. Jim Connolly

    Thanks for widening the initial scope of the conversation.

    In my opinion, scraping is theft. It’s an automated process, designed to populate a site with content and ‘game’ search engines and drive traffic to ads.

    Thanks for the post, Neville.

Comments are closed.
Close