Digg Cannot Be Scraped

It appears that the news sharing site, Digg, cannot be scraped. Scraping is when a site uses a script to get the content from another site.

BACKGROUND

Scraping can be used for good or for bad. Here are two scenarios: some site goes around to various news sites and takes news articles without credit – it is questionable as to whether this is acceptable even with credit. In a second case, a site owner might scrape Amazon results and provide links back to Amazon for fulfilment. Here, there would be little reason for Amazon to complain.

A better solution to scraping and one that has certainly excelled, is providing an API like a Web service or a feed. With APIs there is an arranged and hopefully consistent format for the data as well as a representation of what data the site wants other sites to use.

Tapoll - predict-a-poll Web 2.0 site

The Dan Zen predict-a-poll Web 2.0 site, Tapoll, uses live data from five popular sites to show items relating to the current poll. This is done with a combination of scraping, and various APIs made available. Users then can reference the items and also start polls referring directly to the items by clicking the associated Tapoll link. To place the Tapoll link after each item, we need access to the data. The current Digg JavaScript interface that is made to put Digg stories on other sites will not do.

PROBLEM

In the past, the Tapoll site uses PHP to read the Digg RSS feed. But now, it appears as though Digg is blocking http requests from other scripts most likely with a blanket blocking of non-Digg IP addresses. We get this message when scraping or trying to access the RSS feeds with the PHP file() command. This command works fine with most other sites like Flickr, Del.ico.us, etc.

failed to open stream: HTTP request failed!

If anybody has a work-around or any further information on this, please let us know. Perhaps we should have just checked with Digg rather than this attempt at sensationalism. But hey, I’m playing reporter!

9 Responses to “Digg Cannot Be Scraped”

  1. Nealosis Says:

    Did you ever figure this out? I rolled my own web based RSS reader and I noticed a few months back that all of diggs feed were timing out when I try to get them through .NET’s WebRequest call. They all open fine in IE or Outlook but I spent a lot of time rolling my own client and it really pisses me off that their blocking in this manner.

  2. danzen Says:

    Hi Nealosis – I did not figure it out or try again… but if anyone reading this knows anything about it, please let us know.

  3. John Magda Says:

    There is a google widget that does this, here is a sample of the scrape.

    http://dannyjoo.googlepages.com/diggstop24.xml

  4. Mateo Says:

    CURL seemed to work for me.

  5. Derek Says:

    I can’t believe they did this. I’m trying to roll my own reader in with PHP and get the same error. I’m guessing they are blocking it by user agent? Oh well, I can just use cURL and spoof the agent to “Mozilla/5.0 (fuckyoudigg)”

  6. Stanley Says:

    Mateo is correct, cURL works just fine. But make sure you set a user agent otherwise it will timeout.

  7. Dan Zen Says:

    Yes… cURL worked. In my PHP I used this:

    function download_pretending($url,$user_agent) {
    $ch = curl_init();
    curl_setopt ($ch, CURLOPT_URL, $url);
    curl_setopt ($ch, CURLOPT_USERAGENT, $user_agent);
    curl_setopt ($ch, CURLOPT_HEADER, 0);
    curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
    $result = curl_exec ($ch);
    curl_close ($ch);
    return $result;
    }
    echo download_pretending(“http://digg.com/news/popular/24hours”, “MSIE”);

  8. Dan Zen Says:

    Thanks Mateo and Stanley and PHP.net and cURL 😉 and Digg I suppose…

  9. web scraping service Says:

    hehehe its foolish of them.

    if you can see it on your browser, we can scrape it.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s


%d bloggers like this: