It appears that the news sharing site, Digg, cannot be scraped. Scraping is when a site uses a script to get the content from another site.
BACKGROUND
Scraping can be used for good or for bad. Here are two scenarios: some site goes around to various news sites and takes news articles without credit – it is questionable as to whether this is acceptable even with credit. In a second case, a site owner might scrape Amazon results and provide links back to Amazon for fulfilment. Here, there would be little reason for Amazon to complain.
A better solution to scraping and one that has certainly excelled, is providing an API like a Web service or a feed. With APIs there is an arranged and hopefully consistent format for the data as well as a representation of what data the site wants other sites to use.
The Dan Zen predict-a-poll Web 2.0 site, Tapoll, uses live data from five popular sites to show items relating to the current poll. This is done with a combination of scraping, and various APIs made available. Users then can reference the items and also start polls referring directly to the items by clicking the associated Tapoll link. To place the Tapoll link after each item, we need access to the data. The current Digg JavaScript interface that is made to put Digg stories on other sites will not do.
PROBLEM
In the past, the Tapoll site uses PHP to read the Digg RSS feed. But now, it appears as though Digg is blocking http requests from other scripts most likely with a blanket blocking of non-Digg IP addresses. We get this message when scraping or trying to access the RSS feeds with the PHP file() command. This command works fine with most other sites like Flickr, Del.ico.us, etc.
failed to open stream: HTTP request failed!
If anybody has a work-around or any further information on this, please let us know. Perhaps we should have just checked with Digg rather than this attempt at sensationalism. But hey, I’m playing reporter!
July 23, 2007 at 6:24 pm |
Did you ever figure this out? I rolled my own web based RSS reader and I noticed a few months back that all of diggs feed were timing out when I try to get them through .NET’s WebRequest call. They all open fine in IE or Outlook but I spent a lot of time rolling my own client and it really pisses me off that their blocking in this manner.
July 28, 2007 at 10:04 pm |
Hi Nealosis – I did not figure it out or try again… but if anyone reading this knows anything about it, please let us know.
August 2, 2007 at 9:21 pm |
There is a google widget that does this, here is a sample of the scrape.
http://dannyjoo.googlepages.com/diggstop24.xml
September 6, 2007 at 6:04 am |
CURL seemed to work for me.
October 4, 2007 at 12:41 am |
I can’t believe they did this. I’m trying to roll my own reader in with PHP and get the same error. I’m guessing they are blocking it by user agent? Oh well, I can just use cURL and spoof the agent to “Mozilla/5.0 (fuckyoudigg)”
October 12, 2007 at 7:42 pm |
Mateo is correct, cURL works just fine. But make sure you set a user agent otherwise it will timeout.
April 21, 2008 at 3:15 am |
Yes… cURL worked. In my PHP I used this:
function download_pretending($url,$user_agent) {
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt ($ch, CURLOPT_HEADER, 0);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec ($ch);
curl_close ($ch);
return $result;
}
echo download_pretending(“http://digg.com/news/popular/24hours”, “MSIE”);
April 21, 2008 at 3:17 am |
Thanks Mateo and Stanley and PHP.net and cURL 😉 and Digg I suppose…
March 25, 2009 at 1:09 am |
hehehe its foolish of them.
if you can see it on your browser, we can scrape it.