Scrape Digg Search Results Python Script
Digg is by far the most popular social news site on the internet. With it’s simple “thumbs up” system the users of the site promote the most interesting and high quality stores and the best of those make it to the front page. What you end up with is a filtered view of the most interesting stuff.
It’s a great site and one that I visit every day.
I wanted to write a script that makes use of the search feature on Digg so that I could scrape out and re-purpose the best stuff to use elsewhere. The first step in writing that larger (top secret) program was to start with a scraper for Digg search.
The short python script I came up with will return the search results from Digg in a standard python data structure so it’s simple to use. It parses out the title, destination, comment count, digg link, digg count, and summary for the top 100 search results.
You can perform advanced searches on digg by using a number of different flags:
- +b Add to see buried stories
- +p Add to see only promoted stories
- +np Add to see only unpromoted stories
- +u Add to see only upcoming stories
- Put terms in “quotes” for an exact search
- -d Remove the domain from the search
- Add -term to exclude a term from your query (e.g. apple -iphone)
- Begin your query with site: to only display stories from that URL.
This script also allows the search results to be sorted:
from DiggSearch import digg_search digg_search('twitter', sort='newest') #sort by newest first digg_search('twitter', sort='digg') # sort by number of diggs digg_search('twitter -d') # sort by best match
Here’s the Python code:
#!/usr/bin/env python # -*- coding: utf-8 -*- # (C) 2009 HalOtis Marketing # written by Matt Warren # http://halotis.com/ import urllib,urllib2 import re from BeautifulSoup import BeautifulSoup USER_AGENT = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)' def remove_extra_spaces(data): p = re.compile(r'\s+') return p.sub(' ', data) def digg_search(query, sort=None, pages=10): """Returns a list of the information I need from a digg query sort can be one of [None, 'digg', 'newest'] """ digg_results = [] for page in range (1,pages): #create the URL address = "http://digg.com/search?s=%s" % (urllib.quote_plus(query)) if sort: address = address + '&sort=' + sort if page > 1: address = address + '&page=' + str(page) #GET the page request = urllib2.Request(address, None, {'User-Agent':USER_AGENT} ) urlfile = urllib2.urlopen(request) page = urlfile.read(200000) urlfile.close() #scrape it soup = BeautifulSoup(page) links = soup.findAll('h3', id=re.compile("title\d")) comments = soup.findAll('a', attrs={'class':'tool comments'}) diggs = soup.findAll('strong', id=re.compile("diggs-strong-\d")) body = soup.findAll('a', attrs={'class':'body'}) for i in range(0,len(links)): item = {'title':remove_extra_spaces(' '.join(links[i].findAll(text=True))).strip(), 'destination':links[i].find('a')['href'], 'comment_count':int(comments[i].string.split()[0]), 'digg_link':comments[i]['href'], 'digg_count':diggs[i].string, 'summary':body[i].find(text=True) } digg_results.append(item) #last page early exit if len(links) < 10: break return digg_results if __name__=='__main__': #for testing results = digg_search('twitter -d', 'digg', 2) for r in results: print r
You can grab the source code from the bitbucket repository.
More from halotis.com
Related posts:
- Download Images From Flickr With Python
- Scrape Yahoo Search Results Page
- Scrape Google Search Results Page
- Find Links on Del.icio.us with a Python Script
- Getting links to a domain using Alexa and Python



OK, so I tried to simplify your code to
import urllib, urllib2
USER_AGENT = ‘Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)’
address = “http://services.digg.com/search/stories?query=frankincense%20tree&sort=digg_count-desc&appkey=http%3A%2F%2Fwww.beepl.com&type=json”
request = urllib2.Request(address, None, {‘User-Agent’:USER_AGENT} )
urlfile = urllib2.urlopen(request)
page = urlfile.read(200000)
urlfile.close()
………..but receiving 403.
Digg failed to help me out – and I tried from like 10 different servers (all hosted on goGrid)
I am desperate. Any ideas?
Thank you!!
[jparicka@25358_2_42578_205369:~] python abcd.py
Traceback (most recent call last):
File “abcd.py”, line 9, in ?
urlfile = urllib2.urlopen(request)
File “/usr/lib/python2.4/urllib2.py”, line 130, in urlopen
return _opener.open(url, data)
File “/usr/lib/python2.4/urllib2.py”, line 364, in open
response = meth(req, response)
File “/usr/lib/python2.4/urllib2.py”, line 471, in http_response
response = self.parent.error(
File “/usr/lib/python2.4/urllib2.py”, line 402, in error
return self._call_chain(*args)
File “/usr/lib/python2.4/urllib2.py”, line 337, in _call_chain
result = func(*args)
File “/usr/lib/python2.4/urllib2.py”, line 480, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
I wasn’t able to duplicate the 403 error. but a couple of things to note.
since you’re using the developer api, you don’t need to specify the user agent stuff and you can actually make it much simplier:
>>> import urllib, json
>>> address=”http://services.digg.com/search/stories?query=frankincense%20tree&sort=digg_count-desc&appkey=http%3A%2F%2Fwww.beepl.com&type=json”
>>> data = json.load(urllib.urlopen(address))
[ jparicka dev ~ ] cat delete.me
import urllib, json
address=”http://services.digg.com/search/stories?query=frankincense%20tree&sort=digg_count-desc&appkey=http%3A%2F%2Fwww.beepl.com&type=json”
data = urllib.urlopen(address)
print data.readlines()
[ jparicka dev ~ ] python delete.me
['{"error":{"timestamp":1260195682,"message":"HTTP User-Agent header required","code":1029}}']
I get the same problem on all our boxes. :-(
I have tried it on a number of my computers – Linux, Windows and Mac and it works fine. (all with python 2.6)
Are you stuck behind a firewall or proxy server ?
Matt, we use mostly bitnami (rightscale) server images running on GoGrid. I can bring one server instance up if you want to have a look? I tried everything … :-( Thank you! I very much appreciate your help with this.