Scrape Digg Search Results Python Script

Digg is by far the most popular social news site on the internet. With it’s simple “thumbs up” system the users of the site promote the most interesting and high quality stores and the best of those make it to the front page. What you end up with is a filtered view of the most interesting stuff.

It’s a great site and one that I visit every day.

I wanted to write a script that makes use of the search feature on Digg so that I could scrape out and re-purpose the best stuff to use elsewhere. The first step in writing that larger (top secret) program was to start with a scraper for Digg search.

The short python script I came up with will return the search results from Digg in a standard python data structure so it’s simple to use. It parses out the title, destination, comment count, digg link, digg count, and summary for the top 100 search results.

You can perform advanced searches on digg by using a number of different flags:

  • +b Add to see buried stories
  • +p Add to see only promoted stories
  • +np Add to see only unpromoted stories
  • +u Add to see only upcoming stories
  • Put terms in “quotes” for an exact search
  • -d Remove the domain from the search
  • Add -term to exclude a term from your query (e.g. apple -iphone)
  • Begin your query with site: to only display stories from that URL.

This script also allows the search results to be sorted:

from DiggSearch import digg_search
digg_search('twitter', sort='newest')  #sort by newest first
digg_search('twitter', sort='digg')  # sort by number of diggs
digg_search('twitter -d')  # sort by best match

Here’s the Python code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# (C) 2009 HalOtis Marketing
# written by Matt Warren
# http://halotis.com/
 
import urllib,urllib2
import re
 
from BeautifulSoup import BeautifulSoup
 
USER_AGENT = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)'
 
def remove_extra_spaces(data):
    p = re.compile(r'\s+')
    return p.sub(' ', data)
 
def digg_search(query, sort=None, pages=10):
    """Returns a list of the information I need from a digg query
    sort can be one of [None, 'digg', 'newest']
    """
 
    digg_results = []
    for page in range (1,pages):
 
        #create the URL
        address = "http://digg.com/search?s=%s" % (urllib.quote_plus(query))
        if sort:
            address = address + '&sort=' + sort
        if page > 1:
            address = address + '&page=' + str(page)
 
        #GET the page
        request = urllib2.Request(address, None, {'User-Agent':USER_AGENT} )
        urlfile = urllib2.urlopen(request)
        page = urlfile.read(200000)
        urlfile.close()
 
        #scrape it
        soup = BeautifulSoup(page)
        links = soup.findAll('h3', id=re.compile("title\d"))
        comments = soup.findAll('a', attrs={'class':'tool comments'})
        diggs = soup.findAll('strong', id=re.compile("diggs-strong-\d"))
        body = soup.findAll('a', attrs={'class':'body'})
        for i in range(0,len(links)):
            item = {'title':remove_extra_spaces(' '.join(links[i].findAll(text=True))).strip(), 
                    'destination':links[i].find('a')['href'],
                    'comment_count':int(comments[i].string.split()[0]),
                    'digg_link':comments[i]['href'],
                    'digg_count':diggs[i].string,
                    'summary':body[i].find(text=True)
                    }
            digg_results.append(item)
 
        #last page early exit
        if len(links) < 10:
            break
 
    return digg_results
 
if __name__=='__main__':
    #for testing
    results = digg_search('twitter -d', 'digg', 2)
    for r in results:
        print r

You can grab the source code from the bitbucket repository.

Technorati Tags: , , , , , ,



RSS feed | Trackback URI

5 Comments »

Comment by Jan Paricka
2009-12-04 21:23:30

OK, so I tried to simplify your code to

import urllib, urllib2

USER_AGENT = ‘Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)’

address = “http://services.digg.com/search/stories?query=frankincense%20tree&sort=digg_count-desc&appkey=http%3A%2F%2Fwww.beepl.com&type=json”

request = urllib2.Request(address, None, {‘User-Agent’:USER_AGENT} )

urlfile = urllib2.urlopen(request)

page = urlfile.read(200000)

urlfile.close()

………..but receiving 403.

Digg failed to help me out – and I tried from like 10 different servers (all hosted on goGrid)

I am desperate. Any ideas?

Thank you!!

[jparicka@25358_2_42578_205369:~] python abcd.py
Traceback (most recent call last):
File “abcd.py”, line 9, in ?
urlfile = urllib2.urlopen(request)
File “/usr/lib/python2.4/urllib2.py”, line 130, in urlopen
return _opener.open(url, data)
File “/usr/lib/python2.4/urllib2.py”, line 364, in open
response = meth(req, response)
File “/usr/lib/python2.4/urllib2.py”, line 471, in http_response
response = self.parent.error(
File “/usr/lib/python2.4/urllib2.py”, line 402, in error
return self._call_chain(*args)
File “/usr/lib/python2.4/urllib2.py”, line 337, in _call_chain
result = func(*args)
File “/usr/lib/python2.4/urllib2.py”, line 480, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden

 
Comment by Matt Warren
2009-12-06 14:41:33

I wasn’t able to duplicate the 403 error. but a couple of things to note.

since you’re using the developer api, you don’t need to specify the user agent stuff and you can actually make it much simplier:

>>> import urllib, json
>>> address=”http://services.digg.com/search/stories?query=frankincense%20tree&sort=digg_count-desc&appkey=http%3A%2F%2Fwww.beepl.com&type=json”
>>> data = json.load(urllib.urlopen(address))

Comment by Jan Paricka
2009-12-07 09:22:34

[ jparicka dev ~ ] cat delete.me
import urllib, json
address=”http://services.digg.com/search/stories?query=frankincense%20tree&sort=digg_count-desc&appkey=http%3A%2F%2Fwww.beepl.com&type=json”
data = urllib.urlopen(address)
print data.readlines()

[ jparicka dev ~ ] python delete.me
['{"error":{"timestamp":1260195682,"message":"HTTP User-Agent header required","code":1029}}']

I get the same problem on all our boxes. :-(

 
 
Comment by Matt Warren
2009-12-07 09:48:54

I have tried it on a number of my computers – Linux, Windows and Mac and it works fine. (all with python 2.6)

Are you stuck behind a firewall or proxy server ?

 
Comment by Jan Paricka
2009-12-07 09:54:43

Matt, we use mostly bitnami (rightscale) server images running on GoGrid. I can bring one server instance up if you want to have a look? I tried everything … :-( Thank you! I very much appreciate your help with this.

 
Name (required)
E-mail (required - never shown publicly)
URI
Your Comment (smaller size | larger size)
You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" highlight=""> in your comment.