Scrape Bing Search Engine Results Page

bingLogo_5F00_lgBased on my last post for scraping the Google SERP I decided to make the small change to scrape the organic search results from Bing.

I wasn’t able to find a way to display 100 results per page in the Bing results so this script will only return the top 10. However it could be enhanced to loop through the pages of results but I have left that out of this code.

Example Usage:

$ python BingScrape.py
http://twitter.com/halotis
http://www.halotis.com/
http://www.halotis.com/progress/
http://doi.acm.org/10.1145/367072.367328
http://runtoloseweight.com/privacy.php
http://twitter.com/halotis/statuses/2391293559
http://friendfeed.com/mfwarren
http://www.date-conference.com/archive/conference/proceedings/PAPERS/2001/DATE01/PDFFILES/07a_2.pdf
http://twitterrespond.com/
http://heatherbreen.com

Here’s the Python Code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# (C) 2009 HalOtis Marketing
# written by Matt Warren
# http://halotis.com/
 
import urllib,urllib2
 
from BeautifulSoup import BeautifulSoup
 
def bing_grab(query):
 
    address = "http://www.bing.com/search?q=%s" % (urllib.quote_plus(query))
    request = urllib2.Request(address, None, {'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)'} )
    urlfile = urllib2.urlopen(request)
    page = urlfile.read(200000)
    urlfile.close()
 
    soup = BeautifulSoup(page)
    links =   [x.find('a')['href'] for x in soup.find('div', id='results').findAll('h3')]
 
    return links
 
if __name__=='__main__':
    # Example: Search written to file
    links = bing_grab('halotis')
    print '\n'.join(links)

Technorati Tags: , , , , ,



RSS feed | Trackback URI

1 Comment »

Comment by Marcin
2009-09-19 22:14:38

Hi Matt,

great sharing there. This piece of code is short and effective!

I am actually interested in the “related searches” in bing and was trying to use BeautifulSoup to crawl it for personal data collection.

I was using the code below to identify it, however it returned me with another data that i am not interested in.

results = soup.findAll(‘div’, attrs={‘class’ : ‘sw_menu’})

After going through the html,I realised there are actually two classes with similar class name but different id’s. Well, the module took the latter one which is not what i am interested in.

I am wondering if you know how to use BeautifulSoup more effectively?

Thanks.

Best regards
Marcin

 
Name (required)
E-mail (required - never shown publicly)
URI
Your Comment (smaller size | larger size)
You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" highlight=""> in your comment.