Scrape Yahoo Search Results Page

Ok, even though Yahoo search is on the way out and will be replace by the search engine behind Bing. That transition won’t happen until sometime in 2010. Until then Yahoo still has 20% of the search engine market share and it’s important to consider it as an important source of traffic for your websites.

This script is similar to the Google and Bing SERP scrapers that I posted earlier on this site but Yahoo’s pages were slightly more complicated to parse. This was because they use a re-direct service in their URLs which required some regular expression matching.

I will be putting all these little components together into a larger program later.

Example Usage:

$ python yahooScrape.py
http://www.halotis.com/
http://www.halotis.com/2007/08/27/automation-is-key-automate-the-web/
http://twitter.com/halotis
http://www.scribd.com/halotis
http://www.topless-sandal.com/product_info.php/products_id/743?tsSid=71491a7bb080238335f7224573598606
http://feeds.feedburner.com/HalotisBlog
http://www.planet-tonga.com/sports/haloti_ngata.shtml
http://blog.oregonlive.com/ducks/2007/08/kellens_getting_it_done.html
http://friendfeed.com/mfwarren
http://friendfeed.com/mfwarren?start=30

Here’s the Script:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# (C) 2009 HalOtis Marketing
# written by Matt Warren
# http://halotis.com/
 
import urllib,urllib2
import re
 
from BeautifulSoup import BeautifulSoup
 
def yahoo_grab(query):
 
    address = "http://search.yahoo.com/search?p=%s" % (urllib.quote_plus(query))
    request = urllib2.Request(address, None, {'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)'} )
    urlfile = urllib2.urlopen(request)
    page = urlfile.read(200000)
    urlfile.close()
 
    soup = BeautifulSoup(page)
    url_pattern = re.compile('/\*\*(.*)')
    links =   [urllib.unquote_plus(url_pattern.findall(x.find('a')['href'])[0]) for x in soup.find('div', id='web').findAll('h3')]
 
    return links
 
if __name__=='__main__':
    # Example: Search written to file
    links = yahoo_grab('halotis')
    print '\n'.join(links)
Bookmark and Share

Technorati Tags: , , , , , , ,

Related posts:

  1. Scrape Google Search Results Page
  2. Scrape Bing Search Engine Results Page
  3. Getting links to a domain using Alexa and Python
  4. Translating Text Using Google Translate and Python
  5. SEOCheck: Track Your Google Position Over Time
Stumble it!


RSS feed | Trackback URI

4 Comments »

Comment by halotis
2009-08-12 08:19:35

Scrape Yahoo Search Results Page – Ok, even though Yahoo search is on the way out and will be replace by the searc.. http://bit.ly/mJVDw

This comment was originally posted on Twitter

 
Comment by Ralph
2009-08-13 01:15:02

Thank you for the great how tos? But can you please explain how all this can be monetize? I am having difficulty thinking of ways how to make money.

Than you!

 
Comment by Matt Warren
2009-08-13 09:30:19

Scraping search engine results will not directly put cash in your wallet but it could be useful for tracking SEO efforts – say you download the top 100 results for a keyword you’re targeting everyday and see how your position changes over time as you work on your website. Or get notified of new competitors for that keyword, or when you get kicked out.

You could also use these sites as a starting point for some web crawling software that would go to each of the top sites and pull out some of the content. I have used interesting tools that will allow you to quickly skim through the content of all these sites – by going through them all quickly you can find patterns and maybe get some ideas for your own site.

I’m writing these scripts for a goal of building a business intelligence application that will allow you to see at a glance how well your business is doing.

 
Comment by halotis
2009-08-29 10:01:12

Scrape Yahoo Search Results Page – Ok, even though Yahoo search is on the way out and will be replace by the searc.. http://bit.ly/1208Lk

This comment was originally posted on Twitter

 
Name (required)
E-mail (required - never shown publicly)
URI
Subscribe to comments via email
Your Comment (smaller size | larger size)
You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped=""> in your comment.

Additional comments powered by BackType