Scrape Advertisements from Google Search Results with Python

3038922333_79273fbb30_oThere are a number of services out there such as Google Cash Detective that will go run some searches on Google and then save the advertisements so you can track who is advertising for what keywords over time. It’s actually a very accurate technique for finding out what ads are profitable.

After tracking a keyword for several weeks it’s possible to see what ads have been running consistently over time. The nature of Pay Per Click is that only profitable advertisements will continue to run long term. So if you can identify what ads, for what keywords are profitable then it should be possible to duplicate them and get some of that profitable traffic for yourself.

The following script is a Python program that perhaps breaks the Google terms of service. So consider it as a guide for how this kind of HTML parsing could be done. It spoofs the User-agent to appear as though it is a real browser, and then does a search through all the keywords stored in an sqlite database and stores the ads displayed for that keyword in the database.

The script makes use of the awesome Beautiful Soup library. Beautiful Soup makes parsing HTML content really easy. But because of the nature of scraping the web it is very fragile since it makes several assumptions about the structure of the Google results page and if they change their site then the script could break.

#!/usr/bin/env python
 
import sys
import urllib2
import re
import sqlite3
import datetime
 
from BeautifulSoup import BeautifulSoup  # available at: http://www.crummy.com/software/BeautifulSoup/
 
conn = sqlite3.connect("espionage.sqlite")
conn.row_factory = sqlite3.Row
 
def get_google_search_results(keywordPhrase):
	"""make the GET request to Google.com for the keyword phrase and return the HTML text
	"""
	url='http://www.google.com/search?hl=en&q=' + '+'.join(keywordPhrase.split())
	req = urllib2.Request(url)
	req.add_header('User-agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.29 Safari/525.13')
	page = urllib2.urlopen(req)
	HTML = page.read()
	return HTML
 
def scrape_ads(text, phraseID):
	"""Scrape the text as HTML, find and parse out all the ads and store them in a database
	"""
	soup = BeautifulSoup(text)
	#get the ads on the right hand side of the page
	ads = soup.find(id='rhsline').findAll('li')
	position = 0
	for ad in ads:
		position += 1
 
		#display url
		parts = ad.find('cite').findAll(text=True)
		site = ''.join([word.strip() for word in parts]).strip()
		ad.find('cite').replaceWith("")
 
		#the header line
		parts = ad.find('a').findAll(text=True)
		title = ' '.join([word.strip() for word in parts]).strip()
 
		#the destination URL
		href = ad.find('a')['href']
		start = href.find('&q=')
		if start != -1 :
			dest = href[start+3:]
		else :
			dest = None
			print 'error', href
 
		ad.find('a').replaceWith("")
 
		#body of ad
		brs = ad.findAll('br')
		for br in brs:
			br.replaceWith("%BR%")
		parts = ad.findAll(text=True)
		body = ' '.join([word.strip() for word in parts]).strip()
		line1 = body.split('%BR%')[0].strip()
		line2 = body.split('%BR%')[1].strip()
 
		#see if the ad is in the database
		c = conn.cursor()
		c.execute('SELECT adID FROM AdTable WHERE destination=? and title=? and line1=? and line2=? and site=? and phraseID=?', (dest, title, line1, line2, site, phraseID))
		result = c.fetchall() 
		if len(result) == 0:
			#NEW AD - insert into the table
			c.execute('INSERT INTO AdTable (`destination`, `title`, `line1`, `line2`, `site`, `phraseID`) VALUES (?,?,?,?,?,?)', (dest, title, line1, line2, site, phraseID))
			conn.commit()
			c.execute('SELECT adID FROM AdTable WHERE destination=? and title=? and line1=? and line2=? and site=? and phraseID=?', (dest, title, line1, line2, site, phraseID))
			result = c.fetchall()
		elif len(result) > 1:
			continue
 
		adID = result[0]['adID']
 
		c.execute('INSERT INTO ShowTime (`adID`,`date`,`time`, `position`) VALUES (?,?,?,?)', (adID, datetime.datetime.now(), datetime.datetime.now(), position))
 
 
def do_all_keywords():
	c = conn.cursor()
	c.execute('SELECT * FROM KeywordList')
	result = c.fetchall()
	for row in result:
		html = get_google_search_results(row['keywordPhrase'])
		scrape_ads(html, row['phraseID'])
 
if __name__ == '__main__' :
	do_all_keywords()
Bookmark and Share

Technorati Tags: , , , , , ,

Related posts:

  1. Scrape Technorati Search Results in Python
  2. Translating Text Using Google Translate and Python
  3. Scrape Google Search Results Page
  4. Scrape Yahoo Search Results Page
  5. How To Get RSS Content Into An Sqlite Database With Python – Fast
Stumble it!


RSS feed | Trackback URI

4 Comments »

Comment by Vancouver Movers
2009-07-14 18:10:55

This is a very interesting technique, one that I had not thought of before. I can see how it can really be useful if you are niche hunting or even trying to figure out the best ad phrasing for your specific keywords. I’m going to have to look into this more, it can really yield some very profitable results. Thanks for sharing this technique and the code!

 
Comment by Robert Subscribed to comments via email
2010-01-20 02:56:23

I am beginer with Python…, thanks for the code!

I am getting halotis.com in a search results when I do search for a terms with Python in them. This is good site and some good info here.

Could you please describe how to create the database to work with this code please? When I try to run it, I am getting error – no KeywordList table…, when I create that table I am getting error – No phraseID something… and so on.
For a newbie it is very hard (impossible?) to recreate this to working code.

Could you help please?

Regards
Robert

 
Comment by Matt Warren
2010-01-20 18:49:58

oops guess I missed that part. Here’s the SQL that I used to create the tables:

CREATE TABLE IF NOT EXISTS “main”.”AdTable” (“adID” INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL , “title” VARCHAR, “line1″ VARCHAR, “line2″ VARCHAR, “site” VARCHAR, “destination” VARCHAR NOT NULL , “phraseID” INTEGER NOT NULL )

CREATE TABLE IF NOT EXISTS “main”.”ShowTime” (“adID” INTEGER PRIMARY KEY NOT NULL , “date” DATETIME, “time” DATETIME, “position” INTEGER)

CREATE TABLE IF NOT EXISTS “main”.”KeywordList” (“phraseID” INTEGER PRIMARY KEY NOT NULL , “keywordPhrase” VARCHAR)

CREATE TABLE IF NOT EXISTS “main”.”CPCEstimate” (“phraseID” INTEGER NOT NULL , “date” DATETIME NOT NULL , “cpc” DOUBLE, PRIMARY KEY (“phraseID”, “date”))

Comment by Robert Subscribed to comments via email
2010-01-21 06:10:44

Thanks Matt, got it working…

 
 
Name (required)
E-mail (required - never shown publicly)
URI
Subscribe to comments via email
Your Comment (smaller size | larger size)
You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped=""> in your comment.



Additional comments powered by BackType