How To Get RSS Content Into An Sqlite Database With Python – Fast

Another in my series of Python scripting blog posts. This time I’m sharing a script that can rip through RSS feeds and devour their content and stuff it into a database in a way that scales up to 1000s of feeds. To accomplish this the script is multi-threaded.

The big problem with scaling up a web script like this is that there is a huge amount of latency when requesting something over the internet. Due to the bandwidth as well as remote processing time it can take as long as a couple of seconds to get anything back. Requesting one feed after the other in series will waste a lot of time, and that makes this type of script a prime candidate for some threading.

I borrowed parts of this script from this post: Threaded data collection with Python, including examples

What could you do with all this content? Just off the top of my head I can think of many interesting things to do:

  • Create histograms of the publish times of posts to find out the most/least popular days and times are for publishing
  • Plot trends of certain words or phrases over time
  • create your own aggregation website
  • get the trending topics by doing counting the occurrence of words by day
  • Try writing some natural language processing algorithms

This script is coded at 20 threads, but that really needs to be fine tuned for the best performance. Depending on your bandwidth and the sites you want to grab you may want to tweak the THREAD_LIMIT value.

import sqlite3
import threading
import time
import Queue
from time import strftime
 
import feedparser     # available at http://feedparser.org
 
 
THREAD_LIMIT = 20
jobs = Queue.Queue(0)
rss_to_process = Queue.Queue(THREAD_LIMIT)
 
DATABASE = "rss.sqlite"
 
conn = sqlite3.connect(DATABASE)
conn.row_factory = sqlite3.Row
c = conn.cursor()
 
#insert initial values into feed database
c.execute('CREATE TABLE IF NOT EXISTS RSSFeeds (id INTEGER PRIMARY KEY AUTOINCREMENT, url VARCHAR(1000));')
c.execute('CREATE TABLE IF NOT EXISTS RSSEntries (entry_id INTEGER PRIMARY KEY AUTOINCREMENT, id, url, title, content, date);')
c.execute("INSERT INTO RSSFeeds(url) VALUES('http://www.halotis.com/feed/');")
 
feeds = c.execute('SELECT id, url FROM RSSFeeds').fetchall()
 
def store_feed_items(id, items):
    """ Takes a feed_id and a list of items and stored them in the DB """
    for entry in items:
        c.execute('SELECT entry_id from RSSEntries WHERE url=?', (entry.link,))
        if len(c.fetchall()) == 0:
            c.execute('INSERT INTO RSSEntries (id, url, title, content, date) VALUES (?,?,?,?,?)', (id, entry.link, entry.title, entry.summary, strftime("%Y-%m-%d %H:%M:%S",entry.updated_parsed)))
 
def thread():
    while True:
        try:
            id, feed_url = jobs.get(False) # False = Don't wait
        except Queue.Empty:
            return
 
        entries = feedparser.parse(feed_url).entries
        rss_to_process.put((id, entries), True) # This will block if full
 
for info in feeds: # Queue them up
    jobs.put([info['id'], info['url']])
 
for n in xrange(THREAD_LIMIT):
    t = threading.Thread(target=thread)
    t.start()
 
while threading.activeCount() > 1 or not rss_to_process.empty():
    # That condition means we want to do this loop if there are threads
    # running OR there's stuff to process
    try:
        id, entries = rss_to_process.get(False, 1) # Wait for up to a second
    except Queue.Empty:
        continue
 
    store_feed_items(id, entries)
 
conn.commit()

Technorati Tags: , , , , , , , , ,



RSS feed | Trackback URI

15 Comments »

2009-07-09 02:06:32

[...] How To Get RSS Content Into An Sqlite Database With Python (tags: python rss web programming) Leave a Comment [...]

 
Comment by Gregory Saxton
2009-08-27 12:37:00

Hi,

Sorry to bug you with another question, but I really like this script and am wondering if you have any general tips on how to modify it such that it would read from a text file containing a number of different RSS feeds, then put the information from all the feeds into a single pysqlite database?

Thanks in advance for your time.

Greg

 
Comment by Gregory Saxton
2009-08-28 12:39:01

Sorry about that–I should have seen that that was already built into your script.

It works great by the way, so thanks!

Greg

 
Comment by Gregory Saxton
2009-08-28 17:08:29

OK, I do have one question that I can’t seem to figure out. I am interested in using a list of Twitter RSS feeds, and I’d like to put the http://bit.ly URLs as well as the hashtags in separate columns in the table. However, there does not seem to be an element or item in the Twitter RSS feed separating out these things, so that the “content” column in the pysqlite database contains the content, tiny url, and hashtags all together. So, here’s my question: would you separate these out with a regular expression, or some other method?

Thanks for any advice.

Comment by Matt Warren
2009-09-02 15:23:37

you could use a regular expression like this to pull out the url:

expression = '((?:http|https)(?::\\/{2}[\\w]+)(?:[\\/|\\.]?)(?:[^\\s"]*))'
rg = re.compile(expression,re.IGNORECASE|re.DOTALL)
m = rg.findall(txt)
if m:
    httpurl=m[0]

and to get all the hash tags:

expression = '(#(?:[^\\s"]*))'
rg = re.compile(expression,re.IGNORECASE|re.DOTALL)
m = rg.findall(txt)
for tag in m:
    print tag
 
 
Comment by Gregory Saxton
2009-09-03 20:08:03

Thanks. That’s great!

 
Comment by Gregory Saxton
2009-09-09 13:16:00

Hi Matt,

I’m trying to convert this script to use sqlalchemy and have a couple of questions. I am able to successfully get everything put into the RSSEntries table using sqlalchemy, but there are a few “leftover pieces” in the code that don’t seem to translate:

1) To replace your “conn=sqlite3.connect(DATABASE) line I have written “conn=engine.commit” (and that works). However, I don’t know what to do with the “conn.row_factory = sqlite3.Row” line. What does that do? I’ve left it out and the script still seems to work.

2) If I try to use the “c=conn.cursor()” line I keep getting an error:

c = conn.cursor()
AttributeError: ‘Connection’ object has no attribute ‘cursor’

I guess sqlalchemy doesn’t use “cursor” in this fashion?

3) Because of the “cursor” error, I replaced your “feeds = …” line with “feeds = conn.execute(‘SELECT id, url FROM feeds’).fetchall()”, which works well. However, I can’t get your “if len(c.fetchall()) == 0″ line to work, even when I replace “c” with “conn” or “engine.” (In my initial test, I just commented out that line then the script works). I am guessing this line helps avert duplicates? Is that right? And if so, any idea how to do this using sqlalchemy? Here’s the error I get:

if len(conn.fetchall()) == 0:
AttributeError: ‘Connection’ object has no attribute ‘fetchall’

I hope this makes sense–I can attach the whole code if it doesn’t–and thanks in advance for any help.

-Greg

 
Comment by Matt Warren
2009-09-14 11:16:29

Assuming you have the models defined as RSSEntries and RSSFeeds classes then converting to sqlalchemy would be something like :

CONNSTRING='sqlite:///rss.sqlite'
 
Base = declarative_base()
class RSSFeeds(Base):
    __tablename__ = 'RSSFeeds'
 
    id = Column(Integer, primary_key=True)
    url = Column(String)
 
    def __init__(self, url):
        self.url = url
 
    def __repr__(self):
        return "" % (self.url, ))
 
 
class RSSEntries(Base):
    __tablename__ = 'RSSEntries'
 
    id = Column(Integer, primary_key=True)
    url = Column(String)
    title = Column(String)
    content = Column(String)
    date = Column(Date)
 
    def __init__(self, url, title, content, date):
        self.url = url
        self.title = title
        self.content = content
        self.date = date
 
    def __repr__(self):
        return "" % (self.url, self.title, ))
 
 
engine = create_engine(CONNSTRING)
 
metadata = Base.metadata
metadata.create_all(engine) 
 
Session = sessionmaker(bind=engine)
session = Session()
feeds = session.query(RSSFeeds).all()
 
def store_feed_items(id, items):
    """ Takes a feed_id and a list of items and stored them in the DB """
    for entry in items:
        checkitem = session.query(RSSEntries).filter_by(url=entry.link).all()
            if not checkitem:
            item = RSSEntries(entry.link, entry.title, entry.summary, entry.updated_parsed)
            session.add(item)
    session.commit()

Note: I haven’t actually tested this code to make sure it works.

Comment by Gregory Saxton
2009-09-14 11:26:07

Thanks, Matt! I appreciate it. I think I have the code almost working now–the last piece of the puzzle is dealing with the “duplicates” issue, and I think what you wrote will work.

I just upgraded to Snow Leopard yesterday and will cross my fingers that Python works as expected. :) If it does, then I’ll post the entire code on GitHub and send you a quick note.

By the way, I’m originally from Vancouver–hope you’re enjoying it.

 
 
Comment by Miroslav
2009-09-14 22:49:49

Hi, I am new to Perl but I do need this functionality for the site. How would I import this into a MySQL database using Python? And also, where can I find some information on how to set up Python to run under Apache?

 
Comment by Miroslav
2009-09-14 22:50:50

Oops, meant to say – new to Python :)

 
Comment by Matt Warren
2009-09-14 23:07:57

for mySQL – if you take the code in the comments that uses sqlalchemy, then you can just change the connection string to mysql://user:password@sqlserver:port and it should just work.

getting python working on apache- look into mod-python

 
2010-01-26 17:18:19

[...] the blog post from Halotis that started my looking into [...]

 
2010-05-18 01:40:28

[...] How To Get RSS Content Into An Sqlite Database With Python – Fast [...]

 
Comment by uttecttirty
2011-04-13 05:47:59

When, as an organiser , or DJ , taking into account the audiovisual lease , there are several Visual complement to the audio equipment for the Dance Party that can be used to enhance the experience of players and create a lot more success. It is important to consider these audio-visual equipment Rental meliorations, specially in relation to rental of audio and video equipment. Rental of video equipment is not inexpensive, and you need to determine meaning and apologize the cost per opus added Ab rental equipment.

Of course, audio visual rentals component Ab systems is of predominate importance . Notwithstanding, when you add a simple, unexpected, powerful , and emotions better visuals, your saltations will lift to a new storey .

What are the Visual effects are in that location to add to your installation ball? We pursue the laser clarification , pin location lighting, black lights , light sources flash, mirror musket balls, reflecting telescopes, projection screens and projectors for rent, as well as to insert more modern set of extended lighting stripteases , passed lightin.

 
Name (required)
E-mail (required - never shown publicly)
URI
Your Comment (smaller size | larger size)
You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" highlight=""> in your comment.