cElementTree ftw!
My mission was simple - to crawl all of our artist databases, extract names and email addresses from a couple of different tables, and spin out some custom XML.
The tables - one for complete user profiles and another for users who have signed up to receive emails only - may or may not exist, depending on the artist.
And the XML is to be consumed by an outside marketing vendor, for use in generating email campaigns. (We do some of our own email marketing, but most of our bulk mailing is handled by this third party.) So I got no spec - didn't need one, really - just some sample output.
Writing up a quick XML generator with minidom was a piece of cake.
But then I realized that some of these artists have more than a million records between their combined mailing list and profile tables. We have at least two artists with more than a million user records in each one. (In the end, some of the XML files I'm generating are as large as 40Mb.)
So that's obviously dealing with waaay too much data to keep in memory and throw to a prettily-formatted XML file all at once. It hummed along nicely for some of our artists with, ahem, smaller followings, but once I got to some of the heavy-hitters the whole process began to bog down.
(For the record - morbid curiosity and all - I did try writing each record to the file one at a time. And yes, it was as stupid and dog-slow as you might imagine it to be.)
What I needed was to be able to stream results to an XML output file. So I looked at some of the other XML packages out there (against the recommendations of a boyfriend who, I might add, is very opinionated when it comes to which XML libraries suck and which don't).
From all the benchmarking results I could find, lxml and libxml2 seem to perform pretty well when it comes to parsing.
http://codespeak.net/lxml/performance.html
http://effbot.org/zone/celementtree.htm#benchmarks
But parsing is not my problem - I'll let someone on the receiving end worry about that. My only concern is creating it as simply and quickly as possible, and I didn't find a lot in the way of results for generating XML.
So I decided to try cElementTree. It wasn't an entirely arbitrary decision - not needing to install anything new definitely swayed me (it's been part of Python since 2.5).
I didn't do any formal benchmarking of my own, just got some approximate numbers - it looks like I'm covering my total 3.5 million records at 0.13 ms per. So, considering what I was getting out of minidom, cElementTree is lightning fast.
A few other notes on odd things you might see in the script below:
...............................
#!/usr/bin/python
# usage: ./mylittlexmlscript.py
import shutil, sys, MySQLdb as Database
from xml.etree import cElementTree as ET
# the completed files should be moved out of the way as they're generated
# well, sure, they don't *have* to be - I just like keeping things neat and tidy
processed = "/someplace/toput/allthat/xml"
db = Database.connect(host="myhostname", port=0000, user="myusername", passwd="mypassword", db="")
cursor = db.cursor()
def walktheserver():
""" walk the server, generate a list of dbs """
sql = """SELECT table_schema FROM information_schema.tables GROUP BY table_schema"""
cursor.execute(sql)
dbnames = cursor.fetchall()
return dbnames
def gettables(dbname):
""" return all the table names for a specific db """
tables = []
sql = """SHOW TABLES FROM %s"""%(dbname)
cursor.execute(sql)
rows = cursor.fetchall()
for table in rows:
ltable = list(table)
tables.append(ltable[0])
return tables
def generate_xml(dbname):
""" check to see if the db has either mailing list or user table
and generate the xml accordingly """
tables = gettables(dbname)
if "mailinglist" in tables:
create_xml(dbname, 'ml') # ml = mailing list
if "users" in tables:
create_xml(dbname, 'up') # up = user profiles
def create_xml(dbname, type):
""" here's where we're actually going to put cElementTree to work """
outputfile = dbname+"_"+type+".xml" # name the output file
BSWrapper = ET.Element("BSWrapper") # create the root element of the XML tree
Header = ET.SubElement(BSWrapper, "Header") # create the first subelement, and give it a few attributes
Header.set("AccountID", "996")
Header.set("ListID", "885")
Header.set("UserID", "774")
sql = """USE %s"""%(dbname)
cursor.execute(sql)
if type is 'fcm':
sql = """SELECT fcm.email AS Email, fcm.first_name AS FirstName
FROM mailinglist AS fcm GROUP BY fcm.email ORDER BY Email"""
if type is 'w3t':
sql = """SELECT w3t.U_Email AS Email, w3t.U_First_Name AS FirstName
FROM users AS w3t GROUP BY Email ORDER BY Email"""
cursor.execute(sql)
rows = cursor.fetchall()
for record in rows:
data = ET.SubElement(BSFImport, "data")
Email = ET.SubElement(data, "Email")
Firstname = ET.SubElement(data, "Firstname")
try:
email = record[0].encode('iso-8859-1')
except AttributeError:
email = ''
except UnicodeDecodeError:
email = ''
try:
name = record[1].encode('iso-8859-1')
except AttributeError:
name = ''
except UnicodeDecodeError:
name = ''
Email.text = email
Firstname.text = name
tree = ET.ElementTree(BSFImport)
tree.write(outputfile)
shutil.move(outputfile, processed)
def main():
dblist = walktheserver()
# they're not "bad", per se - I would never pass that kind of judgement on a database
# these are just the dbs to be excluded from consideration
badlist = ['information_schema', 'artist1', 'artist2', 'artist3']
for dbname in dblist:
if dbname[0] not in badlist:
# print dbname[0]
newxml = generate_xml(dbname[0])
cursor.close()
db.close()
if __name__ == "__main__":
main()
For the record, the XML output looks like this - no pretty formatting, alas, but at least all the nodes are where they ought to be, every element accounted for. If formatting is a problem for the vendor, I'll deal with it in a few days and let you know what I came up with:
<BSWrapper><Header AccountID="996" ListID="885" UserID="774" /><data><Email>12ewoldt@yahoo.com</Email>
<Firstname>christian</Firstname></data><data><Email>1995fxstc@sbcglobal.net</Email><Firstname>Daniel</Firstname>
</data></BSWrapper>
gioca al casinĂ² in rete - 2010-05-14 04:38:33
New web development frameworks are sprouting out at a more rapid pace than anyone could keep up with. In this article, we are going to determine how to decide on a framework for creating your next, hot web application. In this current day and age, pushing out a finished, polished application well before your competitor is key.
chanel - 2010-05-20 07:28:47
Curious, the formatting turned out not to be an issue?
gucci - 2010-05-20 07:32:19
Actually, that would be a great idea! Please click here to find out if we can reach further views?
Gucci belt - 2010-06-24 01:34:33
Brand new <a href="http://guccibelts.net/">Gucci belt</a>, <a href="http://guccibelts.net/">Gucci belts</a>, <a href="http://guccibelts.net/">Gucci men's belts</a>, <a href="http://guccibelts.net/">Gucci belts for men</a> on sale.<br>
chanel - 2010-07-22 08:03:13
chanelfsartewtregeftefdsf
moncler jacket - 2010-07-29 01:51:11
MONCLER jacket become a big brand of fashion. We are one of the leader in <a href="http://www.jacket888.com"><strong>Wholesale Moncler jackets</strong></a>. Here you can find <a href="http://www.jacket888.com/mens-moncler-c-4.html"><strong>Men's Moncler</strong></a> that you speak highly of.
http://www.jacket888.com/
Discount Moncler jackets - 2010-08-20 08:34:45
Here provide <a href="http://www.mymonclerstore.com"><strong>Discount Moncler jackets</strong></a>,<a href="http://www.mymonclerstore.com/moncler-tshirts-c-7.html"><strong>Discount Moncler T-shirts</strong></a> with amazing discount price, you can trust us, it won't let you down.
Wholesale Moncler jackets - 2010-08-23 09:10:28
I was told by one of my goodfriend that <a href="http://www.jacket888.com"><strong>Wholesale Moncler jackets</strong></a>,and I have bought one <a href="http://www.jacket888.com/mens-moncler-c-4.html"><strong>Men's Moncler</strong></a> for myself,I like it very much.you can have a try.
www.Jacket888.com
Discount Moncler - 2010-08-24 06:37:44
Just get away you from the cold winter <a href="http://www.mymonclerstore.com"><strong>Discount Moncler jackets</strong></a>,<a href="http://www.mymonclerstore.com/moncler-tshirts-c-7.html"><strong>Discount Moncler T-shirts</strong></a> with amazing discount price,it won't let you down.
http://www.mymonclerstore.com
Cheap lv bags on sale - 2010-08-24 07:43:37
<a href=http://www.worldtophandbags.com>lv bags on sale</a>






D'gou - 2010-02-04 04:30:17
Interesting.
Curious, the formatting turned out not to be an issue?