8 December 2009 5 Comments

Using XLWT and Python to export an Oracle dataset to Excel (Python Simple ETL Part 2)

A few months ago, I showed how you can do simple back and forth ETL task with python scripts - this time I'm going to take it one step further and show how with an extra python module we can actually export data to a usable Excel format as well. I often use this method for automated data extracts AS WELL AS doing full blown Excel reports (yes, that's right - with formula calculations, formatting and everything - its pretty sweet - that'll come later though).

I always say that "ease of use" and "flexibility" are at opposite ends of the spectrum - when one goes up, the other goes down. WYSIWYG applications have their place (I suppose), I prefer coding by hand in a text editor anyday.

Anyways, today we're going to make a simple Excel data export with pythons xlwt module (which is a more updated fork of the old pyExcelerator module). I'll try and comment well so you can cut-n-paste to your hearts content. As well all know, there's nothing better than Googling for a solution and finding one that is 90% what you need and being able to just hack that last 10% in - saves time and sanity. Some purists would argue that its wrong... maybe, but there is a time and place for everything.

Anyways, here we go - as always, you can go step-by-step (like the Full House theme song) with me through the code, or skip to the bottom and just download the whole file. Its small, but at least this saves you from having to do a ton of Ctrl-C and Ctrl-V...

Ok, Lets get (motherfuckin') steppin'... (pardon me - its a live Metallica reference, can't help it, I'm a product of the 80s and 90s whether I like it or not)

# 12/7/2009 - Ryan Robitaille [ryrobes.com]
# http://ryrobes.com/featured-articles/using-xlwt-and-python-to-export-an-oracle-dataset-to-excel-python-simple-etl-part-2/

import cx_Oracle, time, string
from xlwt import *  #formerly "pyExcelerator"

Standard stuff - just loading some modules that we're going to need. Time and String are part of the standard Python distriution (I'm using 2.6 in this example), cx_Oracle I already discussed HERE, and xlwt can be found HERE (http://pypi.python.org/pypi/xlwt) grab the packages for your platform of choice and go nuts.

# set oracle login variables
OraUid="scott"                  #Oracle User  
OraPwd="tiger"              #Oracle password
OraService="TNS_EETFUK"      #Oracle Service name From Tnsnames.ora file

# do a timestamp for being able to track execution time (if you want)
startscript = time.time()  # we will use this later
db = cx_Oracle.connect(OraUid + "/" + OraPwd + "@" + OraService)    #Connect to database
dev_cursor_select = db.cursor()                                             #Allocate a cursor

dev_cursor_select.execute("""SELECT DBMS_RANDOM.STRING('P',40) field1,
        DBMS_RANDOM.STRING('X',30) field2, ROUND(DBMS_RANDOM.VALUE(1000, 9999)) field3,
        DBMS_RANDOM.STRING('A',20) field4  FROM DUAL CONNECT BY LEVEL<=3000"""
# 3,000 rows of random garbage seems good

result_set = dev_cursor_select.fetchall()

There is a bit more going on here, but nothing really complex - setting up our Oracle connection, timestamping the START of the script, creating our Oracle cursor, defining our SQL query (just some random Oracle generated nonsense for the sake of this demo using the awesome and under-used 'DBMS_RANDOM' functions), and getting that result set into a python object.

# Start some Excel magic
wb = Workbook()
ws0 = wb.add_sheet('My New Worksheet')

# Grey background for the header row
BkgPat = Pattern()
BkgPat.pattern = Pattern.SOLID_PATTERN
BkgPat.pattern_fore_colour = 22

# Bold Fonts for the header row
font = Font()
font.name = 'Calibri'
font.bold = True

# Non-Bold fonts for the body
font0 = Font()
font0.name = 'Calibri'
font0.bold = False

# style and write field labels
style = XFStyle()
style.font = font
style.pattern = BkgPat

style0 = XFStyle()
style0.font = font0

Here we start some Excel formatting. Formatting is optional, but I figured that I'd throw in a little bit for some flavor. I can't stand when older Excel uses that horrible "OCR A" font as the default. So what I'm doing is defining "styles" that can applied to each cell as we write data out to them. I'm using the same font throughout (in this case Calibri), but I'm making the header row bold and having a grey cell background.

Confused? Don't worry - It will make a lot more sense in a moment.


for row in result_set:
    for item in row: #i.e. for each field in that row
        ws0.write(row_number,column_num,str(item),style0)  #write excel cell from the cursor at row 1
        column_num=column_num+1  #increment the column to get the next field

    row_number=row_number+1 #increment the row number so the next row goes below it...

Here we are looping through the result set. I'm starting the rows at "row 1", which is actually row number 2 in Excel - we need the space in row number 1 in order to put our header row at. Otherwise we would start the output on row_number 0.

For each row in out result set, we have to write out all the fields in their individual columns. It sounds like a lot of work, but Python makes it easy. Just remember to increment your column numbers out and then reset them to 0 (aka literal column 1) when you start the next row.

If we wanted to get a little more tricky and "automatically" set the column width too. That takes a bit of finesse though, since not all field string lengths are created equally...

col_width_dict = dict() # create a dictionary var

for i in range(4): # fill it up with 0s first so Python doesn't complain
    col_width_dict[i] = 0


for row in result_set:
    for item in row:
        # write the excel row from the cursor - starting at row 1 (literal row 2)
        if len(str(item)) > col_width_dict[column_num]:
            # only redefine the column width if we need it to be bigger
            col_width_dict[column_num] = len(str(item))
        ws0.col(column_num).width = len(str(item))*256
        # set the width of the column depending on incoming string


Still with me? Good. (if not, let me know in the comments and I'll try and clarify things more).

ws0.write(0,0,'Field 1',style)
ws0.write(0,1,'Field 2',style)
ws0.write(0,2,'Field 3',style)
ws0.write(0,3,'Field 4',style)


endscript = time.time()
endtime = endscript - startscript
print 'script run in ' + str(endscript - startscript) + ' seconds or ' + str((endscript - startscript)/60) + ' minutes'

Here we are writing the header row (with our fancy "style" attribute), we are explicitly writing it on the first column (the one we skipped earlier when writing out the data).

Then we save our excel file as 'sample_output.xls' or 'angry_beavers.xls' or 'chiminny_changa.xls', whatever floats your boat. Just for shits and giggles, I'm taking the start timestamp and and the end timestamp and showing how long the script too to execute, but its pretty much useless in this case and olney serves to have some (not so) interesting output for me to look at.

Is anyone confused, angry, irate, excited? Let me know in the comments!

Here is the entire commented script file for downloading.