29 July 2009 15 Comments

Using a simple Python script for End-to-End Data Transformation and ETL (Part 1)

In the Data Transformation Services (DTS) / Extract Transform and Load (ETL) world these days we've got a LOT of expensive ass products. Some are good, some are marginal, and some are pieces of over-complicated (and poorly performing) java-based shit. But, hey, enough with the negativity - I digress, I just want to show you that its pretty easy to do the same types of data manipulations in pure easy-to-read python code. Why lock yourselves into an expensive proprietary ETL system that requires expensive consultants to maintain (badly)?

In this multi-part post I'm going to start with a simple data transport across data platforms (with minimal exception handling), and later mix in some fancier ETL functions and "robustness" (if that's even a word - I know I've heard Cheese-Nip Sales guys use it before). I guarantee its a hell of a lot faster and more reliable than any of the commercial products out there (I know, because I've use them all).

In my not-so-humble opinion, you should NEVER trade point-and-click usability and prettiness for rock-solid reliability and flexibility.

Note: I'm going to be doing this in a nice and easy step-by-step way - I'm not trying to insult anyone's intelligence, but I find that when people gloss over things because "everyone knows XYZ", it alienates people and makes shit downright confusing for others - so if you're a python smarty pants and already know all this, just skip that section and go buy yourself a cup of green tea (or whatever the hell you genius types drink these days). I want these posts to be a definitive guide, and the only way to do that is be thorough.

Ok, here is the task we are completing in this post - Writing a simple (non-interactive) Python script to pull data from an Oracle table and insert it into a SQL Server table (and/or another Oracle database table). You can run it by hand on the command line or schedule it and it will always work the same - all the while using very little overhead.

I spend a lot of time in an old school Oracle shop and you'd be shocked how many people simply cannot do things this without resorting to ridiculous (and idiotic) methods (writing PL/SQL out to a flat file on the server, reading flat file in as an external table, etc.) - well, enough of that madness. Think of Python as the "glue" to stick things together (WITHOUT ODBC or other shitty platform-dependent "connection technologies").

If you've ever done any programming in the past, you'll find that Python is a really easy and flexible language, which makes it very useful for doing just about any back-end task you'd want to accomplish - which (thankfully) includes grabbing extracts from virtually any database you can find a module for (and trust me, there is a lot) and then inserting that resultset into another database type.

I know I've defined the "task" above, but what my example script does specifically (in Part 1) is: Select some rows from an Oracle table (10g) put them into a list, and then use that list to insert them all into destination tables on Microsoft SQL Server (2005 / 2008) and another Oracle instance (after truncating the destination tables, of course). "Trunc-n-load!"

First we need set up the environment - for this example I'm going to be using a windows box (since we need to connect to a SQL Server instance), you could do it all with a linux machine too, but for this particular example it'll be on windows (Linux tutorials coming up in the future).

I'm using Python 2.6.2 which is the latest stable release in the 2.6 tree as of this writing. You can install it from the Python.org site using the MSI package if you want a nice and easy install.

Now we need 2 Python modules for all this shiz-nit to work properly.

cx_Oracle (for, you guessed it, connecting to Oracle) - They've even got a fancy MSI for that too. I'm impressed.

Pymssql (for connecting to Microsoft SQL Server instances) - Download the install file and run it. Its a lot easier than compiling the module from scratch (esp since doing that requires Microsoft Visual Studio 2008).

Once those 2 are installed successfully - lets make sure they work.

Open up a command line Python interpreter... (you know python.exe)

>>> import pymssql, cx_Oracle

You shouldn't get any errors or response of any kind, just a new '>>>' line below it. If so VIOLA! Its installed.

Now on to building the actual script.

I'm going to go through the code chunk by chunk and comment on each section as we go, just in case you're new to Python, but don't worry I'll have the entire file at the end for download (with comments included in the code).

import string, pymssql, cx_Oracle

## Define Oracle connection - format ("username/password@TNSNAME")
ora_conn = cx_Oracle.connect("xxxx/xxxx@XXXXX")
ora_cursor = ora_conn.cursor()  #Allocate a cursor

## Define the MSSQL server connection - format is self-explanatory
mssql_conn = pymssql.connect(host='xxxx', user='xx',
password='xxx', database='XXX')
mssql_cursor = mssql_conn.cursor()  #Allocate a cursor

Ok, first we've got our modules from above, and we're using 'string' just for some output stuff that we're doing later (just to check if its working) (update: I didn't realize that I didn't use the string module when I edited the script, I'll still leave it in though - since we'll need it in later more complicated posts). Then we're defining the connection string and then defining a cursor that uses each connection - so we're ready for hot SQL action.

## Truncate our destination tables
ora_cursor.execute("truncate table test")
mssql_cursor.execute("truncate table test")

Truncate our destination tables (no mystery there). Notes: In this example I'm writing data to BOTH an Oracle table and a MS SQL table just to show you the different type of "insert many" that we have to use.

## Grab our "Source Data"
ora_cursor.execute("""SELECT DBMS_RANDOM.STRING('P',40) field1,
                DBMS_RANDOM.STRING('X',30) field2,
                ROUND(DBMS_RANDOM.VALUE(1000, 9999)) field3,
                DBMS_RANDOM.STRING('A',20) field4
            FROM DUAL CONNECT BY LEVEL<=3000"""
)

Ok, here is a simple select query into Oracle - I'm fetching some source rows to play with. This is just using some Oracle functions that generate random (crap) data - just for the sake of example.

I find it helpful in testing sometimes because using DBMS_RANDOM.STRING('P',40) generates all kinds of characters that tend to break crappy apps and webapps due to un-escaped characters.

## Transfer our cursor rows into a python list object
ResultSet_Py_List = []  #Create an empty list
for field1, field2, field3, field4 in ora_cursor:
    try:   #Then populate it with the cursor results
        ResultSet_Py_List.append((field1, field2, field3, field4))
    except AttributeError:
        pass

Basically, here I create an empty list, called 'ResultSet_Py_List', and then iterate through all the fields and rows in the cursor (which we filled up with data earlier) and add them to it.

Why? Well, first of all its easier to handle than the raw cursor object which can get tricky (esp if we're going to access the data twice), and secondly because if we're going to do any data transformation later, it'll be a fun party trick stunt as opposed to tragic.

When the cursor is empty, we have to do an except: pass so that the interpreter doesn't freak out.

## Quick count to see how many rows we're packin'!
print str(len(ResultSet_Py_List)) + ' Records from Source'

This is pretty simple. I'm doing a count of all the rows in our list, just to make sure that its all loaded and ready to go. When I write command line apps (even simple ones), I like to throw in logging and verbose stuff like this so I can better troubleshoot later. In this case we should see 3,000 rows in that baby.

## Insert the list contents into Oracle, row by row...
ora_cursor.prepare("""INSERT INTO test(field1,field2,field3,field4)
VALUES (:field1, :field2, :field3, :field4)"""
)
ora_cursor.executemany(None, ResultSet_Py_List)
ora_conn.commit() #COMMIT that shit before it gets away!

Ok, time for the INSERT into our Oracle DB table. You'll see that we need to run a prepare with our parametrized query - then we can run an executemany with the list to kick it off. Then, bingo - commit it.

#Insert the list contents into Microsoft SQL table
mssql_cursor.executemany("""INSERT INTO
test (field1, field2, field3, field4)
VALUES (%s, %s, %s, %s)"""
, ResultSet_Py_List)
mssql_conn.commit()

Now its time for the MSSQL insertion, as you can see its a little different. Then we commit, and as they say - "Bob's your Uncle".

ora_cursor.execute("select count(*) from test")
ora_row_count = ora_cursor.fetchone()
print str(ora_row_count[0]) + ' Records INSERTED into Oracle table'

mssql_cursor.execute("select count(*) from test")
mssql_row_count = mssql_cursor.fetchone()
print str(mssql_row_count[0]) + ' Records INSERTED into MSSQL table'

## All done, Lars. Now lets get a drink.

## We will even be nice and close the connections.
##       Its like tipping your hat to the Database engine.
mssql_conn.close()
ora_conn.close()

All I'm doing here is counting the rows in each destination table and printing them on the screen, just so I'm SURE that everything went where it was supposed to be.

Now this was exceeding simplistic on purpose. I was people to be able to cut-n-paste properly and modify for their own liking. In the next post in the series, its going to get a bit more complicated, but this script is the "base" we're going to build on for our Python-based ETL empire. So when people ask you what "ETL Tool" you use, you can say.."Umm, a text editor, bitch!"

Here is the entire commented script file for downloading (who wants to cut and paste over and over again from that shit above, I wouldn't)!

  • speno

    I think it’s nice that you are using Python.

    You probably don’t need to be importing the string module since you aren’t using it. While it may contain a few helpful constants, you’d always want to use string methods instead, unless you were stuck with some really old version of Python. Then again, you’re not using any string methods in this script either. :-)

    You may also want to look into the cursor.rowcount attribute.

    I’m happy that I never heard of the term ETL before reading this.

    Take care.

    • Ry

      Hah, Thanks. I actually had a lot of modules in earlier (READ: really messy) versions of this particular script, and left ‘string’ in for some romantic reason. :)

      cursor.rowcount would be easier than the fetchone and var[0] printing – I might have to change that, since I’m trying to make it as SIMPLE as possible. I guess my brain is hardwired to do a lot of things the “hard way”.

      Thanks again, Speno!

  • Tony Schmidt

    Thanks for the post, Ry.

    I’ve been getting all worked up about data warehousing and ETL for my project, where I’ve gotta do stuff like build an interface for creating new order and order_item entities that relate to members (MS Access), employees (MySQL) and products (flat file).

    People have been telling me I need to read up on Inmon and Kimball and understand stuff like star schemas, data marts, EAV tables and so on.

    Do you think all I need to do is write a bunch of python ETL scripts with event triggers (like “create new order” updates all the related tables in the warehouse)?

    And why not bypass the ETL layer entirely and just perform cross-vendor joins on the application level?

    Thanks in advance for any tips.

    • Ry

      Hey Tony!

      I guess it depends on what you’re trying to do exactly. If your data is so spread out all over the place, it probably should all run from a main location (all in the same DB) – and then be pushed back out the the old locations on a scheduled basis (if they even need to exist anymore).

      I’ve gone down the rabbit hole on building data warehouses (and datamarts) and data cubes to report on and all that fancy “best practice” stuff… but I honestly think that for a lot of implementations its total overkill. That’s just my humble opinion.

      You could look at a more “Access friendly” ETL tool like SQL Server SSIS, but its costs a pretty penny. If you can find a desktop version of SQL Server 2000 somewhere – download and use that, back then they included the whole ETL tool in that version and i’ve been using it like a “data swiss army knife” ever since.

      Let me know if I understand the problem right. :)

      • Tony Schmidt

        Hi, Ry. Thanks for your reply.

        Yeah, I think you’ve got the gist of my problem.

        So, I had been thinking about using an ETL tool (such as Talend) – but some Python people had discouraged me from that, saying that ETL tools were for people who didn’t know how to program, they came with their own product-specific learning curve, and they weren’t necessarily more rapid than just writing your own scripts at the end of the day.

        So what do you think of using an ETL tool vs. writing your own scripts (like you demo in this article).

        Currently I’m leaning toward building what I’m calling a “pseudo-datawarehouse”, forgetting about Inman and Kimball, and not looking for a cross-vendor join solution.

  • Pingback: Using XLWT and Python to export an Oracle dataset to Excel (Python Simple ETL Part 2) | Ryan Robitaille

  • Tim Day

    Thanks a lot for the post. I used pyodbc for connecting to an msaccess db and dragging things into oracle. I really enjoyed not having to touch access. I wonder if there is a way to get the table and column names so that we can also build the required target tables automatically?

  • Jorge

    I know this is an old post.

    I come from the end of the spectrum of being Report Developer – I found that creating reports and writing SQL off of a set of tables following Kimbals methods – star schema, as a positive experience!

    Whats great about it is that it is a universal structure. When I have encountered a star schema – I instantly know what tables I need to join and have hit the ground running – whereas compared to a table structure someone has invented – I always have to go to the inventor and ask how these tables are joined, whats the granularity and what not.

    Also I found working with a cube as being positve thing to. Writing a report thats crunching and aggregating year over year sales data coming from 2,500 plus stores is extremely fast (milleseconds) compared to the same SQL.

    • Ry

      Hey Jorge, thanks for the comments!

      I guess I’ve just been spoiled over the past few years since I’d create the datasource AND do the report afterward. :)

      I’ve also def warmed up to the traditional DW methods for a few larger projects recently. I guess sometimes we just get so used to doing everything “one way”, new (well, old) concepts seem like more effort than payoff.

      Also, I LOVE how I can deploy some well-thought-out Analysis Service Cubes to my end-users, then they can manipulate them as pivot tables in Excel on their own. It saves me the trouble of writing several specific reports off the same data… (you know, the “classic” way)

      :)

      Cheers!

  • Pingback: Business Intelligence, Tools, Dirty Caveman Sex, Open-Source - Part 1

  • Pingback: How to run your Python scripts as a Windows Service | Ryan Robitaille

  • http://pranav.me Pranav

    Nicely written post. Have subscribed using Google Reader for RSS.

  • Joel

    Say I am quite intrigued by this sample – I am new to Python and am a GIS analyst. I have been tasked with running a GIS process that applies spatial attributes – then I have to pull the results from SQLEXPRESS database and write to a waiting table on another SQL Server. I am at my wits end since I have implemented your code and I know I am retrieveing but I keep getting an error stating that it is looking for a a tuple or a dictionary. Any help would be greatly appreciated.

    Thanks,

    Joel Hanson

    • Ry

      Hey Joel, sorry for the delay. Still having issues?

      • Andy B

        I am having a similar problem to Joel. I have implemented similar code, but am receiving the error “ValueError: expected simple type, a tuple or a list.” Looking at the local variables from my debug session, I am providing a list of tuples, so not sure what I am doing wrong.

        Any help appreciated