Using a simple Python script for End-to-End Data Transformation and ETL (Part 1)
In the Data Transformation Services (DTS) / Extract Transform and Load (ETL) world these days we've got a LOT of expensive ass products. Some are good, some are marginal, and some are pieces of over-complicated (and poorly performing) java-based shit. But, hey, enough with the negativity - I digress, I just want to show you that its pretty easy to do the same types of data manipulations in pure easy-to-read python code. Why lock yourselves into an expensive proprietary ETL system that requires expensive consultants to maintain (badly)?
In this multi-part post I'm going to start with a simple data transport across data platforms (with minimal exception handling), and later mix in some fancier ETL functions and "robustness" (if that's even a word - I know I've heard Cheese-Nip Sales guys use it before). I guarantee its a hell of a lot faster and more reliable than any of the commercial products out there (I know, because I've use them all).
In my not-so-humble opinion, you should NEVER trade point-and-click usability and prettiness for rock-solid reliability and flexibility.
Note: I'm going to be doing this in a nice and easy step-by-step way - I'm not trying to insult anyone's intelligence, but I find that when people gloss over things because "everyone knows XYZ", it alienates people and makes shit downright confusing for others - so if you're a python smarty pants and already know all this, just skip that section and go buy yourself a cup of green tea (or whatever the hell you genius types drink these days). I want these posts to be a definitive guide, and the only way to do that is be thorough.
Ok, here is the task we are completing in this post - Writing a simple (non-interactive) Python script to pull data from an Oracle table and insert it into a SQL Server table (and/or another Oracle database table). You can run it by hand on the command line or schedule it and it will always work the same - all the while using very little overhead.
I spend a lot of time in an old school Oracle shop and you'd be shocked how many people simply cannot do things this without resorting to ridiculous (and idiotic) methods (writing PL/SQL out to a flat file on the server, reading flat file in as an external table, etc.) - well, enough of that madness. Think of Python as the "glue" to stick things together (WITHOUT ODBC or other shitty platform-dependent "connection technologies").
If you've ever done any programming in the past, you'll find that Python is a really easy and flexible language, which makes it very useful for doing just about any back-end task you'd want to accomplish - which (thankfully) includes grabbing extracts from virtually any database you can find a module for (and trust me, there is a lot) and then inserting that resultset into another database type.
I know I've defined the "task" above, but what my example script does specifically (in Part 1) is: Select some rows from an Oracle table (10g) put them into a list, and then use that list to insert them all into destination tables on Microsoft SQL Server (2005 / 2008) and another Oracle instance (after truncating the destination tables, of course). "Trunc-n-load!"
First we need set up the environment - for this example I'm going to be using a windows box (since we need to connect to a SQL Server instance), you could do it all with a linux machine too, but for this particular example it'll be on windows (Linux tutorials coming up in the future).
I'm using Python 2.6.2 which is the latest stable release in the 2.6 tree as of this writing. You can install it from the Python.org site using the MSI package if you want a nice and easy install.
Now we need 2 Python modules for all this shiz-nit to work properly.
cx_Oracle (for, you guessed it, connecting to Oracle) - They've even got a fancy MSI for that too. I'm impressed.
Pymssql (for connecting to Microsoft SQL Server instances) - Download the install file and run it. Its a lot easier than compiling the module from scratch (esp since doing that requires Microsoft Visual Studio 2008).
Once those 2 are installed successfully - lets make sure they work.
Open up a command line Python interpreter... (you know python.exe)
>>> import pymssql, cx_Oracle
You shouldn't get any errors or response of any kind, just a new '>>>' line below it. If so VIOLA! Its installed.
Now on to building the actual script.
I'm going to go through the code chunk by chunk and comment on each section as we go, just in case you're new to Python, but don't worry I'll have the entire file at the end for download (with comments included in the code).
## Define Oracle connection - format ("username/password@TNSNAME")
ora_conn = cx_Oracle.connect("xxxx/xxxx@XXXXX")
ora_cursor = ora_conn.cursor() #Allocate a cursor
## Define the MSSQL server connection - format is self-explanatory
mssql_conn = pymssql.connect(host='xxxx', user='xx',
password='xxx', database='XXX')
mssql_cursor = mssql_conn.cursor() #Allocate a cursor
Ok, first we've got our modules from above, and we're using 'string' just for some output stuff that we're doing later (just to check if its working) (update: I didn't realize that I didn't use the string module when I edited the script, I'll still leave it in though - since we'll need it in later more complicated posts). Then we're defining the connection string and then defining a cursor that uses each connection - so we're ready for hot SQL action.
ora_cursor.execute("truncate table test")
mssql_cursor.execute("truncate table test")
Truncate our destination tables (no mystery there). Notes: In this example I'm writing data to BOTH an Oracle table and a MS SQL table just to show you the different type of "insert many" that we have to use.
ora_cursor.execute("""SELECT DBMS_RANDOM.STRING('P',40) field1,
DBMS_RANDOM.STRING('X',30) field2,
ROUND(DBMS_RANDOM.VALUE(1000, 9999)) field3,
DBMS_RANDOM.STRING('A',20) field4
FROM DUAL CONNECT BY LEVEL<=3000""")
Ok, here is a simple select query into Oracle - I'm fetching some source rows to play with. This is just using some Oracle functions that generate random (crap) data - just for the sake of example.
I find it helpful in testing sometimes because using DBMS_RANDOM.STRING('P',40) generates all kinds of characters that tend to break crappy apps and webapps due to un-escaped characters.
ResultSet_Py_List = [] #Create an empty list
for field1, field2, field3, field4 in ora_cursor:
try: #Then populate it with the cursor results
ResultSet_Py_List.append((field1, field2, field3, field4))
except AttributeError:
pass
Basically, here I create an empty list, called 'ResultSet_Py_List', and then iterate through all the fields and rows in the cursor (which we filled up with data earlier) and add them to it.
Why? Well, first of all its easier to handle than the raw cursor object which can get tricky (esp if we're going to access the data twice), and secondly because if we're going to do any data transformation later, it'll be a fun party trick stunt as opposed to tragic.
When the cursor is empty, we have to do an except: pass so that the interpreter doesn't freak out.
print str(len(ResultSet_Py_List)) + ' Records from Source'
This is pretty simple. I'm doing a count of all the rows in our list, just to make sure that its all loaded and ready to go. When I write command line apps (even simple ones), I like to throw in logging and verbose stuff like this so I can better troubleshoot later. In this case we should see 3,000 rows in that baby.
ora_cursor.prepare("""INSERT INTO test(field1,field2,field3,field4)
VALUES (:field1, :field2, :field3, :field4)""")
ora_cursor.executemany(None, ResultSet_Py_List)
ora_conn.commit() #COMMIT that shit before it gets away!
Ok, time for the INSERT into our Oracle DB table. You'll see that we need to run a prepare with our parametrized query - then we can run an executemany with the list to kick it off. Then, bingo - commit it.
mssql_cursor.executemany("""INSERT INTO
test (field1, field2, field3, field4)
VALUES (%s, %s, %s, %s)""", ResultSet_Py_List)
mssql_conn.commit()
Now its time for the MSSQL insertion, as you can see its a little different. Then we commit, and as they say - "Bob's your Uncle".
ora_row_count = ora_cursor.fetchone()
print str(ora_row_count[0]) + ' Records INSERTED into Oracle table'
mssql_cursor.execute("select count(*) from test")
mssql_row_count = mssql_cursor.fetchone()
print str(mssql_row_count[0]) + ' Records INSERTED into MSSQL table'
## All done, Lars. Now lets get a drink.
## We will even be nice and close the connections.
## Its like tipping your hat to the Database engine.
mssql_conn.close()
ora_conn.close()
All I'm doing here is counting the rows in each destination table and printing them on the screen, just so I'm SURE that everything went where it was supposed to be.
Now this was exceeding simplistic on purpose. I was people to be able to cut-n-paste properly and modify for their own liking. In the next post in the series, its going to get a bit more complicated, but this script is the "base" we're going to build on for our Python-based ETL empire. So when people ask you what "ETL Tool" you use, you can say.."Umm, a text editor, bitch!"
Here is the entire commented script file for downloading (who wants to cut and paste over and over again from that shit above, I wouldn't)!
Pingback: Using XLWT and Python to export an Oracle dataset to Excel (Python Simple ETL Part 2) | Ryan Robitaille()
Pingback: Business Intelligence, Tools, Dirty Caveman Sex, Open-Source - Part 1()
Pingback: How to run your Python scripts as a Windows Service | Ryan Robitaille()