4 December 2012 15 Comments

SQL Server Query to Tableau Data Extract LIKE A BOSS – Some more TDE API fun with Python & Tableau 8

Coming off the excitement of my last post about writing a simple bare-bones python usage of Tableau's brand-new Data Extracts API from version 8.0 - I figured that it was time to build on that. Let's take a step forward and get a little more complicated - AND a little more useful.I'm encapsulated in a proprietary binary data file OF EMOTION!

Oh, noes!

I'm encapsulated in a proprietary binary data file OF EMOTION!

Save me, Cringer!

Wait. This could be MORE useful then my last post?

You mean your business doesn't run on a billion rows of dummy shit data? Weird.

Anyways, this time we're going to extract data from SQL Server with a single SQL query and populate a new TDE file.

This script has 3 main features:

  • Can run any query you want (even 'select * from' queries)
  • You don't have to pre-define the field names
  • You don't have to pre-define the data types

I got you, shawty. I GOT YOU.

All of which we had to explicitly state in the last script (even as a 'proof of concept'), and as you can imagine - with a wide result set, defining every little thing and then inserting every little thing can be a huge pain in the ass (not to mention ugly). With my approach, it's literally plug and play - err, paste and play. You can even run queries like "select * from " it's not even necessary to use the field names (assuming they are all unique, that is - more on that later) - and yes, I mentioned that twice for EMPHASIS!

Ye gods! No scripting out my lame_ass_field_names? Huzzah!

Basically, the script works like this:

  • Connect to SQL Server
  • Execute your SQL query
  • Look at the field names and first row data
  • Have Python try to guess what data types they are (based on that first row)
  • Map those data types and field names to a TDE schema definition (TableDefinition())
  • Loop through the entire result set, calling each field by name and type so the correct insert method is used (setDouble, setInteger, setNull, setDateTime, setBoolean, etc)
  • Do a little headbanging as you watch the text fly by!
  • Close the file (write the TDE to disk)
  • Close the SQL connection
  • Wipe hands on pants

Booya. Done.

And you didn't have to do a damn thing except give it a server connection and a SQL query. Now you can open that fresh TDE file and have yourself a little data party. { Pants optional }

Requires: Python 2.7.X+, the DataExtracts module, and pymssql module (pre-compiled binaries here). Oh, and it would help to have a SQL Server and Tableau Desktop as well. ;)

Here it is! Cut, Paste, and give it a try, gosh darnint!

UPDATE: Added var (rowoutput = True / False) for turning off the debug row / column output - good for looking into errors but slowed the script down like 10x
# ryan robitaille 12/6/2012
# simple Tableau Data Extract creation from a single microsoft sql server sql statement

import dataextract as tde # saves some typing, cause i'm a lazy fucker
import os, time, pymssql # for file manipulation, script timing (not necc), database access!

###################### FOR YOUR PARAMETERS, SON! ######################
tdefilename = 'ufo_datas.tde'
sql = "select * from UFOdata.dbo.Sightings" # whatever
sqlserverhost = 'localhost'
sqlusername = 'sa'
sqlpassword = 'passy'
sqldatabase = 'UFOdata'
rowoutput = False # for DEBUGGING data errors / slows shit down 10X however
###################### FOR YOUR PARAMETERS, SON! ######################

dotsevery = 75

start_time = time.time() # simple timing for test purposes

mssql_db = pymssql.connect(host=sqlserverhost, user=sqlusername, password=sqlpassword, database=sqldatabase, as_dict=True) # as_dict very important
mssql_cursor = mssql_db.cursor()

print ' '
print '[ Note: Each . = ' +str(dotsevery)+ ' rows processed ]'

fieldnameslist = [] # define our empty list

#go through the first row to TRY to set fieldnames and datatypes
for row in mssql_cursor:
        itemz = len(row.keys())/2 # because the dict rowset includes BOTH number keys and fieldname keys
        for k in row.keys():
            fieldnameslist.append(str(k) + '|' + str(type(row[k])).replace("<type '","").replace("'>","").replace("<class '","").replace('NoneType','str').replace('uuid.UUID','str') )
        break # after the first row, we SHOULD have a decent idea of the datatypes
# ^ a bit inelegant, but it gets the job done

fieldnameslist.sort() # sort them out so the integer keys are first (we're gonna whack em)
del fieldnameslist[0:itemz] # remove first x amount of keys (should be all integers instead of dict literals)

try:  # Just in case the file exists already, we don't want to bomb out
    tdefile = tde.Extract(tdefilename) # in CWD
    os.system('del '+tdefilename)
    os.system('del DataExtract.log') #might as well erase this bitch too
    tdefile = tde.Extract(tdefilename)

# ok lets build the table definition in TDE with our list of names and types first
# replacing literals with TDE datatype integers, etc
tableDef = tde.TableDefinition() #create a new table def

if rowoutput == True:
    print '*** field names list ***' # debug
for t in fieldnameslist:
    fieldtype = t.split('|')[1]
    fieldname = t.split('|')[0]
    fieldtype = str(fieldtype).replace("str","15").replace("datetime.datetime","13").replace("int","7").replace("decimal.Decimal","10").replace("float","10").replace("uuid.UUID","15").replace("bool","11")
    if rowoutput == True:
        print fieldname + '  (looks like ' + t.split('|')[1] +', TDE datatype ' + fieldtype + ')'  # debug
        tableDef.addColumn(fieldname, int(fieldtype)) # if we pass a non-int to fieldtype, it'll fail
        tableDef.addColumn(fieldname, 15) # if we get a weird type we don't recognize, just make it a string
if rowoutput == True:
    print '***'
    time.sleep(5) # wait 5 seconds so you can actually read shit!

if rowoutput == True:
    # ok, lets print out the table def we just made, for shits and giggles
    print '################## TDE table definition created ######################'
    for c in range(0,tableDef.getColumnCount()):
        print 'Column: ' + str(tableDef.getColumnName(c)) + ' Type: ' + str(tableDef.getColumnType(c))
    time.sleep(5) # wait 5 seconds so you can actually read shit!

# ok lets add the new def as a table in the extract
tabletran = tdefile.addTable("Extract",tableDef)
# why table NEEDS to be called 'Extract' is beyond me

rowsinserted = 1 # we need to count stuff, dude! Robots start at 0, I START AT 1!

# ok, for each row in the result set, we iterate through all the fields and insert based on datatype
for row in mssql_cursor:
    if rowoutput == True:
        print '************** INSERTING ROW NUMBER: ' + str(rowsinserted) + '**************' # debug output
    else: # only print dot every 50 records
        if (rowsinserted%dotsevery) == 0:
            print '.',

    columnposition = 0
    newrow = tde.Row(tableDef)
    for t in fieldnameslist:
        fieldtype = t.split('|')[1]
        fieldname = t.split('|')[0]

        if rowoutput == True:
            print str(columnposition) + ' ' + fieldname + ':   ' + str(row[fieldname]) + ' (' + str(fieldtype).split('.')[0] + ')' # debug output
        if fieldtype == 'str':
            if row[fieldname] != None: # we don't want no None!
                newrow.setCharString(columnposition, str(row[fieldname]))
                newrow.setNull(columnposition) # ok, put that None here

        if fieldtype == 'int':
            if row[fieldname] != None:
                newrow.setInteger(columnposition, row[fieldname])

        if fieldtype == 'bool':
            if row[fieldname] != None:
                newrow.setBoolean(columnposition, row[fieldname])

        if fieldtype == 'decimal.Decimal':
            if row[fieldname] != None:
                newrow.setDouble(columnposition, row[fieldname])

        if fieldtype == 'datetime.datetime': # sexy datetime splitting
            if row[fieldname] != None:
                strippeddate = str(row[fieldname]).split('.')[0] # just in case we get microseconds (not all datetime uses them)
                timechunks = time.strptime(str(strippeddate), "%Y-%m-%d %H:%M:%S") # chunky style!
                newrow.setDateTime(columnposition, timechunks[0], timechunks[1], timechunks[2], timechunks[3], timechunks[4], timechunks[5], 0000)
        columnposition = columnposition + 1 # we gots to know what column number we're working on!
    tabletran.insert(newrow) # finally insert buffered row into TDE 'table'
    rowsinserted = rowsinserted + 1

# ok let's write out that file and get back to making dinner

# timing purposes for debugging / optimizing / FUN! This is FUN, Lars.
timetaken = time.time() - start_time
print str(rowsinserted) + ' rows inserted in ' + str(timetaken) + ' seconds'
print '    (' + str(rowsinserted/timetaken) + ' rows per second)'
# woo, let's have a drink!

I threw this together in a few hours, and will probably be adding to it, fixing it down the line. It has very little exception handling and is only around 130 lines; including my asinine comments and whitespace lines. Parts of it are a bit 'inelegant', lots of list building and string mangling - but it works. Quite well actually.

Love to get some feedback so I can make it better. I tested it on many different tables I had laying around, including that crazy MS AdventureWorks2008R2 Data Warehouse DB which uses a lot of strange data types and calculated fields. All worked splendidly.

Current Known Issues:

1) I'm only looking at the first row for data type guessing. I really should be taking a much larger sample of the data.

2) If you do a bunch of joins, don't explicitly define the field names in the query, and have duplicate field names in the result set - the file will miss some fields, or even bomb out completely.

i.e. - Don't do a:

select a.*, b.*, c.* from table1 a, table2 b, table3 c 
            where a.id = b.id and a.id = c.id

But you wouldn't do that fucking nonsense anyways - right? ;)

I'm also thinking of making a version of this with command line parameters instead of file editing.

i.e. Python TDE_from_mssql.py 
--sql="select * from bellybutton"

What's next for TDE file population posts? Web scraping, MySQL, Oracle, flat files? Might as well hit 'em all.

  • Martin

    Keep up the good work Ryan ! Your posts have already been very useful to me ! Best, Martin

    • http://ryrobes.com/ Ryan Robitaille

      Thanks Martin. Any requests, you know, since I’m going down the path of external datasources? : )

      • Martin

        I am sure massive web scraping with Beautiful Soup for instance will delight your audience ;)

  • srikanth sakhamuri

    This seems good, but I am having trouble finding the Tde file it generated….
    I have non to basic knowledge on Python

  • Jura

    Hi Ryan, if I have a bunch of joins, how do you suggest to work with it?
    Any suggestion?

    • Drew

      Any progress on the multiple table joins? I have a similar issue with multiple left joins to a fact table with 7 dim tables. May or may not have duplicate names in the resulting table.

  • Jura

    I try to run it on a oracle database and got a error “AttributeError: ‘tuple’ object has no attribute ‘keys’ ”

    On line 28 with :

    itemz = len(row.keys())/2 # because the dict rowset includes BOTH number keys and fieldname keys

    • nadia slaimia

      did you find an issu for this. i’m facing the same problem :( plez need help

  • Nat

    Hi Ryan,
    Fantastic script but I am new to python and I need to do the same thing but only I need to get a view from lotus notes and create tableau tde.Just wondering if you can give me some ideas. Many thanks Nat

  • CSeelig

    Nice post(s).

    I’m using the Extract API for a nightly refresh, and looking for a way to (programmatically) validate the success of a run. I trap for errors thrown by the API, but I’d like to get a row count, column summation over rows, that sort of thing. Thoughts? Is this possible?

    Thank you

  • Bojan

    Great! Does this mean that I can programmaticaly refresh my tableau extracts used for various reports ( created with Tableau Desktop 9.0 and viewed with Tableau Reader)?

  • Chris McClellan

    Is there an easy way to extract from TDE to CSV as well ?

  • http://batman-news.com Jostna

    I am working the similar way to db2.
    I am not knowng how to insert rows into created TDE. can you please help me out

  • ahuyvam

    Ryan, is it possible to convert TDE to MDB using this process or at all?

  • Sarah Chu

    Hi Ryan, great work! I use it to connect to oracle database and create tde file. It works for small dataset (<130000 records), when there are more records, the program will complete with no error, but the new tde file is un-usablable. I think it is related to the memory issue. Is there a way to flush the memory in python for certain records, or can we append more rows into the existing tde? Thank you!