4 December 2012 16 Comments

SQL Server Query to Tableau Data Extract LIKE A BOSS – Some more TDE API fun with Python & Tableau 8

Coming off the excitement of my last post about writing a simple bare-bones python usage of Tableau's brand-new Data Extracts API from version 8.0 - I figured that it was time to build on that. Let's take a step forward and get a little more complicated - AND a little more useful.I'm encapsulated in a proprietary binary data file OF EMOTION!

Oh, noes!

I'm encapsulated in a proprietary binary data file OF EMOTION!

Save me, Cringer!

Wait. This could be MORE useful then my last post?

You mean your business doesn't run on a billion rows of dummy shit data? Weird.

Anyways, this time we're going to extract data from SQL Server with a single SQL query and populate a new TDE file.

This script has 3 main features:

  • Can run any query you want (even 'select * from' queries)
  • You don't have to pre-define the field names
  • You don't have to pre-define the data types

I got you, shawty. I GOT YOU.

All of which we had to explicitly state in the last script (even as a 'proof of concept'), and as you can imagine - with a wide result set, defining every little thing and then inserting every little thing can be a huge pain in the ass (not to mention ugly). With my approach, it's literally plug and play - err, paste and play. You can even run queries like "select * from " it's not even necessary to use the field names (assuming they are all unique, that is - more on that later) - and yes, I mentioned that twice for EMPHASIS!

Ye gods! No scripting out my lame_ass_field_names? Huzzah!

Basically, the script works like this:

  • Connect to SQL Server
  • Execute your SQL query
  • Look at the field names and first row data
  • Have Python try to guess what data types they are (based on that first row)
  • Map those data types and field names to a TDE schema definition (TableDefinition())
  • Loop through the entire result set, calling each field by name and type so the correct insert method is used (setDouble, setInteger, setNull, setDateTime, setBoolean, etc)
  • Do a little headbanging as you watch the text fly by!
  • Close the file (write the TDE to disk)
  • Close the SQL connection
  • Wipe hands on pants

Booya. Done.

And you didn't have to do a damn thing except give it a server connection and a SQL query. Now you can open that fresh TDE file and have yourself a little data party. { Pants optional }

Requires: Python 2.7.X+, the DataExtracts module, and pymssql module (pre-compiled binaries here). Oh, and it would help to have a SQL Server and Tableau Desktop as well. ;)

Here it is! Cut, Paste, and give it a try, gosh darnint!

UPDATE: Added var (rowoutput = True / False) for turning off the debug row / column output - good for looking into errors but slowed the script down like 10x
# ryan robitaille 12/6/2012
# simple Tableau Data Extract creation from a single microsoft sql server sql statement

import dataextract as tde # saves some typing, cause i'm a lazy fucker
import os, time, pymssql # for file manipulation, script timing (not necc), database access!

###################### FOR YOUR PARAMETERS, SON! ######################
tdefilename = 'ufo_datas.tde'
sql = "select * from UFOdata.dbo.Sightings" # whatever
sqlserverhost = 'localhost'
sqlusername = 'sa'
sqlpassword = 'passy'
sqldatabase = 'UFOdata'
rowoutput = False # for DEBUGGING data errors / slows shit down 10X however
###################### FOR YOUR PARAMETERS, SON! ######################

dotsevery = 75

start_time = time.time() # simple timing for test purposes

mssql_db = pymssql.connect(host=sqlserverhost, user=sqlusername, password=sqlpassword, database=sqldatabase, as_dict=True) # as_dict very important
mssql_cursor = mssql_db.cursor()
mssql_cursor.execute(sql)

print ' '
print '[ Note: Each . = ' +str(dotsevery)+ ' rows processed ]'

fieldnameslist = [] # define our empty list

#go through the first row to TRY to set fieldnames and datatypes
for row in mssql_cursor:
        itemz = len(row.keys())/2 # because the dict rowset includes BOTH number keys and fieldname keys
        for k in row.keys():
            fieldnameslist.append(str(k) + '|' + str(type(row[k])).replace("<type '","").replace("'>","").replace("<class '","").replace('NoneType','str').replace('uuid.UUID','str') )
        break # after the first row, we SHOULD have a decent idea of the datatypes
# ^ a bit inelegant, but it gets the job done

fieldnameslist.sort() # sort them out so the integer keys are first (we're gonna whack em)
del fieldnameslist[0:itemz] # remove first x amount of keys (should be all integers instead of dict literals)


try:  # Just in case the file exists already, we don't want to bomb out
    tdefile = tde.Extract(tdefilename) # in CWD
except:
    os.system('del '+tdefilename)
    os.system('del DataExtract.log') #might as well erase this bitch too
    tdefile = tde.Extract(tdefilename)

# ok lets build the table definition in TDE with our list of names and types first
# replacing literals with TDE datatype integers, etc
tableDef = tde.TableDefinition() #create a new table def

if rowoutput == True:
    print '*** field names list ***' # debug
for t in fieldnameslist:
    fieldtype = t.split('|')[1]
    fieldname = t.split('|')[0]
    fieldtype = str(fieldtype).replace("str","15").replace("datetime.datetime","13").replace("int","7").replace("decimal.Decimal","10").replace("float","10").replace("uuid.UUID","15").replace("bool","11")
    if rowoutput == True:
        print fieldname + '  (looks like ' + t.split('|')[1] +', TDE datatype ' + fieldtype + ')'  # debug
    try:
        tableDef.addColumn(fieldname, int(fieldtype)) # if we pass a non-int to fieldtype, it'll fail
    except:
        tableDef.addColumn(fieldname, 15) # if we get a weird type we don't recognize, just make it a string
if rowoutput == True:
    print '***'
    time.sleep(5) # wait 5 seconds so you can actually read shit!

if rowoutput == True:
    # ok, lets print out the table def we just made, for shits and giggles
    print '################## TDE table definition created ######################'
    for c in range(0,tableDef.getColumnCount()):
        print 'Column: ' + str(tableDef.getColumnName(c)) + ' Type: ' + str(tableDef.getColumnType(c))
    time.sleep(5) # wait 5 seconds so you can actually read shit!

# ok lets add the new def as a table in the extract
tabletran = tdefile.addTable("Extract",tableDef)
# why table NEEDS to be called 'Extract' is beyond me

rowsinserted = 1 # we need to count stuff, dude! Robots start at 0, I START AT 1!

# ok, for each row in the result set, we iterate through all the fields and insert based on datatype
for row in mssql_cursor:
    if rowoutput == True:
        print '************** INSERTING ROW NUMBER: ' + str(rowsinserted) + '**************' # debug output
    else: # only print dot every 50 records
        if (rowsinserted%dotsevery) == 0:
            print '.',

    columnposition = 0
    newrow = tde.Row(tableDef)
    for t in fieldnameslist:
        fieldtype = t.split('|')[1]
        fieldname = t.split('|')[0]

        if rowoutput == True:
            print str(columnposition) + ' ' + fieldname + ':   ' + str(row[fieldname]) + ' (' + str(fieldtype).split('.')[0] + ')' # debug output
       
        if fieldtype == 'str':
            if row[fieldname] != None: # we don't want no None!
                newrow.setCharString(columnposition, str(row[fieldname]))
            else:
                newrow.setNull(columnposition) # ok, put that None here

        if fieldtype == 'int':
            if row[fieldname] != None:
                newrow.setInteger(columnposition, row[fieldname])
            else:
                newrow.setNull(columnposition)

        if fieldtype == 'bool':
            if row[fieldname] != None:
                newrow.setBoolean(columnposition, row[fieldname])
            else:
                newrow.setNull(columnposition)

        if fieldtype == 'decimal.Decimal':
            if row[fieldname] != None:
                newrow.setDouble(columnposition, row[fieldname])
            else:
                newrow.setNull(columnposition)

        if fieldtype == 'datetime.datetime': # sexy datetime splitting
            if row[fieldname] != None:
                strippeddate = str(row[fieldname]).split('.')[0] # just in case we get microseconds (not all datetime uses them)
                timechunks = time.strptime(str(strippeddate), "%Y-%m-%d %H:%M:%S") # chunky style!
                newrow.setDateTime(columnposition, timechunks[0], timechunks[1], timechunks[2], timechunks[3], timechunks[4], timechunks[5], 0000)
            else:
                newrow.setNull(columnposition)
   
        columnposition = columnposition + 1 # we gots to know what column number we're working on!
    tabletran.insert(newrow) # finally insert buffered row into TDE 'table'
    newrow.close()
    rowsinserted = rowsinserted + 1

# ok let's write out that file and get back to making dinner
tdefile.close()
mssql_db.close()

# timing purposes for debugging / optimizing / FUN! This is FUN, Lars.
timetaken = time.time() - start_time
print str(rowsinserted) + ' rows inserted in ' + str(timetaken) + ' seconds'
print '    (' + str(rowsinserted/timetaken) + ' rows per second)'
# woo, let's have a drink!

I threw this together in a few hours, and will probably be adding to it, fixing it down the line. It has very little exception handling and is only around 130 lines; including my asinine comments and whitespace lines. Parts of it are a bit 'inelegant', lots of list building and string mangling - but it works. Quite well actually.

Love to get some feedback so I can make it better. I tested it on many different tables I had laying around, including that crazy MS AdventureWorks2008R2 Data Warehouse DB which uses a lot of strange data types and calculated fields. All worked splendidly.

Current Known Issues:

1) I'm only looking at the first row for data type guessing. I really should be taking a much larger sample of the data.

2) If you do a bunch of joins, don't explicitly define the field names in the query, and have duplicate field names in the result set - the file will miss some fields, or even bomb out completely.

i.e. - Don't do a:

select a.*, b.*, c.* from table1 a, table2 b, table3 c 
            where a.id = b.id and a.id = c.id


But you wouldn't do that fucking nonsense anyways - right? ;)

I'm also thinking of making a version of this with command line parameters instead of file editing.

i.e. Python TDE_from_mssql.py 
--file=belly.tde 
--server=servername 
--username=sa 
--database=db45 
--sql="select * from bellybutton"

What's next for TDE file population posts? Web scraping, MySQL, Oracle, flat files? Might as well hit 'em all.

  • Martin

    Keep up the good work Ryan ! Your posts have already been very useful to me ! Best, Martin

    • http://ryrobes.com/ Ryan Robitaille

      Thanks Martin. Any requests, you know, since I’m going down the path of external datasources? : )
      Cheers!

      • Martin

        I am sure massive web scraping with Beautiful Soup for instance will delight your audience ;)

  • srikanth sakhamuri

    This seems good, but I am having trouble finding the Tde file it generated….
    I have non to basic knowledge on Python

  • Jura

    Hi Ryan, if I have a bunch of joins, how do you suggest to work with it?
    Any suggestion?

    • Drew

      Any progress on the multiple table joins? I have a similar issue with multiple left joins to a fact table with 7 dim tables. May or may not have duplicate names in the resulting table.

  • Jura

    I try to run it on a oracle database and got a error “AttributeError: ‘tuple’ object has no attribute ‘keys’ ”

    On line 28 with :

    itemz = len(row.keys())/2 # because the dict rowset includes BOTH number keys and fieldname keys

    • nadia slaimia

      did you find an issu for this. i’m facing the same problem :( plez need help

  • Nat

    Hi Ryan,
    Fantastic script but I am new to python and I need to do the same thing but only I need to get a view from lotus notes and create tableau tde.Just wondering if you can give me some ideas. Many thanks Nat

  • CSeelig

    Nice post(s).

    I’m using the Extract API for a nightly refresh, and looking for a way to (programmatically) validate the success of a run. I trap for errors thrown by the API, but I’d like to get a row count, column summation over rows, that sort of thing. Thoughts? Is this possible?

    Thank you

  • Bojan

    Great! Does this mean that I can programmaticaly refresh my tableau extracts used for various reports ( created with Tableau Desktop 9.0 and viewed with Tableau Reader)?

  • Chris McClellan

    Is there an easy way to extract from TDE to CSV as well ?

  • http://batman-news.com Jostna

    Hi,
    I am working the similar way to db2.
    I am not knowng how to insert rows into created TDE. can you please help me out

  • ahuyvam

    Ryan, is it possible to convert TDE to MDB using this process or at all?

  • Sarah Chu

    Hi Ryan, great work! I use it to connect to oracle database and create tde file. It works for small dataset (<130000 records), when there are more records, the program will complete with no error, but the new tde file is un-usablable. I think it is related to the memory issue. Is there a way to flush the memory in python for certain records, or can we append more rows into the existing tde? Thank you!

  • David Howell

    I realise this is old, but for new viewers I will respond to “Love to get some feedback so I can make it better”

    Comments apply equally to the new hyper format for extracts.

    In pymssql and other DBAPI compliant python libraries, cursor.description will tell you both name and a type_code for each column so no need for the for loop to read the first row and the str(type()) shenanigans.

    Alternatively you can query information_schema tables to discover the SQL datatypes and base your mapping off those metadata tables. Most relational databases have those tables ( MSSQL, Oracle, Postgres, MySQL, DB2).

    You can also do what Tableau does internally, and issue the SQL query twice, the first time wrapping it in “SELECT TOP 0 * FROM(” + yourquery + “)” or on other databases apppending “LIMIT 0″ which will return a dataset with no rows , and do it very quickly, but will still allow you to get the schema metadata so you can prepare for receiving the real data.

    The main loop runs for each row in the cursor, then for each field in fieldlist.
    Firstly it does the split function twice, when it could be done once using tuple unpacking e.g. `fieldtype, fieldname = fieldlist.split(‘|’)` , but that would still be doing it redundantly for each cell in each row, i.e. it calls split function (2 * columncount * rowcount) when instead of a list of pipe-separated-strings it could have been a list of tuples up front. e.g. fieldlist.append((fieldname, fieldtype)) then referenced as fieldlist[0] and fieldlist[1]. That’s still not much better though, because string manipulation in python is cheap.

    Looping over that fieldlist is done as a series of `if` statements, and it will check all types for every cell, even if it hits a match on the first test for str and has already set the field, it will continue to check if the value is an int. Put common datatypes at the top and use elif to short-circuit the loop.

    A better approach is to avoid testing types since they are known. You could generate a function dynamically based on the schema. You can use a factory function to loop over the schema by column ordinal and return a function that works on a whole cursor row at a time, with the appropriate set{type} functions for the exact set of columns in your dataset, instead of testing every field in every row. Apart from the network time taken to download data, this looping and redundant datatype checking is the biggest performance bottleneck.

    Calling print function to show dots is also useful for development but should be removed for unattended usage.Function calls are slow, especially IO like print.

    Instead of using a counter for each row, just use the API to connect to the final extract file created and use the inbuilt function that tells you rowcount.

    For large rowcounts, flush the file every 1 gig or so, by closing & reopening the extract file. The API will append data so it’s fine to do this. Note, equally it won’t protect you from duplicating data so be careful not to append the same data.. You can guess when it is 1gig by estimating using a rowcount or some other method. If you don’t do this you’ll out-of-memory and the extract won’t be written.

    If you’re copying & pasting this into your environment, remove the comments where things are referred to a b*** and all the sh** and f*** swearing, it’s juvenile at the least and offensive at worst; no need to add to the BRO-grammer problem. A bit of humour in a tutorial blog is great but this just detracts from the usefulness of this to the community.