SQL Server Query to Tableau Data Extract LIKE A BOSS – Some more TDE API fun with Python & Tableau 8
Coming off the excitement of my last post about writing a simple bare-bones python usage of Tableau's brand-new Data Extracts API from version 8.0 - I figured that it was time to build on that. Let's take a step forward and get a little more complicated - AND a little more useful.
Oh, noes!
I'm encapsulated in a proprietary binary data file OF EMOTION!
Save me, Cringer!
Wait. This could be MORE useful then my last post?
You mean your business doesn't run on a billion rows of dummy shit data? Weird.
Anyways, this time we're going to extract data from SQL Server with a single SQL query and populate a new TDE file.
This script has 3 main features:
- Can run any query you want (even 'select * from' queries)
- You don't have to pre-define the field names
- You don't have to pre-define the data types
I got you, shawty. I GOT YOU.
All of which we had to explicitly state in the last script (even as a 'proof of concept'), and as you can imagine - with a wide result set, defining every little thing and then inserting every little thing can be a huge pain in the ass (not to mention ugly). With my approach, it's literally plug and play - err, paste and play. You can even run queries like "select * from " it's not even necessary to use the field names (assuming they are all unique, that is - more on that later) - and yes, I mentioned that twice for EMPHASIS!
Ye gods! No scripting out my lame_ass_field_names? Huzzah!
Basically, the script works like this:
- Connect to SQL Server
- Execute your SQL query
- Look at the field names and first row data
- Have Python try to guess what data types they are (based on that first row)
- Map those data types and field names to a TDE schema definition (TableDefinition())
- Loop through the entire result set, calling each field by name and type so the correct insert method is used (setDouble, setInteger, setNull, setDateTime, setBoolean, etc)
- Do a little headbanging as you watch the text fly by!
- Close the file (write the TDE to disk)
- Close the SQL connection
- Wipe hands on pants
Booya. Done.
And you didn't have to do a damn thing except give it a server connection and a SQL query. Now you can open that fresh TDE file and have yourself a little data party. { Pants optional }
Here it is! Cut, Paste, and give it a try, gosh darnint!
# simple Tableau Data Extract creation from a single microsoft sql server sql statement
import dataextract as tde # saves some typing, cause i'm a lazy fucker
import os, time, pymssql # for file manipulation, script timing (not necc), database access!
###################### FOR YOUR PARAMETERS, SON! ######################
tdefilename = 'ufo_datas.tde'
sql = "select * from UFOdata.dbo.Sightings" # whatever
sqlserverhost = 'localhost'
sqlusername = 'sa'
sqlpassword = 'passy'
sqldatabase = 'UFOdata'
rowoutput = False # for DEBUGGING data errors / slows shit down 10X however
###################### FOR YOUR PARAMETERS, SON! ######################
dotsevery = 75
start_time = time.time() # simple timing for test purposes
mssql_db = pymssql.connect(host=sqlserverhost, user=sqlusername, password=sqlpassword, database=sqldatabase, as_dict=True) # as_dict very important
mssql_cursor = mssql_db.cursor()
mssql_cursor.execute(sql)
print ' '
print '[ Note: Each . = ' +str(dotsevery)+ ' rows processed ]'
fieldnameslist = [] # define our empty list
#go through the first row to TRY to set fieldnames and datatypes
for row in mssql_cursor:
itemz = len(row.keys())/2 # because the dict rowset includes BOTH number keys and fieldname keys
for k in row.keys():
fieldnameslist.append(str(k) + '|' + str(type(row[k])).replace("<type '","").replace("'>","").replace("<class '","").replace('NoneType','str').replace('uuid.UUID','str') )
break # after the first row, we SHOULD have a decent idea of the datatypes
# ^ a bit inelegant, but it gets the job done
fieldnameslist.sort() # sort them out so the integer keys are first (we're gonna whack em)
del fieldnameslist[0:itemz] # remove first x amount of keys (should be all integers instead of dict literals)
try: # Just in case the file exists already, we don't want to bomb out
tdefile = tde.Extract(tdefilename) # in CWD
except:
os.system('del '+tdefilename)
os.system('del DataExtract.log') #might as well erase this bitch too
tdefile = tde.Extract(tdefilename)
# ok lets build the table definition in TDE with our list of names and types first
# replacing literals with TDE datatype integers, etc
tableDef = tde.TableDefinition() #create a new table def
if rowoutput == True:
print '*** field names list ***' # debug
for t in fieldnameslist:
fieldtype = t.split('|')[1]
fieldname = t.split('|')[0]
fieldtype = str(fieldtype).replace("str","15").replace("datetime.datetime","13").replace("int","7").replace("decimal.Decimal","10").replace("float","10").replace("uuid.UUID","15").replace("bool","11")
if rowoutput == True:
print fieldname + ' (looks like ' + t.split('|')[1] +', TDE datatype ' + fieldtype + ')' # debug
try:
tableDef.addColumn(fieldname, int(fieldtype)) # if we pass a non-int to fieldtype, it'll fail
except:
tableDef.addColumn(fieldname, 15) # if we get a weird type we don't recognize, just make it a string
if rowoutput == True:
print '***'
time.sleep(5) # wait 5 seconds so you can actually read shit!
if rowoutput == True:
# ok, lets print out the table def we just made, for shits and giggles
print '################## TDE table definition created ######################'
for c in range(0,tableDef.getColumnCount()):
print 'Column: ' + str(tableDef.getColumnName(c)) + ' Type: ' + str(tableDef.getColumnType(c))
time.sleep(5) # wait 5 seconds so you can actually read shit!
# ok lets add the new def as a table in the extract
tabletran = tdefile.addTable("Extract",tableDef)
# why table NEEDS to be called 'Extract' is beyond me
rowsinserted = 1 # we need to count stuff, dude! Robots start at 0, I START AT 1!
# ok, for each row in the result set, we iterate through all the fields and insert based on datatype
for row in mssql_cursor:
if rowoutput == True:
print '************** INSERTING ROW NUMBER: ' + str(rowsinserted) + '**************' # debug output
else: # only print dot every 50 records
if (rowsinserted%dotsevery) == 0:
print '.',
columnposition = 0
newrow = tde.Row(tableDef)
for t in fieldnameslist:
fieldtype = t.split('|')[1]
fieldname = t.split('|')[0]
if rowoutput == True:
print str(columnposition) + ' ' + fieldname + ': ' + str(row[fieldname]) + ' (' + str(fieldtype).split('.')[0] + ')' # debug output
if fieldtype == 'str':
if row[fieldname] != None: # we don't want no None!
newrow.setCharString(columnposition, str(row[fieldname]))
else:
newrow.setNull(columnposition) # ok, put that None here
if fieldtype == 'int':
if row[fieldname] != None:
newrow.setInteger(columnposition, row[fieldname])
else:
newrow.setNull(columnposition)
if fieldtype == 'bool':
if row[fieldname] != None:
newrow.setBoolean(columnposition, row[fieldname])
else:
newrow.setNull(columnposition)
if fieldtype == 'decimal.Decimal':
if row[fieldname] != None:
newrow.setDouble(columnposition, row[fieldname])
else:
newrow.setNull(columnposition)
if fieldtype == 'datetime.datetime': # sexy datetime splitting
if row[fieldname] != None:
strippeddate = str(row[fieldname]).split('.')[0] # just in case we get microseconds (not all datetime uses them)
timechunks = time.strptime(str(strippeddate), "%Y-%m-%d %H:%M:%S") # chunky style!
newrow.setDateTime(columnposition, timechunks[0], timechunks[1], timechunks[2], timechunks[3], timechunks[4], timechunks[5], 0000)
else:
newrow.setNull(columnposition)
columnposition = columnposition + 1 # we gots to know what column number we're working on!
tabletran.insert(newrow) # finally insert buffered row into TDE 'table'
newrow.close()
rowsinserted = rowsinserted + 1
# ok let's write out that file and get back to making dinner
tdefile.close()
mssql_db.close()
# timing purposes for debugging / optimizing / FUN! This is FUN, Lars.
timetaken = time.time() - start_time
print str(rowsinserted) + ' rows inserted in ' + str(timetaken) + ' seconds'
print ' (' + str(rowsinserted/timetaken) + ' rows per second)'
# woo, let's have a drink!
I threw this together in a few hours, and will probably be adding to it, fixing it down the line. It has very little exception handling and is only around 130 lines; including my asinine comments and whitespace lines. Parts of it are a bit 'inelegant', lots of list building and string mangling - but it works. Quite well actually.
Love to get some feedback so I can make it better. I tested it on many different tables I had laying around, including that crazy MS AdventureWorks2008R2 Data Warehouse DB which uses a lot of strange data types and calculated fields. All worked splendidly.
1) I'm only looking at the first row for data type guessing. I really should be taking a much larger sample of the data.
2) If you do a bunch of joins, don't explicitly define the field names in the query, and have duplicate field names in the result set - the file will miss some fields, or even bomb out completely.
i.e. - Don't do a:
select a.*, b.*, c.* from table1 a, table2 b, table3 c where a.id = b.id and a.id = c.id
But you wouldn't do that fucking nonsense anyways - right? ;)
I'm also thinking of making a version of this with command line parameters instead of file editing.
i.e. Python TDE_from_mssql.py --file=belly.tde --server=servername --username=sa --database=db45 --sql="select * from bellybutton"
What's next for TDE file population posts? Web scraping, MySQL, Oracle, flat files? Might as well hit 'em all.