Building Tableau Data Extract files with Python in Tableau 8 – Sample Usage
See also: Tableau Data Extracts from SQL Server or from CSV files
If you were learning (and / or partying) with us at the Tableau Customer Conference 2012 in Beautiful San Diego (where me and 6 other individuals were also crowned 'Tableau Zen Masters', ahem) - then you've probably been playing around with the Tableau 8 Beta over the past few weeks. Tableau 8 brings a whole load of new features and performance improvements, but it also includes 2 new APIs for us tinkering / hack-y peeps.
One of these is the 'Tableau Data Extract API' (also called the 'TDE API' in acronym land, where I have a beach house). TDE will allow us to programatically create Data Extracts for Tableau Desktop and Server using our favorite data-munching language. That is, assuming that your favorite language is Python, C, C++, or Java (barf). This is going to be huge, and while the libraries seem to be in their infancy - I fully expect for them to evolve and add features as we move along.
Python is wicked flexible and ideal for data-centric workflow / runstreams (in my humble opinion, at least). It's my weapon of choice. In essence, these libraries give us one more output format for our scripts / programs...
A "NATIVE Tableau Data Extract".
What *IS* a Tableau Data Extract you ask?
You know, those super fast binary .tde files that Tableau creates behind the scenes to mitigate your shitty datasource performance? Yes, THOSE.
If you're a fan of this blog you know that I revel in pulling down bizarre datasources from various places and throwing them in Tableau for making my own brand of 'data visualization moonshine'. Usually, I need to write them out to CSV files, Excel files, or a local database first. But with TDE I can write them directly to a usable Tableau file in the same script, saving a step and enabling a whole bunch of other integrations - most of which I haven't even thought of yet.
First things first. Since we're talking about beta stuff, I can't actually tell you where to GET IT. But I can do everything else from there. Besides if you attended TCC12, you probably already have access to it - otherwise, hold out until Q1 2013 - it'll be SO worth it.
It comes in 2 Python versions - 32-bit and 64-bit (both v8-beta2 as of this writing) - for these examples I'm using the 32-bit versions, since most of my Python modules are 32-bit for now.
Installation is straightforward, the old 'python setup.py install' in the module directory. You'll also see that it also comes with a number of Tableau Server DLLs and a tdeserver.exe in the BIN directory (so, sorry folks - Windows only) - but don't worry about the PATH of those files, they'll all get installed in Python's site-packages directory upon installation.
For the most basic and simple usage - there are no other required modules, just the standard ones. I'm using Python 2.7.3 currently.
I'm going to step through creating a sample file (as of this writing, sample usage files and documentation are not included in the packages) but, I'll show more advanced uses in the future, as I expand on the possibilities (custom functions, integration with other data-handy python modules, revamping old scripts, etc).
Simple Usage of TDE Python Module
(a decent starting point, but with no exception handling except built-in with the module)
Annotated anywhere possible, including code comments - full script below.
import os, time # for file manipulation / script timing (not necc)
start_time = time.time() # simple timing for test purposes
try: # Just for testing purposes and re-running
tdefile = tde.Extract('test.tde') #in CWD
except:
os.system('del test.tde')
os.system('del DataExtract.log')
tdefile = tde.Extract('test.tde')
tableDef = tde.TableDefinition() #create a new table def
As you can see I'm simply importing the 'dataextract' module plus a few others. The only reason I included the try/except block is because during testing I was running this bitch over and over again - and it got really annoying to delete the last created file.
- tdefile = tde.Extract('test.tde') names my TDE file and creates 'tdefile' for later usage
- Then tableDef = tde.TableDefinition() gets me ready to define my file 'schema' (for lack of a better word)
tableDef.addColumn("field_type7", 7) #INTEGER
tableDef.addColumn("field_type10", 10) #DOUBLE
tableDef.addColumn("field_type11", 11) #BOOLEAN
tableDef.addColumn("field_type12", 12) #DATE
tableDef.addColumn("field_type13", 13) #DATETIME
tableDef.addColumn("field_type14", 14) #DURATION
tableDef.addColumn("field_type15", 15) #CHAR_STRING
tableDef.addColumn("field_type16", 16) #UNICODE_STRING
As you can see, I'm defining my field name and field data type here.
I couldn't find any place to define the data-types with literal words (like the C and C++ examples show), for now I'm going to stick with the numbering. Besides, 'addColumn' requires an integer for the datatype anyways. Thankfully there are only 8 data types.
for c in range(0,tableDef.getColumnCount()):
print 'Column: ' + str(tableDef.getColumnName(c)) + ' Type: ' + str(tableDef.getColumnType(c))
This is unnecessary, but you may find it useful. Basically, after we define our table layout (in tableDef) we echo it out on the screen. I like to add stuff like this so I have something to look at while it crunches away. So shoot me.
In my case, the output looks like this:
Column: field_type7 Type: 7 Column: field_type10 Type: 10 Column: field_type11 Type: 11 Column: field_type12 Type: 12 Column: field_type13 Type: 13 Column: field_type14 Type: 14 Column: field_type15 Type: 15 Column: field_type16 Type: 16
tabletran = tdefile.addTable("Extract",tableDef)
# why table NEEDS to be called 'Extract' is beyond me
rowsinserted = 1
Ok, now that we've defined the table schema (tableDef) - we need to create that 'table' in our TDE file we defined. As of now, you are (seemingly) required to name the table "Extract", not that it really matters at this time.
"rowsinserted" is just so I can count all the rows and echo the number out at the end.
for i in range(1,1000000):
newrow = tde.Row(tableDef)
#newrow.setNull(0) #column
newrow.setInteger(0,i) #column, value
newrow.setDouble(1,i*1.4) #column, value
newrow.setBoolean(2,1) #column, value (1/0)
newrow.setDate(3, 2012, 12, 2) #column, y,m,d
newrow.setDateTime(4, 2012, 12, 2, 18, 25, 55, 0000) #column, y,m,d,h,m,s,frac
newrow.setDuration(5, 6, 18, 25, 55, 0000) #column, d,h,m,s,frac
newrow.setCharString(6, 'Character String') #column, 'value'
newrow.setString(7, 'Unicode String!') #column, 'value'
rowsinserted = rowsinserted + 1 #count this row!
tabletran.insert(newrow) #insert row into TDE table
newrow.close()
tdefile.close()
Here is my main "insert" loops. Let's go step by step (the 'for i in range(x,x)' is just to run this block multiple times and generate some row depth for sample purposes).
print str(rowsinserted) + ' rows inserted in ' + str(timetaken) + ' seconds'
print ' (' + str(rowsinserted/timetaken) + ' rows per second)'
#booya!
Just printing some info about the execution - not needed - BUT IT'S FUN!
1000000 rows inserted in 34.0269999504 seconds (29388.4268803 rows per second)
Here's the whole shebang, Cut-n-Paste Away!
# really simple TDE creation test
import dataextract as tde #saves some typing
import os, time # for file manipulation / script timing (not necc)
start_time = time.time() # simple timing for test purposes
try: # Just for testing purposes and re-running
tdefile = tde.Extract('test.tde') #in CWD
except:
os.system('del test.tde')
os.system('del DataExtract.log')
tdefile = tde.Extract('test.tde')
tableDef = tde.TableDefinition() #create a new table def
# using integers for now because the literal defs are not in the python module code
tableDef.addColumn("field_type7", 7) #INTEGER
tableDef.addColumn("field_type10", 10) #DOUBLE
tableDef.addColumn("field_type11", 11) #BOOLEAN
tableDef.addColumn("field_type12", 12) #DATE
tableDef.addColumn("field_type13", 13) #DATETIME
tableDef.addColumn("field_type14", 14) #DURATION
tableDef.addColumn("field_type15", 15) #CHAR_STRING
tableDef.addColumn("field_type16", 16) #UNICODE_STRING
# ok, lets print out the table def, just for shits and giggles
for c in range(0,tableDef.getColumnCount()):
print 'Column: ' + str(tableDef.getColumnName(c)) + ' Type: ' + str(tableDef.getColumnType(c))
# lets add the new def as a table in the extract
tabletran = tdefile.addTable("Extract",tableDef)
# why table NEEDS to be called 'Extract' is beyond me
rowsinserted = 1
# let's create a bunch of rows! wheeeeee!
for i in range(1,1000000):
newrow = tde.Row(tableDef)
#newrow.setNull(0) #column
newrow.setInteger(0,i) #column, value
newrow.setDouble(1,i*1.4) #column, value
newrow.setBoolean(2,1) #column, value (1/0)
newrow.setDate(3, 2012, 12, 2) #column, y,m,d
newrow.setDateTime(4, 2012, 12, 2, 18, 25, 55, 0000) #column, y,m,d,h,m,s,frac
newrow.setDuration(5, 6, 18, 25, 55, 0000) #column, d,h,m,s,frac
newrow.setCharString(6, 'Character String') #column, 'value'
newrow.setString(7, 'Unicode String!') #column, 'value'
rowsinserted = rowsinserted + 1 #count this row!
tabletran.insert(newrow) #insert row into TDE table
newrow.close()
tdefile.close()
timetaken = time.time() - start_time #just for timing and fun
print str(rowsinserted) + ' rows inserted in ' + str(timetaken) + ' seconds'
print ' (' + str(rowsinserted/timetaken) + ' rows per second)'
What's next? Building TDEs from web screen-scraping as well as using the MessyTables python module for turning ugly CSV and XLS into sexified TDEs (hopefully). Stay tuned - this topic is somewhat of a moving target...
Want more? Here is a follow-up post creating TDE files from SQL Server queries!