2 December 2012 15 Comments

Building Tableau Data Extract files with Python in Tableau 8 – Sample Usage

See also: Tableau Data Extracts from SQL Server or from CSV files
If you were learning (and / or partying) with us at the Tableau Customer Conference 2012 in Beautiful San Diego (where me and 6 other individuals were also crowned 'Tableau Zen Masters', ahem) - then you've probably been playing around with the Tableau 8 Beta over the past few weeks. Tableau 8 brings a whole load of new features and performance improvements, but it also includes 2 new APIs for us tinkering / hack-y peeps.

UNLEASH THE KRAKEN!

One of these is the 'Tableau Data Extract API' (also called the 'TDE API' in acronym land, where I have a beach house). TDE will allow us to programatically create Data Extracts for Tableau Desktop and Server using our favorite data-munching language. That is, assuming that your favorite language is Python, C, C++, or Java (barf). This is going to be huge, and while the libraries seem to be in their infancy - I fully expect for them to evolve and add features as we move along.

Python is wicked flexible and ideal for data-centric workflow / runstreams (in my humble opinion, at least). It's my weapon of choice. In essence, these libraries give us one more output format for our scripts / programs...

A "NATIVE Tableau Data Extract".

What *IS* a Tableau Data Extract you ask?

You know, those super fast binary .tde files that Tableau creates behind the scenes to mitigate your shitty datasource performance? Yes, THOSE.

If you're a fan of this blog you know that I revel in pulling down bizarre datasources from various places and throwing them in Tableau for making my own brand of 'data visualization moonshine'. Usually, I need to write them out to CSV files, Excel files, or a local database first. But with TDE I can write them directly to a usable Tableau file in the same script, saving a step and enabling a whole bunch of other integrations - most of which I haven't even thought of yet.

First things first. Since we're talking about beta stuff, I can't actually tell you where to GET IT. But I can do everything else from there. Besides if you attended TCC12, you probably already have access to it - otherwise, hold out until Q1 2013 - it'll be SO worth it.

It comes in 2 Python versions - 32-bit and 64-bit (both v8-beta2 as of this writing) - for these examples I'm using the 32-bit versions, since most of my Python modules are 32-bit for now.

Installation is straightforward, the old 'python setup.py install' in the module directory. You'll also see that it also comes with a number of Tableau Server DLLs and a tdeserver.exe in the BIN directory (so, sorry folks - Windows only) - but don't worry about the PATH of those files, they'll all get installed in Python's site-packages directory upon installation.

For the most basic and simple usage - there are no other required modules, just the standard ones. I'm using Python 2.7.3 currently.

I'm going to step through creating a sample file (as of this writing, sample usage files and documentation are not included in the packages) but, I'll show more advanced uses in the future, as I expand on the possibilities (custom functions, integration with other data-handy python modules, revamping old scripts, etc).

Simple Usage of TDE Python Module

(a decent starting point, but with no exception handling except built-in with the module)

Annotated anywhere possible, including code comments - full script below.

import dataextract as tde #saves some typing
import os, time # for file manipulation / script timing (not necc)

start_time = time.time() # simple timing for test purposes

try:  # Just for testing purposes and re-running
    tdefile = tde.Extract('test.tde') #in CWD
except:
    os.system('del test.tde')
    os.system('del DataExtract.log')
    tdefile = tde.Extract('test.tde')

tableDef = tde.TableDefinition() #create a new table def

As you can see I'm simply importing the 'dataextract' module plus a few others. The only reason I included the try/except block is because during testing I was running this bitch over and over again - and it got really annoying to delete the last created file.

  • tdefile = tde.Extract('test.tde') names my TDE file and creates 'tdefile' for later usage
  • Then tableDef = tde.TableDefinition() gets me ready to define my file 'schema' (for lack of a better word)
# using integers for now because the literal defs are not in the python module code
tableDef.addColumn("field_type7", 7)   #INTEGER
tableDef.addColumn("field_type10", 10) #DOUBLE
tableDef.addColumn("field_type11", 11) #BOOLEAN
tableDef.addColumn("field_type12", 12) #DATE
tableDef.addColumn("field_type13", 13) #DATETIME
tableDef.addColumn("field_type14", 14) #DURATION
tableDef.addColumn("field_type15", 15) #CHAR_STRING
tableDef.addColumn("field_type16", 16) #UNICODE_STRING

As you can see, I'm defining my field name and field data type here.

I couldn't find any place to define the data-types with literal words (like the C and C++ examples show), for now I'm going to stick with the numbering. Besides, 'addColumn' requires an integer for the datatype anyways. Thankfully there are only 8 data types.

# ok, lets print out the table def, just for shits and giggles
for c in range(0,tableDef.getColumnCount()):
    print 'Column: ' + str(tableDef.getColumnName(c)) + ' Type: ' + str(tableDef.getColumnType(c))

This is unnecessary, but you may find it useful. Basically, after we define our table layout (in tableDef) we echo it out on the screen. I like to add stuff like this so I have something to look at while it crunches away. So shoot me.

In my case, the output looks like this:

Column: field_type7 Type: 7
Column: field_type10 Type: 10
Column: field_type11 Type: 11
Column: field_type12 Type: 12
Column: field_type13 Type: 13
Column: field_type14 Type: 14
Column: field_type15 Type: 15
Column: field_type16 Type: 16

# lets add the new def as a table in the extract
tabletran = tdefile.addTable("Extract",tableDef)
# why table NEEDS to be called 'Extract' is beyond me

rowsinserted = 1

Ok, now that we've defined the table schema (tableDef) - we need to create that 'table' in our TDE file we defined. As of now, you are (seemingly) required to name the table "Extract", not that it really matters at this time.

"rowsinserted" is just so I can count all the rows and echo the number out at the end.

# let's create a bunch of rows! wheeeeee!
for i in range(1,1000000):

    newrow = tde.Row(tableDef)
    #newrow.setNull(0) #column
    newrow.setInteger(0,i) #column, value
    newrow.setDouble(1,i*1.4) #column, value
    newrow.setBoolean(2,1) #column, value (1/0)
    newrow.setDate(3, 2012, 12, 2) #column, y,m,d
    newrow.setDateTime(4, 2012, 12, 2, 18, 25, 55, 0000) #column, y,m,d,h,m,s,frac
    newrow.setDuration(5, 6, 18, 25, 55, 0000) #column, d,h,m,s,frac
    newrow.setCharString(6, 'Character String') #column, 'value'
    newrow.setString(7, 'Unicode String!') #column, 'value'
    rowsinserted = rowsinserted + 1 #count this row!

    tabletran.insert(newrow) #insert row into TDE table
    newrow.close()

tdefile.close()

Here is my main "insert" loops. Let's go step by step (the 'for i in range(x,x)' is just to run this block multiple times and generate some row depth for sample purposes).

  • Define our "newrow" as (our previously defined schema) tableDef
  • Insert each field in our new row - based on the different datatype we defined it as (for validation purposes) Note for each particular datatype we have to use a different format after defining the column number (based on the order they exist in the schema)
  • Once done, we insert that defined row (schema and data) into our file object
  • Close that row object since it will be re-opened next row (not sure if this is required, just keeping it clean)
  • Finally, when all rows are inserted - tdefile.close() writes it all to disk
  • timetaken = time.time() - start_time #just for timing and fun
    print str(rowsinserted) + ' rows inserted in ' + str(timetaken) + ' seconds'
    print '    (' + str(rowsinserted/timetaken) + ' rows per second)'
    #booya!

    Just printing some info about the execution - not needed - BUT IT'S FUN!

    1000000 rows inserted in 34.0269999504 seconds
        (29388.4268803 rows per second)
    
    

    Here's the whole shebang, Cut-n-Paste Away!

    # ryan robitaille 12/2/2012
    # really simple TDE creation test

    import dataextract as tde #saves some typing
    import os, time # for file manipulation / script timing (not necc)

    start_time = time.time() # simple timing for test purposes

    try:  # Just for testing purposes and re-running
        tdefile = tde.Extract('test.tde') #in CWD
    except:
        os.system('del test.tde')
        os.system('del DataExtract.log')
        tdefile = tde.Extract('test.tde')

    tableDef = tde.TableDefinition() #create a new table def

    # using integers for now because the literal defs are not in the python module code
    tableDef.addColumn("field_type7", 7)   #INTEGER
    tableDef.addColumn("field_type10", 10) #DOUBLE
    tableDef.addColumn("field_type11", 11) #BOOLEAN
    tableDef.addColumn("field_type12", 12) #DATE
    tableDef.addColumn("field_type13", 13) #DATETIME
    tableDef.addColumn("field_type14", 14) #DURATION
    tableDef.addColumn("field_type15", 15) #CHAR_STRING
    tableDef.addColumn("field_type16", 16) #UNICODE_STRING

    # ok, lets print out the table def, just for shits and giggles
    for c in range(0,tableDef.getColumnCount()):
        print 'Column: ' + str(tableDef.getColumnName(c)) + ' Type: ' + str(tableDef.getColumnType(c))

    # lets add the new def as a table in the extract
    tabletran = tdefile.addTable("Extract",tableDef)
    # why table NEEDS to be called 'Extract' is beyond me

    rowsinserted = 1

    # let's create a bunch of rows! wheeeeee!
    for i in range(1,1000000):

        newrow = tde.Row(tableDef)
        #newrow.setNull(0) #column
        newrow.setInteger(0,i) #column, value
        newrow.setDouble(1,i*1.4) #column, value
        newrow.setBoolean(2,1) #column, value (1/0)
        newrow.setDate(3, 2012, 12, 2) #column, y,m,d
        newrow.setDateTime(4, 2012, 12, 2, 18, 25, 55, 0000) #column, y,m,d,h,m,s,frac
        newrow.setDuration(5, 6, 18, 25, 55, 0000) #column, d,h,m,s,frac
        newrow.setCharString(6, 'Character String') #column, 'value'
        newrow.setString(7, 'Unicode String!') #column, 'value'
        rowsinserted = rowsinserted + 1 #count this row!

        tabletran.insert(newrow) #insert row into TDE table
        newrow.close()

    tdefile.close()

    timetaken = time.time() - start_time #just for timing and fun
    print str(rowsinserted) + ' rows inserted in ' + str(timetaken) + ' seconds'
    print '    (' + str(rowsinserted/timetaken) + ' rows per second)'

    What's next? Building TDEs from web screen-scraping as well as using the MessyTables python module for turning ugly CSV and XLS into sexified TDEs (hopefully). Stay tuned - this topic is somewhat of a moving target...

    UPDATE: Changed the code loop to a million rows.. instead of a BILLION. Oops. For the record, on my desktop, generating a billion rows took 9.5 hours - approx 33,924 rows per second (using 64-bit Python and Tableau libs)

    Want more? Here is a follow-up post creating TDE files from SQL Server queries!

    • Martin

      Excellent post ! Thanks for sharing !

    • Alex Kerin

      This is really cool – we can do things like combine python scripts to harvest Twitter (http://community.tableausoftware.com/thread/118609 – or your version) to write directly to a tde that server or desktop can access – very nice.

      • http://ryrobes.com/ Ryan Robitaille

        Thanks Alex.
        Exactly. And the TDE creation doesn’t have to happen on the Tableau Server either, since all the binaries are included with the module! Cheers!

    • http://twitter.com/Kovner Michael B. Kovner

      Best line: “…enabling a whole bunch of other integrations – most of which I haven’t even thought of yet.” There are a handful of core, basic, intended uses of the TDEAPI (“Tee-dee-pie”?), another handful of grandiose ideas I’ve daydreamed about, and an unlimited amount of crazy uses that guys like you are going to come up with. I’m pumped! The picture at the top is perfect: The Kraken is beginning to reveal itself, but we have not yet seen its full glory.

      • http://ryrobes.com/ Ryan Robitaille

        Haha, so true. I figure that I might as well post about the basics before I start doing any crazy shit with it. :) Hopefully it’ll all make more sense that way.

    • http://www.facebook.com/karel.v.sedlacek Karel V Sedlacek

      Where do we get the api distribution? I do not see it in my beta 5 desktop oor server folder structure…Best, Karel

    • disqus_i4nPiWX0Mt

      Ryan –

      I have the anacondapro python stack installed. the dataextract package install appears to be fine, but the import fails with “can’t import name Exceptions” Searching for answers, I included the directory in PYTHONPATH, but no luck. I’m sure it’s something simple. Any ideas?

      • http://ryrobes.com/ Ryan Robitaille

        Hmm – let me play around with the anacondapro stack and try to replicate, then I can give a better answer. Sorry for the delay, was traveling last week.

        Cheers!

    • ratnesh

      Can we delete the rows as well , Using these api ?

    • kpfeif

      This was really helpful. I was able to migrate the result of a bunch of simulations within python away from matplotlib to Tableu and all of its pretty gooey visualization goodness.

      It’s your ability to weave just the right amount of profanity into your writing and code that I really respect. Your comments remind me of my own, but yours are in a completely different creativity league.

    • MJ

      Appreciate this so much for this!!! had never done anything with python till 3 days ago!!! though a fairly active tableau user.

      And now got through reading a csv file directly with help from your site!!!

      Now just a teeny step further – once we’ve finished testing what’s the way to append instead of delete/recreate a tdefile in the except section??? I’m referring to this section:
      *********quote*********
      import dataextract as tde #saves some typing
      import os, time # for file manipulation / script timing (not necc)

      start_time = time.time() # simple timing for test purposes

      try: # Just for testing purposes and re-running
      tdefile = tde.Extract(‘test.tde’) #in CWD
      except:
      os.system(‘del test.tde’)
      os.system(‘del DataExtract.log’)
      tdefile = tde.Extract(‘test.tde’)

      tableDef = tde.TableDefinition() #create a new table def
      **********end quote********

    • Ekta Grover

      Excellent post, @ryan@ryrobes:disqus – quick question though – I do need some processing with Python, while loading data to tableau, basically I have a bunch of metrics(say, A, B, C) that are pre-loaded in mysql db – and while creating a view, I want to do publish confidence intervals, z score etc. (A’, B’,C’) – this part I do with python. I can’t store this data in the another table in same DB – since the user may want to change the date range for aggregation on the fly – any thoughts on how do I approach this use-case ?

      Also looked through your others posts, http://ryrobes.com/featured-articles/using-a-simple-python-script-for-end-to-end-data-transformation-and-etl-part-1/ .

    • ubbu

      Thank for sharing. How to convert Excle(xlsx) file to tde?? Did any one created a Tableau tde from Excel(xlsx) file using Python??

    • DENA SHIPE

      my husband was looking for DD 2384-1 recently and used an online service that has 6 million forms . If others want DD 2384-1 also , here’s

      https://goo.gl/nbtcxq
    • JOHNNY HANNING

      Good article, Thanks!