7 December 2012 6 Comments

Build Tableau Data Extracts out of CSV Files? More Python TDE API madness!

So here we have the 3rd in a series about using Tableau 8's 'Data Extract' API to automatically create TDE files from various data sources without using the desktop client. This time we're focusing on good ole' Comma Separated files.

Much like my SQL Server script, this one tries to guess field names and field data types in order to produce as good of an extract as possible - I've also added a little command line progress bar for your FILE WATCHING ENJOYMENT.

You know, in case you're into shit like that.

Plus, I report on file sizes and row counts! OMG! You ALSO have the ability to run MULTIPLE files in a directory - in case you have a bunch that need to be crunched into their own TDE files - that's right, Christmas come early! (You're welcome)

I used some census data files for testing (as well as Metallica, Bigfoot, and UFO data - but that's a given) as you can see below. I gotta, say processing CSV files is decent speedy - even though I haven't tried to process any HUGE files yet (2GB+).

Bit of explanation about the script parameters:

  • cvsfilenamemask = '.csv' It's basically an "ends with" string - so you can use something like '.csv' to process ALL CSV files in a directory, or simply change it to something like 'input.csv' to ONLY process that file from that dir.
  • sourcedir = 'C:\\Python27\\' Fairly obvious - this is the folder that the script will look in for any files matching 'csvfilenamemask' above. You can also use UNC file paths like ''\\\\ryrobesxps\d$\\'. Remember to double up your slashes accordingly!
  • targetdir = 'C:\\Python27\\' Same as above, except this it the target dir that the resulting TDE file will be written to.
  • csvdelimiter = ',' You should know this one.
  • csvquotechar = '"' AND this one, but who knows.
  • rowoutput = False Set to 'True' to have all the row and column info echoed into the command line window - it's good for debugging data and script errors but it slows things down like 10X (literally). Who knew that printing out thousands of things would be so expensive! ;)

Here's a sample output from some tests I did earlier

C:\Python27>python CSV_to_TDE.py

###########################################################################
  Now working on cbp10co.csv (2,155,390) -> cbp10co.tde (59,871 rows per =)
###########################################################################
[ = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ]
(2,155,390 rows)

###########################################################################
  Now working on cbp10st.csv (456,410) -> cbp10st.tde (12,678 rows per =)
###########################################################################
[ = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ]
(456,410 rows)

###########################################################################
  Now working on PlayerStats.csv (23) -> PlayerStats.tde (10 rows per =)
###########################################################################
[ = = ]
(23 rows)

###########################################################################
  Now working on simple.csv (32) -> simple.tde (10 rows per =)
###########################################################################
[ = = = ]
(31 rows)

cbp10co.csv   383 seconds   2,155,390 rows. TDE file is 172.9MB (source was 42.2MB)
cbp10st.csv   431 seconds   456,410 rows. TDE file is 114.6MB (source was 43.5MB)
PlayerStats.csv 4 seconds   23 rows. TDE file is 3.5KB (source was 57.4KB)
simple.csv      1 seconds   31 rows. TDE file is 659.0bytes (source was 34.4KB)

TOTAL RUN     820 seconds   2,611,854 rows - 287.5MB of text into 85.8MB of data sex!

C:\Python27>

Sexy, right? Indeed. Ok, here it is, it's a whopper, so get that Cntrl-C hand ready...

Wh3r3's th3 b33f? Oh, nevermind.

Note: if you have less than 20 rows or so, my "fancy" header checking could get confused. Why you would have a file like that, I have no idea - just sayin'.
# ryan robitialle (12/6/2012)
# creating Tableau Data Extracts via CSV files

import csv, os, time
from datetime import datetime
import dataextract as tde #saves some typing, cause i'm a lazy fucker

################ PARAMETERS FOR YOU, CODE MONKEY! ##########################
cvsfilenamemask = 's.csv' # can be explicit 'thisfile.csv' for one file - or open '.csv' for all that match
sourcedir = 'C:\\Python27\\' # need to double up the \\s | windows shares use like this '\\\\ryrobesxps\d$\' etc
targetdir = 'C:\\Python27\\' # can't be a share or UNC path
csvdelimiter = ',' # obvious!
csvquotechar = '"' # obvious!
rowoutput = False # useful for debugging data errors / slows shit down a lot however
################ PARAMETERS FOR YOU, CODE MONKEY! ##########################

# Note: if you have less than a few thousand rows, the progress bar will be a bit fucked looking.

fileperf = dict() # for saving each files execution times

# since the CSV module imports all fields as strings regardless of what they are..
def datatyper(n):    # force some data types to figure shit out
        try:         # kind of lame.... BUT IT WORKS
            x = int(n)
            return int(n)
        except:
                try:
                    x = float(n)
                    return float(n)
                except:
                    try:
                        date_object = datetime.strptime(n, '%m/%d/%Y')
                        return date_object
                    except:
                        try:
                            date_object = datetime.strptime(n, '%Y-%m-%d')
                            return date_object
                        except:
                            if n == 'NULL': # just in case, don't want any literal NULLs in there
                                return None
                            elif len(n) > 0:
                                return str(n)
                            else: # no need to return an empty string, let's NULL that shit out
                                return None
# end ugly data types function

def showhumanfilesize(num):
    for x in ['bytes','KB','MB','GB']:
        if num < 1024.0:
            return "%3.1f%s" % (num, x)
        num /= 1024.0
    return "%3.1f%s" % (num, 'TB')

def intWithCommas(x):
    if type(x) not in [type(0), type(0L)]:
        raise TypeError("Parameter must be an integer.")
    if x < 0:
        return '-' + intWithCommas(-x)
    result = ''
    while x >= 1000:
        x, r = divmod(x, 1000)
        result = ",%03d%s" % (r, result)
    return "%d%s" % (x, result)

def file_lines(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

#print ' '
#print '[ Note: Each . = ' +str(dotsevery)+ ' rows processed ]'

os.chdir(sourcedir)
for csvfilename in os.listdir("."):
    if csvfilename.endswith(cvsfilenamemask):
        tdefilename = csvfilename.split('.')[0]+'.tde'
        linez = file_lines(sourcedir + csvfilename)
        if linez > 36:
            dotsevery = linez/36
        else:
            dotsevery = 10
        print ' '
        print '###########################################################################'
        print '  Now working on ' + csvfilename + ' ('+str(intWithCommas(linez))+') -> ' + tdefilename + ' (' + str(intWithCommas(dotsevery)) + ' rows per =)'
        print '###########################################################################'
        #print dotsevery
        time.sleep(5) # so you can read it.

# BEGIN MULTI FILE LOOP
        start_time = time.time() # simple timing for test purposes

        try:
            # taking a sample of the file
            csvfile = open(csvfilename, 'rb')
            toplines = csvfile.readlines()
            filebuffer = '' # empty string
            for i in range(dotsevery):
                filebuffer = filebuffer + toplines[i]
            hasheader = csv.Sniffer().has_header(filebuffer) # csvfile.read()  /  filebuffer
        except:
            hasheader = False


        # ok lets go
        csvfile.seek(0) # YOU WILL DO, WHAT I SAY, WHEN I SAY! BACK TO THE FRONT!
        csvreader = csv.DictReader(csvfile, delimiter=csvdelimiter, quotechar=csvquotechar)

        dfields = []
        dtypes = []

        if hasheader == True:
            for f in csvreader.fieldnames:
                dfields.append(f)
           
        else: # WTF? No header? JERK.
            fieldnum = 0
            #print 'If you don\'t have a header, how the fuck will you recognize the fields in Tableau?'
            for f in csvreader.fieldnames:
                dfields.append('field'+str(fieldnum))
                fieldnum = fieldnum + 1
            csvreader = csv.DictReader(csvfile, delimiter=csvdelimiter, quotechar=csvquotechar, fieldnames=dfields)
            # we have to make our own field names
           
        for row in csvreader:
            for i in dfields:
                dtypes.append(str(type(datatyper(row[i]))))
            break # got shit, we're out

        csvfile.seek(0) # BACK TO THE FRONT! (AGAIN!)



        os.chdir(targetdir)
        try:  # Just in case the file exists already, we don't want to bomb out
            tdefile = tde.Extract(tdefilename) # in CWD
        except:
            os.system('del ' + targetdir + tdefilename)
            os.system('del DataExtract.log') #might as well erase this bitch too
            tdefile = tde.Extract(targetdir + tdefilename)

        # ok lets build the table definition in TDE with our list of names and types first
        # replacing literals with TDE datatype integers, etc
        tableDef = tde.TableDefinition() #create a new table def

        numfields = len(dfields)
        #print numfields

        if rowoutput == True:
            print '*** field names list ***' # debug
        for t in range(numfields):
            fieldtypeo = dtypes[t].replace("<type '","").replace("'>","").replace("<class '","").replace('NoneType','str').replace('uuid.UUID','str')
            fieldname = dfields[t]
            fieldtype = str(fieldtypeo).replace("str","15").replace("datetime.datetime","13").replace("int","7").replace("decimal.Decimal","10").replace("float","10").replace("uuid.UUID","15").replace("bool","11")
            if rowoutput == True:
                print fieldname + '  (looks like ' + fieldtypeo +', TDE datatype ' + fieldtype + ')'  # debug
            try:
                tableDef.addColumn(fieldname, int(fieldtype)) # if we pass a non-int to fieldtype, it'll fail
            except:
                tableDef.addColumn(fieldname, 15) # if we get a weird type we don't recognize, just make it a string
        if rowoutput == True:
            print '***'
            time.sleep(5) # wait 5 seconds so you can actually read shit!

        # ok, lets print out the table def we just made, for shits and giggles
        if rowoutput == True:
            print '################## TDE table definition created ######################'
            for c in range(0,tableDef.getColumnCount()):
                print 'Column: ' + str(tableDef.getColumnName(c)) + ' Type: ' + str(tableDef.getColumnType(c))
            time.sleep(5) # wait 5 seconds so you can actually read shit!

        # ok lets add the new def as a table in the extract
        tabletran = tdefile.addTable("Extract",tableDef)

        # time to start pumping rows!

        rowsinserted = 1

        # if we have a header, we don't want to try and process it
        if hasheader == True:
            csvreader.next()
        print '[',
        for row in csvreader:

            if rowoutput == True: # row deets, else just '.'
                print '************** INSERTING ROW NUMBER: ' + str(rowsinserted) + '**************' # debug output
            else: # only print dot every 50 records
                if (rowsinserted%dotsevery) == 0:
                    print '=',

            columnposition = 0
            newrow = tde.Row(tableDef)

            for t in range(numfields):
                fieldtype = dtypes[t].replace("<type '","").replace("'>","").replace("<class '","").replace('NoneType','str').replace('uuid.UUID','str')
                fieldname = dfields[t]
               
                if rowoutput == True: # column deets
                    print str(columnposition) + ' ' + fieldname + ':   ' + str(row[fieldname]) + ' (' + str(fieldtype).split('.')[0] + ')' # debug output

                if fieldtype == 'str':
                    if row[fieldname] != None: # we don't want no None!
                        newrow.setCharString(columnposition, str(row[fieldname]))
                    else:
                        newrow.setNull(columnposition) # ok, put that None here

                if fieldtype == 'int':
                    if row[fieldname] != None:
                        newrow.setInteger(columnposition, datatyper(row[fieldname]))
                    else:
                        newrow.setNull(columnposition)

                if fieldtype == 'bool':
                    if row[fieldname] != None:
                        newrow.setBoolean(columnposition, datatyper(row[fieldname]))
                    else:
                        newrow.setNull(columnposition)

                if fieldtype == 'decimal.Decimal':
                    if row[fieldname] != None:
                        newrow.setDouble(columnposition, datatyper(row[fieldname]))
                    else:
                        newrow.setNull(columnposition)

                if fieldtype == 'datetime.datetime': # sexy datetime splitting
                    if row[fieldname] != None:
                        strippeddate = str(datatyper(row[fieldname])).split('.')[0] # just in case we get microseconds (not all datetime uses them)
                        timechunks = time.strptime(str(strippeddate), "%Y-%m-%d %H:%M:%S") # chunky style!
                        newrow.setDateTime(columnposition, timechunks[0], timechunks[1], timechunks[2], timechunks[3], timechunks[4], timechunks[5], 0000)
                    else:
                        newrow.setNull(columnposition)
           
                columnposition = columnposition + 1 # we gots to know what column number we're working on!
            tabletran.insert(newrow) # finally insert buffered row into TDE 'table'
            newrow.close()
            rowsinserted = rowsinserted + 1

        # ok let's write out that file and get back to making dinner
        tdefile.close()
        csvfile.close()
        print '] '
        print '('+str(intWithCommas(rowsinserted))+' rows)' # to clear out the row on command line

        plist = []

        # timing purposes for debugging / optimizing / FUN! This is FUN, Lars.
        timetaken = time.time() - start_time
        plist.append(timetaken)
        plist.append(rowsinserted)
        plist.append(os.path.getsize(sourcedir + csvfilename))
        plist.append(os.path.getsize(sourcedir + tdefilename))
        fileperf[str(csvfilename)] = plist

# just for our "result time"
totaltime = 0
totalrecords = 0
totalpresize = 0
totalpostsize = 0

print ' '

for p in fileperf:
    print p + '     ' + str(intWithCommas(int(fileperf[p][0]))) + ' seconds     processed ' + str(intWithCommas(fileperf[p][1])) + ' records. Resulting TDE file is ' + str(showhumanfilesize(fileperf[p][3])) + ' (source was ' + str(showhumanfilesize(fileperf[p][2])) + ')'
    totaltime = fileperf[p][0] + totaltime
    totalrecords = fileperf[p][1] + totalrecords
    totalpresize = fileperf[p][2] + totalpresize
    totalpostsize = fileperf[p][3] + totalpostsize

if len(fileperf) > 1:
    print ' '
    print 'TOTAL RUN        ' + str(intWithCommas(int(totaltime))) + ' seconds      processed ' + str(intWithCommas(totalrecords)) + ' records - crunched ' + str(showhumanfilesize(totalpresize)) + ' of text into ' + str(showhumanfilesize(totalpostsize)) + ' of binary sexiness'

Issues? Let me know below and I can fix them! Maybe.

  • Srini

    I keep getting an error

    newrow.setInteger(columnposition, datatyper(row[fieldname]))

    File “C:Python27libsite-packagesdataextractBase.py”, line 220, in setInteger, c_int(value)

    TypeError: an integer is required

    • Chad Kovac

      same

  • Jura

    Hi Ryan, thank you for share this python script with us, very nice code lines.

    Ryan, I copy and past your code and for the first time I ran it was ok, but when I try do run again , I’m getting a error as you can see below:
    Can we get a fix or a workaround?

    Traceback (most recent call last):

    File “C:/Users/olivejur/Documents/Extract API/csv2TDE_v2.py”, line 173, in

    tabletran = tdefile.addTable(“Extract”,tableDef)

    File “C:Python27libsite-packagesdataextractBase.py”, line 484, in addTable

    raise Exceptions.TableauException(result, wstring_at(tablib.TabGetLastErrorMessage()))

    TableauException: TableauException (303): duplicate table name

    • Dan

      Hi Jura, not sure if you figured this out yet, but if you’re getting this error try manually deleting the tde file you already created before running it again.

  • Dan

    Hey Ryan,

    You’re the man. Really appreciate your posts and explanation of using the TDE API.

    Do you know of any way to multiprocess with Python to speed these scripts up? I have a decent size CSV (~400,000 observations of ~300 variables), and Python defaults to only using 1 core to do the processing. Takes about 15 minutes to process what Tableau Desktop can do in 1-1.5 minutes, and I need a script that runs comparably.

    So far all of my attempts have led to pickling / ctypes pointer errors. Do you know of any workarounds to get better CPU utilization? I posted threads below as well:

    http://stackoverflow.com/questions/24533207/issues-with-python-pickling-multiprocessing-and-tableau-data-extract-api/24534579#24534579

    http://community.tableausoftware.com/thread/145286

    Any ideas are much appreciated!

    -Dan

  • manish jain

    Hi Ryan

    Can we read a tde extract file using python API ?
    I want to implement data backload in the extract