Build Tableau Data Extracts out of CSV Files? More Python TDE API madness!

7 December 2012 14 Comments

Build Tableau Data Extracts out of CSV Files? More Python TDE API madness!

So here we have the 3rd in a series about using Tableau 8's 'Data Extract' API to automatically create TDE files from various data sources without using the desktop client. This time we're focusing on good ole' Comma Separated files.

Much like my SQL Server script, this one tries to guess field names and field data types in order to produce as good of an extract as possible - I've also added a little command line progress bar for your FILE WATCHING ENJOYMENT.

You know, in case you're into shit like that.

Plus, I report on file sizes and row counts! OMG! You ALSO have the ability to run MULTIPLE files in a directory - in case you have a bunch that need to be crunched into their own TDE files - that's right, Christmas come early! (You're welcome)

I used some census data files for testing (as well as Metallica, Bigfoot, and UFO data - but that's a given) as you can see below. I gotta, say processing CSV files is decent speedy - even though I haven't tried to process any HUGE files yet (2GB+).

Bit of explanation about the script parameters:

cvsfilenamemask = '.csv' It's basically an "ends with" string - so you can use something like '.csv' to process ALL CSV files in a directory, or simply change it to something like 'input.csv' to ONLY process that file from that dir.
sourcedir = 'C:\\Python27\\' Fairly obvious - this is the folder that the script will look in for any files matching 'csvfilenamemask' above. You can also use UNC file paths like ''\\\\ryrobesxps\d$\\'. Remember to double up your slashes accordingly!
targetdir = 'C:\\Python27\\' Same as above, except this it the target dir that the resulting TDE file will be written to.
csvdelimiter = ',' You should know this one.
csvquotechar = '"' AND this one, but who knows.
rowoutput = False Set to 'True' to have all the row and column info echoed into the command line window - it's good for debugging data and script errors but it slows things down like 10X (literally). Who knew that printing out thousands of things would be so expensive! ;)

Here's a sample output from some tests I did earlier

C:\Python27>python CSV_to_TDE.py

###########################################################################
Now working on cbp10co.csv (2,155,390) -> cbp10co.tde (59,871 rows per =)
###########################################################################
[ = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ]
(2,155,390 rows)

###########################################################################
Now working on cbp10st.csv (456,410) -> cbp10st.tde (12,678 rows per =)
###########################################################################
[ = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ]
(456,410 rows)

###########################################################################
Now working on PlayerStats.csv (23) -> PlayerStats.tde (10 rows per =)
###########################################################################
[ = = ]
(23 rows)

###########################################################################
Now working on simple.csv (32) -> simple.tde (10 rows per =)
###########################################################################
[ = = = ]
(31 rows)

cbp10co.csv 383 seconds 2,155,390 rows. TDE file is 172.9MB (source was 42.2MB)
cbp10st.csv 431 seconds 456,410 rows. TDE file is 114.6MB (source was 43.5MB)
PlayerStats.csv 4 seconds 23 rows. TDE file is 3.5KB (source was 57.4KB)
simple.csv 1 seconds 31 rows. TDE file is 659.0bytes (source was 34.4KB)

TOTAL RUN 820 seconds 2,611,854 rows - 287.5MB of text into 85.8MB of data sex!

C:\Python27>

Sexy, right? Indeed. Ok, here it is, it's a whopper, so get that Cntrl-C hand ready...

Wh3r3's th3 b33f? Oh, nevermind.

Note: if you have less than 20 rows or so, my "fancy" header checking could get confused. Why you would have a file like that, I have no idea - just sayin'.

# ryan robitialle (12/6/2012)
# creating Tableau Data Extracts via CSV files

import csv, os, time
from datetime import datetime
import dataextract as tde #saves some typing, cause i'm a lazy fucker

################ PARAMETERS FOR YOU, CODE MONKEY! ##########################
cvsfilenamemask = 's.csv' # can be explicit 'thisfile.csv' for one file - or open '.csv' for all that match
sourcedir = 'C:\\Python27\\' # need to double up the \\s | windows shares use like this '\\\\ryrobesxps\d$\' etc
targetdir = 'C:\\Python27\\' # can't be a share or UNC path
csvdelimiter = ',' # obvious!
csvquotechar = '"' # obvious!
rowoutput = False # useful for debugging data errors / slows shit down a lot however
################ PARAMETERS FOR YOU, CODE MONKEY! ##########################

# Note: if you have less than a few thousand rows, the progress bar will be a bit fucked looking.

fileperf = dict() # for saving each files execution times

# since the CSV module imports all fields as strings regardless of what they are..
def datatyper(n): # force some data types to figure shit out
try: # kind of lame.... BUT IT WORKS
x = int(n)
return int(n)
except:
try:
x = float(n)
return float(n)
except:
try:
date_object = datetime.strptime(n, '%m/%d/%Y')
return date_object
except:
try:
date_object = datetime.strptime(n, '%Y-%m-%d')
return date_object
except:
if n == 'NULL': # just in case, don't want any literal NULLs in there
return None
elif len(n) > 0:
return str(n)
else: # no need to return an empty string, let's NULL that shit out
return None
# end ugly data types function

def showhumanfilesize(num):
for x in ['bytes','KB','MB','GB']:
if num < 1024.0:
return "%3.1f%s" % (num, x)
num /= 1024.0
return "%3.1f%s" % (num, 'TB')

def intWithCommas(x):
if type(x) not in [type(0), type(0L)]:
raise TypeError("Parameter must be an integer.")
if x < 0:
return '-' + intWithCommas(-x)
result = ''
while x >= 1000:
x, r = divmod(x, 1000)
result = ",%03d%s" % (r, result)
return "%d%s" % (x, result)

def file_lines(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1

#print ' '
#print '[ Note: Each . = ' +str(dotsevery)+ ' rows processed ]'

os.chdir(sourcedir)
for csvfilename in os.listdir("."):
if csvfilename.endswith(cvsfilenamemask):
tdefilename = csvfilename.split('.')[0]+'.tde'
linez = file_lines(sourcedir + csvfilename)
if linez > 36:
dotsevery = linez/36
else:
dotsevery = 10
print ' '
print '###########################################################################'
print ' Now working on ' + csvfilename + ' ('+str(intWithCommas(linez))+') -> ' + tdefilename + ' (' + str(intWithCommas(dotsevery)) + ' rows per =)'
print '###########################################################################'
#print dotsevery
time.sleep(5) # so you can read it.

# BEGIN MULTI FILE LOOP
start_time = time.time() # simple timing for test purposes

try:
# taking a sample of the file
csvfile = open(csvfilename, 'rb')
toplines = csvfile.readlines()
filebuffer = '' # empty string
for i in range(dotsevery):
filebuffer = filebuffer + toplines[i]
hasheader = csv.Sniffer().has_header(filebuffer) # csvfile.read() / filebuffer
except:
hasheader = False

# ok lets go
csvfile.seek(0) # YOU WILL DO, WHAT I SAY, WHEN I SAY! BACK TO THE FRONT!
csvreader = csv.DictReader(csvfile, delimiter=csvdelimiter, quotechar=csvquotechar)

dfields = []
dtypes = []

if hasheader == True:
for f in csvreader.fieldnames:
dfields.append(f)

else: # WTF? No header? JERK.
fieldnum = 0
#print 'If you don\'t have a header, how the fuck will you recognize the fields in Tableau?'
for f in csvreader.fieldnames:
dfields.append('field'+str(fieldnum))
fieldnum = fieldnum + 1
csvreader = csv.DictReader(csvfile, delimiter=csvdelimiter, quotechar=csvquotechar, fieldnames=dfields)
# we have to make our own field names

for row in csvreader:
for i in dfields:
dtypes.append(str(type(datatyper(row[i]))))
break # got shit, we're out

csvfile.seek(0) # BACK TO THE FRONT! (AGAIN!)

os.chdir(targetdir)
try: # Just in case the file exists already, we don't want to bomb out
tdefile = tde.Extract(tdefilename) # in CWD
except:
os.system('del ' + targetdir + tdefilename)
os.system('del DataExtract.log') #might as well erase this bitch too
tdefile = tde.Extract(targetdir + tdefilename)

# ok lets build the table definition in TDE with our list of names and types first
# replacing literals with TDE datatype integers, etc
tableDef = tde.TableDefinition() #create a new table def

numfields = len(dfields)
#print numfields

if rowoutput == True:
print '*** field names list ***' # debug
for t in range(numfields):
fieldtypeo = dtypes[t].replace("<type '","").replace("'>","").replace("<class '","").replace('NoneType','str').replace('uuid.UUID','str')
fieldname = dfields[t]
fieldtype = str(fieldtypeo).replace("str","15").replace("datetime.datetime","13").replace("int","7").replace("decimal.Decimal","10").replace("float","10").replace("uuid.UUID","15").replace("bool","11")
if rowoutput == True:
print fieldname + ' (looks like ' + fieldtypeo +', TDE datatype ' + fieldtype + ')' # debug
try:
tableDef.addColumn(fieldname, int(fieldtype)) # if we pass a non-int to fieldtype, it'll fail
except:
tableDef.addColumn(fieldname, 15) # if we get a weird type we don't recognize, just make it a string
if rowoutput == True:
print '***'
time.sleep(5) # wait 5 seconds so you can actually read shit!

# ok, lets print out the table def we just made, for shits and giggles
if rowoutput == True:
print '################## TDE table definition created ######################'
for c in range(0,tableDef.getColumnCount()):
print 'Column: ' + str(tableDef.getColumnName(c)) + ' Type: ' + str(tableDef.getColumnType(c))
time.sleep(5) # wait 5 seconds so you can actually read shit!

# ok lets add the new def as a table in the extract
tabletran = tdefile.addTable("Extract",tableDef)

# time to start pumping rows!

rowsinserted = 1

# if we have a header, we don't want to try and process it
if hasheader == True:
csvreader.next()
print '[',
for row in csvreader:

if rowoutput == True: # row deets, else just '.'
print '************** INSERTING ROW NUMBER: ' + str(rowsinserted) + '**************' # debug output
else: # only print dot every 50 records
if (rowsinserted%dotsevery) == 0:
print '=',

columnposition = 0
newrow = tde.Row(tableDef)

for t in range(numfields):
fieldtype = dtypes[t].replace("<type '","").replace("'>","").replace("<class '","").replace('NoneType','str').replace('uuid.UUID','str')
fieldname = dfields[t]

if rowoutput == True: # column deets
print str(columnposition) + ' ' + fieldname + ': ' + str(row[fieldname]) + ' (' + str(fieldtype).split('.')[0] + ')' # debug output

if fieldtype == 'str':
if row[fieldname] != None: # we don't want no None!
newrow.setCharString(columnposition, str(row[fieldname]))
else:
newrow.setNull(columnposition) # ok, put that None here

if fieldtype == 'int':
if row[fieldname] != None:
newrow.setInteger(columnposition, datatyper(row[fieldname]))
else:
newrow.setNull(columnposition)

if fieldtype == 'bool':
if row[fieldname] != None:
newrow.setBoolean(columnposition, datatyper(row[fieldname]))
else:
newrow.setNull(columnposition)

if fieldtype == 'decimal.Decimal':
if row[fieldname] != None:
newrow.setDouble(columnposition, datatyper(row[fieldname]))
else:
newrow.setNull(columnposition)

if fieldtype == 'datetime.datetime': # sexy datetime splitting
if row[fieldname] != None:
strippeddate = str(datatyper(row[fieldname])).split('.')[0] # just in case we get microseconds (not all datetime uses them)
timechunks = time.strptime(str(strippeddate), "%Y-%m-%d %H:%M:%S") # chunky style!
newrow.setDateTime(columnposition, timechunks[0], timechunks[1], timechunks[2], timechunks[3], timechunks[4], timechunks[5], 0000)
else:
newrow.setNull(columnposition)

columnposition = columnposition + 1 # we gots to know what column number we're working on!
tabletran.insert(newrow) # finally insert buffered row into TDE 'table'
newrow.close()
rowsinserted = rowsinserted + 1

# ok let's write out that file and get back to making dinner
tdefile.close()
csvfile.close()
print '] '
print '('+str(intWithCommas(rowsinserted))+' rows)' # to clear out the row on command line

plist = []

# timing purposes for debugging / optimizing / FUN! This is FUN, Lars.
timetaken = time.time() - start_time
plist.append(timetaken)
plist.append(rowsinserted)
plist.append(os.path.getsize(sourcedir + csvfilename))
plist.append(os.path.getsize(sourcedir + tdefilename))
fileperf[str(csvfilename)] = plist

# just for our "result time"
totaltime = 0
totalrecords = 0
totalpresize = 0
totalpostsize = 0

print ' '

for p in fileperf:
print p + ' ' + str(intWithCommas(int(fileperf[p][0]))) + ' seconds processed ' + str(intWithCommas(fileperf[p][1])) + ' records. Resulting TDE file is ' + str(showhumanfilesize(fileperf[p][3])) + ' (source was ' + str(showhumanfilesize(fileperf[p][2])) + ')'
totaltime = fileperf[p][0] + totaltime
totalrecords = fileperf[p][1] + totalrecords
totalpresize = fileperf[p][2] + totalpresize
totalpostsize = fileperf[p][3] + totalpostsize

if len(fileperf) > 1:
print ' '
print 'TOTAL RUN ' + str(intWithCommas(int(totaltime))) + ' seconds processed ' + str(intWithCommas(totalrecords)) + ' records - crunched ' + str(showhumanfilesize(totalpresize)) + ' of text into ' + str(showhumanfilesize(totalpostsize)) + ' of binary sexiness'

Issues? Let me know below and I can fix them! Maybe.

Tags: csv, Python, tableau 8, tableau data extracts

Srini

I keep getting an error

newrow.setInteger(columnposition, datatyper(row[fieldname]))

File “C:Python27libsite-packagesdataextractBase.py”, line 220, in setInteger, c_int(value)

TypeError: an integer is required
- Chad Kovac
  
  same
- Earththing
  
  I got the same error. Worked first time, then not again. Not sure how it worked, but I figured out WHY this error occured.
  
  In my case the sniffer function thought a column in my data was an int, but buried in a few rows were strings. I don’t yet have a better sniffer solution. For testing the rest of this script, I basically disabled the sniffer by having it say everything was a string. Not ideal…
Jura

Hi Ryan, thank you for share this python script with us, very nice code lines.

Ryan, I copy and past your code and for the first time I ran it was ok, but when I try do run again , I’m getting a error as you can see below:
Can we get a fix or a workaround?

Traceback (most recent call last):

File “C:/Users/olivejur/Documents/Extract API/csv2TDE_v2.py”, line 173, in

tabletran = tdefile.addTable(“Extract”,tableDef)

File “C:Python27libsite-packagesdataextractBase.py”, line 484, in addTable

raise Exceptions.TableauException(result, wstring_at(tablib.TabGetLastErrorMessage()))

TableauException: TableauException (303): duplicate table name
- Dan
  
  Hi Jura, not sure if you figured this out yet, but if you’re getting this error try manually deleting the tde file you already created before running it again.
Dan

Hey Ryan,

You’re the man. Really appreciate your posts and explanation of using the TDE API.

Do you know of any way to multiprocess with Python to speed these scripts up? I have a decent size CSV (~400,000 observations of ~300 variables), and Python defaults to only using 1 core to do the processing. Takes about 15 minutes to process what Tableau Desktop can do in 1-1.5 minutes, and I need a script that runs comparably.

So far all of my attempts have led to pickling / ctypes pointer errors. Do you know of any workarounds to get better CPU utilization? I posted threads below as well:

http://stackoverflow.com/questions/24533207/issues-with-python-pickling-multiprocessing-and-tableau-data-extract-api/24534579#24534579

http://community.tableausoftware.com/thread/145286

Any ideas are much appreciated!

-Dan
manish jain

Hi Ryan

Can we read a tde extract file using python API ?
I want to implement data backload in the extract
Taylor

What is the final TDE file called? I can’t find where the codes saves it to.
Mike

You can also try our Advanced ETL Processor it supports creating tableau tde files and It works with 30 different datasources such as Text, XML, Excel, Access,
DBF, Foxpro, Paradox, ODBC, BDE, OLE DB, MS Sql Server, Oracle, MySql,
MariaDB, PostgreSQL, Greenplum, Firebird, Interbase, SQLite, POP3,
IMAP4, HL7, HTTP, File System, Unicode, RSS, Windows Event Log, Google
Spreadsheets, SalesForce, BrightPearl and QlikView.
rushi

I am getting the same error as mentioned by Srini:

CSV_to_TDE.py”, line 208, in

newrow.setInteger(columnposition, datatyper(row[fieldname]))

File “C:Python27libsite-packagesdataextractBase.py”, line 248, in setInteger

, c_int(value)

TypeError: an integer is required

Ryan, Could you please help me in fixing this error ?
Peder

Hi Ryan
Thank you for sharing your fantastic code. It was just what I am looking for. It is working fine, but I get an error, when there are Danish letters as ‘æ’, ‘ø’ and å in the csv file. Do you know how I can make it work? It must be something with the encoding I think.
I get this error:
“Traceback (most recent call last):
File “C:Python27PFO_programmerConvertCsvToTde_pfo1.py”, line 209, in
newrow.setInteger(columnposition, datatyper(row[fieldname]))
File “C:Python27dataextractBase.py”, line 248, in setInteger
, c_int(value)
TypeError: an integer is required
>>>”
ubbu

Thank for sharing. Did any one created a Tableau tde from Excel(xlsx) file using Python??How to convert Excle(xlsx) file to tde??
DENA SHIPE

Simply amazing article! You touched on a topical issue. I would appreciate if you’d written about how to fill a form online. You’ll forget about paperwork when you try PDFfiller. CA EJ-130 can be filled out in 5-10 mins here

https://goo.gl/yynP99
EVELYNN HEIDEN

Good article, Thanks!

Ryan Robitaille

Business Technologist, Data Artist, Problem Solver

Build Tableau Data Extracts out of CSV Files? More Python TDE API madness!

Here's a sample output from some tests I did earlier

Wh3r3's th3 b33f? Oh, nevermind.

Fun Data Visualizations, Analytics

Sweet Data Visualizations

Categories

Fun Data Visualizations, Analytics

Links

Search

Recent Comments

Ryan Robitaille

Business Technologist, Data Artist, Problem Solver

Build Tableau Data Extracts out of CSV Files? More Python TDE API madness!

Here's a sample output from some tests I did earlier

Wh3r3's th3 b33f? Oh, nevermind.

Related Posts

Fun Data Visualizations, Analytics

Sweet Data Visualizations

Categories

Fun Data Visualizations, Analytics

Links

Search

Recent Comments

Crazy-Ass Category Cloud