Build Tableau Data Extracts out of CSV Files? More Python TDE API madness!
So here we have the 3rd in a series about using Tableau 8's 'Data Extract' API to automatically create TDE files from various data sources without using the desktop client. This time we're focusing on good ole' Comma Separated files.
Much like my SQL Server script, this one tries to guess field names and field data types in order to produce as good of an extract as possible - I've also added a little command line progress bar for your FILE WATCHING ENJOYMENT.
You know, in case you're into shit like that.
Plus, I report on file sizes and row counts! OMG! You ALSO have the ability to run MULTIPLE files in a directory - in case you have a bunch that need to be crunched into their own TDE files - that's right, Christmas come early! (You're welcome)
I used some census data files for testing (as well as Metallica, Bigfoot, and UFO data - but that's a given) as you can see below. I gotta, say processing CSV files is decent speedy - even though I haven't tried to process any HUGE files yet (2GB+).
Bit of explanation about the script parameters:
- cvsfilenamemask = '.csv' It's basically an "ends with" string - so you can use something like '.csv' to process ALL CSV files in a directory, or simply change it to something like 'input.csv' to ONLY process that file from that dir.
- sourcedir = 'C:\\Python27\\' Fairly obvious - this is the folder that the script will look in for any files matching 'csvfilenamemask' above. You can also use UNC file paths like ''\\\\ryrobesxps\d$\\'. Remember to double up your slashes accordingly!
- targetdir = 'C:\\Python27\\' Same as above, except this it the target dir that the resulting TDE file will be written to.
- csvdelimiter = ',' You should know this one.
- csvquotechar = '"' AND this one, but who knows.
- rowoutput = False Set to 'True' to have all the row and column info echoed into the command line window - it's good for debugging data and script errors but it slows things down like 10X (literally). Who knew that printing out thousands of things would be so expensive! ;)
Here's a sample output from some tests I did earlier
###########################################################################
Now working on cbp10co.csv (2,155,390) -> cbp10co.tde (59,871 rows per =)
###########################################################################
[ = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ]
(2,155,390 rows)
###########################################################################
Now working on cbp10st.csv (456,410) -> cbp10st.tde (12,678 rows per =)
###########################################################################
[ = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ]
(456,410 rows)
###########################################################################
Now working on PlayerStats.csv (23) -> PlayerStats.tde (10 rows per =)
###########################################################################
[ = = ]
(23 rows)
###########################################################################
Now working on simple.csv (32) -> simple.tde (10 rows per =)
###########################################################################
[ = = = ]
(31 rows)
cbp10co.csv 383 seconds 2,155,390 rows. TDE file is 172.9MB (source was 42.2MB)
cbp10st.csv 431 seconds 456,410 rows. TDE file is 114.6MB (source was 43.5MB)
PlayerStats.csv 4 seconds 23 rows. TDE file is 3.5KB (source was 57.4KB)
simple.csv 1 seconds 31 rows. TDE file is 659.0bytes (source was 34.4KB)
TOTAL RUN 820 seconds 2,611,854 rows - 287.5MB of text into 85.8MB of data sex!
C:\Python27>
Sexy, right? Indeed. Ok, here it is, it's a whopper, so get that Cntrl-C hand ready...
Wh3r3's th3 b33f? Oh, nevermind.
# creating Tableau Data Extracts via CSV files
import csv, os, time
from datetime import datetime
import dataextract as tde #saves some typing, cause i'm a lazy fucker
################ PARAMETERS FOR YOU, CODE MONKEY! ##########################
cvsfilenamemask = 's.csv' # can be explicit 'thisfile.csv' for one file - or open '.csv' for all that match
sourcedir = 'C:\\Python27\\' # need to double up the \\s | windows shares use like this '\\\\ryrobesxps\d$\' etc
targetdir = 'C:\\Python27\\' # can't be a share or UNC path
csvdelimiter = ',' # obvious!
csvquotechar = '"' # obvious!
rowoutput = False # useful for debugging data errors / slows shit down a lot however
################ PARAMETERS FOR YOU, CODE MONKEY! ##########################
# Note: if you have less than a few thousand rows, the progress bar will be a bit fucked looking.
fileperf = dict() # for saving each files execution times
# since the CSV module imports all fields as strings regardless of what they are..
def datatyper(n): # force some data types to figure shit out
try: # kind of lame.... BUT IT WORKS
x = int(n)
return int(n)
except:
try:
x = float(n)
return float(n)
except:
try:
date_object = datetime.strptime(n, '%m/%d/%Y')
return date_object
except:
try:
date_object = datetime.strptime(n, '%Y-%m-%d')
return date_object
except:
if n == 'NULL': # just in case, don't want any literal NULLs in there
return None
elif len(n) > 0:
return str(n)
else: # no need to return an empty string, let's NULL that shit out
return None
# end ugly data types function
def showhumanfilesize(num):
for x in ['bytes','KB','MB','GB']:
if num < 1024.0:
return "%3.1f%s" % (num, x)
num /= 1024.0
return "%3.1f%s" % (num, 'TB')
def intWithCommas(x):
if type(x) not in [type(0), type(0L)]:
raise TypeError("Parameter must be an integer.")
if x < 0:
return '-' + intWithCommas(-x)
result = ''
while x >= 1000:
x, r = divmod(x, 1000)
result = ",%03d%s" % (r, result)
return "%d%s" % (x, result)
def file_lines(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
#print ' '
#print '[ Note: Each . = ' +str(dotsevery)+ ' rows processed ]'
os.chdir(sourcedir)
for csvfilename in os.listdir("."):
if csvfilename.endswith(cvsfilenamemask):
tdefilename = csvfilename.split('.')[0]+'.tde'
linez = file_lines(sourcedir + csvfilename)
if linez > 36:
dotsevery = linez/36
else:
dotsevery = 10
print ' '
print '###########################################################################'
print ' Now working on ' + csvfilename + ' ('+str(intWithCommas(linez))+') -> ' + tdefilename + ' (' + str(intWithCommas(dotsevery)) + ' rows per =)'
print '###########################################################################'
#print dotsevery
time.sleep(5) # so you can read it.
# BEGIN MULTI FILE LOOP
start_time = time.time() # simple timing for test purposes
try:
# taking a sample of the file
csvfile = open(csvfilename, 'rb')
toplines = csvfile.readlines()
filebuffer = '' # empty string
for i in range(dotsevery):
filebuffer = filebuffer + toplines[i]
hasheader = csv.Sniffer().has_header(filebuffer) # csvfile.read() / filebuffer
except:
hasheader = False
# ok lets go
csvfile.seek(0) # YOU WILL DO, WHAT I SAY, WHEN I SAY! BACK TO THE FRONT!
csvreader = csv.DictReader(csvfile, delimiter=csvdelimiter, quotechar=csvquotechar)
dfields = []
dtypes = []
if hasheader == True:
for f in csvreader.fieldnames:
dfields.append(f)
else: # WTF? No header? JERK.
fieldnum = 0
#print 'If you don\'t have a header, how the fuck will you recognize the fields in Tableau?'
for f in csvreader.fieldnames:
dfields.append('field'+str(fieldnum))
fieldnum = fieldnum + 1
csvreader = csv.DictReader(csvfile, delimiter=csvdelimiter, quotechar=csvquotechar, fieldnames=dfields)
# we have to make our own field names
for row in csvreader:
for i in dfields:
dtypes.append(str(type(datatyper(row[i]))))
break # got shit, we're out
csvfile.seek(0) # BACK TO THE FRONT! (AGAIN!)
os.chdir(targetdir)
try: # Just in case the file exists already, we don't want to bomb out
tdefile = tde.Extract(tdefilename) # in CWD
except:
os.system('del ' + targetdir + tdefilename)
os.system('del DataExtract.log') #might as well erase this bitch too
tdefile = tde.Extract(targetdir + tdefilename)
# ok lets build the table definition in TDE with our list of names and types first
# replacing literals with TDE datatype integers, etc
tableDef = tde.TableDefinition() #create a new table def
numfields = len(dfields)
#print numfields
if rowoutput == True:
print '*** field names list ***' # debug
for t in range(numfields):
fieldtypeo = dtypes[t].replace("<type '","").replace("'>","").replace("<class '","").replace('NoneType','str').replace('uuid.UUID','str')
fieldname = dfields[t]
fieldtype = str(fieldtypeo).replace("str","15").replace("datetime.datetime","13").replace("int","7").replace("decimal.Decimal","10").replace("float","10").replace("uuid.UUID","15").replace("bool","11")
if rowoutput == True:
print fieldname + ' (looks like ' + fieldtypeo +', TDE datatype ' + fieldtype + ')' # debug
try:
tableDef.addColumn(fieldname, int(fieldtype)) # if we pass a non-int to fieldtype, it'll fail
except:
tableDef.addColumn(fieldname, 15) # if we get a weird type we don't recognize, just make it a string
if rowoutput == True:
print '***'
time.sleep(5) # wait 5 seconds so you can actually read shit!
# ok, lets print out the table def we just made, for shits and giggles
if rowoutput == True:
print '################## TDE table definition created ######################'
for c in range(0,tableDef.getColumnCount()):
print 'Column: ' + str(tableDef.getColumnName(c)) + ' Type: ' + str(tableDef.getColumnType(c))
time.sleep(5) # wait 5 seconds so you can actually read shit!
# ok lets add the new def as a table in the extract
tabletran = tdefile.addTable("Extract",tableDef)
# time to start pumping rows!
rowsinserted = 1
# if we have a header, we don't want to try and process it
if hasheader == True:
csvreader.next()
print '[',
for row in csvreader:
if rowoutput == True: # row deets, else just '.'
print '************** INSERTING ROW NUMBER: ' + str(rowsinserted) + '**************' # debug output
else: # only print dot every 50 records
if (rowsinserted%dotsevery) == 0:
print '=',
columnposition = 0
newrow = tde.Row(tableDef)
for t in range(numfields):
fieldtype = dtypes[t].replace("<type '","").replace("'>","").replace("<class '","").replace('NoneType','str').replace('uuid.UUID','str')
fieldname = dfields[t]
if rowoutput == True: # column deets
print str(columnposition) + ' ' + fieldname + ': ' + str(row[fieldname]) + ' (' + str(fieldtype).split('.')[0] + ')' # debug output
if fieldtype == 'str':
if row[fieldname] != None: # we don't want no None!
newrow.setCharString(columnposition, str(row[fieldname]))
else:
newrow.setNull(columnposition) # ok, put that None here
if fieldtype == 'int':
if row[fieldname] != None:
newrow.setInteger(columnposition, datatyper(row[fieldname]))
else:
newrow.setNull(columnposition)
if fieldtype == 'bool':
if row[fieldname] != None:
newrow.setBoolean(columnposition, datatyper(row[fieldname]))
else:
newrow.setNull(columnposition)
if fieldtype == 'decimal.Decimal':
if row[fieldname] != None:
newrow.setDouble(columnposition, datatyper(row[fieldname]))
else:
newrow.setNull(columnposition)
if fieldtype == 'datetime.datetime': # sexy datetime splitting
if row[fieldname] != None:
strippeddate = str(datatyper(row[fieldname])).split('.')[0] # just in case we get microseconds (not all datetime uses them)
timechunks = time.strptime(str(strippeddate), "%Y-%m-%d %H:%M:%S") # chunky style!
newrow.setDateTime(columnposition, timechunks[0], timechunks[1], timechunks[2], timechunks[3], timechunks[4], timechunks[5], 0000)
else:
newrow.setNull(columnposition)
columnposition = columnposition + 1 # we gots to know what column number we're working on!
tabletran.insert(newrow) # finally insert buffered row into TDE 'table'
newrow.close()
rowsinserted = rowsinserted + 1
# ok let's write out that file and get back to making dinner
tdefile.close()
csvfile.close()
print '] '
print '('+str(intWithCommas(rowsinserted))+' rows)' # to clear out the row on command line
plist = []
# timing purposes for debugging / optimizing / FUN! This is FUN, Lars.
timetaken = time.time() - start_time
plist.append(timetaken)
plist.append(rowsinserted)
plist.append(os.path.getsize(sourcedir + csvfilename))
plist.append(os.path.getsize(sourcedir + tdefilename))
fileperf[str(csvfilename)] = plist
# just for our "result time"
totaltime = 0
totalrecords = 0
totalpresize = 0
totalpostsize = 0
print ' '
for p in fileperf:
print p + ' ' + str(intWithCommas(int(fileperf[p][0]))) + ' seconds processed ' + str(intWithCommas(fileperf[p][1])) + ' records. Resulting TDE file is ' + str(showhumanfilesize(fileperf[p][3])) + ' (source was ' + str(showhumanfilesize(fileperf[p][2])) + ')'
totaltime = fileperf[p][0] + totaltime
totalrecords = fileperf[p][1] + totalrecords
totalpresize = fileperf[p][2] + totalpresize
totalpostsize = fileperf[p][3] + totalpostsize
if len(fileperf) > 1:
print ' '
print 'TOTAL RUN ' + str(intWithCommas(int(totaltime))) + ' seconds processed ' + str(intWithCommas(totalrecords)) + ' records - crunched ' + str(showhumanfilesize(totalpresize)) + ' of text into ' + str(showhumanfilesize(totalpostsize)) + ' of binary sexiness'
Issues? Let me know below and I can fix them! Maybe.