RiffBank – Parsing arbitrary Text-based Guitar Tab into an Indexable and Queryable “RiffCode for ElasticSearch

29 August 2013 0 Comments

RiffBank – Parsing arbitrary Text-based Guitar Tab into an Indexable and Queryable “RiffCode for ElasticSearch

Quick Links: This post is on the approach taken to get guitar tab into a normalized and data-structured form. See previous posts for more context.

Part 0: Why Guitar Tab?
Part 1: Text Tab to "RiffCode" (this post)
Part 2: Riff Storage and Querying in ElasticSearch
Part 3: Simple UI display with PHP-FatFree and Twitter Bootstrap

Guitar tablature is meant for human readability...
not for machine consumption.

Granted it's "procedural" and "linear" already, but it's also column-based AND row-based at the same time (readers read down a short row and then over) - you are dealing with text chunks that are easily understandable by a human, but require a lot of "context" and rules for a machine to decipher. Not to mention the fact that it's hand-written by humans, which is another error waiting to happen.

Aside from the "how to do this" aspect, I also had to create a system to 'normalize' tab into a consistent format that lent itself to being queried properly.

The solution? "RiffCode"

My initial implementation goes like this:

encode single notes and chords into "pseudo-words"
turn those riff sections into "sentences"
capture note meta-data when possible (palm-muting, etc)

By storing the data this way I can use full-text search technology to try and gleam results from (which, with ElasticSearch, worked quite well).

In practice it looks / works like this:

input

E-|----------------------------------------------------------------|
B-|----------------------------------------------------------------|
G-|*--------------------------------------------------------------*|
D-|*-----------------------------------------------5--------------*|
A-|---7-7-5-7---------7-7-5-----------7-7-5-7------5---------------|
E-|-0---------------0-------7-6-5---0--------------3---6-5-0-3-5---|

output

6a 5h 5h 5f 5h 6a 5h 5h 5f 6h 6g 6f 6a 5h 5h 5f 5h 6d5f4f 6g 6f 6a 6d 6f

The concept is to pivot each "column line" of tab and keep it as "lossless" as possible (including extraneous spaces) using a basic letter system for fret number and number for strings. The system has some shortcomings (over 26 frets), but is adequate (if not damn good) for 95% of tab.

Ok, now how do we do this a million times over? - Python.

Since I wasn't even sure this was going to work - I wrote fast and carelessly. So what we have is a very inelegant solution that iterates over the text several times and creates several dictionaries - and then re-constructs it at the end.

It's one cluster fuck of a text-parsing function, but it works, and it's fast enough.

"No time for love, Dr. Jones!"

def GenerateRiffCodeFromText(x): # x being the giant text string...

riff_number = 0
commonchar = None
string_line = None
lineresults = {} # or dict()

for lineno, linestr in enumerate(x):

linestr = linestr.rstrip('\r\n') #.strip()
first3char = linestr[0:3]
if len(linestr) > 0: # not counting some common accent symbols in case tab author was crazy w solo accents
prev_commonchar = commonchar
mod_linestr = linestr.replace('~',' ').replace('\\',' ').replace('^',' ' )
commonchar = (collections.Counter(mod_linestr).most_common(1)[0])[0] # [0] is digit, [1] is freq
if commonchar == ' ':
try:
commonchar = (collections.Counter(mod_linestr).most_common(2)[1])[0] # [0] is digit, [1] is freq
except:
pass # ? not sure see 2x4

else:
prev_commonchar = commonchar
commonchar = 'DIVIDER'

# find (probable) string lines and label them
if (commonchar == '-' and linestr.find('P') < 0) or (commonchar.isdigit() and linestr.find('-') > 0): # so we don't grab the Palm Mute line...
if string_line == None:
prev_string_line = string_line
string_line = 1
else:
prev_string_line = string_line
string_line = string_line + 1
#lineresults[lineno] = string_line
else:
prev_string_line = string_line
string_line = None

# find (probable) meta / PM line ?
if linestr.find('P') > 2 and string_line == None:
prev_string_line = string_line
string_line = 0

# find (possible) section headers
if len(linestr)<30 and commonchar.isalpha() and string_line == None and commonchar <> 'DIVIDER':
riff_name = 'riff name?'
else:
riff_name = None

#print "line number: " + str(lineno) + ": " + linestr.rstrip()+' ',
#print '{' + commonchar + ' ' + str(string_line) + ' ' + str(riff_name) +'} Rnum' + str(riff_number) #+ 'prevst'+str(prev_string_line)

if prev_string_line == 6:
riff_number = riff_number + 1
elif (commonchar == 'DIVIDER' and prev_string_line < 6 and prev_string_line > None):
riff_number = riff_number + 1

# add all this shit to a dick(t)
if not lineresults.has_key(riff_number):
lineresults[riff_number] = {}
lineresults[riff_number][string_line] = {'linestr':linestr.rstrip(), 'commonchar':commonchar, 'string_line':string_line, 'riff_name':riff_name, 'riff_number':riff_number}
#lineresults[riff_number] = {'linestr':linestr.rstrip(), 'lineno':lineno, 'string_line':string_line, 'riff_name':riff_name, 'riff_number':riff_number}

#print 'done: ' + str(len(lineresults)) + ' lines'
#pp.pprint(lineresults)

linelengths = {}

# get longest line ?
for rnum in lineresults:
linelengths[rnum] = 999
for ln in lineresults[rnum]:
if ln > 0:
if len(lineresults[rnum][ln]['linestr']) < linelengths[rnum]:
linelengths[rnum] = len(lineresults[rnum][ln]['linestr'])
if linelengths[rnum] == 999:
linelengths[rnum] = 0

result_dict = {}

for rnum in lineresults:
result_dict[rnum] = {}

raw_lines = ''

try:
raw_lines += str(lineresults[rnum][None]['linestr']) + "\n"
riff_name = str(lineresults[rnum][None]['linestr'])
except:
pass
try:
raw_lines += str(lineresults[rnum][0]['linestr']) + "\n"
except:
pass
try:
raw_lines += str(lineresults[rnum][1]['linestr']) + "\n"
except:
pass
try:
raw_lines += str(lineresults[rnum][2]['linestr']) + "\n"
except:
pass
try:
raw_lines += str(lineresults[rnum][3]['linestr']) + "\n"
except:
pass
try:
raw_lines += str(lineresults[rnum][4]['linestr']) + "\n"
except:
pass
try:
raw_lines += str(lineresults[rnum][5]['linestr']) + "\n"
except:
pass
try:
raw_lines += str(lineresults[rnum][6]['linestr']) + "\n"
except:
pass

result_dict[rnum]['raw_lines'] = raw_lines

# alphabetize fret nums
for i in lineresults[rnum]:
lineresults[rnum][i]['linestr_a'] = AlphabatizeFrets(lineresults[rnum][i]['linestr'])

s = ''

for column in range(linelengths[rnum]):
try:

# first attempt at simple PM recording..
try:
##print lineresults[rnum][6]['linestr'][column],
if lineresults[rnum][0]['linestr_a'][column] <> ' ':
s += '#'
else:
s += '0' # record nothing?
except:
##print '?',
s += '0'

try:
##print lineresults[rnum][6]['linestr'][column],
s += lineresults[rnum][6]['linestr_a'][column].replace(' ','-')
except:
##print '?',
s += '?'
try:
##print lineresults[rnum][5]['linestr'][column],
s += lineresults[rnum][5]['linestr_a'][column].replace(' ','-')
except:
##print '?',
s += '?'
try:
##print lineresults[rnum][4]['linestr'][column],
s += lineresults[rnum][4]['linestr_a'][column].replace(' ','-')
except:
##print '?',
s += '?'
try:
##print lineresults[rnum][3]['linestr'][column],
s += lineresults[rnum][3]['linestr_a'][column].replace(' ','-')
except:
##print '?',
s += '?'
try:
##print lineresults[rnum][2]['linestr'][column],
s += lineresults[rnum][2]['linestr_a'][column].replace(' ','-')
except:
##print '?',
s += '?'
try:
##print lineresults[rnum][1]['linestr'][column],
s += lineresults[rnum][1]['linestr_a'][column].replace(' ','-')
###sz = lineresults[rnum][1]['linestr_a'][column].replace(' ','-')
except:
##print '?',
s += '?'
#print ',',
s += ' '
#print s,

except:
pass #test

#changing some chars for a test indexing run
# constructing the "LONG CODE"
#print s # original s code

# check first "note" for bar notes
if ("##" in s[:8] or "B" in s[:8] or "E" in s[:8] or "A" in s[:8] or "?" in s[:8] or "||" in s[:8] or "::" in s[:8]) and "-" not in s[:8]:
s = s[8:]
if ("##" in s[:8] or "B" in s[:8] or "E" in s[:8] or "A" in s[:8] or "?" in s[:8] or "||" in s[:8] or "::" in s[:8]) and "-" not in s[:8]:
s = s[8:]

# check late "note" for bar notes
if ("##" in s[-8:] or "E" in s[-8:] or "A" in s[-8:] or "?" in s[-8:] or "||" in s[-8:]) and "-" not in s[-8:]:
s = s[:-8]
if ("##" in s[-8:] or "E" in s[-8:] or "A" in s[-8:] or "?" in s[-8:] or "||" in s[-8:]) and "-" not in s[-8:]:
s = s[:-8]

#s = s.replace('0EADGBE','').replace('0||||||','').replace('0::::::','')
#s = s.replace('|||','')

s = s.replace('0||||||','|') # riff bar seperators
s = s.replace('#||||||','|') # riff bar seperators
s = s.replace('?','-') # temp - work on it later (mark out missing tabbed strings)

if ("| " in s[:2]):
s = s[2:]

# now to change LONG CODE to SHORT CODE

long_code = s
long_code_list = long_code.split()
short_code = ''

for w in long_code_list:
#print w
if w == '|':
w = '|'
ww = '|'
elif w == '0------':
w = '.'
ww = '.'
else:
wd = {}
cnt = 0
cntf= 0
for l in w:

if l == '#' or l == '0' or l == '-' or l == '/' or l == '(' or l == ')' or l == '\\':
wd[cnt] = str(l)
else:
wd[cnt] = str(7-cnt)+str(l)

cnt = cnt+1
#elif w[0:1] in [a-z]:
# print 'tt'
#pp.pprint(wd)
ww = ''
for i in range(0,7):
try: # ? changed last minute
ww += wd[i]
except:
pass

#short_code += str(w)+' '
wdee = str(ww)
wdee = wdee.replace('0-----','')
wdee = wdee.replace('0----','')
wdee = wdee.replace('0---','')
wdee = wdee.replace('0--','')
wdee = wdee.replace('0-','')
wdee = wdee.replace('0','')
wdee = wdee.replace('#-----','*')
wdee = wdee.replace('#----','*')
wdee = wdee.replace('#---','*')
wdee = wdee.replace('#--','*')
wdee = wdee.replace('#-','*')
wdee = wdee.replace('#','*')

if wdee[-5:] == '-----':
wdee = wdee[:-5]
elif wdee[-4:] == '----':
wdee = wdee[:-4]
elif wdee[-3:] == '---':
wdee = wdee[:-3]
elif wdee[-2:] == '--':
wdee = wdee[:-2]
elif wdee[-1:] == '-':
wdee = wdee[:-1]

if "H" in wdee:
wdee = '>'
if "P" in wdee:
wdee = '<'

short_code += str(wdee)+' '

result_dict[rnum]['short_code'] = short_code
result_dict[rnum]['long_code'] = long_code
result_dict[rnum]['riff_name'] = riff_name

#return short_code
return result_dict

Text parsing from hell!

Anyways, that function is part of the module that I use to read the tab file into a JSON format that I can then insert into my "riff" ElasticSearch index - that is, of course, AFTER I pull the raw tab out of my "scraping" index. :)

Note: I'm using the amazing Requests module instead of a specific ES python module.

# ryan robitaille
# prototype [riffbank / riffwords / riffml / riffql / riffjson] "encoder" script
# 7/3/2013

import requests, pprint, json, urllib
import time, os, string, sys, time, collections, hashlib
import riff_coder # custom
from random import choice

pp = pprint.PrettyPrinter(indent=3)

# get a random ES box each time
es_boxes = ['192.168.xxx.234','192.168.xxx.115','192.168.xxx.47','192.168.xxx.241','192.168.xxx.191','localhost']

payload = { 'query': { 'bool': { 'must': [ { 'match_all': { } } ], 'must_not': [ { 'term': { 'incoming.riff_indexed': 3 } } ], 'should': [ ] } } }
rrr = requests.get("http://localhost:9200/scraper/incoming/_search?from=0&size=5000", data=json.dumps(payload))
resp = rrr.json
rr = resp()

for i in rr['hits']['hits']:
print i['_id']
#pp.pprint(i)

es_box = choice(es_boxes)

x = i['_source']['raw_text'].splitlines(True)
riff_coder_dict = riff_coder.GenerateRiffCodeFromText(x)

#for each in riff_Coder_dict, insert ALL above fields plus rnum, raw_riff, short_code (no need for long code)
#then update scraper record as riff_indexed
print '----------------',i['_source']['artist_name'],' - ',i['_source']['song_name'],'---------------- ',es_box
#pp.pprint(riff_coder_dict)

try:
spotify_album_released = i['_source']['spotify_album_released']
spotify_album_href = i['_source']['spotify_album_href']
spotify_album_name = i['_source']['spotify_album_name']
spotify_track_href = i['_source']['spotify_track_href']
spotify_track_popularity = i['_source']['spotify_track_popularity']
except:
spotify_album_released = ''
spotify_album_href = ''
spotify_album_name = ''
spotify_track_href = ''
spotify_track_popularity = ''

try:
tabversion = i['_source']['tab_version']
except:
tabversion = 0

for r in riff_coder_dict:

url_hash = hashlib.sha1(i['_source']['source_url']).hexdigest()+'__'+str(r) # add domain of URL?
payload = { 'artist_id':i['_source']['artist_id'], \
'artist_name':i['_source']['artist_name'], \
'artist_terms':i['_source']['artist_terms'], \
'audio_summary':i['_source']['audio_summary'], \
'images':i['_source']['images'], \
'location':i['_source']['location'], \
'lookup_name':i['_source']['lookup_name'], \
'lookup_song_name':i['_source']['lookup_song_name'], \

'riff_num':r, \
'riff_name':riff_coder_dict[r]['riff_name'], \
'riff_code':riff_coder_dict[r]['short_code'], \
'riff_text':riff_coder_dict[r]['raw_lines'], \
'long_code':riff_coder_dict[r]['long_code'], \

'similar_artists':i['_source']['similar_artists'], \
'song_id':i['_source']['song_id'], \
'song_name':i['_source']['song_name'], \
'source_url':i['_source']['source_url'], \
'tab_version':tabversion, \
'text_hash':hashlib.sha1(i['_source']['raw_text']).hexdigest(), \

'track_data_7digital':i['_source']['track_data_7digital'], \
'track_data_spotify':i['_source']['track_data_spotify'], \

'spotify_data': { 'album_released':spotify_album_released, \
'album_href':spotify_album_href, \
'album_name':spotify_album_name, \
'track_href':spotify_track_href, \
'track_popularity':spotify_track_popularity } }
inr = requests.put("http://"+es_box+":9200/riffs/single/"+str(url_hash), data=json.dumps(payload)) # no need for response
print inr.text
print '----------------',i['_source']['artist_name'],' - ',i['_source']['song_name'],'----------------'

updpayload = { 'script': { "script" : "ctx._source.riff_indexed = 3" } }
upd = requests.post("http://"+es_box+":9200/scraper/incoming/"+i['_id']+"/_update", data=json.dumps(updpayload))
print upd.text
print ''
print ''

So now I've got each tab file (song) split into many "riff-based" JSON documents in my ElasticSearch system... (with a lot of extra meta-data picked up along the way - I'll write another post on searching Spotify and Echonest)

What's next? Getting it OUT in a meaningful way...

Tags: elasticsearch, Python, riffbank, tablature

Ryan Robitaille

Business Technologist, Data Artist, Problem Solver

RiffBank – Parsing arbitrary Text-based Guitar Tab into an Indexable and Queryable “RiffCode for ElasticSearch

Guitar tablature is meant for human readability...
not for machine consumption.

Aside from the "how to do this" aspect, I also had to create a system to 'normalize' tab into a consistent format that lent itself to being queried properly.

The solution? "RiffCode"

"No time for love, Dr. Jones!"

Text parsing from hell!

What's next? Getting it OUT in a meaningful way...

Fun Data Visualizations, Analytics

Sweet Data Visualizations

Categories

Fun Data Visualizations, Analytics

Links

Search

Recent Comments

Ryan Robitaille

Business Technologist, Data Artist, Problem Solver

RiffBank – Parsing arbitrary Text-based Guitar Tab into an Indexable and Queryable “RiffCode for ElasticSearch

Guitar tablature is meant for human readability...not for machine consumption.

Aside from the "how to do this" aspect, I also had to create a system to 'normalize' tab into a consistent format that lent itself to being queried properly.

The solution? "RiffCode"

"No time for love, Dr. Jones!"

Text parsing from hell!

What's next? Getting it OUT in a meaningful way...

Related Posts

Fun Data Visualizations, Analytics

Sweet Data Visualizations

Categories

Fun Data Visualizations, Analytics

Links

Search

Recent Comments

Crazy-Ass Category Cloud

Guitar tablature is meant for human readability...
not for machine consumption.