MoneyBall (for Real): Business Intelligence, Data Mining, SQL Server, and Major League Baseball – Part 1
Let's be honest - I'm a data guy. "Geek" to be more exact. But, I'm also a sports guy - and if there is ANY sport out there that prides itself as being massively data-centric... its Baseball. Data goes with baseball like Barry Bonds and... um, Home Runs (yeah, that's it). Data collecting is an integral part of baseball culture and has been going on way before the internet or relational databases were even invented.
"Business Intelligence" is a term that's been kicked around around for some time now, and basically just means "analyzing data from your past, in order to BETTER make better decisions in the future (or to better steer it in realtime)". Its all about strategy, learning from mistakes (and successes), and being able to actively monitor / measure the health of your business. We might use this data to ask important business related questions: "Is Suzi Salesperson performing up to par this quarter?" or "What line of products have the biggest margin this time of year?".
However, its not too far of a stretch to imagine someone who works for a Major League Baseball team sitting in front of a computer at the head office thinking: "What is the ROI of our pitching staff this year? Are they performing to expectations?" Its really no different then the business and reporting scenarios many of us encounter every single day. The exciting part is not only being able to read what has happened out on the ball field yesterday - but what might happen tomorrow. Boom. "Bizball Intelligence" anyone?
In this multi-part post, I'll be taking you through all the steps needed to get up and running with your own historical AND current Major League Baseball statistics database (an operational / transactional DB), staging it out in a more "reportable" fashion (a data warehouse DB), and then finally building some cubes, calculated measures, and choice "Player Key Performance Indicators" (KPIs) based on the data. Who knows, maybe you'll be so good at it that you'll get hired by the San Diego Padres front-office, like this guy did, but more on him later.
So grab some sunflower seeds and a calculator - adjust your jock straps and get ready, 'cause here we go.
[ Part I: DATA ]
First up. We need some data (actually, a TON of data), and (more importantly) we need a place to store and easily access it. Its database creation time.
Here at C&C, we are big fans of the Microsoft SQL Server 2008 stack of products (Database, Analysis Services, & Reporting Services), and I'll be using SQL Server 2008 R2 for this tutorial. You can get a 6 month eval copy, (you can also install the free SQL Server Express product to get up and running with a database at no cost - but you'll be missing out on the Analysis Services and Data Mining pieces).
There are lots of places with baseball data on the web, but sometimes its incomplete, unofficial, not granular enough, and organized badly for what we need it for. Besides, we can't do as much with stats that are ALREADY calculated - we need to calculate them ourselves in order to Data Mine properly (although those "other" data sources come in handy to verify that our own calculation formulas are correct). So in my mind, the grand-poobah of MLB data collection is (shockingly enough) MLB.com.
MLB.com's "GameDay" application uses a XML structure to read info about every game, inning, pitch, hit and player. The "full-monty" of data only goes back to about 2006 or so (pitch location, coordinates, speeds, etc also known as "Pitch/FX Data"), but less granular data is available for prior years (and even more from other methods, I'll post about later).
Browse around their backend, and you'll see what I mean: http://gd2.mlb.com/components/game/mlb/. From there it goes into year/ month/ day/ game/ inning, linescore, batters, pitchers, etc. etc. all packaged up into XML files for the reading. Now before you think this is some kind of illegal backdoor in MLB's servers, its not - well, not really. MLB has been aware that people have been using this data for about 4 years now. If they REALLY wanted to lock people out of it, they would have done it long ago. So feel free to browse it 100% guilt-free. The data is near realtime too! Pretty slick, MLB.
linescore.xml Example
<game id="2011/04/26/sfnmlb-pitmlb-1" venue="PNC Park" game_pk="287242" time="7:05"
time_zone="ET"
ampm="PM"
away_time="4:05"
away_time_zone="PT"
away_ampm="PM"
home_time="7:05"
home_time_zone="ET"
home_ampm="PM"
game_type="R"
time_aw_lg="7:05"
aw_lg_ampm="PM"
tz_aw_lg_gen="ET"
time_hm_lg="7:05"
hm_lg_ampm="PM"
tz_hm_lg_gen="ET"
venue_id="31"
scheduled_innings="9"
away_name_abbrev="SF"
home_name_abbrev="PIT"
away_code="sfn"
away_file_code="sf"
away_team_id="137"
away_team_city="San Francisco"
away_team_name="Giants"
away_division="W"
away_league_id="104"
away_sport_code="mlb"
home_code="pit"
home_file_code="pit"
home_team_id="134"
home_team_city="Pittsburgh"
home_team_name="Pirates"
home_division="C"
home_league_id="104"
home_sport_code="mlb"
day="TUE"
gameday_sw="P"
venue_w_chan_loc="USPA1290"
gameday_link="2011_04_26_sfnmlb_pitmlb_1"
away_win="11"
away_loss="11"
home_win="10"
home_loss="13"
league="NN"
top_inning="N"
inning_state=""
status="Final"
ind="F"
inning="10"
outs="3"
away_team_runs="3"
home_team_runs="2"
away_team_hits="7"
home_team_hits="5"
away_team_errors="1"
home_team_errors="1"
wrapup_link="/mlb/gameday/index.jsp?gid=2011_04_26_sfnmlb_pitmlb_1&mode=wrap&c_id=mlb"
home_preview_link="/mlb/gameday/index.jsp?gid=2011_04_26_sfnmlb_pitmlb_1&mode=preview&c_id=mlb"
away_preview_link="/mlb/gameday/index.jsp?gid=2011_04_26_sfnmlb_pitmlb_1&mode=preview&c_id=mlb"
preview="/mlb/gameday/index.jsp?gid=2011_04_26_sfnmlb_pitmlb_1&mode=preview&c_id=mlb"
tv_station="ROOT SPORTS"
home_recap_link="/mlb/gameday/index.jsp?gid=2011_04_26_sfnmlb_pitmlb_1&mode=recap&c_id=pit"
away_recap_link="/mlb/gameday/index.jsp?gid=2011_04_26_sfnmlb_pitmlb_1&mode=recap&c_id=sf"
photos_link="/mlb/gameday/index.jsp?gid=2011_04_26_sfnmlb_pitmlb_1&mode=photos">
<linescore inning="1" home_inning_runs="0" away_inning_runs="0"/>
<linescore inning="2" home_inning_runs="1" away_inning_runs="0"/>
<linescore inning="3" home_inning_runs="0" away_inning_runs="0"/>
<linescore inning="4" home_inning_runs="0" away_inning_runs="0"/>
<linescore inning="5" home_inning_runs="0" away_inning_runs="0"/>
<linescore inning="6" home_inning_runs="1" away_inning_runs="1"/>
<linescore inning="7" home_inning_runs="0" away_inning_runs="0"/>
<linescore inning="8" home_inning_runs="0" away_inning_runs="1"/>
<linescore inning="9" home_inning_runs="0" away_inning_runs="0"/>
<linescore inning="10" home_inning_runs="0" away_inning_runs="1"/>
<winning_pitcher first_name="Sergio" first="Sergio" id="489265" last_name="Romo" last="Romo"
name_display_roster="Romo"
wins="1"
losses="0"
era="1.04"
s_wins=""
s_losses=""
s_era=""/>
<losing_pitcher first_name="Joel" first="Joel" id="430629" last_name="Hanrahan" last="Hanrahan"
name_display_roster="Hanrahan"
wins="0"
losses="1"
era="2.25"
s_wins=""
s_losses=""
s_era=""/>
<save_pitcher first_name="Brian" first="Brian" id="451216" last_name="Wilson" last="Wilson"
name_display_roster="Wilson, B"
wins="0"
losses="1"
era="8.64"
s_wins=""
s_losses=""
s_era=""
saves="6"/>
<game_media>
<media type="game" calendar_event_id="14-287242-2011-04-26"
start="2011-04-26T19:05:00-0400"
title="SF @ PIT"
has_mlbtv="true"
free="NO"
media_state="media_archive"
thumbnail="http://mediadownloads.mlb.com/mlbam/preview/sfnpit_287242_preview.jpg"/>
</game_media>
</game>
So how do we get this mountain of XML files into our database?
Well, that's easy. In 2010 Wells Oliver created a python script and associated libraries that reads the XML Gameday data and inserts it into a relational database format. Wells wrote the script to work with MySQL and (also) hasn't updated it in some time (the MLB data format is always changing, and expanding, especially in 2011). So I took his fine work, rewrote it to use Microsoft SQL Server and added / changed / enhanced portions of its logic and the database schema. Anyways, I have "forked" his project, as they say in software development circles.
Download the Python code here: gameday_mssql.zip
Download the SQL Server Create table script here: gameday_mssql.sql
It requires:
- Python 2.6 (probably works with any 2.X ABOVE 2.6, but haven't tried yet)
- Pymssql Python Library
- BeautifulSoup Python Library
Download it, expand it - (I use "C:\Gameday" myself), and then modify the 'db.ini' file in the root folder to configure your database instance.
I'm working on a development SQL instance, so my db.ini looks like this:
user=sa
password=*******
database=mlbgameday
host=myhostname
(If you need to use Windows Authentication / Trusted Auth, stay tuned because I'll be adding that shortly)
Obviously you're going to want to create a 'mlbgameday' (or whatever you desire to call it) database on the server first AND you're going to have to create the schema frist.
Once you're all set, the script is used like this: (make sure that the python directory is in your PATH)
The only required argument is "year", but I would shy away from trying to fetch a whole year at one time for now. The multiple threads tend to overwhelm the pymssql library and max out some internal max connections setting (the TDS lib MAX_CONNECTIONS). For "type" you can fetch 'mlb' data or 'aaa' data, which might be useful in our analysis, but not a neccessity.
In order to gather a couple years worth of data painlessly, there is ANOTHER script I put together in the package called build_all.py - simply open this file with a text editor and modify these variables near the top:
end_date = date(2011,4,30)
type = 'mlb'
Now, running "python build_all.py" will crawl through each date in your specified date range gathering data from all games on each day. Crawling through 6 years of MLB data might take a day or so - but once you have it, you have it - and updating it each day to get the previous days games should only take a minute or so (more on that in the next post).
Sample Output:
python gameday_mssql.py --year=2006 --month=09 --day=22 --type=mlb
Working on: 2006/09/22/anamlb-oakmlb-1
Working on: 2006/09/22/arimlb-lanmlb-1
Working on: 2006/09/22/atlmlb-colmlb-1
Working on: 2006/09/22/bosmlb-tormlb-1
Working on: 2006/09/22/clemlb-texmlb-1
Working on: 2006/09/22/detmlb-kcamlb-1
Working on: 2006/09/22/flomlb-phimlb-1
Working on: 2006/09/22/minmlb-balmlb-1
Working on: 2006/09/22/nyamlb-tbamlb-1
Working on: 2006/09/22/seamlb-chamlb-1
Working on: 2006/09/22/sfnmlb-milmlb-1
Working on: 2006/09/22/slnmlb-houmlb-1
Working on: 2006/09/22/wasmlb-nynmlb-1
Finished. Total time: 00:08:23
Now building: 2006-09-23
python gameday_mssql.py --year=2006 --month=09 --day=23 --type=mlb
Working on: 2006/09/23/anamlb-oakmlb-1
Working on: 2006/09/23/arimlb-lanmlb-1
Etc... etc...
Fire it up and you'll have more data then you know what to do with - what DO we do with it? That's next...