16 September 2011 5 Comments

MoneyBall (for Real): Business Intelligence, Data Mining, SQL Server, and Major League Baseball – Part 1

Let's be honest - I'm a data guy. "Geek" to be more exact. But, I'm also a sports guy - and if there is ANY sport out there that prides itself as being massively data-centric... its Baseball. Data goes with baseball like Barry Bonds and... um, Home Runs (yeah, that's it). Data collecting is an integral part of baseball culture and has been going on way before the internet or relational databases were even invented.

"Business Intelligence" is a term that's been kicked around around for some time now, and basically just means "analyzing data from your past, in order to BETTER make better decisions in the future (or to better steer it in realtime)". Its all about strategy, learning from mistakes (and successes), and being able to actively monitor / measure the health of your business. We might use this data to ask important business related questions: "Is Suzi Salesperson performing up to par this quarter?" or "What line of products have the biggest margin this time of year?".

However, its not too far of a stretch to imagine someone who works for a Major League Baseball team sitting in front of a computer at the head office thinking: "What is the ROI of our pitching staff this year? Are they performing to expectations?" Its really no different then the business and reporting scenarios many of us encounter every single day. The exciting part is not only being able to read what has happened out on the ball field yesterday - but what might happen tomorrow. Boom. "Bizball Intelligence" anyone?

In this multi-part post, I'll be taking you through all the steps needed to get up and running with your own historical AND current Major League Baseball statistics database (an operational / transactional DB), staging it out in a more "reportable" fashion (a data warehouse DB), and then finally building some cubes, calculated measures, and choice "Player Key Performance Indicators" (KPIs) based on the data. Who knows, maybe you'll be so good at it that you'll get hired by the San Diego Padres front-office, like this guy did, but more on him later.

So grab some sunflower seeds and a calculator - adjust your jock straps and get ready, 'cause here we go.

[ Part I: DATA ]

First up. We need some data (actually, a TON of data), and (more importantly) we need a place to store and easily access it. Its database creation time.

Here at C&C, we are big fans of the Microsoft SQL Server 2008 stack of products (Database, Analysis Services, & Reporting Services), and I'll be using SQL Server 2008 R2 for this tutorial. You can get a 6 month eval copy, (you can also install the free SQL Server Express product to get up and running with a database at no cost - but you'll be missing out on the Analysis Services and Data Mining pieces).

There are lots of places with baseball data on the web, but sometimes its incomplete, unofficial, not granular enough, and organized badly for what we need it for. Besides, we can't do as much with stats that are ALREADY calculated - we need to calculate them ourselves in order to Data Mine properly (although those "other" data sources come in handy to verify that our own calculation formulas are correct). So in my mind, the grand-poobah of MLB data collection is (shockingly enough) MLB.com.

MLB.com's "GameDay" application uses a XML structure to read info about every game, inning, pitch, hit and player. The "full-monty" of data only goes back to about 2006 or so (pitch location, coordinates, speeds, etc also known as "Pitch/FX Data"), but less granular data is available for prior years (and even more from other methods, I'll post about later).

Browse around their backend, and you'll see what I mean: http://gd2.mlb.com/components/game/mlb/. From there it goes into year/ month/ day/ game/ inning, linescore, batters, pitchers, etc. etc. all packaged up into XML files for the reading. Now before you think this is some kind of illegal backdoor in MLB's servers, its not - well, not really. MLB has been aware that people have been using this data for about 4 years now. If they REALLY wanted to lock people out of it, they would have done it long ago. So feel free to browse it 100% guilt-free. The data is near realtime too! Pretty slick, MLB.

linescore.xml Example

<?xml version="1.0" encoding="UTF-8"?>
<game id="2011/04/26/sfnmlb-pitmlb-1" venue="PNC Park" game_pk="287242" time="7:05"
     time_zone="ET"
     ampm="PM"
     away_time="4:05"
     away_time_zone="PT"
     away_ampm="PM"
     home_time="7:05"
     home_time_zone="ET"
     home_ampm="PM"
     game_type="R"
     time_aw_lg="7:05"
     aw_lg_ampm="PM"
     tz_aw_lg_gen="ET"
     time_hm_lg="7:05"
     hm_lg_ampm="PM"
     tz_hm_lg_gen="ET"
     venue_id="31"
     scheduled_innings="9"
     away_name_abbrev="SF"
     home_name_abbrev="PIT"
     away_code="sfn"
     away_file_code="sf"
     away_team_id="137"
     away_team_city="San Francisco"
     away_team_name="Giants"
     away_division="W"
     away_league_id="104"
     away_sport_code="mlb"
     home_code="pit"
     home_file_code="pit"
     home_team_id="134"
     home_team_city="Pittsburgh"
     home_team_name="Pirates"
     home_division="C"
     home_league_id="104"
     home_sport_code="mlb"
     day="TUE"
     gameday_sw="P"
     venue_w_chan_loc="USPA1290"
     gameday_link="2011_04_26_sfnmlb_pitmlb_1"
     away_win="11"
     away_loss="11"
     home_win="10"
     home_loss="13"
     league="NN"
     top_inning="N"
     inning_state=""
     status="Final"
     ind="F"
     inning="10"
     outs="3"
     away_team_runs="3"
     home_team_runs="2"
     away_team_hits="7"
     home_team_hits="5"
     away_team_errors="1"
     home_team_errors="1"
     wrapup_link="/mlb/gameday/index.jsp?gid=2011_04_26_sfnmlb_pitmlb_1&amp;mode=wrap&amp;c_id=mlb"
     home_preview_link="/mlb/gameday/index.jsp?gid=2011_04_26_sfnmlb_pitmlb_1&amp;mode=preview&amp;c_id=mlb"
     away_preview_link="/mlb/gameday/index.jsp?gid=2011_04_26_sfnmlb_pitmlb_1&amp;mode=preview&amp;c_id=mlb"
     preview="/mlb/gameday/index.jsp?gid=2011_04_26_sfnmlb_pitmlb_1&amp;mode=preview&amp;c_id=mlb"
     tv_station="ROOT SPORTS"
     home_recap_link="/mlb/gameday/index.jsp?gid=2011_04_26_sfnmlb_pitmlb_1&amp;mode=recap&amp;c_id=pit"
     away_recap_link="/mlb/gameday/index.jsp?gid=2011_04_26_sfnmlb_pitmlb_1&amp;mode=recap&amp;c_id=sf"
     photos_link="/mlb/gameday/index.jsp?gid=2011_04_26_sfnmlb_pitmlb_1&amp;mode=photos">
   <linescore inning="1" home_inning_runs="0" away_inning_runs="0"/>
   <linescore inning="2" home_inning_runs="1" away_inning_runs="0"/>
   <linescore inning="3" home_inning_runs="0" away_inning_runs="0"/>
   <linescore inning="4" home_inning_runs="0" away_inning_runs="0"/>
   <linescore inning="5" home_inning_runs="0" away_inning_runs="0"/>
   <linescore inning="6" home_inning_runs="1" away_inning_runs="1"/>
   <linescore inning="7" home_inning_runs="0" away_inning_runs="0"/>
   <linescore inning="8" home_inning_runs="0" away_inning_runs="1"/>
   <linescore inning="9" home_inning_runs="0" away_inning_runs="0"/>
   <linescore inning="10" home_inning_runs="0" away_inning_runs="1"/>
   <winning_pitcher first_name="Sergio" first="Sergio" id="489265" last_name="Romo" last="Romo"
                   name_display_roster="Romo"
                   wins="1"
                   losses="0"
                   era="1.04"
                   s_wins=""
                   s_losses=""
                   s_era=""/>
   <losing_pitcher first_name="Joel" first="Joel" id="430629" last_name="Hanrahan" last="Hanrahan"
                  name_display_roster="Hanrahan"
                  wins="0"
                  losses="1"
                  era="2.25"
                  s_wins=""
                  s_losses=""
                  s_era=""/>
   <save_pitcher first_name="Brian" first="Brian" id="451216" last_name="Wilson" last="Wilson"
                name_display_roster="Wilson, B"
                wins="0"
                losses="1"
                era="8.64"
                s_wins=""
                s_losses=""
                s_era=""
                saves="6"/>
   <game_media>
      <media type="game" calendar_event_id="14-287242-2011-04-26"
            start="2011-04-26T19:05:00-0400"
            title="SF @ PIT"
            has_mlbtv="true"
            free="NO"
            media_state="media_archive"
            thumbnail="http://mediadownloads.mlb.com/mlbam/preview/sfnpit_287242_preview.jpg"/>
   </game_media>
</game>

So how do we get this mountain of XML files into our database?

Well, that's easy. In 2010 Wells Oliver created a python script and associated libraries that reads the XML Gameday data and inserts it into a relational database format. Wells wrote the script to work with MySQL and (also) hasn't updated it in some time (the MLB data format is always changing, and expanding, especially in 2011). So I took his fine work, rewrote it to use Microsoft SQL Server and added / changed / enhanced portions of its logic and the database schema. Anyways, I have "forked" his project, as they say in software development circles.

Download the Python code here: gameday_mssql.zip

Download the SQL Server Create table script here: gameday_mssql.sql

It requires:

Download it, expand it - (I use "C:\Gameday" myself), and then modify the 'db.ini' file in the root folder to configure your database instance.

I'm working on a development SQL instance, so my db.ini looks like this:

[db]
user=sa
password=*******
database=mlbgameday
host=myhostname

(If you need to use Windows Authentication / Trusted Auth, stay tuned because I'll be adding that shortly)

Obviously you're going to want to create a 'mlbgameday' (or whatever you desire to call it) database on the server first AND you're going to have to create the schema frist.

Once you're all set, the script is used like this: (make sure that the python directory is in your PATH)

C:\gameday> python gameday_mssql.py --year=2011 --month=4 --day=28 --verbose --type=mlb

The only required argument is "year", but I would shy away from trying to fetch a whole year at one time for now. The multiple threads tend to overwhelm the pymssql library and max out some internal max connections setting (the TDS lib MAX_CONNECTIONS). For "type" you can fetch 'mlb' data or 'aaa' data, which might be useful in our analysis, but not a neccessity.

In order to gather a couple years worth of data painlessly, there is ANOTHER script I put together in the package called build_all.py - simply open this file with a text editor and modify these variables near the top:

start_date = date(2005,1,1)
end_date = date(2011,4,30)
type = 'mlb'

Now, running "python build_all.py" will crawl through each date in your specified date range gathering data from all games on each day. Crawling through 6 years of MLB data might take a day or so - but once you have it, you have it - and updating it each day to get the previous days games should only take a minute or so (more on that in the next post).

Sample Output:

Now building: 2006-09-22
python gameday_mssql.py --year=2006 --month=09 --day=22 --type=mlb
Working on: 2006/09/22/anamlb-oakmlb-1
Working on: 2006/09/22/arimlb-lanmlb-1
Working on: 2006/09/22/atlmlb-colmlb-1
Working on: 2006/09/22/bosmlb-tormlb-1
Working on: 2006/09/22/clemlb-texmlb-1
Working on: 2006/09/22/detmlb-kcamlb-1
Working on: 2006/09/22/flomlb-phimlb-1
Working on: 2006/09/22/minmlb-balmlb-1
Working on: 2006/09/22/nyamlb-tbamlb-1
Working on: 2006/09/22/seamlb-chamlb-1
Working on: 2006/09/22/sfnmlb-milmlb-1
Working on: 2006/09/22/slnmlb-houmlb-1
Working on: 2006/09/22/wasmlb-nynmlb-1
Finished. Total time: 00:08:23

Now building: 2006-09-23
python gameday_mssql.py --year=2006 --month=09 --day=23 --type=mlb
Working on: 2006/09/23/anamlb-oakmlb-1
Working on: 2006/09/23/arimlb-lanmlb-1

Etc... etc...

Fire it up and you'll have more data then you know what to do with - what DO we do with it? That's next...

  • Buzzer66

    Hi Ryan, I’m a Nats fan – a Brit living in UK and found the game only recently on a DC trip, but I’m completely hooked. Would love to get / maintain a pitch F/X DB as I keep wanting to know stuff the F/X sites don’t analyse. Think your article and idea is great. Only problem is knowledge and technology on my part…I’ve spent two weeks mired in guides from 2010 which go to tech hell. Finally got a mySQL DB – great!! – but not updated since early 2011 (Zimmerman). Yours seems do-able, but I’m struggling as a non techie with Windows 7. Got MS SQL server., Python (2.7), got Pymssql installed in the end with a binary I found (I think !), struggling with Beautiful Soup on Win7. Cant get gameday_mssql.py working – not sure how you’re running that. Have i got a hope on Win 7? Do I need Ubuntu or some other alternative operating instance? Can you dumb it down even further for whining non tecchie Win plebs like me please? :)
    Mark

    • http://ryrobes.com/ Ryan Robitaille

      Hey Mark! I’m actually putting together a Virtual Machine that is running all this stuff that I’m going to post on the site in a few days.

      You’ll be able to download it run it in a windows and it’ll have a lot of these scripts running on it already. (GameDay MLB and BaseballReference data included)

      You’ll be free to mess with any of it, but if you just want the data – it’ll be easy to get! :)

      Cheers!

      • Marksemail247

        Great – will look forward to it.
        Mark

      • Brent Goldman

        Hi Ryan, I just found this great article and wanted to use it build an up-to-date baseball database for statistical calculations. I saw your previous comment about putting together a Virtual Machine, but I don’t know if you got around to actually posting it on your site. Please let me know if you have any more information / utilities, beyond what’s in this article, thanks!

        Brent

  • Scott Parrish

    Ryan, I am across your site after a WHOLE lot of searching and started using your code. I was curious if you are still using this. The reason I ask is when I try to download 2016 games into the database I’m getting an error, but your example from 2006 works just fine. I’m thinking it maybe something that has changed in the XML on the gameday website that is causing the issues.