{ Programmer, Data Hacker, Tech Integrator } for lack of a better term:
Business Technologist
"Its simple. I make shit work."
Impatient? Skip down to the bottom for a brief screencast demonstration.
We, over here at Linium, love Tableau Server. Big time.
It takes quite the tool to be able to wrap around all my procedural data hacking and make it look like it came that way. Custom internal apps + Salesforce data + consultant time-tracking data + Quickbooks data etc – it all fits, works, comes together and looks BEA-utiful thanks to my sick “Data Artist” skills (cough, cough, kidding).
Sounds good, right? Gee, you could almost (gasp!) run your entire freaking business on Tableau! Perish the thought! :)
When doing a multi-department integration (data automation) client project, even when you’re front-ending it with a product as good as Tableau, there comes a time when your end-users, practice leads, etc, ask for something that isn’t exactly inside Tableau’s wheelhouse – say, “data entry”.
Let’s say you’re trying to consolidate a mess of “resource planning” Excel spreadsheets so they can be reported on in your Tableau datasource – WITH the rest of the company data.
1) You COULD write an SSIS / DTS package in SQL Server to allow you to pull in this data (from various Excel files) on a semi-daily basis, but anyone with experience doing THAT knows that they’d rather take a boot to the head instead of constantly chasing after arbitrary document structure changes and user error.
2) You COULD write a simple custom web app on the intranet and then import that data into the OLAP cubes that already exist (if you’ve got the time), but then we’re sending business peeps all over the damn intranet for different reasons for the same data. Not to mention another “app” to maintain.
Besides, we’re trying to force them to use their fancy Tableau server login for EVERYTHING company-wide, so why would we want them going to yet ANOTHER internal web property. People tend get all pissy about shit like that.
Use a cool package like PhpGrid and embedded web parts in Tableau dashboards to allow end-users to change the data on the fly.
POC of Tableau and PHPgrid enabling a user-updatable resource plan report
Not really.
Probably.
This can be VERY useful for anyone who needs to report on a dataset that needs to have a bit of end-user entry that is BEYOND the tool-data we are reporting on. Is that clear? Clear as mud, eh?
Ok. Let me try and illustrate a use-case. For example, you could be reporting on a proprietary employee time-tracking system. We create our vizzes, reports, and all that hootenanny off the stock, (virtually) untouchable database. Then – after the initial roll-out and consumption, the various Project Managers want certain added fields and metrics implemented into the report that our time-tracking system does not support.
How about these for starters:
Now we’re back to the choices I mentioned earlier. Which all take quite a bit of time and effort, and will STILL be less-than-ideal. What can we do?
Here are our needs for this use case:
Fastest and quickest way to deploy some data-editing features? A cool like package called PhpGrid. It’s a fully customizable HTML5 “editable datagrid” implementation using ADODB5 (so it has lots of datasource support) and PHP. Also, its easy to use, and its damn cheap.
You can get an editable SQL table up in a matter of minutes – customized, prettified and formatted for your use in only hours. The best part is, even with all the AJAX-y magic of HTML5 and jQuery going on – it won’t interfere with the rest of your Tableau dashboard at all.
Set up a secondary Apache/PHP webserver on your Tableau server (or where ever) using a different port and get it running. Build your tables for the “new” data (keep in mind that you’re going to have to report on these tables and tie them into the “main” tables at some point – so design them with a smidgen of care).
Set up a PHP page to display and make that table editable (create, update, delete) via PhpGrid.
Now depending on the level of interactivity required in the initial dashboard – you may not have to do a whole lot more than that on the PHP side. For my use, I needed to allow the user to select what practice they were editing within the dashboard and refresh the PhpGrid data accordingly. I did this with some simple php $_GET['var'] variables and then populated those variables with a Tableau Dashboard Action.
Now users are free to add / change / delete resource plans for particular employees – and with a quick refresh, they will be able to see it reflected in the viz (as well as all the other vizzes – since I set THIS particular datasource to read directly from the database and not be made into an extract)
A fun little Viz I did about Bigfoot sightings.
I guess you could file this under “Too Much Time on my Hands”, but hey, people said that about my Metallica shit too, and that was pure genius. :)
Basically, I scraped all the sighting info off of The Bigfoot Field Research Organization site with Python and the Scrapemark module – inserted it into a local SQL Server database, cleaned it up (good lord, what a f*cking mess of data in some places) and threw up a pretty nice looking Viz using Tableau Public.
I’m pretty happy with it – its quite interactive and hopefully someone finds it fun or interesting.
Not too shabby for a couple hours worth of work that was spawned from sitting on the couch watch a re-run of “Finding Bigfoot” on the Animal Planet – after a few beers – and laughing to myself,
“Hah, it’d be funny to map out some freaking Bigfoot sightings”
On another note – after doing my research and harvesting the info… I wish that they’d re-vamp their site and not only display this “data” in a more visual / analytical way – BUT, more importantly, they should be collecting this data in a better way – validating fields, trying to make inferences and not just allowing anything to fly. Maybe its because I’m a data guy – and a bit of a purist at times, but still. It would be cool.
Oh well.
It took about an hour to run, looping through all the pages and sub-pages until you are scraping the main detail page – I DID have some time.sleep(2) commands sprinkled in there to not overwhelm their website, but, as it turns out, their site is so slow it didn’t really matter. Check out the Bigfoot Viz and enjoy!
Twitter. It’s informative. It’s time-wasting.
It’s useful. It’s entertaining. It’s important.
It’s drinking-from-the-fire-hose. It’s everywhere.
A service unlike any we’ve ever had before, and yes, very different and much faster than the traditional internet. We’ve seen it used as a tool for social activism, a platform for change, as well as a real-time news stream in a way that old-school media just can’t play.
The thing about drinking from the firehose is, well, just that. Overwhelming data and no real easy way to capture it all in a format that you can easily use. There is simply too much shit happening every second to utilize properly.
So, if you want to do any real trending, data mining, post-processing, or simple analysis on it – you really need to be making a historical record of what you’re looking for and store it in a usable way. The Twitter search function only goes back so far (at least from what I’ve found), so for the vast majority of data (aka anything you’d want to do some long-term trending on) – when its gone, its gone (for easy public consumption, at least).
Anyone who has been reading this blog over the past month or so, has seen my Metallica stuff – and yes, honestly, I wrote this script in order to add some data to that project. I wanted to chart mentions of the bands name, and member names – analyze them at a later date and try to draw some conclusions (as well as pretty visualizations).
So the question is, “How do we grab stuff from the timeline in the first place?” My answer, as it always is – Python, the world’s best Swiss army knife for data.
Enter my trusty Python 2.7 (32-bit to not break all sorts of modules). If you don’t already have it, you can download it here. They have flavors for whatever you favorite OS is, but in this case I’m going to be using Windows and SQL Server, if you use OS X or Linux, you’d have to rewrite the SQL statements and connect syntax to use MySQL (which isn’t a big deal, the rest would work fine as-is).
Once python is installed you need 2 more modules for this to work:
Open a windows command shell prompt (in Administrator Mode if in Win 7 / Vista) go to the C:\Python27\Scripts\ directory and type “easy_install twython” (no quotes). Easy.
Now that our Python installation is all sexed up like a fresh bowl of Kraft Macaroni and Cheese, we can worry about the database.
Did I say “worry”? Nah, no worries – keep your pants on and keep you wallet in them, muchacho. The free SQL Server Express Edition works wonders for this type of shiz – and you can download it here. The 32-bit or 64-bit version should work fine for this.
When you DO install it, just make sure that you use “mixed-mode” authentication – since we’re going to use a hardcoded user – none of that pesky “Windows Autentication” shit – AND install it as the “Default Instance” NOT a “Named Instance”. You see what I mean as it walks you through the process.
now we’ve got a fresh and spanky SQL Server installation – we need to set up some database tables to hold all our Tweets. “A table to hold my tweets”, Hmm, sounds like a sad new-wave children’s story… Anyways…
First we need to make a new database. Log into your installation by using the SQL Server Management Studio app (as the ‘sa’ user, who you created a password for earlier), right-click on the “Databases” folder and go to “New Database”. Name whatever you want, just remember that you’ll have change my script to match. The defaults are fine in the creation.
Once its created, click “New Query” in the upper left bar.
Go to the query window and type “user [TweetHarvester]” (or [PinkyWinky], or whatever the hell you called your database)
Now we are ready to make some tables, here is the structure that I used for my script.
Go ahead and Download this if you aren’t the Cut-n-Paste type.
In my case, as I said earlier, I’m tracking Metallica tweets (which includes the names of all past and present members), so my script has an array of all of those keywords and will iterate over them one at a time.
Ok, enough foreplay – here is the actual script that does all the work. Notice that I commented the shit out of it, hope that helps.
Go ahead and Download this if you aren’t the Cut-n-Paste type.
Probably could have been done MUCH more efficiently – but I’m all about results, and for a project like this. Quick wins. Set it up to run every 5 minutes in the background using Task Scheduler and suddenly we’re off to the races gathering tweets.
What do we DO with this data? Well, that’ll have to wait a bit while I gather it – but I sense some Tableau in the future…
Would love feedback, especially if you got it working (or if you didn’t, maybe I can help).
As I’ve been very vocal about all over Twitter and elsewhere – I’ve been spending a lot of time playing with Tableau Public over the past few weeks. Lots of learning, experimentation, tweaking, and some full blown data / analysis presentations have occurred (enter my Metallica Visualization pages or as I cheesily call it, “MetallicAnalysis“). During that time I learned a lot about the way the product works and continue to try and “jazz” up my Tableau “vizes” (that’s the current lingo, kids) in any way I can.
Not that a Tableau Public viz isn’t already “jazzy” as it is – it’s probably the single sexiest presentation layer I’ve ever encountered in my days of reporting (including some very expensive OLAP systems paired up with full blown dashboard products) and, trust me, I’ve worked with a lot of stuff.
I’ll go over some Tableau Public basics on another day, in a screencast perhaps – but for now, I wanted to highlight a couple of things that I stumbled upon during my own customizations. Just a couple of little tweaks that I wasn’t really able to find instructions for on the web. Hopefully, you’ll find this useful for your next “viz”, and make something so damn pretty it’ll put all mine to shame.
Tableau has its own tricks up its sleeve, it supplements a (slightly greyed out) cached image of your viz while all the requisite AJAX javascript libraries are loading in the background. In my mind, this is great because instead of looking at a useless box with just a “loading gif spinner” on it – at least the user gets a sense of what it is they are waiting for.
Now, this image isn’t 100% accurate, in my experience, it is often rendered at a different size or slightly off center (possibly due to my odd hardcoded sizes in order to maximize how they look on the blog itself) – but that doesn’t matter, it works incredibly well at keeping your user staring for those few precious seconds until everything is fully loaded and the interactive viz is working.
It gives them a shot of patience while they furrow their brows and try to decipher what the hell is happening in that magic box in front of them.
That being said, it works really well, but since I can’t seem to leave well enough alone, I decided to monkey with it and see what happened.
Here is my (mostly) unmolested embed code for one of my MetallicAnalysis dashboards (it’s actual the ‘small’ dashboard on the main index page), I just added a bunch of carriage returns in order to make it more readable on my narrow ass blog theme.
Now, a lot of these variables are pretty self-explanatory. Show Tabs, Animate Transition, Display Spinner, Overlay, (View) Count, etc. I’m sure that some of you have been altering these already.
But you WILL see an option or “display_static_image” as well as a value for “static_image” above. Let’s check out that auto-generated URL, shall we (resized to fit here, that is).
Like I said earlier, its a screenshot of the fully rendered report (albeit a bit tightly cropped). Its the first thing you see. So I decided to replace it with a goofy “Load” screen since I’m doing Metallica vizes. Load being the name of their 1997 album, which included the (often mentioned) picture of the peak of post-Black Album opulence that you see on the back cover.
Anyways, a few Photoshop minutes later, I have my little gag “Loading” screen.
Utterly hilarious AND incredibly clever? Well, I thought so. (Kidding) But it works, just make sure that the dimensions are the same as your viz is, at least width-wise, or else it might look a bit wonky. The rest will be filled with a light grey color.
Remember, this has to load while the user waits for everything ELSE to load – so making a big ass pre-loading-loading pic would probably hamper the user experience a bit. Just sayin’.
Check out how it works on a viz page here, or the screencast below.
[See post to watch Flash video]Does this make sense for all vizes? Of course not, but depending on the subject matter and how “complete” and “artsy” you want to get – it’s a cool little branding option for sure.
I do this stuff all the time in building custom reporting apps on the web (PHP, SQL Server, etc), it’s barebones functionality to get any type of drill down and primitive level of report interactivity. But, I didn’t even know this was possible with Tableau Public (at least as far as the documentation is concerned). I had assumed that the embedded vizzes were a “walled garden” affair – but a week or so ago I had an idea…
That particular day I had a lot of visitors from France & Poland (literally 70% of my, already spiked, traffic) due to some links on some European message boards. I thought to myself,
“Great! But I’ve got a bunch of US centric stuff on there, and they’d probably enjoy it more if they could immediately interact with their own country’s shows instead of having to zoom and pan all around the damn earth to find them. Since I’m sure that’s what they probably care about anyways”
I already knew that I could get WordPress Geo-Location Plug-in up and running that could estimate their country name via their ip address (using shortcodes even, wut wut!). The real trick was getting that to Tableau and getting it rendered on first page load (no pun intended).
Let’s look at my Tableau designer screen for the Shows in YOUR Country viz (the very one I was working on at the time).
For this usage, I created a parameter called “Visitor Country” and set the dashboard up so that changing the parameter value (a static list of countries pulled from the Country field in the data) would refresh the report. So now I had my dashboard with the static pre-populated drop-down of “data relevant” countries – and changing it triggered an action filter to refocus the map – ok, almost there, now for the hook.
Now, let’s go back to our embed code, but this time we will focus on only 3 lines.
You’ll see that those 3 values literally make up the source link for my particular report. Turns out that passing a parameter to that report is as easy as passing in URL $_GET vars in PHP. Depending on your platform you will be populating this differently (like my geo-location ip country).
Basically, just adding this string to the end of the url in the “name” param.
?Visitor Country=Denmark
Replace “Visitor Country” with the name of your Tableau parameter and Denmark with your value, obviously.
Scroll to the right to see the difference.
Just like the first tip – Simple, but it works. You can even pile up parameters by using the standard URL?param1=value1¶m2=value2¶m3=value3 pattern.
…We can start to do some more crazy things. The next, slightly harder usage was putting it all together in a seemless manner, at least to the user. I’m talking about building WordPress page(s) that pass a variable from the WordPress URL to Tableau, while still using pretty permalinks, not breaking WordPress, and all using only ONE Tableau report and ONE WordPress page. Take this link list for example.
Shows in Argentina, Australia, Austria, Belgium, Brazil, Bulgaria, Canada, Chile, Colombia, Costa Rica, Croatia, Czech Republic, Denmark, Estonia, Finland, France, Germany, Greece, Guatemala, Hungary, Iceland, India, Indonesia, Ireland, Israel, Italy, Japan, Latvia, Lithuania, Mexico, New Zealand, Norway, Panama, Peru, Philippines, Poland, Portugal, Puerto Rico, Romania, Russia, Serbia, Singapore, Slovakia, Slovenia, South Africa, South Korea, Spain, Sweden, Switzerland, Thailand, The Netherlands, Turkey, UK, USA, Venezuela, etc.
Each of those links references a good-looking and “normal” URL, for example:
http://ryrobes.com/metallicanalysis/metallica-shows-in-denmark
What am I doing in this example to keep everything working well together?
Booya, dynamically populated viz via beautifully “hidden” URL parameters (not really hidden, but not ugly-ass obvious ones). They look like they are 50+ individual pages, not a thinly veiled dynamic URL query structure.
What’s next? Doing the same thing, but having layers of detail. Building dynamic URLs within Tableau that link to another more detailed and parametized page and Tableau report – therefore creating a multi-layered Tableau “drill down” for the end user (for all intents and purposes).
I’m sure I skipped or missed something, let me know so I can add / fix it.
It’s interesting, the types of data conversations you have with people when discussing a unique project like MetallicAnalysis. You’re trying to show things in a different way and get as deep as possible, but never want to get too “tech-y” or else it loses contact with the actual subject matter and begins to make less sense. Numbers are cool, but numbers that ACTUALLY FUCKING MEAN SOMETHING to the average fan / person are way cooler. Like, Miles Davis level of cool.
When I work with clients on “regular” business intelligence projects, we end up talking about lots of things: reports, metrics, trends, percentages, key performance indicators, gauges, etc. But it always seems to boil down to a few simple questions:
and sometimes in the case of heavily operational systems,
People get so mired in the day to day banal minutiae of their business, they cease to SEE their business and the story it is telling them. Numbers in a vacuum don’t tell too much of a story. Which brings us back to the “Live Metallica Data Story”…
If you’re a big music fan like me, you’ve been to tons of shows; big ones, small ones, in dive bars, in stadiums, in people’s basements, and everything in between. So you know what its like to follow a band (like, you know, actually follow, remember there was a world before Twitter and Facebook and ‘follow’ meant something different), know their body of work and be informed about the composition of the live show. Hell, maybe you’re a hardcore fan and have even seen a few shows on different tours from different album release cycles (albums; another artifact for the times). You know, that whole thing called “Touring”?
In the case of my little MetallicAnalysis play, I have data on the setlist level to go down to and show (at the most granular) – which is cool, but in a vacuum it loses something. Staring at a list of songs from days, weeks, or decades ago doesn’t mean that much without context. It loses that little ‘ohh, this is special’ thrill that I got when they randomly played ‘Outlaw Torn’ in Albany, NY in ’04 on the St Anger Tour. As a hardcore fan you KNOW that that shit does NOT HAPPEN on regular basis.
After thinking about it for awhile, I realized that as much ‘heart’ as is in that, there is some ‘math’ too.
I was in the middle of building out the “Last Show Played” report and was trying to figure out some calculations that would help show the “uninformed individual” what was so special about that particular setlist, in this case it was the final Metallica Club Fillmore Show. So, Enter Math. I started thinking about frequency, ranking, and percentiles. I thought, would that ‘tell the story’ somehow?
Came up with something like this.
Which gives me (showing only the Fillmore songs, not all 169 possible songs):
| SongName | Frequency | RowNum | Rnk | DenseRnk | NTile10 |
|---|---|---|---|---|---|
| Orion | 35 | 90 | 89 | 22 | 6 |
| Through the Never | 199 | 133 | 133 | 58 | 8 |
| Ride the Lightning | 282 | 142 | 142 | 66 | 9 |
| The God That Failed | 71 | 102 | 101 | 30 | 6 |
| Welcome Home (Sanitarium) | 843 | 159 | 159 | 82 | 10 |
| Rebel of Babylon | 1 | 2 | 1 | 1 | 1 |
| Blackened | 426 | 150 | 150 | 73 | 9 |
| Dirty Window | 31 | 86 | 86 | 20 | 6 |
| Frantic | 173 | 130 | 130 | 55 | 8 |
| Sabbra Cadabra | 6 | 66 | 66 | 6 | 4 |
| Iron Man | 2 | 41 | 37 | 2 | 3 |
| Paranoid | 2 | 38 | 37 | 2 | 3 |
| King Nothing | 335 | 147 | 147 | 71 | 9 |
| Whiplash | 856 | 160 | 160 | 83 | 10 |
| Motorbreath | 253 | 140 | 140 | 64 | 9 |
| Phantom Lord | 148 | 123 | 123 | 49 | 8 |
| Jump in the Fire | 87 | 107 | 107 | 34 | 7 |
| Metal Militia | 76 | 103 | 103 | 31 | 7 |
| Hit the Lights | 146 | 121 | 121 | 47 | 8 |
| Seek & Destroy | 1267 | 167 | 167 | 90 | 10 |
Makes sense, right? “Rebel of Babylon” had never been played before (shit, no one had even heard it before), so thus it is in the “1″ (top) percentile (DenseRank in this case), and it also belongs in the 1 group for NTile (grouping of percentiles into a lower more general rank, in this case 1-10), so NTile 1 should equate to “Rarest Shit Ever” and NTile 10 is pretty much “Very Commonly Played”, with 2-9 being all the layers in-between. Make sense? To me it does, that pretty much tells the story I was looking for.
So, I ran the calculation (or pre-calculation in this case – the Tableau visual layer can’t do this, and honestly I wouldn’t want it trying), put that data in the report and all was right with the world, right? Nope. What about the other 1,714 shows in the database? Oh, no problem, we can just use the numbers for all the songs I just ran and join them to the…. Oh. Shit. No we can’t. The Fillmore numbers work because we’re looking at the LAST THING that happened and calculating it using ALL THE DATA WE HAVE, which is pretty much our working set at this point.
The data is valid, but can I use that same probability to judge the impact of, let’s say, a setlist from 1986? Fuck no. That was a 1,500 shows and 6 albums ago (not to mention 2 bass players – but that’s another story). I needed to calculate the probability and frequency of all songs played only up to the date in question. So In essence, recalculating the entire universe of song possibilities for every single show including all previous shows.
Ended up using a cursor in a stored procedure to get a proper historical calculation for each show:
Hey, works for me. Check out the results on my visualizations over here.
I’ve been fucking around with Tableau for a few nights, and I must say that even the free “Public” version is damned impressive, and I’ve used a LOT of reporting / business intelligence tools, hell, I pretty much make 80% of my living building solutions around that exact business need.
The question was – what kind of data do I use to give it a proper test drive…
Anyone who knows me knows that I’m a huge Metallica fan, hell for as long as I can remember, that’s just been “one of my things” (as many current, ex-girlfriends, and ex-wives can attest to)
so, to make a ling story short, I ended up writing a python script to scrape all the setlist data off of the official site (with the easy-peazy scrapemark module and pymssql), clean it up a bit, another python script to geocode all the shitty venue / location data, inserted it into a local SQL Server database and viola.
Is it ground breaking? No.
Is Lars going to call me and praise my analysis? No.
Is anyone even going to give a shit at all? Probably not.
But, hey, it was a fun little exercise. Hell, I might even be adding song and show lengths to it in the future – I’d be amused with a metric that says that James Hetfield has spent 1.25 years of his life on stage singing Master of Puppets.
(UPDATE: It only turns out to be 8.3 days straight playing Puppets. Fuck)
I’ll probably post the Python scraping scripts I used in the next few days. They might be of some help to someone – I know I get lots of hits from people using my use a Python script as a windows service post.
UPDATE: I’m too lazy to fix my theme to make this fully visible (aka, it looks like shit), so I’m going to post it on a seperate page here.
I’ve take out the embedding since this data has its own set of pages on this site now. See above “MetallicAnalysis“.
Let’s be honest – I’m a data guy. “Geek” to be more exact. But, I’m also a sports guy – and if there is ANY sport out there that prides itself as being massively data-centric… its Baseball. Data goes with baseball like Barry Bonds and… um, Home Runs (yeah, that’s it). Data collecting is an integral part of baseball culture and has been going on way before the internet or relational databases were even invented.
“Business Intelligence” is a term that’s been kicked around around for some time now, and basically just means “analyzing data from your past, in order to BETTER make better decisions in the future (or to better steer it in realtime)“. Its all about strategy, learning from mistakes (and successes), and being able to actively monitor / measure the health of your business. We might use this data to ask important business related questions: “Is Suzi Salesperson performing up to par this quarter?” or “What line of products have the biggest margin this time of year?”.
However, its not too far of a stretch to imagine someone who works for a Major League Baseball team sitting in front of a computer at the head office thinking: “What is the ROI of our pitching staff this year? Are they performing to expectations?” Its really no different then the business and reporting scenarios many of us encounter every single day. The exciting part is not only being able to read what has happened out on the ball field yesterday – but what might happen tomorrow. Boom. “Bizball Intelligence” anyone?
In this multi-part post, I’ll be taking you through all the steps needed to get up and running with your own historical AND current Major League Baseball statistics database (an operational / transactional DB), staging it out in a more “reportable” fashion (a data warehouse DB), and then finally building some cubes, calculated measures, and choice “Player Key Performance Indicators” (KPIs) based on the data. Who knows, maybe you’ll be so good at it that you’ll get hired by the San Diego Padres front-office, like this guy did, but more on him later.
First up. We need some data (actually, a TON of data), and (more importantly) we need a place to store and easily access it. Its database creation time.
Here at C&C, we are big fans of the Microsoft SQL Server 2008 stack of products (Database, Analysis Services, & Reporting Services), and I’ll be using SQL Server 2008 R2 for this tutorial. You can get a 6 month eval copy, (you can also install the free SQL Server Express product to get up and running with a database at no cost – but you’ll be missing out on the Analysis Services and Data Mining pieces).
There are lots of places with baseball data on the web, but sometimes its incomplete, unofficial, not granular enough, and organized badly for what we need it for. Besides, we can’t do as much with stats that are ALREADY calculated – we need to calculate them ourselves in order to Data Mine properly (although those “other” data sources come in handy to verify that our own calculation formulas are correct). So in my mind, the grand-poobah of MLB data collection is (shockingly enough) MLB.com.
MLB.com’s “GameDay” application uses a XML structure to read info about every game, inning, pitch, hit and player. The “full-monty” of data only goes back to about 2006 or so (pitch location, coordinates, speeds, etc also known as “Pitch/FX Data”), but less granular data is available for prior years (and even more from other methods, I’ll post about later).
Browse around their backend, and you’ll see what I mean: http://gd2.mlb.com/components/game/mlb/. From there it goes into year/ month/ day/ game/ inning, linescore, batters, pitchers, etc. etc. all packaged up into XML files for the reading. Now before you think this is some kind of illegal backdoor in MLB’s servers, its not – well, not really. MLB has been aware that people have been using this data for about 4 years now. If they REALLY wanted to lock people out of it, they would have done it long ago. So feel free to browse it 100% guilt-free. The data is near realtime too! Pretty slick, MLB.
linescore.xml Example
Well, that’s easy. In 2010 Wells Oliver created a python script and associated libraries that reads the XML Gameday data and inserts it into a relational database format. Wells wrote the script to work with MySQL and (also) hasn’t updated it in some time (the MLB data format is always changing, and expanding, especially in 2011). So I took his fine work, rewrote it to use Microsoft SQL Server and added / changed / enhanced portions of its logic and the database schema. Anyways, I have “forked” his project, as they say in software development circles.
It requires:
Download it, expand it – (I use “C:\Gameday” myself), and then modify the ‘db.ini’ file in the root folder to configure your database instance.
I’m working on a development SQL instance, so my db.ini looks like this:
(If you need to use Windows Authentication / Trusted Auth, stay tuned because I’ll be adding that shortly)
Obviously you’re going to want to create a ‘mlbgameday’ (or whatever you desire to call it) database on the server first AND you’re going to have to create the schema frist.
Once you’re all set, the script is used like this: (make sure that the python directory is in your PATH)
The only required argument is “year”, but I would shy away from trying to fetch a whole year at one time for now. The multiple threads tend to overwhelm the pymssql library and max out some internal max connections setting (the TDS lib MAX_CONNECTIONS). For “type” you can fetch ‘mlb’ data or ‘aaa’ data, which might be useful in our analysis, but not a neccessity.
In order to gather a couple years worth of data painlessly, there is ANOTHER script I put together in the package called build_all.py – simply open this file with a text editor and modify these variables near the top:
Now, running “python build_all.py” will crawl through each date in your specified date range gathering data from all games on each day. Crawling through 6 years of MLB data might take a day or so – but once you have it, you have it – and updating it each day to get the previous days games should only take a minute or so (more on that in the next post).
Sample Output:
Etc… etc…
Fire it up and you’ll have more data then you know what to do with – what DO we do with it? That’s next…
Just a recent article I wrote for the C&C Computer Solutions site / blog. Great stuff. Here’s a bit of the intro, but for the MEAT of it, you’ll have to go to the C&C site and read it yourself…
Let’s be honest – we’re data guys. “Geeks” to be more exact. But, we’re also sports guys – and if there is ANY sport out there that prides itself as being massively data-centric… its Baseball. Data goes with baseball like Barry Bonds and… um, Home Runs (yeah, that’s it). Data collecting is an integral part of baseball culture and has been going on way before the internet or relational databases were even invented.
“Business Intelligence” is a term that’s been kicked around around for some time now, and basically just means “analyzing data from your past, in order to BETTER make better decisions in the future (or to better steer it in realtime)“. Its all about strategy, learning from mistakes (and successes), and being able to actively monitor / measure the health of your business. We might use this data to ask important business related questions: “Is Suzi Salesperson performing up to par this quarter?” or “What line of products have the biggest margin this time of year?”.
However, its not too far of a stretch to imagine someone who works for a Major League Baseball team sitting in front of a computer at the head office thinking: “What is the ROI of our pitching staff this year? Are they performing to expectations?” Its really no different then the business and reporting scenarios many of us encounter every single day. The exciting part is not only being able to read what has happened out on the ball field yesterday – but what might happen tomorrow. Boom. “Bizball Intelligence” anyone?
In this multi-part post, I’ll be taking you through all the steps needed to get up and running with your own historical AND current Major League Baseball statistics database (an operational / transactional DB), staging it out in a more “reportable” fashion (a data warehouse DB), and then finally building some cubes, calculated measures, and choice “Player Key Performance Indicators” (KPIs) based on the data. Who knows, maybe you’ll be so good at it that you’ll get hired by the San Diego Padres front-office, like this guy did, but more on him later.
Sometimes you get stuck in a rut, hey, we all do – but how you deal with it is what separates us from common wildebeests. Most people? They do nothing – they are totally fine with having an unfulfilled albeit mostly “comfortable” existence.
Sounds crazy? Nope.
Follow the rules, color in the lines, do what youre told and one day (if you’re REALLY obedient) the retirement fairy will take you away to Utopian bliss, well that is, if you consider driving the Hoveround around the Grand Canyon in search of a diaper vending machine, Utopian bliss.
Don’t rock the boat. Just go with the flow. Come in to work on time. Don’t fight with your sister. Keep your hands inside the bus at all times. (Ok, maybe those last 2 are reasonable)
The only way to be truly exceptional is to set out to BE truly exceptional – and you can’t do that by waiting in line for your turn to come around.
Almost three weeks ago I quit my job as a programmer for a pretty cool New York State agency. Honestly, it was a fine job, the pay was good, the benefits were great. All that was REALLY expected of me was to come in on time and follow (all) the rules. Basically be a cubicle droid for the next 30 years. Nice.
It wasn’t for me. So what did I do? I left and didn’t look back. I don’t even have a job to land safely into. But you know what? Who cares. I am now 100% in charge of my own destiny and direction – my accomplishments will be mine alone – and my failures will be judged only by me.
What really shocked me were the responses from many other state workers upon hearing about my resignation.
“How will you pay your mortgage and all your bills?”
“In this economy?”
People seem so trained to think of life as this ultra linear amusement park ride. Birth at the beginning, death at the end – and hitting all the “white-bread common-status-quo” lumps in between. Just like the car ahead of you, and him, and her, etc.
Fuck it. I’ve never been a conventional guy – so its about time I started embracing that fact.
Whats next? Freelancing, business-building, & muse-finding.
One from the “Random-Shit-Pulled-Out-of-the-Junk-Drawer” Department… as I sit here and watch my beloved Red Sox cement their way OUT of playoffs… {sigh}
Here’s a little function I use when building sexy little graphs in my PHP Reporting / Web Interfaces. Its good to calculate a MAX value for the graph axis. Since we never know what the values are going to be until we coax that data out of the DB – it could be 10, could be 10,000 or 10,000,000 (sometimes its even “apple”, but fruit to integer conversions will be another day)
I’m sure there is a WAY more elegant way to do this using the CEIL function, but for my needs – this fit the bill and I couldn’t find anything out there on the interwebs that worked exactly like I wanted for this purpose (aka “I never want to round down, and still need to keep it close to the original number”).
If you were wondering about the “sexy little graph” thing I mentioned – check out Open Flash Chart, its quick, pretty and just-plain-works.
If anyone wants to correct me with a ONE LINE version to embarrass me – go for it! But remember:
I keed! I keed! :)