{ Programmer, Data Hacker, Tech Integrator } for lack of a better term:
Business Technologist
"Its simple. I make shit work."
Let’s be honest – I’m a data guy. “Geek” to be more exact. But, I’m also a sports guy – and if there is ANY sport out there that prides itself as being massively data-centric… its Baseball. Data goes with baseball like Barry Bonds and… um, Home Runs (yeah, that’s it). Data collecting is an integral part of baseball culture and has been going on way before the internet or relational databases were even invented.
“Business Intelligence” is a term that’s been kicked around around for some time now, and basically just means “analyzing data from your past, in order to BETTER make better decisions in the future (or to better steer it in realtime)“. Its all about strategy, learning from mistakes (and successes), and being able to actively monitor / measure the health of your business. We might use this data to ask important business related questions: “Is Suzi Salesperson performing up to par this quarter?” or “What line of products have the biggest margin this time of year?”.
However, its not too far of a stretch to imagine someone who works for a Major League Baseball team sitting in front of a computer at the head office thinking: “What is the ROI of our pitching staff this year? Are they performing to expectations?” Its really no different then the business and reporting scenarios many of us encounter every single day. The exciting part is not only being able to read what has happened out on the ball field yesterday – but what might happen tomorrow. Boom. “Bizball Intelligence” anyone?
In this multi-part post, I’ll be taking you through all the steps needed to get up and running with your own historical AND current Major League Baseball statistics database (an operational / transactional DB), staging it out in a more “reportable” fashion (a data warehouse DB), and then finally building some cubes, calculated measures, and choice “Player Key Performance Indicators” (KPIs) based on the data. Who knows, maybe you’ll be so good at it that you’ll get hired by the San Diego Padres front-office, like this guy did, but more on him later.
First up. We need some data (actually, a TON of data), and (more importantly) we need a place to store and easily access it. Its database creation time.
Here at C&C, we are big fans of the Microsoft SQL Server 2008 stack of products (Database, Analysis Services, & Reporting Services), and I’ll be using SQL Server 2008 R2 for this tutorial. You can get a 6 month eval copy, (you can also install the free SQL Server Express product to get up and running with a database at no cost – but you’ll be missing out on the Analysis Services and Data Mining pieces).
There are lots of places with baseball data on the web, but sometimes its incomplete, unofficial, not granular enough, and organized badly for what we need it for. Besides, we can’t do as much with stats that are ALREADY calculated – we need to calculate them ourselves in order to Data Mine properly (although those “other” data sources come in handy to verify that our own calculation formulas are correct). So in my mind, the grand-poobah of MLB data collection is (shockingly enough) MLB.com.
MLB.com’s “GameDay” application uses a XML structure to read info about every game, inning, pitch, hit and player. The “full-monty” of data only goes back to about 2006 or so (pitch location, coordinates, speeds, etc also known as “Pitch/FX Data”), but less granular data is available for prior years (and even more from other methods, I’ll post about later).
Browse around their backend, and you’ll see what I mean: http://gd2.mlb.com/components/game/mlb/. From there it goes into year/ month/ day/ game/ inning, linescore, batters, pitchers, etc. etc. all packaged up into XML files for the reading. Now before you think this is some kind of illegal backdoor in MLB’s servers, its not – well, not really. MLB has been aware that people have been using this data for about 4 years now. If they REALLY wanted to lock people out of it, they would have done it long ago. So feel free to browse it 100% guilt-free. The data is near realtime too! Pretty slick, MLB.
linescore.xml Example
Well, that’s easy. In 2010 Wells Oliver created a python script and associated libraries that reads the XML Gameday data and inserts it into a relational database format. Wells wrote the script to work with MySQL and (also) hasn’t updated it in some time (the MLB data format is always changing, and expanding, especially in 2011). So I took his fine work, rewrote it to use Microsoft SQL Server and added / changed / enhanced portions of its logic and the database schema. Anyways, I have “forked” his project, as they say in software development circles.
It requires:
Download it, expand it – (I use “C:\Gameday” myself), and then modify the ‘db.ini’ file in the root folder to configure your database instance.
I’m working on a development SQL instance, so my db.ini looks like this:
(If you need to use Windows Authentication / Trusted Auth, stay tuned because I’ll be adding that shortly)
Obviously you’re going to want to create a ‘mlbgameday’ (or whatever you desire to call it) database on the server first AND you’re going to have to create the schema frist.
Once you’re all set, the script is used like this: (make sure that the python directory is in your PATH)
The only required argument is “year”, but I would shy away from trying to fetch a whole year at one time for now. The multiple threads tend to overwhelm the pymssql library and max out some internal max connections setting (the TDS lib MAX_CONNECTIONS). For “type” you can fetch ‘mlb’ data or ‘aaa’ data, which might be useful in our analysis, but not a neccessity.
In order to gather a couple years worth of data painlessly, there is ANOTHER script I put together in the package called build_all.py – simply open this file with a text editor and modify these variables near the top:
Now, running “python build_all.py” will crawl through each date in your specified date range gathering data from all games on each day. Crawling through 6 years of MLB data might take a day or so – but once you have it, you have it – and updating it each day to get the previous days games should only take a minute or so (more on that in the next post).
Sample Output:
Etc… etc…
Fire it up and you’ll have more data then you know what to do with – what DO we do with it? That’s next…
Just a recent article I wrote for the C&C Computer Solutions site / blog. Great stuff. Here’s a bit of the intro, but for the MEAT of it, you’ll have to go to the C&C site and read it yourself…
Let’s be honest – we’re data guys. “Geeks” to be more exact. But, we’re also sports guys – and if there is ANY sport out there that prides itself as being massively data-centric… its Baseball. Data goes with baseball like Barry Bonds and… um, Home Runs (yeah, that’s it). Data collecting is an integral part of baseball culture and has been going on way before the internet or relational databases were even invented.
“Business Intelligence” is a term that’s been kicked around around for some time now, and basically just means “analyzing data from your past, in order to BETTER make better decisions in the future (or to better steer it in realtime)“. Its all about strategy, learning from mistakes (and successes), and being able to actively monitor / measure the health of your business. We might use this data to ask important business related questions: “Is Suzi Salesperson performing up to par this quarter?” or “What line of products have the biggest margin this time of year?”.
However, its not too far of a stretch to imagine someone who works for a Major League Baseball team sitting in front of a computer at the head office thinking: “What is the ROI of our pitching staff this year? Are they performing to expectations?” Its really no different then the business and reporting scenarios many of us encounter every single day. The exciting part is not only being able to read what has happened out on the ball field yesterday – but what might happen tomorrow. Boom. “Bizball Intelligence” anyone?
In this multi-part post, I’ll be taking you through all the steps needed to get up and running with your own historical AND current Major League Baseball statistics database (an operational / transactional DB), staging it out in a more “reportable” fashion (a data warehouse DB), and then finally building some cubes, calculated measures, and choice “Player Key Performance Indicators” (KPIs) based on the data. Who knows, maybe you’ll be so good at it that you’ll get hired by the San Diego Padres front-office, like this guy did, but more on him later.
Sometimes you get stuck in a rut, hey, we all do – but how you deal with it is what separates us from common wildebeests. Most people? They do nothing – they are totally fine with having an unfulfilled albeit mostly “comfortable” existence.
Sounds crazy? Nope.
Follow the rules, color in the lines, do what youre told and one day (if you’re REALLY obedient) the retirement fairy will take you away to Utopian bliss, well that is, if you consider driving the Hoveround around the Grand Canyon in search of a diaper vending machine, Utopian bliss.
Don’t rock the boat. Just go with the flow. Come in to work on time. Don’t fight with your sister. Keep your hands inside the bus at all times. (Ok, maybe those last 2 are reasonable)
The only way to be truly exceptional is to set out to BE truly exceptional – and you can’t do that by waiting in line for your turn to come around.
Almost three weeks ago I quit my job as a programmer for a pretty cool New York State agency. Honestly, it was a fine job, the pay was good, the benefits were great. All that was REALLY expected of me was to come in on time and follow (all) the rules. Basically be a cubicle droid for the next 30 years. Nice.
It wasn’t for me. So what did I do? I left and didn’t look back. I don’t even have a job to land safely into. But you know what? Who cares. I am now 100% in charge of my own destiny and direction – my accomplishments will be mine alone – and my failures will be judged only by me.
What really shocked me were the responses from many other state workers upon hearing about my resignation.
“How will you pay your mortgage and all your bills?”
“In this economy?”
People seem so trained to think of life as this ultra linear amusement park ride. Birth at the beginning, death at the end – and hitting all the “white-bread common-status-quo” lumps in between. Just like the car ahead of you, and him, and her, etc.
Fuck it. I’ve never been a conventional guy – so its about time I started embracing that fact.
Whats next? Freelancing, business-building, & muse-finding.
One from the “Random-Shit-Pulled-Out-of-the-Junk-Drawer” Department… as I sit here and watch my beloved Red Sox cement their way OUT of playoffs… {sigh}
Here’s a little function I use when building sexy little graphs in my PHP Reporting / Web Interfaces. Its good to calculate a MAX value for the graph axis. Since we never know what the values are going to be until we coax that data out of the DB – it could be 10, could be 10,000 or 10,000,000 (sometimes its even “apple”, but fruit to integer conversions will be another day)
I’m sure there is a WAY more elegant way to do this using the CEIL function, but for my needs – this fit the bill and I couldn’t find anything out there on the interwebs that worked exactly like I wanted for this purpose (aka “I never want to round down, and still need to keep it close to the original number”).
If you were wondering about the “sexy little graph” thing I mentioned – check out Open Flash Chart, its quick, pretty and just-plain-works.
If anyone wants to correct me with a ONE LINE version to embarrass me – go for it! But remember:
I keed! I keed! :)
I’m sorry – but who the fuck cares that you can recite X, Y, and Z off the top of your head. Is its relevant and useful? Generally, no. Are you helping the situation, project or issue? No.
There is a word for out-of-context over-information like that…
I just don’t get why so many IT / tech-types are obsessed with specific pointless bullshit in their particular development niche or language – instead of just focusing on SOLVING PROBLEMS? Why spend your whole career “stuck in the weeds”?
Anything I talk about tech-wise (even if its lots of Python lately) – is used as a TOOL to solve a PROBLEM. We all know that next year there will be different tools. We
might solve them differently. Marrying yourself to one technology, language or approach is just crippling your usefulness as far as I’m concerned.
End rant.
Anyone who has ever dealt with the mortgage software package “Encompass” from Ellie Mae – from any kind of support / IT standpoint knows the pain. It just… well, doesn’t behave as you would assume. Seemingly simple things (from an integration / workflow standpoint) just don’t play nice, help in the web forums is laughable, its slow, ugly, and so on. Yet, Encompass (360 even) is still an indispensable tool for many mortgage shops.
From an outside integration standpoint I had 2 initial hurdles – importing fresh data and decent reporting. I’ll tackle the first one today; In terms of getting new contacts and prospects imported into the system – there has to be built-in ways to do that, right? Sure, but there are (at the very least) 3 serious problems with them “out-of-the-box”:
If all of your Loan Officers got their perfect contacts / prospects over the phone and typed them in by hand into your beautiful Encompass Software – AND they always did it properly and consistently (chuckle) – well than that’s fine. However, we all know that isn’t how things work. Let’s say you get 90% of your incoming apps via your webpage… now unless you have the SDK and build some custom apps in-house to take care of these applications (or buy some RIDICLOUSLY expensive 3rd party ones to do it) its going to be quite an effort – most likely of manual (and seemingly endless) importing…
Thankfully, there IS another easier and quicker way. Encompass is built on a SQL Server database backend. We can safely insert all new apps DIRECTLY into the Encompass backend Database as contacts.
This is what *I* do – your mileage may vary – it works with the latest version of Encompass (as of this writing – Aug 2010). ALWAYS TAKE A SQL SERVER backup before fucking around with the database (READ: I am not responsible for your hosing your installation), however, I’ve never had a problem doing this – even DELETING contact records didn’t cause any damage.
Disclaimer: This maybe look like a “dirty” solution, but you’re only seeing a fraction of the whole “integrated solution” – this is just a teaser to solve a simple problem that I faced many months ago and found very little help on the web for.
Step 1 – The INITIAL insert into the Borrower table via a built-in Encompass stored procedure.
NOTE: For this example, I’m just using static data for the Application info – this SQL would have to be dynamically generated with each new app obviously (from PHP, Python, ASP, .NET, etc. etc. etc). Also, note that not ALL of these fields are required (obviously), but some WILL be – give it a few tests first. You might have to put in dummy NULLs or empty strings for fields that Encompass requires, but your web app does not (as I did).
Step 2 – Now that we’ve started the contact process, we’ve got a unique contact ID (saved in a whole SQL script as I did, or saved in an external application to be re-used througout the transaction) and its time to put a bit more info in. The ‘Opportunity’ table has important things like Loan Amount Requested, Credit Rating, etc. Pretty important stuff.
Step 3 – (optional) This next one I do just for audit-trail. It basically adds a contact “history” record telling the LOs where the record came from – this is especially important if you’ve got apps coming from many sources or multiple websites.
Incorporating this type of automation in your mortgage workflow could save you TONS of time and help increase conversions and reduce wasted administrative time. Obviously this is just the SQL Server side of the import – it would FIRST have to come from its original source and THEN be inserted… that small journey is up to your IT staff, a data consultant, or web team – but for the value it delivers… its well worth it.
Now any Python duct-taper integrate-anything junkie like me has a need to schedule their things (in production) every once in awhile. Usually this is not a problem – Unix / Linux cron jobs handle this nicely – but for a client or job that runs on a Windows server – the built-in “Scheduled Tasks” just never really cut it for me – in fact, I’ve always though that it was pretty Mickey-Mouse and not super reliable (just my opinion – don’t flame me, you wily Windows Dudes – I know that there are 1,000 ways to skin a cat, here’s mine!)…
It honestly gave me the creeps to put production stuff on there that NEEDED to be run every X minutes, hours, etc. Besides, making a batch file to fire off a python scripts feels so… well, unpolished – and pretty much of a lazy hack that made it into prod…
Disclaimer: Hey, I love hacks – they make the IT world go around, but I don’t like delivering shit to clients that have loose feeling triggers like that. Especially, if some smart ass IT dickface (who COULDN’T pull off what I did, or else I wouldn’t have been paid to do it) is going to dig into them one day and bad mouth me about it (rightfully so).
Besides, I’ll find you! :)
Enough blather – here is the whole script. Modify and distribute where you see fit – parts of it were written by others that I can’t recall (so if its you – please accept my apology and an expired gift card).
As you can see above between the [actual service code between rests] text – THAT’S the real meat of this script – since that is the ACTION that will be executed within the service with each iteration (regardless of the timing).
So it doesn’t matter if you’re:
Hell, it can be and do virtually anything! Hey, its YOUR service after all.
Don’t want to Cut-N-Paste? I don’t blame you – download the whole she-bang here.
(especially corrections – sometimes things get mangled in cut-n-paste)
Part 1 - Before we all start examining tools like seconds rate doctors – lets cover the Business Intelligence (Wikipedia for you skeptics) basics, shall we?
Business Intelligence. Business Intelligence Solution. Business Intelligence Data Analysis Tool. BI-this. BI-That. God Damn.
How often do we have to have that phrase shoved down our throats by faceless corporate product snake-oil salesmen, unscrupulous consultants, random shithead “experts” and brainless middle managers who just read about it in last months issue of Wired. I hate the term as an IT buzzword, but the underlying concept is essential these days in any industry and at every step on the big-money biz ladder. Hate it or not “Business Intelligence” isn’t going anywhere – in fact, its been here in one way, shape, or form since the first caveman traded one of his stone axe blades for some otherwise unsanctioned cave-nookie.
Since those pioneering days of seeking more profitable cave transactions, the amount of data that has become available and stored about almost every aspect of a company’s business is just freaking staggering. Now its more important than ever to try and figure your shit out – and FAST. No time to wait for some slacker to come in to slowly generate those quarterly TPS reports. I mean, gee, who DOESN’T want to be able to constantly make sense of their data, identify trends, optimize their approach, and just generally decode the data hieroglyphics of yesterday to kick more ass tomorrow?
Doesn’t matter if we’re talking about sales, crimes, advertising campaign conversions, seasonal product performance, spice potency, or @Ed’s “re-tweetability”. Whatever it is that you measure / Whatever is important to you – make sure you’re always making the best possible decisions to get there based on the historical knowledge at hand. A good Business Intelligence solution (Ok, read that as “Business Intelligence IMPLEMENTATION” since, much to my dismay, nothing ever configures itself) can put you on the road to finally getting the big picture of your business data at any given moment and therefore being able to respond quicker, be more proactive, and generally avoid becoming putrid bizdev roadkill (or getting fired by some monopoly guy from the 10th floor).
“In 2010 – not having a solid Business Intelligence reporting framework in place is like driving blindfolded with a skunk on your lap and a trunk full of dead hookers.
You may be able to stall off disaster for a while by swinging the wheel wildly and honking the horn but you’re bound to get totally douched in the end.”
I could go on and on like a self-indulgent geek-asshole about the importance of meta-data in BI, post flow charts about “Business Intelligence Implementation Best Practices“, the importance of proper data warehousing, star schema, and other bullshit that won’t ever get you a date – but on a high-level all you REALLY need to know is this…
Word. Actually, in THIS of economy we might as well say that a solid view of your data can mean the difference between being IN business and being OUT of business. Ouch. The truth hurts, don’t it? Anyways, let’s put that nasty thought behind us for now…
Moving on…
Most small to medium size companies (regardless of industry) have these 3 things in common:
Assuming that you’ve been living under a rock and vacationing in a tree for a few years – just take a look at a quick Google search for business intelligence or business intelligence tools. The top paid and unpaid results are pretty much what you’d expect: IBM, Oracle, Microsoft (who is actually doing some REALLY cool stuff these days), SAP / BusinessObjects, Cognos, etc. You the idea. What do all these tools and “solutions” have in common? They all cost many thousands (if not millions) of dollars to license – and THEN you still have to figure out how to use and implement the damn things anyways.
I have to mention – you don’t really need any fancy product (free or otherwise) to help you with your BI woes – in fact, I’ve done it several times before using some custom (and clever – if I do say so myself) PHP, Python, and a variety of databases – but that was after I already had an intimate familiarity with the data, business logic, and desired results. Meaning that I had already mapped out what I wanted (aka “the questions I wanted to ask the system”) / what was there (“where to find my answers”) and THEN built it to these specs. So if you’ve got some rouge IT genius / data hacker on your staff – throw him a months supply of RedBull and go for it – OR you could just hire me (besides I’m affordable, funny, and probably sexier than your IT staffers), but I digress. :) Barring that, a decent “BI Tool” can usually make the exploration / building / troubleshooting process much easier.
[ Subscribe to the mailing list to make sure you don't miss it! Hint: Its in the upper right corner! ]
Oh, that’s right – I just dropped a feel-good knowledge bomb on you of EPIC proportions. I guarantee that if you look people in the eyes and smile at them – regardless if its a cashier, toll booth worker, random ass-pony on the street, cute girls (or dudes) at a bar, or even the fuggin’ mailman – they will smile back in a genuine fashion probably 92.5% of the time (the remaining 7.5% are probably just miserable douchebags). Try it. Its an easy experiment that has a really “human” payoff.
Think about that the next time you’re in a meeting where the talking heads are trying to figure out how to “make big money on this social media thing”. If you’re still thinking like that you’ve (almost) already lost. Drink a Diet Coke and reboot. Just communicate genuinely with people and they’ll commnuicate back. Think small-town hardware store / except on an online, global scale. THAT’S how business is going to be done 2010 and beyond.
That’s real fucking shit. I’m sorry @garyvee – but its true.
What do you think?
TrueBlood. Killer show. Let’s face it. It really is a weird-ass perfect storm of timing, promotion, and just straight up quality storytelling. Don’t even dare to compare it with those Justin-Beiber-esque-Saved-by-the-Bell-type-Vampire-Twilight-Movies that are currently sweeping the minds and loins of young teens these days. They are an inferior product – vamprires (fictional or not) do not “sparkle like diamonds” in the sunlight (no matter HOW blatantly gay they might be), its just Nickelodeon brand bullshit. (Damn kids these days!)
Anyways, the highly-accliamed HBO show is in its 3rd season now and has been a huge success (and hasn’t really even “jumped the shark” yet). The wicked popularity and ratings can be attributed to many reasons I think – but I’ll just boil them down to a few simple ones…
Its not exactly the most polically correct “safe” choice for Sunday night viewing. Honestly, that’s why we LOVE the hell out of it. I’ll admit, it took a few for me to “get into” it – but once you’ve acclimated yourself to their world and particular brand of southern Lousiana beastiality – you’re HOOKED.
How does that relate to your blogging (or lack thereof)? Who are you talking TO, who are you talking LIKE? What are you afraid of?
People don’t have to be cute, blonde telepaths to be able to smell bullshit.
No sir. You’ll be much more interesting to your target audience when you stop posing as something that you probably are not. We’re more like the characters in the show than you think (well, minus most of the murdering and mythical abilities anyways) – each very unique, troubled, and has a different story to tell. We are taught growing up to tell people what they want to hear, say what we’re supposed to say, don’t color outside the lines. Bound by Status quo – Social contract – and Baby Boomer expectations. All this is supposed to be that glue that holds “society” together. Bullshit. Its social manipulation. Its the glue that keeps us from moving forward.
All Generation X and Generation Y – we’re all a bit “off” to begin with. Us Gen-Xers are slowly taking the reigns away from the retiring Baby Boomers, and we don’t need to get ‘pitched to’ the same way. Be a damn human, talk to us, if your shit is good and we like you – we buy. Its as simple as that. No need for 3-piece suits, dazzlingly clean copywriting full of $13 adjectives, or Hard Sell verbage.
Give it a try – let your fans get to know you, trust you, and (ultimately) maybe even buy from you. Feed them like little baby birds. Mmmm freshly regurgitated worms…