The Best Team in Baseball

On Saturday the Montreal tech community came together at Notman House for Montreal Baseball Hackday. The event was organized by Plank Design to celebrate Montreal’s baseball heritage and the love of the game. The fact that baseball junkies love data and stats makes baseball and hacking a natural pair.

I showed up at the event with a few ideas for using fuzzy logic in a project. My friend Aran Rasmussen suggested we form a team to take on one of the challenge projects: “Prove that the 1994 Expos were the best team in baseball.” We were joined by Reda Lofti who helped us out with HTML and CSS for the project. It was Reda’s first hackday ever, and he’d just learned HTML last month.

For Montreal baseball fans, 1994 is the championship year that never was. The season ended in August due to the crippling baseball strike that led to the first cancellation of the World Series in almost a century. The Montreal Expos, who’d shown strong results for the previous few years, led the league at the time of the shut-down. And many people think that they would have been world champions if the season hadn’t ended prematurely.

But a lot happens in baseball between August and October. Wins and losses mid-season don’t really count. If you’re trying to argue that the Expos were the best team in baseball in 1994, how do you do it?

Fuzzy scoring

We took on the challenge by reframing it this way: what is the combination of statistics, and weights, that when presented to a fuzzy logic agent, give the Expos as the number 1 team? While Aran scoured the Web for stats from 1994, I started putting together a fuzzy agent strategy that would work.

If you’re unfamiliar with fuzzy logic, here’s the short description: a fuzzy agent accepts a number of input variables and maps them onto fuzzy sets — intuitive terms from the problem domain. It then uses a set of fuzzy rules to reason about the input variables and produce output fuzzy memberships. The output fuzzy values are then defuzzified into a single crisp score.

For our project, I decided to use as an output a score between 0 and 10, showing how “good” a team in 1994 was. We’ve found a few problem domains where this kind of unitless output is helpful.

For input values, Aran managed to come up with seven important stats that baseball afficionados use to compare teams and predict future performance:

  • Run differential
  • ERA
  • OPS
  • Speed score
  • Strikeout-to-walk percentage
  • WHIP
  • RA9-WAR

Some of these are familiar to any baseball fan; others are only relevant to the most hardcore SABR fanatic. But we wanted to pick numbers that were commonly used to say who’s the best team.

Aran boiled down the stats to a single table that we used for input to the fuzzy agent. I then broke down each input variable into 5 fuzzy sets — “veryLow”, “low”, “medium”, “high”, and “veryHigh”. Casual review showed each statistic varied linearly, so I just broke down the stats in 5 sets of equal size.

Some of the sets, like ERA, varied inversely with our output score. A low ERA shows a better team, and a high ERA shows a worse team. But most of them varied proportionally — a higher run differential shows a better team. So I mapped each of the inputs to an output using simple rules.

Varying the weights

The point of our project, however, was to help a user pick their argument points to show that the 94 Expos were the best. There are techniques to optimize the weights of your fuzzy rules to come out with expected scores based on training data. However, we wanted to give the user an interface to vary their own weights.

So we put together some radio buttons labelled: “Ignore”, “Low”, “Medium”, and “High”. (We changed them later, but you get the idea.) Each button represented a relative weight for rules based on that input: 0%, 25%, 50%, and 100%. When the user changes an input, we post the new weights to the back end; it then makes a new fuzzy agent with those rule weights, and scores each of the teams from the 1994 season — with corresponding data. It returns the values to the front-end, which then shows them in a sorted table.

You can see the results here: http://best-team-in-baseball.com/ . All the code is on github at https://github.com/evanp/the-best-team. You’ll need your own fuzzy.io API key to make the code work.

I had a lot of fun doing this project. Aran and Reda were fun to work with, and the baseball hackday was a blast. (A bag of Cracker Jack in the 9th inning was what I needed to get through the day!) Baseball is a good example of a mix between hard data and user wisdom, which is an area that fuzzy logic shines. I’m looking forward to seeing what other ways we can apply fuzzy logic to baseball stats.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s