Examining the Run Differential in Baseball

‘Enjoying success requires the ability to adapt. Only by being open to change will you have a true opportunity to get the most from your talent.’

Nolan Ryan

Major League Baseball defines run differential as a cumulative team statistic that combines offensive and defensive performance. A team’s run differential is determined by subtracting the total number of runs (both earned and unearned) it has allowed from the number of runs it has scored. Source

This performance measure helps to identify teams that are overachieving and underachieving throughout the season. Teams that produce a high run differential are usually correlated to to favorable chances to making it to the playoffs at the end of the season. While the measure does have its limitations; it serves as a good barometer on the productivity of the team.

For example, a team with an average to slightly negative run differential may still very well be in the top of their respective division or in the playoff hunt. A team that wins many one-run games will contribute to a slow cumulative run differential measure yet still be in the top of the division. The Texas Rangers are team that proved that very scenario; in 2016 they did very well in one-run games finishing the season with a run differential of +8 (14th in the Majors), but with the best record in the American League.


The Dataset

MLB standings data was acquired leveraging the baseball cube data store online. (thebaseballcube.com)

The standings dataset lists MLB team performance going back to 1957. Every day of the year, for every team you know the final score of each game; giving you the necessary elements to calculate each teams run differential measure.

The following data elements are found in the main dataset:

  • Game Date
  • Year
  • TeamID
  • League
  • Division
  • Wins
  • Losses
  • DivisionRank
  • LeagueRank
  • MLBRank
  • Games Back
  • Home Record
  • Away Record
  • Runs For (*)
  • Runs Against (*)
  • Playoffs
  • Attendance
  • Streak

Below is a data snip of what that data looks like from a csv file:

The day to day data gives us the huge opportunity to plot run differential over a larger period of time. This means that we can see the performance of all the MLB teams throughout the whole season. We should be able to see the teams a ‘hot’ streak and those that experience collapses towards the end of their respective seasons. Since game date is listed in the following format (YYYYMMDD); we will need to use Tableau to format every game day in a numerical form, as in Game 1, Game 2 and so on. Most teams play their respective third, fourth, fortieth game on different days; so its imperative to convert to game numbers to establish a proper comparison for each team. (More on this later in this page)

Other Datasets Used:
  • Current Team Name Mapping
  • List of World Series Champions by Year

I added these two mapping files for a few reasons:

I noticed that our historical dataset from the baseball cube listed all teams by unique TEAMID strings. This means that the TEAMID string is specific by its respective year. So you will see Atlanta Braves historical data going back in the past under the TEAMID ‘ATL’; but you will also see its past data under the TEAMID ‘MLN’ for the Milwaukee Braves before they moved to Atlanta, Georgia. Another example would be the Los Angeles Dodgers (TEAMID ‘LAN’) that is also identified under TEAMID ‘BRO’ for the Brooklyn Dodgers. I created a team mapping file which groups all these past teams under their current active team name. This data decision only makes it easier to compare and examine team performance over time. I also anticipate that reviewing team run differential performance over time allows for more questions when analyzing cohorts of team performance. So I created a dataset of a list of World Series Champions by year to review the run differential measure for those teams that won the World Series.

Using Tableau, I joined all three datasets. I left joined the standings dataset (we will refer to it as our activity dataset) to the team mapping list. I then utilized a relationship join (tableau default) to the WS Champions dataset to create a flag for those specific teams that won the World Series.

Above is a quick glimpse of the data screen from the Tableau desktop file where I joined the data sets. This will allow me to have one complete file for Tableau to work with while preserving the number of records in the original base dataset file (the standings).

Strategize it and then Viz it!

One thing I like to do before building in Tableau is understanding the questions you would like answered. Even at the workplace, if you are developing a reporting visualization for a business team it is very important to strategize how stakeholders will use what you create. As a data steward; you want to make sure that visualizations assist in finding resolutions not confusion. So in this case what are the types of questions we want answered. We are going to visualize MLB statistics and more specifically team run differentials since 1957. Here are a few questions I would like to answer:

  • Which season was the best season for each team?
  • Do all past World Series champions have positive run differentials?
  • Combining past seasons; which teams have the best run differentials?
  • Which season/team combinations had the most increase/decrease in run differential?
  • Are there instances where teams’ run differential was in contrast to their respective division standings?

What I envision is a line chart over a period of time; this time each mark/line represents each MLB team and the period of time is the games in a season. As games are underway, you will see the talented teams rise while those that can’t keep up fall below and sometimes negative.


Tableau: Time to Build

In order to create the run differential visualization I am looking for, I created the following calculations on top of joined dataset explained above:

  • Run Differential: simple calc of runsfor minus runsagainst
  • Game Number: DateDiff calculation looking at the first game date of the year vs. the game date as the season elapses
  • Season End; Division Rank: This was a combination of assessing the division rank of each team at the final game. I used a combination of a few LOD calcs to isolate the division rank at season end and assign that rank for every row observation per year
  • WS Champ Flag: This looks at those World Series champions for each year; if it matches via left join from the data pane an indicator of ‘Y’ will be placed and ‘N’ will be applied to the rest of the year.
  • I also downloaded some png files from the MLB website; these were team logos for the current active teams; I uploaded those logos as shapes to be used in the visualization

You can see the full build in the video below as well as some questions I answer with the dashboard.

MLB Run Differential (Blog Version)

Feel free to play with the filters and download yourself to recreate.

Things to follow:

  • Using the same data to create story dashboard
  • Recreating these Tableau table calcs in Alteryx

Thank you very much for your attention and please feel free to drop me a comment or any feedback. I appreciate you and look forward to networking with you in the future.

-Ralph

Leave a comment

Leave a comment