Sabremetrics and Math: How sports can teach statistics



Mental arithmatic.

Do those words scare you? If they do, you’re in good company. Mathematical anxiety is a well studied phenomenon that manifests for a number of different reasons. It’s an issue I’ve talked about before at length, and something that frustrates me no end. In my opinion though, one of the biggest culprits behind this is how math alienates people. Lets try an example:

If the average of three distinct positive integers is 22, what is the largest possible value of these three integers?
A: 64
B: 63
C: 33
D: 42
E: 48

Too easy? How about this one:

The average of the integers 24, 6, 12, x and y is 11. What is the value of the sum x + y?

A: 11
B: 17
C: 13
D: 15

I do statistics regularly, and I find these tricky. Not because the underlying math is hard, or that they’re fundamentally “difficult,” but because you have to read the question 3 or 4 times just to figure out what they’re asking. This is exacerbated at higher levels, where you need to first understand the problem, and then understand the math.*

Last week, my colleague Cristina Russo discussed how sports can be used to teach biology. Today I’m going to discuss a personal example, and how I use sports to explain statistics.

One of my main objectives as a statistics instructor is to take “fear” out of the equation (math joke!), and make my students comfortable with the underlying mathematical concepts. I’m not looking for everyone to become a statistician, but I do want them to be able to understand statistics in everyday life. Once they have mastered the underlying concepts, we can then apply them to new and novel situations. Given most of my students are athletically minded or have a basic understanding of sports, this is a logical and reasonable place to start.

Hi, I'm Chris Neil and I'll be your instructor today
The mean number of teeth in adults is 32. The mean number of teeth among hockey players is considerably less | Chris Neil picture source: NHLPA

First, a little backstory. The world of sports has undergone a major shift in the past 20 years. While in the 50s and 60s it was a much smaller enterprise, now it is a multi-million dollar business, where player performance is vitally important. When every dollar counts, you use every tool at your disposal to maximise your assets – including recording everything you can (documented in the book and film Moneyball). Shots, goals, assists, batting averages, yards gained, completions, you name it, there are stats available. But it’s not just owners, management and staff who use this information – armchair fans are now using this information to help them draft the best fantasy team possible – as there is a large amount of money to be won by competing in these leagues. As a result, a lot of data is freely available online.

Let me illustrate this with an example. One of the first concepts people learn about is the difference between mean vs median vs mode.

To reiterate: the mean is the average value, the median is the middle value (which is useful if your data are very skewed), and the mode is the most common value. Typically, this is accompanied by an example of birth weight, or something somewhat relateable. However, it’s hard to understand why there is a difference between these numbers as they are typically the same, as much of the “example” data we use is almost all normally distributed, or is skewed because of some other, usually more convoluted, reason. But not so in the case of sports.

Note: All examples use data on all players from the 2010-2011 NHL season. They were taken from Hockey-Reference, which has a great list of stats on the NHL going all the way back to 1917 (!).

Lets start with age and look at the mean, median and modal values. The mean is 26.6, the median is 26.0 and the mode is 26. Which basically tells us that the mean age of players in the NHL is 26.6, the “middle value” for age is 26, and the most common age is 26. Graphically, it looks like this:

The ages of players in the 2010-11 NHL Season | Data from Hockey-Reference
The ages of players in the 2010-11 NHL Season | Source: Hockey-Reference

Those are all very similar, which makes it difficult to see the difference between the values. However, all students have an intuitive understanding of age – they see most players are 20 to 30 years old, and there are very few who continue to play into their late 30s (except Teemu Selanne, who is actually Benjamin Button).

This changes when we look at another important statistic in hockey – goals. In this case, the mean is 7.5, the median is 4.0 and the mode is 0. This is interesting, as it tells us the “average” number of goals scored in the NHL is 7.5, the median, or “middle value” is 4.0, but the most common value is 0, i.e. a large number of people in the NHL didn’t score any goals. The data are highly skewed, and, more importantly, students can understand why, so they can dedicate their energy in understanding what that skew “means” in statistical terms.

The distribution of goals scored in the 2010-11 NHL season | Source: Hockey-reference
The distribution of goals scored in the 2010-11 NHL season | Source: Hockey-reference

Here, the concept of “skew” is very clear, and you can see that the most common number of goals scored in the NHL is 0, i.e. many players didn’t score any goals at all! This is considerably easier to understand than an example on blood pressure, birth weights, or mileage on cars, and takes the intimidation factor out of statistics.

This is one example of how sports can be used to highlight a statistical concept that I find students struggle with. However, here’s where the real power of sports stats comes in handy: You can scale this up to cover advanced concepts. You want to compare means between groups, (i.e. t-tests)? You can calculate the mean number of goals scored by forwards and defencemen and compare them (forwards score more goals). Need to do a chi-square test? Look at the number of forwards and defencemen on each team and if different teams have different numbers (they don’t). Need to talk about regression? Why not model goals scored and how much time on ice you get to see if more time results in more goals. The possibilities keep going from there.**

The thing I like the most about this is how accessible this makes things. Take away the intimidating part of math, and all of a sudden it’s not nearly as scary. You can change sports to pretty much anything else – baseball, football (association or gridiron),  or even other widely available databases – movie revenue by genre, number of albums sold by pop artists, voter turnout in recent elections, whatever connects with your students. Once you’ve made the example relatable and have removed the “fear” part of the statistics equation, math can suddenly become much more interesting and engaging to students. And once they’re engaged, learning will become that much easier.


*I should point out: I’m not against difficult problems, as comprehension is an important skill to develop in order to apply statistics to new and novel situations. But lets leave that for another day, and not start there. The way we teach statistics and math now is like asking a toddler to do cartwheels on a balance beam above a lake of hungry alligators before they can walk.

**If you would like me to provide webinars/slideshares on statistical concepts in future posts, let me know in the comments.

Author: Mike Klymkowsky

I am a Professor of Molecular, Cellular, and Developmental Biology at the University of Colorado Boulder. Growing up in Pennsylvania, I earned a bachelors degree in biophysics from Penn State then moved to California and earned a Ph.D. from CalTech (working for a time at UCSF and the Haight-Ashbury Free Clinic). I was a Muscular Dystrophy Association post-doctoral fellow at University College London and the Rockefeller University before moving to Boulder. My research has involved a number of topics, including neurotransmitter receptor structure, cytoskeletal organization and ciliary function, neural crest formation, and signaling systems in the context of the clawed frog Xenopus laevis as well as biology education research, leading to the development of the Biological Concepts Instrument (BCI), a suite of virtuallaboratory activities, and biofundamentals, a re-designed introductory molecular biology course. I have a close collaboration with Melanie Cooper (@Michigan State) that has resulted in transformed (and demonstrably effective and engaging) course materials in general and organic chemistry known as CLUE: Chemistry, Life, the Universe & Everything. I was in the first class of Pew Biomedical Scholars and am a Fellow of the American Association for the Advancement of Science.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s