# September Book Club – Statistics, Beer and t-tests

In September, the Skeptics book club met to discuss a favorite topic of mine: statistics. We read “The Lady Tasting Tea” by David Salsburg and “What is a *p*-value Anyway?” by Andrew Vickers. The Salsburg book examines the historical evolution of statistics and spotlights the men and women who made modern statistics what it is today. In “What is a *p*-value Anyway,” Vickers takes readers through short examples of the use of statistics. Members of the group found these statistical examples interesting and accessible.

For this blog entry, I’d like to discuss Statistics and Beer. The discerning reader may be questioning this relationship but, in fact, one of the most commonly used statistical tests, Student’s *t-*test, was developed in the quest to make a better beer.

**The Story of Student’s t-test**

In 1899, a man named William Sealy Gosset was hired to work at the Guinness brewery in Dublin, Ireland. At the time, Guinness was in the habit of hiring the leading chemists from Oxford. Fortunately for Guinness, Gosset also had earned a degree in mathematics from the university. Gosset proved to be an excellent administrator and eventually became head of greater London operations. Although Gosset was initially hired for his chemistry background, he quickly realized the importance mathematics and statistics could bring to the beer making process. In his first published paper, he demonstrated the use of the *Poisson probability distribution* to solve the problem of counting yeast cells for consistent measurement in beer making (for those readers unfamiliar with yeast, yeast is constantly multiplying and dividing, thus, an accurate count of the number of yeasts cells in a jar is near impossible). Gosset’s method of modeling yeast cells using a probability distribution allowed the factory to make more accurate assessments of the concentration of beer cells, ultimately producing a much more consistent brew (so when your Guinness tastes just as refreshingly delicious as the time before, be thankful that Gosset was so smart!).

As impressive as that first paper was, it was not Gosset’s biggest contribution to science. In a 1908 paper entitled “The Probable Error of the Mean,” Gosset addressed the issue of making statistical inferences from small samples (as Gossett needed to do for small samples of beer). At the time, the leading statistical methods relied on very large samples of data. Gosset’s experience at Guinness let him to conclude that large sample sizes were not the norm in science; Gosset wanted to address the issue of making conclusions about the population in question if sample sizes are small. As a result, Gosset developed the *t*-test to test hypotheses about mean differences using small samples. This remarkable contribution brought about “small sample theory” and Gosset’s test became the foundation for modern tests of statistical significance.

If Gosset developed the method, why is it called *Student’s**t-*test?

At the time, Guinness did not allow employees to publish papers. Some statistical historians (including Salsburg) claim the reason for the publishing ban was because a former employee of Guinness published some trade secrets in the past, thus necessitating the need for a rule banning all publications. Others argue that Guinness imposed this rule so that consumers would think of beer making as an artistic craft rather than a scientific process. Regardless for the reason, Gosset was forced to publish a under a pen name, Student, and consequently, Gosset’s innovative method became known as “Student’s t-test”.

At the time of its initial publication, Gosset’s seminal article, “The Probable Error of the Mean,” was not celebrated or appreciated. It took the efforts of another famous statistician, Sir Ronald Fisher, to recognize the importance of the work and bring the *t-*test into the modern statistical paradigm. Fisher made three important advancements to Gosset’s work; he (1) proved the *t-*test, (2) embedded the *t-*test into a unified framework for testing statistical significance, and 3) transformed Gosset’s “*z* –score” (with Gosset’s input) into the version of the “*t*-score” we use today.

**So what exactly is Student’s t-test? **

Before delving into Student’s *t-*test, it is important to understand a few essential terms:

**Mean:** The mean is more commonly known as an arithmetic average. It is calculated by adding all the scores in a sample or population and dividing by the number of scores.

**Variance and Standard Deviation:** At the most basic level, variance and standard deviation are measures of dispersion. They answer the question: On average, how far away from the mean are the data? The standard deviation is simply the square root of the variance.

**Population:** The set of all individuals of interest in a particular study.

**Sample:** Set of individuals selected from the population intended to represent the population in a research study.

**Standard Error: **The notion of the standard error can be difficult to understand. The standard error is a measure of how different estimates would be if we completed a study an infinite amount of times. For example, imagine we were interested in determining the average height of residents of the state of Arizona. Our population would be residents of the State of Arizona. We could randomly sample 200 residents and compute their average height. If we did it again, and randomly sampled another 200 people, and computed their average height, we’d get a slightly different mean estimate. Because each time we are getting a slightly different sample from the same population, we wouldn’t necessarily expect any given estimate of the average to *exactly *equal the mean of the population. If we sampled repeatedly and estimated the mean, we would end up with distribution of mean estimates. Even though we don’t expect any two sampled means to be exactly the same, the mean of all the estimates would exactly equal the mean of the population. The standard error, is just an estimate of the standard deviation that we would theoretically expect to get if we sampled an infinite amount of times. In simple terms, the standard error represents how stable we expect an estimate to be with repeated sampling and estimation. Because we never actually observe the standard error, we must estimate it. In terms of mean estimates, the standard error is estimated using information about the sample standard deviation and number of observations.

Now, let’s return to Student’s *t*-test. Student’s *t*-test has a wide variety of applications. Most generally the *t-*test is a statistic used to test hypotheses about mean differences. In the most basic test, we test the hypothesis that the sample mean is different from an unknown population mean when the standard deviation is unknown. We calculate the *t*-statistic as follows:

In the above expression, we are often testing the hypothesis that the population mean is zero (this is the test I will assume for the remainder of the example). In other words, we are testing the hypothesis that the sample mean differs from zero.

Thus, assuming we are testing whether our observed mean differs from a population with a mean of zero, the *t*-statistic becomes:

Now that we have our *t-*statistic, what do we do? We use the t-distribution (example shown below) to determine if we think our sample came from a population with a mean of zero.

Each t-distribution is specific to the number of *degrees of freedom* (i.e. the distribution changes based on the number of participants in the sample). We use the appropriate *t*-distribution, based on the number of people sampled, to determine which values of *t* you would expect to see if mean differences are simply due to random sampling (chance). What do I mean by “differences due to random sampling”? Recall from the definition of the standard error the mean of any given sample may not exactly match the population mean due to error inherent in the fact that our sample is only a subset of the population. In fact, there is a reasonable range of mean estimates that we expect. We use the *t-*test to determine this reasonable range. The horizontal axis of the t-distribution show a range of *t*-values. The vertical axis is the likelihood of observing a particular *t*-score (technically called the “probability density”). The area under the curve can be used to compute the probability of any given* t*-value. Values of *t* close to the center of the distribution, where the curve is the highest, represent very likely *t*-values that we would expect from a population with a mean of zero, whereas values of t near the tails are less likely. Thus, we can use our observed value of *t* to determine the probability that the sample mean came from a population with a mean of zero. We compute this probability by determining the area under the curve to the left of negative *t* and to the right of positive *t* (shown in red above). As an aside, this is an example of a *two-tailed* probability test. In layman’s terms, we are just testing whether the sample is different from zero without any preconceived notion of whether the difference will be higher or lower than zero. Sometimes you will see reference to a *one-tailed test* where we only consider probability on one end of the distribution. In a tw0-tailed test, the area under the curve in red represents the probability that our particular sample came from the hypothesized population with a mean of zero and we often call this value the “*p*-value”. As a general rule of thumb, researchers will often determine a test statistically significant when the red area under the curve is less than or equal to .05 (Note: I hate this rule of thumb and will be blogging about this in a future entry). In other words, assuming requisite assumptions are met*, we can determine the value of *t* we would expect that gives us only a 5% probability of obtaining a mean estimate as extreme as the one observed, assuming the null hypothesis is true (in this case the null hypothesis was that the population has a mean of zero). When the probability is very low, we reject the null hypothesis providing evidence that the alternative hypothesis (the sample’s population mean was not zero) is plausible.

And that, my friends, is one example of how researchers might apply student’s *t-*test.

*Assumptions of the *t*-test include independence of observations and normally distributed data.