Statistics

Statistics is the set of mathematical tools and techniques that are used to analyze data. In genetics, statistical tests are crucial for determining if a particular chromosomal region is likely to contain a disease gene, for instance, or for expressing the certainty with which a treatment can be said to be effective.

Statistics is a relatively new science, with most of the important developments occurring with the last 100 years. Motivation for statistics as a formal scientific discipline came from a need to summarize and draw conclusions from experimental data. For example, Sir Ronald Aylmer Fisher, Karl Pearson, and Sir Francis Galton each made significant contributions to early statistics in response to their need to analyze experimental agricultural and biological data. For example, one of Fisher's interests was whether crop yield could be predicted from meteorological readings. This problem was one of several that motivated Fisher to develop some of the early methods of data analysis. Much of modern statistics can be categorized as exploratory data analysis, point estimation, or hypothesis testing.

The goal of exploratory data analysis is to summarize and visualize data and information in a way that facilitates the identification of trends or interesting patterns that are relevant to the question at hand. A fundamental exploratory data-analysis tool is the histogram, which describes the frequency with which various outcomes occur. Histograms summarize the distribution of the outcomes and facilitate the comparison of outcomes from different experiments. Histograms are usually plotted as bar plots, with the range of outcomes plotted on the x-axis and the frequency of the individual outcome represented by a bar on the y-axis. For instance, one might use a histogram to describe the number of people in a population with each of the different genotypes for the ApoE alleles, which influence the risk of Alzheimer's disease.

The range of outcomes from an experiment are also described mathematically by their central tendency and their dispersion. Central tendency is a measure of the center of the distribution. This can be characterized by the mean (the arithmetic average) of the outcomes or by the median, which is the value above and below which the number of outcomes is the same. The mean of 3, 4, and 8 is 5, whereas the median is 4. The median length of response to a gene therapy trial might be 30 days, meaning as many people had less than 30 days' benefit as had more than that. The mean might be considerably more—if one person benefited for 180 days, for instance.

Dispersion is a measure of how spread out the outcomes of the random variable are from their mean. It is characterized by the variance or standard deviation. The spread of the data can often be as important as the central tendency in estimating the value of the results. For instance, suppose the median number of errors in a gene-sequencing procedure was 3 per 10,000 bases sequenced. This error rate might be acceptable if the range that was found in 100 trials was between 0 and 5 errors, but it would be unacceptable if the range was between 0 and 150 errors. The occasional large number of errors makes the data from any particular procedure suspect.

Another important concept in statistics is that of populations and samples. The population represents every possible experimental unit that could be measured. For example, every zebra on the continent of Africa might represent a population. If we were interested in the mean genetic diversity of zebras in Africa, it would be nearly impossible to actually analyze the DNA of every single zebra; neither can we sequence the entire DNA of any individual. Therefore we must take a random selection of some smaller number of zebras and some smaller amount of DNA, and then use the mean differences among these zebras to make inferences about the mean diversity in the entire population.

Any summary measure of the data, such as the mean of variance in a subset of the population, is called a sample statistic. The summary measure of the entire group is called a population parameter. Therefore, we use statistics to estimate parameters. Much of statistics is concerned with the accuracy of parameter estimates. This is the statistical science of point estimation.

The final major discipline of statistics is hypothesis testing. All scientific investigations begin with a motivating question. For example, do identical twins have a higher likelihood than fraternal twins of both developing alcoholism ?

From the question, two types of hypotheses are derived. The first is called the null hypothesis. This is generally a theory about the value of one or more population parameters and is the status quo, or what is commonly believed or accepted. In the case of the twins, the null hypothesis might be that the rates of concordance (i.e., both twins are or are not alcoholic) are the same for identical and fraternal twins. The alternate hypothesis is generally what you are trying to show. This might be that identical twins have a higher concordance rate for alcoholism, supporting a genetic basis for this disorder. It is important to note that statistics cannot prove one or the other hypothesis. Rather, statistics provides evidence from the data that supports one hypothesis or the other.

Much of hypothesis testing is concerned with making decisions about the null and alternate hypotheses. You collect the data, estimate the parameter, calculate a test statistic that summarizes the value of the parameter estimate, and then decide whether the value of the test statistic would be expected if the null hypothesis were true or the alternate hypothesis were true. In our case, we collect data on alcoholism in a limited number of twins (which we hope accurately represent the entire twin population) and decide whether the results we obtain better match the null hypothesis (no difference in rates) or the alternate hypothesis (higher rate in identical twins).

Of course, there is always a chance that you have made the wrong decision—that you have interpreted your data incorrectly. In statistics, there are two types of errors that can be made. A type I error is when the conclusion was made in favor of the alternate hypothesis, when the null hypothesis was really true. A type II error refers to the converse situation, where the conclusion was made in favor of the null hypothesis when the alternate hypothesis was really true. Thus a type I error is when you see something that is not there, and a type II error is when you do not see something that is really there. In general, type I errors are thought to be worse than type II errors, since you do not want to spend time and resources following up on a finding that is not true.

How can we decide if we have made the right choice about accepting or rejecting our null hypothesis? These statistical decisions are often made by calculating a probability value, or p-value. P-values for many test statistics are easily calculated using a computer, thanks to the theoretical work of mathematical statisticians such as Jerzy Neyman.

A p-value is simply the probability of observing a test statistic as large or larger than the one observed from your data, if the null hypothesis were really true. It is common in many statistical analyses to accept a type I error rate of one in twenty, or 0.05. This means there is less than a one-in-twenty chance of making a type I error.

To see what this means, let us imagine that our data show that identical twins have a 10 percent greater likelihood of being concordant for alcoholism than fraternal twins. Is this a significant enough difference that we should reject the null hypothesis of no difference between twin types? By examining the number of individuals tested and the variance in the data, we can come up with an estimate of the probability that we could obtain this difference by chance alone, even if the null hypothesis were true. If this probability is less than 0.05—if the likelihood of obtaining this difference by chance is less than one in twenty—then we reject the null hypothesis in favor of the alternate hypothesis.

Prior to carrying out a scientific investigation and a statistical analysis of the resulting data, it is possible to get a feel for your chances of seeing something if it is really there to see. This is referred to as the power of a study and is simply one minus the probability of making a type II error. A commonly accepted power for a study is 80 percent or greater. That is, you would like to know that you have at least an 80 percent chance of seeing something if it is really there. Increasing the size of the random sample from the population is perhaps the best way to improve the power of a study. The closer your sample is to the true population size, the more likely you are to see something if it is really there.

Thus, statistics is a relatively new scientific discipline that uses both mathematics and philosophy for exploratory data analysis, point estimation, and hypothesis testing. The ultimate utility of statistics is for making decisions about hypotheses to make inferences about the answers to scientific questions.

Jason H. Moore

Bibliography

Gonick, Larry, and Woollcott Smith. The Cartoon Guide to Statistics. New York:Harper Collins, 1993.

Jaisingh, Lloyd R. Statistics for the Utterly Confused. New York: McGraw-Hill, 2000.

Salsberg, David. The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century. New York: W. H. Freeman, 2001.

Internet Resource

HyperStat Online: An Introductory Statistics Book and Online Tutorial for Help in Statistics Courses. David M. Lane., ed. <http://davidmlane.com/hyperstat/>.