3 minute read

Probability

The Statistical Geneticist And The Chi-square Test

Researchers often want to know whether one particular gene occurs in a population more or less frequently than another. This may help them determine, for example, whether the gene in question causes a particular disease. For a dominant gene, such as the one that causes Huntington's disease, the frequency of the disease can be used to determine the frequency of the gene, since everyone who has the gene will eventually develop the disease. However, it would be practically impossible to find every case of Huntington's disease, because it would require knowing the medical condition of every person in a population. Instead, genetic researchers sample a small subset of the population that they believe is representative of the whole. (The same technique is used in political polling.)

Whenever a sample is used, the possibility exists that it is unrepresentative, generating misleading data. Statisticians have a variety of methods to minimize sampling error, including sampling at random and using large samples. But sampling errors cannot be eliminated entirely, so data from the sample must be reported not just as a single number but with a range that conveys the precision and possible error of the data. Instead of saying the prevalence of Huntington's disease in a population is 10 per 100,000 people, a researcher would say the prevalence is 7.8-12.1 per 100,000 people.

The potential for errors in sampling also means that statistical tests must be conducted to determine if two numbers are close enough to be considered the same. When we take two samples, even if they are both from exactly the same population, there will always be slight differences in the samples that will make the results differ.

A researcher might want to determine if the prevalence of Huntington's disease is the same in the United States as it is in Japan, for example. The population samples might indicate that the prevalences, ignoring ranges, are 10 per 100,000 in the United States and 11 per 100,000 in Japan. Are these numbers close enough to be considered the same? This is where the Chi-square test is useful.

First we state the "null hypothesis," which is that the two prevalences are the same and that the difference in the numbers is due to sampling error alone. Then we use the Chi-square test, which is a mathematical formula, to test the hypothesis.

The test generates a measure of probability, called a p value, that can range from 0 percent to 100 percent. If the p value is close to 100 percent, the difference in the two numbers is almost certainly due to sampling error alone. The lower the p value, the less likely the difference is due solely to chance.

Scientists have agreed to use a cutoff value of 5 percent for most purposes. If the p value is less than 5 percent, the two numbers are said to be significantly different, the null hypothesis is rejected, and some other cause for the difference must be sought besides sampling error. There are many statistical tests and measures of significance in addition to the Chi-square test. Each is adapted for special circumstances.

Another application of the Chi-square test in genetics is to test whether a particular genotype is more or less common in a population than would be expected. The expected frequencies can be calculated from population data and the Hardy-Weinberg Equilibrium formula. These expected frequencies can then be compared to observed frequencies, and a p value can be calculated. A significant difference between observed and expected frequencies would indicate that some factor, such as natural selection or migration, is at work in the population, acting on allele frequencies. Population geneticists use this information to plan further studies to find these factors.

Additional topics

Medicine EncyclopediaGenetics in Medicine - Part 3Probability - The Clinical Geneticist And The Punnett Square, The Statistical Geneticist And The Chi-square Test