Testing for "Near-Normality": The Probability Plot


Robert F. Hart, Ph.D.
Marilyn K. Hart, Ph.D.

Many of the procedures used in statistics require that the data be normally distributed. In the realm of statistical process control, the control chart for individuals and the control chart for averages have an underlying normality assumption in the calculation of the control limits. If the data are not by nature normally distributed and a control chart for individuals or for averages is used, points outside the control limits may be due to the skew of the data and not due to special-cause variation.

These control charts are fairly robust in that the underlying distribution of the data need not be exactly normally distributed. In fact, no data set is exactly normally distributed. The normal distribution yields values that range from minus infinity to plus infinity, which never reflects real life. As Shapiro [1990, p. 5] warns,

Any distribution, the normal for example, is a mathematical concept. Geary (1947) once suggested that in front of all statistical texts should be printed, "Normality is a myth. There never was and will never be, a normal distribution."

Instead, it is only necessary for the data to be "near-normal," defined here as having no detrimental effect on the control chart. So if data need to be near-normal, how does one test for that?

A common test for normality of data is to use the statistical technique of chi-squared analysis. Not only difficult, chi-squared analysis is too tough of a test. The data can have more departure from normality than chi-squared analysis is willing to allow and the control charts be useful. A good (and easy) test for the needed near-normality is the normal probability plot.

A probability plot is a graph of the relative cumulative frequencies of the data, using a specific plotting convention. [Hart and Hart, 2002] To illustrate the calculations involved in making a probability plot, consider a very small data sample consisting of just five time-ordered values: 7, 3, 4, 11, and 9. In Table 1, the ordered observations of 3, 4, 7, 9 and 11 each occur only once, so each has a frequency of one. Note that they are arranged from the smallest observation (3) to the largest observation (11). The frequencies may then be accumulated, giving the cumulative frequency of the sample. Note that the cumulative frequency of 5 for the observation of 11 means that all five observations from the sample are of value 11 or smaller. But this is 5 out of 5, 100% of the observations. This implies that 100% of the observations from the sample are of a value of 11 or smaller. The relative cumulative frequency also tells that 1 out of 5, 20%, of the observations from the sample are of value 3 or smaller and 2 out of 5, 40%, are of value 4 or smaller, 3 out of 5, 60%, are 7 or smaller, and 4 out of 5, 80%, are 9 or smaller.

Table 1 Cumulative Frequency Calculations for the Sample (n = 5)

Data Frequency Cumulative Frequency Relative Cumulative Frequency
3 1 1 1/5 = 0.20
4 1 2 2/5 = 0.40
7 1 3 3/5 = 0.60
9 1 4 4/5 = 0.80
11 1 5 5/5 = 1.00

Note that these cumulative frequencies are for the sample, but what is really desired are estimates of the relative cumulative frequencies (or cumulative probabilities) for the total population (not just the sample). A plotting convention is needed for the population cumulative percentage, which will accomplish three objectives:

1. After putting data in ascending order, the middle value will be at 50%.
2. The largest observation will not be plotted at 100% but at some lower percentage. This will allow for some future observations from the process to be larger than the largest observation obtained from the sample.
3. The smallest observation will be plotted symmetrically to the largest (i.e., if the largest observation is plotted at x%, the smallest observation will be plotted at (100 - x)%).

There are many plotting conventions that accomplish these three objectives. One commonly used convention that will be used here is y = i/(n + 1), where y is the relative cumulative frequency, i is the order number of the data point, and n is the total number of observations. There are other plotting conventions. Another popular one that is sometimes used is (i - 0.5)/n. Using the convention i/(n + 1), the results are in Table 2.

Table 2 Probability Plot Calculations (n = 5)

Data Frequency Cumulative Frequency Relative Cumulative Frequency
i i/(n+1)
3 1 1 1/6 = 0.167
4 1 2 2/6 = 0.333
7 1 3 3/6 = 0.500
9 1 4 4/6 = 0.667
11 1 5 5/6 = 0.833

Note that this plotting convention accomplishes the three objectives mentioned earlier:

1. The middle value (7), called the median, is at 50%.
2. The largest value (11) is at 83%, leaving room for 17% of the population to be larger.
3. The smallest value (3) is at 17%, leaving room for 17% of the population to be smaller.

For the probability plot (also called the cumulative or normal probability plot), the relative cumulative frequencies are plotted on special graph paper (called normal probability paper). This paper has a special y-axis (vertical) scale that has been chosen so that data that are normally distributed will tend to yield a straight line. The x-axis (horizontal) scale is just a regular linear scale. Each ordered pair of the data point and its cumulative probability is plotted.

Plotting the data from the short example of five values, the lowest observation of 3 gets plotted at 17%, 4 gets plotted at 33%, and so on, as shown in Figure 1. Note that the plot of the data can reasonably be approximated by a straight line.


Figure 1 Probability Plot of Data from Short Example

Figures 2 and 3 are the histogram and probability plot of 200 measurements that are near-normal in their distribution.


Figure 2 Histogram of Near-Normal Data, n = 200


Figure 3 Probability Plot of Near-Normal Data, n = 200

Figures 4 and 5 illustrate the histogram and probability plot of data that is severely skewed.


Figure 4 Histogram of Severely Skewed Data


Figure 5 Probability Plot of Severely Skewed Data

The probability plot makes for a better graphical test for near-normality than does the histogram. The shape of the histogram may be highly dependent upon the size and number of cells, which does not happen with the probability plot. But how straight does the line have to be to consider it near-normal? Shapiro [1990, p. 9] points out:

If the model is appropriate then the plotted points will tend to fall on a straight line. If it is not appropriate the points will deviate from a straight line, generally in some systematic manner. The decision whether or not to reject the hypothesized model is subjective and two people looking at the same data might come to different conclusions, but with some experience a reasonably good assessment can be made.

So what does one do if the data are not near-normal? An excellent way of handling such data is to transform the data set to make it near-normal. This will be the subject of a future paper.

References:

Geary, R. C. "Testing for Normality." Biometrika, vol. 34, pp. 209 - 242, 1947.
Hart, Marilyn and Hart, Robert. Statistical Process Control for Health Care. Pacific Grove, California: Duxbury Press, 2002.
Shapiro, Samuel. How to Test Normality and Other Distribution Assumptions. Milwaukee, WI: American Society for Quality, 1990.

For more information, contact Drs. Robert and Marilyn Hart at robthart@aol.com or (541)412-0425.

If you would like additional information, please send email to statit.support@acs-inc.com.