Robert F. Hart, Ph.D.
Marilyn K. Hart, Ph.D.
Many of the procedures used in statistics require
that the data be normally distributed. In the
realm of statistical process control, the control
chart for individuals and the control chart
for averages have an underlying normality assumption
in the calculation of the control limits. If
the data are not by nature normally distributed
and a control chart for individuals or for averages
is used, points outside the control limits may
be due to the skew of the data and not due to
special-cause variation.
These control charts are fairly robust in that
the underlying distribution of the data need
not be exactly normally distributed. In fact,
no data set is exactly normally distributed.
The normal distribution yields values that range
from minus infinity to plus infinity, which
never reflects real life. As Shapiro [1990,
p. 5] warns,
Instead, it is only necessary for the data
to be "near-normal," defined here
as having no detrimental effect on the control
chart. So if data need to be near-normal, how
does one test for that?
A common test for normality of data is to use
the statistical technique of chi-squared analysis.
Not only difficult, chi-squared analysis is
too tough of a test. The data can have more
departure from normality than chi-squared analysis
is willing to allow and the control charts be
useful. A good (and easy) test for the needed
near-normality is the normal probability plot.
A probability plot is a graph of the relative
cumulative frequencies of the data, using a
specific plotting convention. [Hart and Hart,
2002] To illustrate the calculations involved
in making a probability plot, consider a very
small data sample consisting of just five time-ordered
values: 7, 3, 4, 11, and 9. In Table 1, the
ordered observations of 3, 4, 7, 9 and 11 each
occur only once, so each has a frequency of
one. Note that they are arranged from the smallest
observation (3) to the largest observation (11).
The frequencies may then be accumulated, giving
the cumulative frequency of the sample. Note
that the cumulative frequency of 5 for the observation
of 11 means that all five observations from
the sample are of value 11 or smaller. But this
is 5 out of 5, 100% of the observations. This
implies that 100% of the observations from the
sample are of a value of 11 or smaller. The
relative cumulative frequency also tells that
1 out of 5, 20%, of the observations from the
sample are of value 3 or smaller and 2 out of
5, 40%, are of value 4 or smaller, 3 out of
5, 60%, are 7 or smaller, and 4 out of 5, 80%,
are 9 or smaller.
| Table 1 Cumulative Frequency
Calculations for the Sample (n = 5) |
|
| Data |
Frequency |
Cumulative Frequency |
Relative Cumulative Frequency |
| 3 |
1 |
1 |
1/5 = 0.20 |
| 4 |
1 |
2 |
2/5 = 0.40 |
| 7 |
1 |
3 |
3/5 = 0.60 |
| 9 |
1 |
4 |
4/5 = 0.80 |
| 11 |
1 |
5 |
5/5 = 1.00 |
|
Note that these cumulative frequencies are
for the sample, but what is really desired are
estimates of the relative cumulative frequencies
(or cumulative probabilities) for the total
population (not just the sample). A plotting
convention is needed for the population cumulative
percentage, which will accomplish three objectives:
| 1. |
After putting data in ascending
order, the middle value will be at 50%. |
| 2. |
The largest observation will
not be plotted at 100% but at some lower
percentage. This will allow for some future
observations from the process to be larger
than the largest observation obtained from
the sample. |
| 3. |
The smallest observation will
be plotted symmetrically to the largest
(i.e., if the largest observation is plotted
at x%, the smallest observation will be
plotted at (100 - x)%). |
There are many plotting conventions that accomplish
these three objectives. One commonly used convention
that will be used here is y = i/(n + 1), where
y is the relative cumulative frequency, i is
the order number of the data point, and n is
the total number of observations. There are
other plotting conventions. Another popular
one that is sometimes used is (i - 0.5)/n. Using
the convention i/(n + 1), the results are in
Table 2.
| Table 2 Probability Plot Calculations
(n = 5) |
|
| Data |
Frequency |
Cumulative Frequency |
Relative Cumulative Frequency
i i/(n+1) |
| 3 |
1 |
1 |
1/6 = 0.167 |
| 4 |
1 |
2 |
2/6 = 0.333 |
| 7 |
1 |
3 |
3/6 = 0.500 |
| 9 |
1 |
4 |
4/6 = 0.667 |
| 11 |
1 |
5 |
5/6 = 0.833 |
|
Note that this plotting convention accomplishes
the three objectives mentioned earlier:
| 1. |
The middle value (7), called
the median, is at 50%. |
| 2. |
The largest value
(11) is at 83%, leaving room for 17% of
the population to be larger. |
| 3. |
The smallest value (3) is
at 17%, leaving room for 17% of the population
to be smaller. |
For the probability plot (also called the cumulative
or normal probability plot), the relative cumulative
frequencies are plotted on special graph paper
(called normal probability paper). This paper
has a special y-axis (vertical) scale that has
been chosen so that data that are normally distributed
will tend to yield a straight line. The x-axis
(horizontal) scale is just a regular linear
scale. Each ordered pair of the data point and
its cumulative probability is plotted.
Plotting the data from the short example of
five values, the lowest observation of 3 gets
plotted at 17%, 4 gets plotted at 33%, and so
on, as shown in Figure 1. Note that the plot
of the data can reasonably be approximated by
a straight line.

Figure 1 Probability Plot of Data from Short
Example
Figures 2 and 3 are the histogram and probability
plot of 200 measurements that are near-normal
in their distribution.

Figure 2 Histogram of Near-Normal Data, n =
200

Figure 3 Probability Plot of Near-Normal Data,
n = 200
Figures 4 and 5 illustrate the histogram and
probability plot of data that is severely skewed.

Figure 4 Histogram of Severely Skewed Data

Figure 5 Probability Plot of Severely Skewed
Data
The probability plot makes for a better graphical
test for near-normality than does the histogram.
The shape of the histogram may be highly dependent
upon the size and number of cells, which does
not happen with the probability plot. But how
straight does the line have to be to consider
it near-normal? Shapiro [1990, p. 9] points
out:
Geary,
R. C. "Testing for Normality." Biometrika,
vol. 34, pp. 209 - 242, 1947.
Hart,
Marilyn and Hart, Robert. Statistical Process
Control for Health Care. Pacific Grove, California:
Duxbury Press, 2002.
Shapiro,
Samuel. How to Test Normality and Other Distribution
Assumptions. Milwaukee, WI: American Society
for Quality, 1990.