Normal vs. NonNormal Distributed Data—Comparing Results
In order to generate a control chart, the user
must understand if the collected data is variable
or attribute data. This distinction is important
since the control limits are calculated based
on different assumptions within the data. Variable
control charts expect the input data to be normally
distributed. The calculations employed in attribute
charts are based on other distributions. The
most common are the binomial approximation to
the normal distribution and the Poisson distribution.
This discussion will focus on the normal distribution.
Normally distributed data exhibit predictable
traits and probabilities. These characteristics
are used to define rules that identify control
violations. The most common rules define conditions
that would only be expected to occur by chance
.3% of the time, provided the data are normally
distributed.
In practice, we are frequently confronted with
data that is not normal. It is useful to understand
how nonnormal data behaves when it is analyzed
by tools that are based on the normal distribution.
This discussion will compare the results of
2 data sets with similar means and standard
deviations, but different distributions. This
data is total minutes spent in the emergency
room. The time is measured from the moment the
patient enters the ER to the time recorded when
the person is discharged from ER. The data is
total time per person.
The first step to take is to look at how the
data is distributed. This will be done with
a process capability chart. There are no specifications
for this data which will cause some of the statistics
to be unavailable. However, the value of this
is to see the shape of the distribution and
verify the mean and variation of the data.
The first chart shows that the data is approximately
normally distributed. The mean, mode and median
are very close to being equal. The data show
very little skewness. The mean is 166.9, standard
deviation is 76.1 with 24 cases.
The second process capability chart shows a
very different picture. The mean for this set
of data is 167.2 with a standard deviation of
82.5 and 24 cases. One of the obvious features
of this distribution is that it is bimodal.
In normally distributed data, the mean = median
= mode. The bimodal feature clearly violates
the relationship of normal data. The presence
of bimodal distributions is very common in
certain settings. There are many reasons why
bi or multimodal data may be unavoidable such
as demands for services. There may be natural
peaks or modes during certain times of the day
or certain days of the week.
Although visually, it appears that these data
sets exhibit normal and nonnormal tendencies,
the next examples are further evidence. The
following plots are probability plots. The probability
plot draws a theoretical line through the data
points and evaluates how the actual data points
adhere to the theoretical normal distribution.
The plot is augmented by the pvalue. When the
pvalue is smaller than a critical value, .05
in this discussion, we reject the hypothesis
that the distribution is normal. In this case,
the conclusion is that the distribution is nonnormal.
If the pvalue if greater than .05, we do not
have enough evidence to reject the hypothesis
that the distribution is normal.
The following graph is for the same data set
that appears visually to be normally distributed
in the process capability plot.
The pvalue for this data is .467. To reiterate,
since .467 > .05, we cannot reject the hypothesis
that the data is normal. This is further evidence
that this data is normally distributed.
In contrast, the second set of data follows
a very different pattern in the probability
plot as seen in the next graph.
The data points are not scattered randomly
about the theoretical line. There is also wider
divergence from the line than is shown with
the normal set of data. The pvalue for this
data is .046. Since .046 < .05, we reject
the hypothesis that the data is normally distributed.
Now that we have both visual and statistical
evidence that one set of data is approximately
normally distributed and one is not, we will
proceed to see how the different data sets behave
in a variable control chart. The data points
are individual values. The most appropriate
chart is the Ichart. The data for the first
data set does not violate any control rules.
The control chart for the nonnormal data does
have rule violations.
Notice that the rule violations are not because
of points beyond the control limits. Mousing
over the control violations show that the violations
are for too many points in zone B or beyond
and too many points in zone C or beyond. This
types of violations are consistent with the
properties of the data. Since these data are
not normally distributed, they would not be
expected to exhibit a random scatter above and
below the center line. The lack of randomness
makes the data prone to these types of control
violations. However, the chart still provides
valuable information in regards to the time
spent in the ER. The data appears to indicate
a gradual decrease in variability over time
as well as an overall reduction in the later
time periods.
Both sets of data produce informative charts.
There is enough information in the charts to
consider conducting further analysis. The control
violations present in the nonnormal data could
be omitted by using a different rule set. It
may be appropriate to use a rule set that only
evaluates control limit violations and upward
or downward trends. However, this decision should
only be made with good understanding of the
data and what types of violations the nonnormal
characteristics would be expected to increase.
For additional information on behavior of nonnormal
data in control charts see Discussions
on Normality.
