Normal vs. Non-Normal Distributed DataŚComparing Results


In order to generate a control chart, the user must understand if the collected data is variable or attribute data. This distinction is important since the control limits are calculated based on different assumptions within the data. Variable control charts expect the input data to be normally distributed. The calculations employed in attribute charts are based on other distributions. The most common are the binomial approximation to the normal distribution and the Poisson distribution. This discussion will focus on the normal distribution.

Normally distributed data exhibit predictable traits and probabilities. These characteristics are used to define rules that identify control violations. The most common rules define conditions that would only be expected to occur by chance .3% of the time, provided the data are normally distributed.

In practice, we are frequently confronted with data that is not normal. It is useful to understand how non-normal data behaves when it is analyzed by tools that are based on the normal distribution. This discussion will compare the results of 2 data sets with similar means and standard deviations, but different distributions. This data is total minutes spent in the emergency room. The time is measured from the moment the patient enters the ER to the time recorded when the person is discharged from ER. The data is total time per person.

The first step to take is to look at how the data is distributed. This will be done with a process capability chart. There are no specifications for this data which will cause some of the statistics to be unavailable. However, the value of this is to see the shape of the distribution and verify the mean and variation of the data.

The first chart shows that the data is approximately normally distributed. The mean, mode and median are very close to being equal. The data show very little skewness. The mean is 166.9, standard deviation is 76.1 with 24 cases.

The second process capability chart shows a very different picture. The mean for this set of data is 167.2 with a standard deviation of 82.5 and 24 cases. One of the obvious features of this distribution is that it is bi-modal. In normally distributed data, the mean = median = mode. The bi-modal feature clearly violates the relationship of normal data. The presence of bi-modal distributions is very common in certain settings. There are many reasons why bi or multi-modal data may be unavoidable such as demands for services. There may be natural peaks or modes during certain times of the day or certain days of the week.

Although visually, it appears that these data sets exhibit normal and non-normal tendencies, the next examples are further evidence. The following plots are probability plots. The probability plot draws a theoretical line through the data points and evaluates how the actual data points adhere to the theoretical normal distribution. The plot is augmented by the p-value. When the p-value is smaller than a critical value, .05 in this discussion, we reject the hypothesis that the distribution is normal. In this case, the conclusion is that the distribution is non-normal. If the p-value if greater than .05, we do not have enough evidence to reject the hypothesis that the distribution is normal.

The following graph is for the same data set that appears visually to be normally distributed in the process capability plot.

The p-value for this data is .467. To reiterate, since .467 > .05, we cannot reject the hypothesis that the data is normal. This is further evidence that this data is normally distributed.

In contrast, the second set of data follows a very different pattern in the probability plot as seen in the next graph.

The data points are not scattered randomly about the theoretical line. There is also wider divergence from the line than is shown with the normal set of data. The p-value for this data is .046. Since .046 < .05, we reject the hypothesis that the data is normally distributed.

Now that we have both visual and statistical evidence that one set of data is approximately normally distributed and one is not, we will proceed to see how the different data sets behave in a variable control chart. The data points are individual values. The most appropriate chart is the I-chart. The data for the first data set does not violate any control rules.

Java is not enabled in browser, data tips cannot work for this graph.

The control chart for the non-normal data does have rule violations.

Java is not enabled in browser, data tips cannot work for this graph.

Notice that the rule violations are not because of points beyond the control limits. Mousing over the control violations show that the violations are for too many points in zone B or beyond and too many points in zone C or beyond. This types of violations are consistent with the properties of the data. Since these data are not normally distributed, they would not be expected to exhibit a random scatter above and below the center line. The lack of randomness makes the data prone to these types of control violations. However, the chart still provides valuable information in regards to the time spent in the ER. The data appears to indicate a gradual decrease in variability over time as well as an overall reduction in the later time periods.

Both sets of data produce informative charts. There is enough information in the charts to consider conducting further analysis. The control violations present in the non-normal data could be omitted by using a different rule set. It may be appropriate to use a rule set that only evaluates control limit violations and upward or downward trends. However, this decision should only be made with good understanding of the data and what types of violations the non-normal characteristics would be expected to increase.

For additional information on behavior of non-normal data in control charts see Discussions on Normality.

If you would like additional information, please send email to statit.support@acs-inc.com.