Solutions Company Statit Training Home

Designing Spreadsheets for Data Import

At Statit we get a lot of questions having to do with importing spreadsheets into Statit. Statit does a good job of importing these files, however that data may be in a format that can make it difficult to analyze or chart the data. What I would like to do is give a few tips on up-front formatting or designing of the spreadsheet to make your life easier. A few minutes thinking about how you will use the data can make all the difference in the world.

A primary idea to consider is the way Statit sees data. Statit views columns of data as variables and rows of data as cases or observations. Remembering this will help you to design your spreadsheets so it is easier to do the analysis. Thinking of each row as an individual observation in time will save you time. I will illustrate this in a few examples.

Statit QC procedures take one data variable as input for the measures. Each case of this variable is one observation. In Variable charts, you might group the observations based on another variable such as month or doctor to create subgroups, but each case is still a single observation taken at that point in time. Other variables can be used for X axis id and things like comments or phases.

Spreadsheet formats

A problem we often see is measures by month, with each month in a separate column. In this case each row has several observations of the same measure, one for each month. This violates the single observation rule and requires some work either in the spreadsheet or in Statit to concatenate the different columns into a single variable so that we have a proper input variable to QC procs. We can alleviate this by remembering the one row/one observation rule. Figure 1 gives an example of a spreadsheet that will make analysis unnecessarily difficult.

Figure 1

A good rule of thumb is to have one column for a date, one column for the observation on that date. Depending on the data you may also need a second measure column such as for subgroup size or denominator. The spreadsheet in Figure 2 illustrates this.

Figure 2

Another situation that occurs is maintaining the same type of measure in one spreadsheet but with separate columns for different groups such as location. In this case, you would not expect to chart the different locations on the same chart but would chart each group separately.

Figure 3

If you did find you needed to plot all the data in one chart, you will need to manipulate the data in some way. In the case illustrated above you would probably want the sum of the measure. In Statit Express QC, you would create these sums in the spreadsheet by creating two new columns, one with the sum of the numerators and one with the sum of the denominators. In other Statit products you also have the option of creating the variables in the Statit workspace as you need them by summing the Num variables into Tot_Num and summing the Den variables into Tot_Den.

As another example, suppose you had p chart data in a spreadsheet such as the following:

Figure 4

If you wanted a p chart of Measure = Floor2 and were using Statit Express QC, you would need to rearrange your spreadsheet to something similar to Figure 3. Express does not have the select capability to choose which subset of measures you want to graph. However, the other Statit products allow either a global selection or a local selection. This format of spreadsheet is workable in these situations, but it still makes it a bit more difficult to aggregate the data so that you can produce a p chart of both Floor2 and Floor3 data.

Data Formatting

Notice that in both the sample spreadsheets above, the Date Column is formatted as a date and the other data variables are entered as numeric. This is very important. For the data variable, Statit requires numeric data. If the spreadsheet does not have the measure column formatted as numeric, you would need to redo the spreadsheet for Statit Express QC or manipulate the data in the other Statit products.

It should be pointed out that changing the column format in a spreadsheet, does not ensure that the column contains numeric values. Be sure to check this as well.

For any type of sorting on date, the date variable needs to be numeric. Otherwise you may have April coming before January. This is a simple thing, but something that can be forgotten.

Statit has several date formats you can use to display the date as you would like. Since the data above is taken on a monthly basis, you would probably want to format it in a Mon-YY (Jan_05) type format. Notice also that the actual date (the first of the month) will not make much difference if the date is formatted as Mon-YY.

Additionally, it is always a good idea to have the spreadsheet sorted in the manner that you would use the data. Although this step can be done in Statit, it will actually help in the process of entering data in the spreadsheet.

We often see spreadsheets with these issues. It is important, when designing a spreadsheet to consider how you expect to use the spreadsheet and the data within it. This is true whether the data comes from a spreadsheet or from an industrial strength database. If you can't easily get the information you need and use it effectively, no amount of normalized database design will help you.