Bill Farrell, Ph.D.
Senior Analyst
Sutter Health
Sutter Health is a family of not-for-profit
hospitals and physician organizations that share
resources and expertise to advance health care
quality. Serving more than 100 communities in
Northern California, Sutter Health is a regional
leader in cardiac care, cancer treatment, orthopedics,
obstetrics, and newborn intensive care, and
is a pioneer in advanced patient safety technology.
In this brief article we take a look at the
legendary t test, the test used when making
inferences with measured (as opposed to counted)
data.
Suppose we administer an IQ test to two groups
of people. The mean score for the first group
is 100, and for the second group it’s 105. Is
this 5-point difference statistically significant?
To answer that, we’d want to know a little more
about the variability of scores within
the groups. Consider two scenarios:
Scenario 1: The range of scores for
the first group of people is 99-101, and the
range of scores for the second group is 104-106.
In this scenario, the difference is very likely
to be significant. The scores are tightly clustered
around their means, and the distributions do
not overlap (that is, the highest score in the
first group is 101, while the lowest score in
the second group is 104).
Scenario 2: The range of scores for
the first group is 65-170, and for the second
group it’s 67-171. In this scenario it is highly
unlikely that a 5-point difference would turn
out to be significant. There is huge variability
of scores around their means, and the distributions
overlap almost completely.
This is the logic behind the t test: we calculate
the difference between two groups of scores
(the difference is 5 points in the example above)
and then look at the variability of the scores
to determine if the difference is meaningful.
We used the range above to describe
the variability of the scores (i.e., the range
for one group was 65-170). Statisticians (of
course) have decided that the standard deviation
is a much more useful measure of variability.
You can think of the standard deviation as
(sort of) the average distance of a group of
scores from its mean. If we have scores of 1,
2, 3, 4, and 5, the mean is 3 and the distances
from the mean (ignoring sign) are 2, 1, 0, 1,
and 2. Thus the average distance from the mean
is 6/5 = 1.2. This is not the actual standard
deviation, but it’s close and it nicely illustrates
the concept.
What would our pseudo-standard deviation be
if the scores were 1, 2, 3, 4, and 10? (Hint:
it’s 2.4.) Just by changing a 5 to a 10 we’ve
doubled the variability of our scores.
So the t test tells us whether a difference
between two groups is statistically significant
by looking at the size of the difference and
the variability of the scores, where variability
can be viewed (roughly) as how different scores
are from each other or from the mean. The greater
the variability, the greater the difference
must be to achieve statistical significance.
This should provide some insight as to why
researchers take such pains to match treatment
and control groups on age, gender, health status,
and the like. Homogeneous groups are likely
to be low-variability groups, and low-variability
groups give us our best shot at discerning subtle
differences between them.