[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Treatment of Confounding Factors in an Ecologic Study
A copy of this paper is appended below for those who requested it.
The figure (not included here) is Fig. 1 from my paper in Health Physics
68:157-174;1995
Bernard L. Cohen
Physics Dept.
University of Pittsburgh
Pittsburgh, PA 15260
Tel: (412)624-9245
Fax: (412)624-9163
e-mail: blc+@pitt.edu
TREATMENT OF CONFOUNDING FACTORS IN AN ECOLOGIC STUDY
Bernard L. Cohen
University of Pittsburgh
It is often said that an ecologic study cannot give adequate
treatment for confounding factors because the average value of a variable
does not truly represent its confounding effects. For example, the average
age for groups cannot elucidate an effect which depends only on the
fraction of the groups that are very young, or that are very old. Several
arguments to that effect have been published1,2,3 and Lubin4 has given a
mathematical proof that this problem can lead to an unbounded error.
Lagarde and Pershagen5 have recently given an actual example of where this
problem resulted in a wrong result. An associated problem is that an
ecologic study may not have data on important details; for example, it may
have data on average levels of a pollutant in homes of the groups, but
this does not give information on how much time is spent in their homes
which is required to determine actual exposures. This is a special case of
the problem that confounding may not be represented by a single parameter,
such as the fraction of the population that smokes when intensity of
smoking is an issue. Still another problem is that average values of
potential confounders may be unreliable for some reason.
While all of these problems are very serious and even fatal in
nearly all ecologic studies, the purpose of this paper is to describe
methods of addressing them in at least one particular ecologic study.
A. The Problem
In a study of individuals, these problems can be solved by asking
appropriate questions of each individual, but that is not possible in an
ecologic study. The particular ecologic study6 considered here was
designed to test the validity of the linear-no threshold theory
[hereafter, LNT] of radiation carcinogenesis, as applied to the radon vs
lung cancer relationship in the BEIR-IV Report7.
According to BEIR-IV, the lung cancer mortality risk to an
individual, m, is related to his accumulated radon exposure, r, by
m = an (1 + b r ) for a non-smoker
m = as (1 + b r ) for a smoker
where an, as, and b are given constants. Using these and adding up the
experiences of all males in a county and dividing by the male population
(all previous papers give parallel analyses for females, always with
similar results), gives
m = [ (1 - S) an + as S ] ( 1 + b r )
where m is the lung cancer mortality rate for the county, r is the average
radon exposure, and S is the fraction of the adult males that smoke. The
derivation of this equation, carried out elaborately in Ref. 6, is
mathematically rigorous and is not affected by the ecologic fallacy.
Ref. 6 presents a test of this relationship involving data for
over half of all U.S. counties, containing 90% of the U.S. population.
With corrections for .movement of people between counties, the above
relationship with numerical values becomes
M = m / [9 + 99 S] = A + B r
(1)
where the bracketed term may be thought of as the correction for smoking
prevalence, A is a number close to 1.00, and B, in percent per pCi/L of
radon (percent per 37 Bq/m3), is
B = +7.3 LNT prediction
Ref. 6 tests the validity of LNT by fitting the data for 1601 counties to
Eq. (1) to obtain a value of B, to be compared with this LNT prediction..
While all fitting is done mathematically to the 1601 data points. the
results can be appreciated from Fig. 1 which is a presentation explained
in its caption. The result from data fitting is B = -7.3 +/-0.56,
discrepant from the LNT prediction by 26 standard deviations; we refer to
this as our discrepancy. The remainder of this paper is about efforts to
explain our discrepancy without abandoning LNT by invoking possible
confounding factors (CF).
B. Treatment of Confounding Factors (CF) by Stratification
For a given CF, for which data are available for each county, the
method used was to stratify the data file on that CF into several sub
files. An example, for county population density(PD) as a CF, is exhibited
in Table 1, where the data are divided into 10 separate sub-files of 160
counties each, and each sub-file is analyzed independently to obtain a
value of B. Since the variation in population density in any one of these
sub-files is much less than in the total data file, its confounding
effects are much reduced. This is shown in Table 1 by including results of
a double regression of M on r and PD, noting that B-values from single and
double regression are essentially the same. Thus an average of the 10 B
values obtained gives a value of B largely free of confounding by
population density. In Table 1 we include results for males and females as
an example of typical differences, but hereafter for brevity we give
results for males only.
This procedure may not eliminate effects of a CF because the
average value of a CF does not necessarily represent its confounding
effects, a problem sometimes discussed as cross level bias.For example,
average annual income may not represent the confounding effects of income
because its confounding effects may depend on the fraction of the
population that is very poor, or very rich.. To cover this problem, we
consider separately as CF the fraction of the population with income
<$5000, $5000 to $10,000, ....., >$150,000 (10 brackets in all), plus
various combinations of adjacent brackets8. With regard to the example
cited in the Introduction where average age does not represent the effects
of age distribution, we consider as CF the fraction of the population in
age group, <1 year, 1-2 years, .....,>85 years (31 groups in all) plus
various combinations of adjacent brackets.
These examples afford us an opportunity to discuss Lubins
mathematical proof that using average values for a CF can lead to
unbounded errors4. Lets assume that men with incomes that are integral
multiples of $737 per year have N times higher lung cancer rates and N
times higher radon exposure than the average - we have no data to show
that this is not correct, and this would cause a very large error in our
result if it were true, in accordance with Lubins mathematical proof.
However, the problem with this example is that it is completely
implausible. Lubins proof must include a corollary involving plausibility.
Without which it is not applicable. To invalidate our study, one must
concoct a CF with a not implausible behavior and examine its effects on
the results. That process is the principal topic of this paper.
In Ref. 6, and subsequent papers, well over 100 potential CF were
treated by the stratification method with no progress in resolving our
discrepancy. But stratification is a tedious process and the number of
potential confounding factors is very large, so a simple screening
procedure was developed, described in the next section.
C. Plausibility Requirements for Confounding Factors
The way a confounding factor, X, can affect the value of B derived
from fitting data with Eq. (1) is by systematically causing counties with
low M to have high (or low) r, and vice versa.. This would be evidenced by
the rankings of counties by M, R(M), and (for our case) the inverse
rankings by r, R(r), both being highly correlated for unrelated reasons
with the rankings of counties by X, R(X). The most effective R(X), is
R0(X) = 0.5 R(M) + 0.5 R(r)
which leads to a coefficient of correlation by rankings, CoRR(X.r) =
CoRR(X,M) = 0.82. Lesser correlations can be generated by taking
R(X) = p R0(X) + (1 - p) R-random
where R-random is a random rearrangement of the rankings, and p can be
varied between 0 and 1.0 to obtain varying CoRR(X,M) and CoRR(X,r). For
each value of p we use the stratification method to obtain a value of B
from fitting the data to Eq. (1). The results for three different sets of
R random (generated by the MINITAB statistical package) are listed in
Table 2. We see there that the different sets of R-random give consistent
results and indicate that for a CF, X, to shift the value of B from its
original value B = -7.3, to the LNT prediction B = +7.3, requires
CoRR(X,r) and CoRR(X,M) of about 0.75, and even to change the sign of B
from - to + , accounting for half of our discrepancy, requires these
correlations to be about 0.6.
How plausible are such correlations? The factors affecting radon
exposure, r, are geology and house construction characteristics, while the
factors affecting M, lung cancer rates corrected for smoking, are health
related problems, so it is almost impossible to imagine a direct causal
relationship between them. By far the most likely source of confounding is
through socioeconomic variables, SEV. In Ref. 6, 54 SEV were studied and
among them for CoRR(SEV,r) there were 3 between 0.35 and 0.30, 2 between
0.30 and 0.25, and the remaining 47 were <0.25; for CoRR(SEV,M) there were
4 between 0.31 and 0.25 and the remaining 50 were <0.25. It thus seems
highly implausible for correlations by ranking like 0.6 to 0.75 to occur.
It is interesting to note here that the correlation by rankings between M
and r, CoRR(M.r) = 0.35, higher than for any of our SEV; this is the
strong correlation evident in Fig. 1.
D. Confounding by Uncertainties in Smoking and Radon Exposure
Up to this point, we have not discussed smoking as a CF; it has a
very special place because of its importance in causation of lung cancer,
leading to its explicit inclusion represented by the term S in Eq.(1)..
This means that the distribution of S-values, not just its ranking for
various counties, affect results for B. While several different sources of
data have been used to generate S-values for each county (these are
reviewed in Ref. 9) and all lead to similar B-values, the uncertainty in
S-values is still a matter of some concern.
If the width of our distribution of S-values is maintained but S
values for counties are reassigned so as to give CoRR(S.r) = 1.0, a
perfect correlation by rank, B is reduced from its original value, -7.3,
to zero, still leaving half of our discrepancy unexplained. .But the
effects are increased if the distribution of S-values is wider. The
maximum plausible width for the S-distribution is the width of the lung
cancer mortality rate (m) distribution, since other factors influence m in
ways that, statistically, would increase the width. With this increased S
distribution width, centered on the well established national average for
S, S-values were assigned to each county to give CoRR(S.r) = 1.0; we call
this S-perfect. At the other extreme, these same S-values were randomly
assigned to each county to obtain S-random. Calculations were then done
with
S = q S-perfect + (1-q) S-random
where q is various numbers between 0 and 1.0 chosen to obtain various
coefficients of correlation with r (not correlations by rank), Corr(S,r).
The results for three different S-random are shown in Table 3. We see
there that the Corr(S,r) required to change B to the LNT prediction, B =
+7.3, is about 0.9, and just to reduce B down to zero, eliminating half of
our discrepancy, is about 0.62, even with this substantially increased
width of the S-distribution.
How plausible are these required Corr(S,r)? Since there is no
direct causal relationship between S and r, the most probable source of a
correlation is through socioeconomic variables, SEV. It therefore seems
reasonable to assume that Corr(S,r) should be in the same range as
Corr(SEV,r). For the 54 SEV considered in Ref. 6, the largest absolute
value Corr(SEV,r) is 0.37, the next largest is 0.30, and for 49 of the 54
SEV, it is less than 0.23.. It thus seems highly implausible for
Corr(SEV,r) to approach the values required to help explain our
discrepancy.
Another type problem arises if there is a systematic difference10
in average radon exposures for smokers, rs, and non-smokers, rn; this is a
confounding factor on the level of individuals rather than on the level of
the entire group, which is often cited as a fatal flaw in ecologic
studies, but we treat it here Since smokers are 12 times more at relative
risk from radon than non-smokers7, the effective radon level, re, for the
county as a whole for causing lung cancer is
re = [12 S rs + (1 - S) rn ] / [12 S + (1 - S)]
where the denominator is the sum of the weights. This differs from the
measured average radon level, r.
r = S rs + (1-S) rn
If we define x = rs / rn, the relationship between the effective and
measured radon levels becomes10
re = r (12 S x + 1 - S) / [(x S + 1 - S) (11 S + 1)]
and we use re instead of r in fitting the data to determine values of B.
In doing this, the parameters that may be varied are the average value of
x (x-average), the width of the distribution of x-values, and Corr(x,r).
Our studies11 show that the national average for x is 0.9, but we give
some results for other values. The width of distributions of county
averages of our socioeconomic variables (SEV) that are not proportional to
population and do not include counties with values less than 25% of the
mean have standard deviations between 7% and 21% of the mean. If we
include all SEV for which there are no zero values, the average width is
26% of the mean, and for one SEV it is above 50%, 55% for percent of
income from government which is an understandable special case. We
consider distributions of x-values 57% of the mean which severely
stretches the limits of plausibility, and 28% of the mean which is quite
high but perhaps not implausible.
Some results are listed in Table 4. The first five entries
explore the effect of x-average for the most favorable assumptions about
the other factors. The remaining entries use the known value of x-average
and explore the effects of the width of the distribution and of Corr(x,r).
The results in Table 4 make it evident that the effect under investigation
can do very little to change our B-value from -7.3 to the LNT prediction,
B = +7.3.
E. Plausibility of Correlation Applied to Other Factors
The above discussion is based on the assumption that Eq. (1)
represents the true prediction of LNT. But the BEIR-VI Report12 suggested
that Eq. (1) is deficient in that it ignores intensity of smoking, and
proposes that this be treated by dividing smokers into two categories, 2
pack/day and 1 pack/day. To study this13, we define
k = ratio of 2 pack/day to 1 pack/day smokers in a county
f = ratio of lung cancer risk for 2 pack/day to 1 pack/day
Analysis of available data indicates12 the plausible values most favorable
for the BEIR-VI suggestion are f = 2.0 and national average for k = 0.4.
Using these converts Eq. (1) to
M = m / [9 - 9S + 84 S {(1 + 2 k)/((1 + k)} ] ( A + B r )
(2).
Different distributions of k-values were tried but the most promising was
a level distribution between 0 and 0.8, to be consistent with the national
average of 0.4.. As in Section D, we assign k-values to counties so as to
define k-perfect as assignments for which CoRR(k.r) = 1.0, and k-random
as one where k-values are assigned randomly. We then generate k-values to
be used in fitting Eq. (2) as
k = g k-perfect + (1 - g) k-random
where g is given various values between 0 and 1.0 to obtain different
Corr(k,r). The results for one set of k-random are
Corr(k,r) 0 -0.37 -0.52 -0.78 -0.91
-0.93
B -7.6 -5.0 -4.2 -2.5 -0.7
+1.4
In view of our previous discussion on plausibility of correlations, it is
clear that including intensity of smoking as a confounder can do little to
reduce our discrepancy. In Ref. 13, consideration was also given to
possible correlations with r for both S and k, using the above method. For
cases where Corr(k,r) = Corr(S,r), as these vary from zero to -0.8, B
increases roughly linearly from -10.0 to +1.3; for example, for Corr(k,r)
= Corr(S,r) =0.4, B=-4.3. Again it is apparent that plausible values of
these correlations can do little to bring B close to the LNT prediction, B
= +7.3. .
F. Other Problems with Confounding
In our extensive studies11 of how radon levels vary with
socioeconomic factors, house characteristics, geography, etc), it was
found that rural houses average about 25% higher radon levels than urban
houses, whereas urban males smoke about 25% more frequently than rural
males. This is a variation on the level of individuals which cannot be
taken into account properly by use of county average radon level and
smoking prevalence.. This problem was treated in Ref.6 using a model with
the above percentages as parameters by modifying the derivation of Eq. (1)
to consider not just the two categories , smokers and non-smokers, but
four categories, urban and rural smokers and urban and rural non-smokers,
each category having its own percentage of the population, lung cancer
rate, and average radon level. These are related by the percent of the
population that lives in urban areas, a known quantity for each county,
and m, r, and S for the county. It was found that the changes in B caused
by various plausible values of the parameters was only a few percent.
Actually the slope of a regression of lung cancer rate, m, on r is rather
strongly affected, but nearly all of these effects are compensated by the
correction for smoking given by the term in square brackets in Eq. (1).
There have been suggestions that effective radon exposure, r
effective, may not be the same as the measured radon level in the home, r
measured; for example, time spent in the home may vary, or exposures
outside the home may be important. We represent this as
r-effective = (1+f) r-measured
This can make a difference if there is a strong correlation between f and
r. This problem was studied by the methods outlined above, testing various
correlations and their plausibility9, and it was found not to be
important. But this can be easily understood as follows A positive
correlation stretches out the abscissa in Fig. 1, which can reduce the
B-value, but not change it from negative to positive as required to
explain our discrepancy. A negative correlation contracts the abscissa,
thus increasing the negative value of B. Small correlations only spread
the data more, without changing B. Thus this approach can do little for
explaining our discrepancy.
Another issue is the effect of confounding by combinations of CF.
This is a complex issue, but it is treated in Ref. 6, where it is
concluded
that combinations of the 54 socioeconomic factors used there can account
for only about 10% of our discrepancy.
G. Conclusion
All potential confounding factors that we or others have suggested
have been shown to be incapable of explaining our discrepancy. In view of
the fact that there is no other evidence supporting the theory in this low
dose region, this may be interpreted to mean that the linear-no threshold
theory fails the experimental test. However, further suggestions for
explaining our discrepancy are always welcome and will be carefully
analyzed.
Table 1:Treatment of County Population Density (PD) as a confounding
factor by the stratification method. Results are for B using single
regression of M on r as in Eq. (!), and double regression of M on r and
PD, fitting the data to
M = A + B r + E PD
where E is a fitting parameter.
County Rank PD range Single Regression Double Regression
by PD (x100/sq.mi) B-male B-female B-male B-female
_________ ___________ ____ ____ ____ ____
1 - 160 0.003-0.094 -3.7 -6.6 -3.7
-6.4
161- 320 0.095-0.22 -8.0 -7.8 -8.0
-7.9
321- 480 0.22-0.35 -7.0 -8.5 -7.0
-8.5
481- 640 0.35-0.50 -6.4 -9.7 -6.4
-9.8
641- 800 0.50-0.67 -8.9 -8.7 -8.9
-8.7
801- 960 0.67-0.92 -4.3 -4.4 -4.3
-4.4
961-1120 0.93-1.29 -9.2 -6.0 -9.3
-6.0
1121-1280 1.30-2.05 -5.9 -8.1 -5.9
-8.1
1281-1440 2.05-4.11 -0.5 -2.7 -0.5
-2.8
1441-1601 4.12-671.8 -4.5 -7.4 -3.9
-6.2
__________ _____ _____ _____
_____
AVERAGE -5.8 -7.0
-5.8 -6.9
Table 2: B values obtained if a confounding factor, X, has various
correlations by ranking with M and r, CoRR(X,M) and CoRR(X,r). The three
sets of results are for three different R-random.
CoRR(X.r) CoRR(X,M) B CoRR(X,r) CoRR(X,M) B CoRR(X,r) CoRR(X,M) B
0.09 0.09 -7.2 0.07 0.12 -7.2 0.08
0.08 -7.2
0.18 0.18 -7.0 0.16 0.21 -6.9 0.16
0.17 -7.0
0.34 0.34 -5.2 0.32 0.37 -5.3 0.33
0.33 -5.1
0.53 0.53 -2.1 0.51 0.55 -2.3 0.52
0.52 -2.1
0.69 0.69 +3.2 0.68 0.70 +2.9 0.69
0.69 +3.3
0.79 0.79 +10.5 0.78 0.79 +10.4 0.79
0.79 +10.8
0 81 0.81 +13.7 0.81 0.82 +13.7 0.81
0.81 +13.8
Table 3: B-values obtained if smoking prevalence, S, has various
Corr(S,r), assuming the maximum plausible width for the S-distribution.
The three sets of results are for three different S-random.
Corr(S,r) B Corr(S,r) B
Corr(S,r) B
-0.17 -7.1 -0.24 -5.5
-0.23 -6.0
-0.33 -4.7 -0.39 -3.2
-0.38 -3.7
-0.41 -3.5 -0.47 -2.0
-0.45 -2.6
-0.49 -2.3 -0.54 -0.9
-0.53 -1.4
-0.57 -1.1 -0.62 +0.3
-0.60 -0.2
-0.65 +0.1 -0.68 +1.4
-0.68 +0.9
-0.78 +2.7 -0.81 +3.8
-0.80 +3.3
-0.88 +5.5 -0.89 +6.4
-0.88 +6.0
-0.93 +8.6 -0.93 +9.3
-0.93 +8.9
Table 4: Effects of difference in radon exposure for smokers and non
smokers, with x = smoker/non-smoker exposures in each county. Table gives
value of B for various choices of the distribution of x-values and
Corr(x,r).
x-average SD/mean of x Corr(x,r) B
________ ___________ _______ ____
0.8 0.57 1.0
-4.9
0.9 0.57 1.0
-4.8
1.0 0.57 1.0
-4.7
1.2 0.57 1.0
-4.5
1.5 0.57 1.0
-4.3
0.9 0.57 0
-6.5
0.9 0.57 0.4
-5.9
0.9 0.57 0.7
-5.5
0.9 0.57 1.0
-4.8
0.9 0.28 0
-7.3
0.9 0.28 0.4
-6.7
0.9 0.28 1.0
-5.6
CAPTION FOR FIGURE
Fig. 1: Lung cancer mortality rates corrected for smoking prevalence with
the bracketed term in Eq. (1), vs average radon levels in homes, for 1601
U.S. counties. Data points shown are the average of ordinates for all
counties within the range of r-values shown on the base-line of the upper
left figure; the number of counties within that range is also shown there.
Error bars are one standard deviation of the mean, and the first and third
quartiles of the distributions are also shown. Theory lines are
arbitrarily normalized lines increasing at a rate of +7.3% per pCi/L. The
left and right figures are calculated with lung cancer mortality rates for
1970-1979 and 1979-1994 respectively. These figures are used only for
presentation; all analyses, including the straight line fit to the data
shown here, use the 1601 actual data points.
REFERENCES
1. Greenland, S. And Robins, J. Ecologic studies -- biases,
misconceptions, and counterexamples. Am. J.
Epidemiol.139:747-760;1994
2. Stidley, C.A. And Samet, J.M. Assessment of ecologic regression in the
study of lung cancer and indoor radon. Am. J. Radiol. 139:312 322;1994
3.Morgenstern, H. Ecologic studies in epidemiology: concepts, principles,
and methods. Annual Rev. Public Health 16:61-81;1995
4. Lubin, J.H. On the discrepancy between epidemiologic studies in
individuals of lung cancer and residential radon and Cohens ecologic
regression, Health Phys. 75:4-10;1998
5. Lagarde, F and Pershagen, G. Parallel analyses of individual and
ecologic data on residential radon, cofactors, and lung cancer in
Sweden. Am J Epidemiol 149:268-274;1999
6. Cohen, B.L. Test of the linear-no threshold theory of radiation
carcinogenesis for inhaled radon decay products, Health Phys.
68:157-174;1995
7. National Academy of Sciences Committee on Biological Effects of Low
Level Radiation: Health risks of radon and other internally deposited
alpha emitters (BEIR-IV). National Academy Press, Washington, DC;
1988
8. Cohen, B.L. Updates and extensions to tests of the linear-no threshold
theory. Technology 7:657-672;2000
9. Cohen, B.L. Response to criticisms of Smith et al. Health Phys.75:23
28;1998
.
10. Cohen, B.L. Response to Lubins proposed explanations of our
discrepancy, Health Phys.75:18-22;1998
11. Cohen, B.L. Variation of radon levels in U.S. Homes correlated with
house characteristics, location, and socioeconomic factors. Health
Phys.60:631-642;1991
12. National Academy of Sciences Committee on Health Effects of Ionizing
Radiation: Health effects of exposure to radon (BEIR-VI). National
Academy Press, Washington,DC; 1999
13. Cohen, B.L. Testing a BEIR-VI suggestion for explaining the lung
cancer vs radon relationship for U.S. Counties. Health Phys. 78:522
527;2000
************************************************************************
You are currently subscribed to the Radsafe mailing list. To unsubscribe,
send an e-mail to Majordomo@list.vanderbilt.edu Put the text "unsubscribe
radsafe" (no quote marks) in the body of the e-mail, with no subject line.