[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Treatment of Confounding Factors in an Ecologic Study



	A copy of this paper is appended below for those who requested it.

The figure (not included here) is Fig. 1 from my paper in Health Physics

68:157-174;1995



Bernard L. Cohen

Physics Dept.

University of Pittsburgh

Pittsburgh, PA 15260

Tel: (412)624-9245

Fax: (412)624-9163

e-mail: blc+@pitt.edu



TREATMENT OF CONFOUNDING FACTORS IN AN ECOLOGIC STUDY



Bernard L. Cohen

University of Pittsburgh



	It is often said that an ecologic study cannot give adequate

treatment for confounding factors because the average value of a variable

does not truly represent its confounding effects. For example, the average

age for groups cannot elucidate an effect which depends only on the

fraction of the groups that are very young, or that are very old. Several

arguments to that effect have been published1,2,3 and Lubin4 has given a

mathematical proof that this problem can lead to an unbounded error.

Lagarde and Pershagen5 have recently given an actual example of where this

problem resulted in a wrong result. An associated problem is that an

ecologic study may not have data on important details; for example, it may

have data on average levels of a pollutant in homes of the groups, but

this does not give information on how much time is spent in their homes

which is required to determine actual exposures. This is a special case of

the problem that confounding may not be represented by a single parameter,

such as the fraction of the population that smokes when intensity of

smoking is an issue. Still another problem is that average values of

potential confounders may be unreliable for some reason. 

	While all of these problems are very serious and even fatal in

nearly all ecologic studies, the purpose of this paper is to describe

methods of addressing them in at least one particular ecologic study.





A. The Problem



	In a study of individuals, these problems can be solved by asking

appropriate questions of each individual, but that is not possible in an

ecologic study. The particular ecologic study6 considered here was

designed to test the validity of the linear-no threshold theory

[hereafter, LNT] of radiation carcinogenesis, as applied to the radon vs

lung cancer relationship in the BEIR-IV Report7.

	According to BEIR-IV, the lung cancer mortality risk to an

individual, m, is related to his accumulated radon exposure, r, by

		m = an (1 + b r )		for a non-smoker

		m = as (1 + b r )		for a smoker

where an, as, and b are given constants. Using these and adding up the

experiences of all males in a county and dividing by the male population

(all previous papers give parallel analyses for females, always with

similar results), gives

		m = [ (1 - S) an + as S ] ( 1 + b r )

where m is the lung cancer mortality rate for the county, r is the average

radon exposure, and S is the fraction of the adult males that smoke. The

derivation of this equation, carried out elaborately in Ref. 6,  is

mathematically rigorous and is not affected by the ecologic fallacy.

	Ref. 6 presents a test of this relationship involving data for

over half of all U.S. counties, containing 90% of the U.S. population.

With  corrections for .movement of people between counties, the above

relationship with numerical values becomes

		M = m / [9 + 99 S] = A + B r

(1)

where the bracketed term may be thought of as the correction for smoking

prevalence, A is a number close to 1.00, and B, in percent per pCi/L of

radon (percent per 37 Bq/m3), is

			B = +7.3			LNT prediction		

Ref. 6 tests the validity of LNT by fitting the data for 1601 counties to

Eq. (1) to obtain a value of B, to be compared with this LNT prediction..

While all fitting is done mathematically to the 1601 data points. the

results can be appreciated from Fig. 1 which is a presentation explained

in its caption. The result from data fitting is B = -7.3 +/-0.56,

discrepant from the LNT prediction by 26 standard deviations; we refer to

this as our discrepancy. The remainder of this paper is about efforts to

explain our discrepancy without abandoning LNT by invoking possible

confounding factors (CF).



B. Treatment of Confounding Factors (CF) by Stratification



	For a given CF, for which data are available for each county, the

method used was to stratify the data file on that CF into several sub

files. An example, for county population density(PD) as a CF, is exhibited

in Table 1, where the data are divided into 10 separate sub-files of 160

counties each, and each sub-file is analyzed independently to obtain a

value of B. Since the variation in population density in any one of these

sub-files is much less than in the total data file, its confounding

effects are much reduced. This is shown in Table 1 by including results of

a double regression of M on r and PD, noting that B-values from single and

double regression are essentially the same.  Thus an average of the 10 B

values obtained gives a value of B largely free of confounding by

population density. In Table 1 we include results for males and females as

an example of typical differences, but hereafter for brevity we give

results for males only. 

	This procedure may not eliminate effects of a CF because the

average value of a CF does not necessarily represent its confounding

effects, a problem sometimes discussed as cross level bias.For example,

average annual income may not represent the confounding effects of income

because its confounding effects may depend on the fraction of the

population that is very poor, or very rich.. To cover this problem, we

consider separately as CF the fraction of the population with income

<$5000, $5000 to $10,000, ....., >$150,000 (10 brackets in all), plus

various combinations of adjacent brackets8. With regard to the example

cited in the Introduction where average age does not represent the effects

of age distribution, we consider as CF the fraction of the population in

age group, <1 year, 1-2 years, .....,>85 years (31 groups in all) plus

various combinations of adjacent brackets.

	These examples afford us an opportunity to discuss Lubins

mathematical proof that using average values for a CF can lead to

unbounded errors4. Lets assume that men with incomes that are integral

multiples of $737 per year have N times higher lung cancer rates and N

times higher radon exposure than the average - we have no data to show

that this is not correct, and this would cause a very large error in our

result if it were true, in accordance with Lubins mathematical proof.

However, the problem with this example is that it is completely

implausible. Lubins proof must include a corollary involving plausibility.

Without which it is not applicable. To invalidate our study, one must

concoct a CF with a not implausible behavior and examine its effects on

the results. That process is the principal topic of this paper.

	In Ref. 6, and subsequent papers, well over 100 potential CF were

treated by the stratification method with no progress in resolving our

discrepancy. But stratification is a tedious process and the number of

potential confounding factors is very large, so a simple screening

procedure was developed, described in the next section.



C. Plausibility Requirements for Confounding Factors



	The way a confounding factor, X, can affect the value of B derived

from fitting data with Eq. (1) is by systematically causing counties with

low M to have high (or low) r, and vice versa.. This would be evidenced by

the rankings of counties by M, R(M), and (for our case) the inverse

rankings by r, R(r), both being highly correlated for unrelated reasons

with the rankings of counties by X, R(X). The most effective R(X), is

		R0(X) = 0.5 R(M) + 0.5 R(r)

which leads to a coefficient of correlation by rankings, CoRR(X.r) =

CoRR(X,M) = 0.82. Lesser correlations can be generated by taking

		R(X) = p R0(X) + (1 - p) R-random

where R-random is a random rearrangement of the rankings, and p can be

varied between 0 and 1.0 to obtain varying CoRR(X,M) and CoRR(X,r). For

each value of p we use the stratification method to obtain a value of B

from fitting the data to Eq. (1). The results for three different sets of

R random (generated by the MINITAB statistical package) are listed in

Table 2. We see there that the different sets of R-random give consistent

results and indicate that for a CF, X, to shift the value of B from its

original value B = -7.3, to the LNT prediction B = +7.3, requires

CoRR(X,r) and CoRR(X,M) of about 0.75, and even to change the sign of B

from - to + , accounting for half of our discrepancy, requires these

correlations to be about 0.6.

	How plausible are such correlations? The factors affecting radon

exposure, r, are geology and house construction characteristics, while the

factors affecting M, lung cancer rates corrected for smoking, are health

related problems, so it is almost impossible to imagine a direct causal

relationship between them. By far the most likely source of confounding is

through socioeconomic variables, SEV. In Ref. 6, 54 SEV were studied and

among them for CoRR(SEV,r) there were 3 between 0.35 and 0.30, 2 between

0.30 and 0.25, and the remaining 47 were <0.25; for CoRR(SEV,M) there were

4 between 0.31 and 0.25 and the remaining 50 were <0.25. It thus seems

highly implausible for correlations by ranking like 0.6 to 0.75 to occur.

It is interesting to note here that the correlation by rankings between M

and r, CoRR(M.r) = 0.35, higher than for any of our SEV; this is the

strong correlation evident in Fig. 1.



D. Confounding by Uncertainties in Smoking and Radon Exposure



	Up to this point, we have not discussed smoking as a CF; it has a

very special place because of its importance in causation of lung cancer,

leading to its explicit inclusion represented by the term S in Eq.(1)..

This means that the distribution of S-values, not just its ranking for

various counties, affect results for B. While several different sources of

data have been used to generate S-values for each county (these are

reviewed in Ref. 9) and all lead to similar B-values, the uncertainty in

S-values is still a matter of some concern. 

	If the width of our distribution of S-values is maintained but S

values for counties are reassigned so as to give CoRR(S.r) = 1.0, a

perfect correlation by rank, B is reduced from its original value, -7.3,

to zero, still leaving half of our discrepancy unexplained. .But the

effects are increased if the distribution of S-values is wider. The

maximum plausible width for the S-distribution is the width of the lung

cancer mortality rate (m) distribution, since other factors influence m in

ways that, statistically, would increase the width. With this increased S

distribution width, centered on the well established national average for

S, S-values were assigned to each county to give CoRR(S.r) = 1.0; we call

this S-perfect. At the other extreme, these same S-values were randomly

assigned to each county to obtain S-random. Calculations were then done

with

		S = q S-perfect + (1-q) S-random

where q is various numbers between 0 and 1.0 chosen to obtain various

coefficients of correlation with r (not correlations by rank), Corr(S,r).

The results for three different S-random are shown in Table 3. We see

there that the Corr(S,r) required to change B to the LNT prediction, B =

+7.3, is about 0.9, and just to reduce B down to zero, eliminating half of

our discrepancy, is about 0.62, even with this substantially increased

width of the S-distribution.

	How plausible are these required Corr(S,r)? Since there is no

direct causal relationship between S and r, the most probable source of a

correlation is through socioeconomic variables, SEV. It therefore seems

reasonable to assume that Corr(S,r) should be in the same range as

Corr(SEV,r). For the 54 SEV considered in Ref. 6, the largest absolute

value Corr(SEV,r) is 0.37, the next largest is 0.30, and for 49 of the 54

SEV, it is less than 0.23.. It thus seems highly implausible for

Corr(SEV,r) to approach the values required to help explain our

discrepancy. 

	Another type problem arises if there is a systematic difference10

in average radon exposures for smokers, rs, and non-smokers, rn; this is a

confounding factor on the level of individuals rather than on the level of

the entire group, which is often cited as a fatal flaw in ecologic

studies, but we treat it here  Since smokers are 12 times more at relative

risk from radon than non-smokers7, the effective radon level, re, for the

county as a whole for causing lung cancer is

		re = [12 S rs + (1 - S) rn ] / [12 S + (1 - S)]

where the denominator is the sum of the weights. This differs from the

measured average radon level, r.

		r = S rs + (1-S) rn

If we define x = rs / rn, the relationship between the effective and

measured radon levels becomes10

		re = r (12 S x + 1 - S) / [(x S + 1 - S) (11 S + 1)]

and we use re instead of r in fitting the data to determine values of B.

In doing this, the parameters that may be varied are the average value of

x (x-average), the width of the distribution of x-values, and Corr(x,r).

Our studies11 show that the national average for x is 0.9, but we give

some results for other values. The width of distributions of county

averages of our socioeconomic variables (SEV) that are not proportional to

population and do not include counties with values less than 25% of the

mean have standard deviations between 7% and 21% of the mean. If we

include all SEV for which there are no zero values, the average width is

26% of the mean, and for one SEV it is above 50%, 55% for percent of

income from government which is an understandable special case. We

consider distributions of x-values 57% of the mean which severely

stretches the limits of plausibility, and 28% of the mean which is quite

high but perhaps not implausible.

	 Some results are listed in Table 4. The first five entries

explore the effect of x-average for the most favorable assumptions about

the other factors. The remaining entries use the known value of x-average

and explore the effects of the width of the distribution and of Corr(x,r).

The results in Table 4 make it evident that the effect under investigation

can do very little to change our B-value from -7.3 to the LNT prediction,

B = +7.3.



E. Plausibility of Correlation Applied to Other Factors



	The above discussion is based on the assumption that Eq. (1)

represents the true prediction of LNT. But the BEIR-VI Report12 suggested

that Eq. (1) is deficient in that it ignores intensity of smoking, and

proposes that this be treated by dividing smokers into two categories, 2

pack/day and 1 pack/day. To study this13, we define

		k = ratio of 2 pack/day to 1 pack/day smokers in a county

		f = ratio of lung cancer risk for 2 pack/day to 1 pack/day

Analysis of available data indicates12 the plausible values most favorable

for the BEIR-VI suggestion are f = 2.0 and national average for k = 0.4.

Using these converts Eq. (1) to 

	M = m / [9 - 9S + 84 S {(1 + 2 k)/((1 + k)} ] ( A + B r )

(2). 

Different distributions of k-values were tried but the most promising was

a level distribution between 0 and 0.8, to be consistent with the national

average of 0.4.. As in Section D, we assign k-values to counties so as to

define  k-perfect as assignments for which CoRR(k.r) = 1.0, and k-random

as one where k-values are assigned randomly. We then generate k-values to

be used in fitting Eq. (2) as

		k = g k-perfect + (1 - g) k-random

where g is given various values between 0 and 1.0 to obtain different

Corr(k,r). The results for one set of k-random are



Corr(k,r)       0       -0.37        -0.52       -0.78       -0.91

-0.93

    B		-7.6     -5.0          -4.2         -2.5         -0.7

+1.4		



In view of our previous discussion on plausibility of correlations, it is

clear that including intensity of smoking as a confounder can do little to

reduce our discrepancy. In Ref. 13, consideration was also given to

possible correlations with r for both S and k, using the above method. For

cases where Corr(k,r) = Corr(S,r), as these vary from zero to -0.8, B

increases roughly linearly from -10.0 to +1.3; for example, for Corr(k,r)

= Corr(S,r) =0.4, B=-4.3. Again it is apparent that plausible values of

these correlations can do little to bring B close to the LNT prediction, B

= +7.3.  . 



F. Other Problems with Confounding

	

	In our extensive studies11 of how radon levels vary with

socioeconomic factors, house characteristics, geography, etc), it was

found that rural houses average about 25% higher radon levels than urban

houses, whereas urban males smoke about 25% more frequently than rural

males. This is a variation on the level of individuals which cannot be

taken into account properly by use of county average radon level and

smoking prevalence.. This problem was treated in Ref.6 using a model with

the above percentages as parameters by modifying the derivation of Eq. (1)

to consider not just the two categories , smokers and non-smokers, but

four categories, urban and rural smokers and urban and rural non-smokers,

each category having its own percentage of the population, lung cancer

rate, and average radon level. These are related by the percent of the

population that lives in urban areas, a known quantity for each county,

and m, r, and S for the county. It was found that the changes in B caused

by various plausible values of the parameters was only a few percent.

Actually the slope of a regression of lung cancer rate, m, on r is rather

strongly affected, but nearly all of these effects are compensated by the

correction for smoking given by the term in square brackets in Eq. (1). 

	There have been suggestions that effective radon exposure, r

effective, may not be the same as the measured radon level in the home, r

measured; for example, time spent in the home may vary, or exposures

outside the home may be important. We represent this as

		r-effective = (1+f) r-measured

This can make a difference if there is a strong correlation between f and

r. This problem was studied by the methods outlined above, testing various

correlations and their plausibility9, and it was found not to be

important.  But this can be easily understood as follows A positive

correlation stretches out the abscissa in Fig. 1, which can reduce the

B-value, but not change it from negative to positive as required to

explain our discrepancy. A negative correlation contracts the abscissa,

thus increasing the negative value of B. Small correlations only spread

the data more, without changing B. Thus this approach can do little for

explaining our discrepancy.

	Another issue is the effect of confounding by combinations of CF.

This is a complex issue, but it is treated in Ref. 6, where it is

concluded 

that combinations of the 54 socioeconomic factors used there can account

for only about 10% of our discrepancy.











G. Conclusion



	All potential confounding factors that we or others have suggested

have been shown to be incapable of explaining our discrepancy. In view of

the fact that there is no other evidence supporting the theory in this low

dose region, this may be interpreted to mean that the linear-no threshold

theory fails the experimental test. However, further suggestions for

explaining our discrepancy are always welcome and will be carefully

analyzed.











Table 1:Treatment of County Population Density (PD) as a confounding

factor by the stratification method. Results are for B using single

regression of M on r as in Eq. (!),  and double regression of M on r and

PD, fitting the data to

		M = A + B r + E PD

where E is a fitting parameter. 



County Rank   	PD range	Single Regression	Double Regression

   by PD            (x100/sq.mi)  B-male  B-female         B-male B-female

_________    ___________    ____     ____               ____      ____

    1 - 160        0.003-0.094      -3.7       -6.6                -3.7

-6.4

 161- 320        0.095-0.22        -8.0       -7.8                -8.0

-7.9

 321- 480          0.22-0.35        -7.0       -8.5                -7.0

-8.5

 481- 640          0.35-0.50        -6.4       -9.7                -6.4

-9.8

 641- 800          0.50-0.67        -8.9       -8.7                -8.9

-8.7

 801- 960          0.67-0.92        -4.3       -4.4                -4.3

-4.4

 961-1120         0.93-1.29        -9.2       -6.0                -9.3

-6.0

1121-1280        1.30-2.05        -5.9       -8.1                -5.9

-8.1

1281-1440        2.05-4.11        -0.5       -2.7                -0.5

-2.8

1441-1601       4.12-671.8       -4.5       -7.4                -3.9

-6.2

		__________          _____   _____             _____

_____

		 AVERAGE                -5.8       -7.0

-5.8        -6.9

	 







Table 2: B values obtained if a confounding factor, X, has various

correlations by ranking with M and r, CoRR(X,M) and CoRR(X,r). The three

sets of results are for three different R-random.



CoRR(X.r) CoRR(X,M)   B    CoRR(X,r) CoRR(X,M)  B   CoRR(X,r) CoRR(X,M)  B

   0.09	0.09	    -7.2     0.07	    0.12    -7.2     0.08

0.08      -7.2

   0.18        0.18       -7.0     0.16         0.21    -6.9     0.16

0.17	-7.0

   0.34        0.34	     -5.2     0.32        0.37     -5.3     0.33

0.33     -5.1

   0.53        0.53       -2.1     0.51        0.55      -2.3    0.52

0.52      -2.1

   0.69        0.69       +3.2     0.68        0.70      +2.9    0.69

0.69      +3.3

   0.79        0.79      +10.5    0.78        0.79     +10.4   0.79

0.79     +10.8

   0 81        0.81      +13.7    0.81        0.82     +13.7   0.81

0.81     +13.8





Table 3: B-values obtained if smoking prevalence, S, has various

Corr(S,r), assuming the maximum plausible width for the S-distribution.

The three sets of results are for three different S-random.

  

Corr(S,r)	    B  		    Corr(S,r)	    B

Corr(S,r)	     B   

   -0.17        -7.1              -0.24        -5.5

-0.23         -6.0

   -0.33        -4.7             -0.39        -3.2

-0.38         -3.7

   -0.41        -3.5             -0.47        -2.0

-0.45         -2.6

   -0.49        -2.3             -0.54        -0.9

-0.53         -1.4

   -0.57        -1.1             -0.62        +0.3

-0.60         -0.2

   -0.65        +0.1             -0.68        +1.4

-0.68         +0.9

   -0.78        +2.7             -0.81        +3.8

-0.80         +3.3

   -0.88        +5.5             -0.89        +6.4

-0.88         +6.0

   -0.93        +8.6             -0.93        +9.3

-0.93         +8.9

















Table 4: Effects of difference in radon exposure for smokers and non

smokers, with x = smoker/non-smoker exposures in each county. Table gives

value of B for various choices of the distribution of x-values and

Corr(x,r).



	  x-average	 	SD/mean of x	  Corr(x,r)		B

           ________         ___________        _______        ____

		0.8		     0.57		       1.0

-4.9

		0.9		     0.57		       1.0

-4.8

		1.0		     0.57		       1.0

-4.7

		1.2		     0.57		       1.0

-4.5

		1.5		     0.57		       1.0

-4.3

		0.9		     0.57		         0

-6.5

		0.9		     0.57		        0.4

-5.9

		0.9		     0.57		        0.7

-5.5

		0.9		     0.57		        1.0

-4.8

		0.9		     0.28		          0

-7.3

		0.9		     0.28		        0.4

-6.7

		0.9		     0.28		        1.0

-5.6







CAPTION FOR FIGURE



Fig. 1: Lung cancer mortality rates corrected for smoking prevalence with

the bracketed term in Eq. (1), vs average radon levels in homes, for 1601

U.S. counties. Data points shown are the average of ordinates for all

counties within the range of r-values shown on the base-line of the upper

left figure; the number of counties within that range is also shown there.

Error bars are one standard deviation of the mean, and the first and third

quartiles of the distributions are also shown. Theory lines are

arbitrarily normalized lines increasing at a rate of +7.3% per pCi/L. The

left and right figures are calculated with lung cancer mortality rates for

1970-1979 and 1979-1994 respectively. These figures are used only for

presentation; all analyses, including the straight line fit to the data

shown here, use the 1601 actual data points.





REFERENCES



1. Greenland, S. And Robins, J. Ecologic studies -- biases,

misconceptions, 	and counterexamples. Am. J.

Epidemiol.139:747-760;1994



2. Stidley, C.A. And Samet, J.M. Assessment of ecologic regression in the

study of lung cancer and indoor radon. Am. J. Radiol. 139:312 	322;1994



3.Morgenstern, H. Ecologic studies in epidemiology: concepts, principles,

and methods. Annual Rev. Public Health 16:61-81;1995



4. Lubin, J.H. On the discrepancy between epidemiologic studies in

individuals of lung cancer and residential radon and Cohens ecologic

regression, Health Phys. 75:4-10;1998



5. Lagarde, F and Pershagen, G. Parallel analyses of individual and

ecologic 	data on residential radon, cofactors, and lung cancer in

Sweden. Am 	J Epidemiol 149:268-274;1999



6. Cohen, B.L. Test of the linear-no threshold theory of radiation

carcinogenesis for inhaled radon decay products, Health Phys.

68:157-174;1995



7. National Academy of Sciences Committee on Biological Effects of Low

Level Radiation: Health risks of radon and other internally deposited

alpha emitters (BEIR-IV). National Academy Press, Washington, DC;

1988

8. Cohen, B.L. Updates and extensions to tests of the linear-no threshold

theory. Technology 7:657-672;2000

9. Cohen, B.L. Response to criticisms of Smith et al. Health Phys.75:23

28;1998

. 

10. Cohen, B.L. Response to Lubins proposed explanations of our

discrepancy, Health Phys.75:18-22;1998 







11. Cohen, B.L. Variation of radon levels in U.S. Homes correlated with

house characteristics, location, and socioeconomic factors. Health

Phys.60:631-642;1991



12. National Academy of Sciences Committee on Health Effects of Ionizing

Radiation: Health effects of exposure to radon (BEIR-VI). National

Academy Press, Washington,DC; 1999



13.  Cohen, B.L. Testing a BEIR-VI suggestion for explaining the lung

cancer 	vs radon relationship for U.S. Counties. Health Phys. 78:522

527;2000







************************************************************************

You are currently subscribed to the Radsafe mailing list. To unsubscribe,

send an e-mail to Majordomo@list.vanderbilt.edu  Put the text "unsubscribe

radsafe" (no quote marks) in the body of the e-mail, with no subject line.