The confounding effect of cryptic relatedness for environmental risks of systolic blood pressure on cohort studies

The impact of cryptic relatedness (CR) on genomic association studies is well studied and known to inflate false-positive rates as reported by several groups. In contrast, conventional epidemiological studies for environmental risks, the confounding effect of CR is still uninvestigated. In this study, we investigated the confounding effect of unadjusted CR among a rural cohort in the relationship between environmental risk factors (body mass index, smoking status, alcohol consumption) and systolic blood pressure. We applied the methods of population-based whole-genome association studies for the analysis of the genome-wide single nucleotide polymorphism data in 1622 subjects, and detected 20.2% CR in this cohort population. In the case of the sample size, approximately 1000, the ratio of CR to the population was 20.2%, the population prevalence 25%, the prevalence in the CR 26%, heritability for liability 14.3% and prevalence in the subpopulation without CR 26%, the difference of estimated regression coefficient between samples with and without CR was not significant (P-value = 0.55). On the other hand, in another case with approximately >20% heritability for liability, we showed that confounding due to CR biased the estimation of exposure effects.


Introduction
Cryptic relatedness (CR) is well known as a confounding factor in genome wide association study (GWAS) (Yu et al. 2006;Kang et al. 2008;Price et al. 2010), which inflates the false-positive rate. Voight and Pritchard (2005) developed a formal model of CR and studied its impact on genomic case-control association studies. They showed that the degree of confounding due to CR would usually be negligible. However, in contrast, they also reported on studies with sampling biases toward collecting relatives may indeed suffer from excessive rates of false positives. Typically, epidemiological designs in which individuals are ascertained nonrandomly from a closed population, the effect of the influence of close relatives might not be negligible. It is therefore important to correct or account for the confounding effect of CR in epidemiological cohort studies that have collected data from a limited or smallsized sample. However, the knowledge of the confounding effect of CR in epidemiological association studies is still unknown. Here, we examined the confounding effect of CR in the relationship between systolic blood pressure (SBP) and several environmental risk factors (body mass index [BMI], smoking status [daily smoker vs. nonsmoking], and alcohol consumption [drinking vs. abstention]).
It is also well known that being overweight and obese increases the risk of high blood pressure (Kannel 1967;World Health Organization 2000). However, interpreting the blood pressure-BMI relationship is further complicated by data from other studies, in which there appears to be no correlation between these variables (Roche and Siervogel 1991;Spiegelman et al. 1992;Bunker et al. 1995;Gallagher et al. 1996). In this article, we examined whether the confounding effect of CR might involve in the relationship between blood pressure and BMI. Additional goals of the study were to assess the confounding effect of CR in any potential relationship between blood pressure and the risk of other environmental factors; for example, smoking and alcohol consumption. There are several studies that examined the relationship between alcohol consumption or smoking and blood pressure in a Japanese population (Kiyohara et al. 1995;Minami et al. 2002;Ohmori et al. 2002). Ohira et al. (2009) looked into the effect of habitual alcohol intake on ambulatory blood pressure among Japanese men, which was associated with increased BP in the morning. Minami et al.(1999) studied the effects of smoking cessation on blood pressure in habitual smokers. However, there were no studies which examined the confounding effect of CR in the relationship between blood pressure and the risk of environmental factors. We aim to address the question of whether CR is likely to be a serious issue for inferring epidemiological relationship between these factors using the cohort study of Takahata residents. First, using the techniques to detect and correct for unrecognized population structure in GWAS, we examined how CR was presented in the sampling. Next, we tested the assumption of parallel regressions to examine whether its confounding effect as a covariate affected on environmental risk factors in difference setting (sample size, ratio of CR to the population, prevalence in CR, the population prevalence). Then, we applied multiple regression analysis to these data with and without CR in order to examine the differences obtained in estimating the regression coefficient.

Analysis of real data
We used the genome-wide 657,366 single nucleotide polymorphism (SNP) data and SBP as a phenotype in the cohort study of Takahata. We selected BMI, variables for smoking status (1: nonsmoking vs. 2: daily smoker), and alcohol consumption (1: abstention vs. 2: drinking) as environmental risk factors, and gender (1: male; 2: female) and age as covariates. Weight, height, and SBP were measured and standardized in the Takahata cohort design. We examined the relationship between individuals genetic background by the PLINK (Purcell et al. 2007) and the EIGENSTRAT methods (Price et al. 2006). We detected relationships between subjects using identity by descent (IBD) probability as a measurement.

Multiple regression analyses
We sampled our cohort population in a difference setting; sample size (approximately 1000, 400, or 500), the ratio of CR to the population (approximately 20%, 40%, or 50%), the prevalence in the subpopulation without CR (26%, 50%, or 76%), the prevalence in CR (26% fixed), the population prevalence (25%, 40%, or 50%), and heritability for liability (approximately 14%, 22%, or 32%) in our cohort population. Here, we estimated heritability for liability using the formula {(x p À x q )/a p }/q, where q denotes the expected proportion of alleles shared IBD (i.e., q = 2 ÀR , where R denotes the degree of relationship), x p denotes the difference between mean value in the subpopulation without CR and threshold, x q denotes the difference between mean value in the subpopulation without CR and mean value in CR, and a p denotes the difference between population mean and mean value in the group of affected individuals (Yasuda 2007). In our context, q can be defined as the sum of expected proportion of allele shared IBD for all of the degree of relationship in our detected CR (i.e., R = 0, 1, 2). First, we tested the assumption of parallel regressions to examine whether the confounding effect of CR as a covariate affected on environmental risk factors; that is, the following test was performed, y k ¼ a k þ b nk x n ðk ¼ 1; 2; n ¼ 1; 2. . .; 6Þ where k is the number of population (i.e., population with and without CR), and n is the number of independent variables and covariates. Then, we applied multiple regression analysis to these data with and without CR in order to examine the differences obtained in estimating the regression coefficient.

Cohort description
We performed an analysis of the data collected within a closed, small prospective study, concerned with various risk factors for common diseases.

Subject recruitment
The Takahata cohort was established for a baseline survey in a small rural town, Takahata in Yamagata Prefecture from 2006 to 2008. The total population size has been constant, approximately 25,000 throughout this period. The Takahata cohort has become part of our large genomic cohort initiative, the Yamagata cohort, which is now ongoing in the urban prefectural capital, Yamagata City, having approximately 250,000 residents. We used genomic DNAs from 1622 individuals who completed the questionnaire for environmental exposures and informed consent for our modern prospective genomic cohort study. This cohort study was performed under the approval by the Committee on Ethics at Yamagata University and all other institutions involved.

Genotyping
Using genomic DNAs from the Takahata population, we carried out genotyping for 657,366 SNPs using the Infinium Assay with Human660W-Quad BeadChip (Illumina, San Diego, CA) according to the standard procedure provided by Illumina. Figure 1 shows the relationship between total 1622 subjects with an IBD probability with regard to an identity by distance. First step, we removed individuals for low genotyping (P 0.05) from total 1622 sample using the PLINK method. By the PLINK and the EIGENSTRAT methods, we detected a relationship between subjects with an IBD probability >1/4 as a CR of 326 subjects (i.e., monozygotic twins, dizygotic twins, full-sibs, parent-offspring, half-siblings, grandparent, grandchild, aunt/uncle, and niece/nephew) in the sample of 1617 subjects. Figure 2 shows the relationship between 1291 subjects with an IBD probability with regard to an identity by distance after removed a CR of 326 subjects. Next, we removed subjects medically treated for blood pressure from samples with and without CR, respectively. Using the sample with and without CR, the sample sizes were 1039 and 829 individuals, respectively. We analyzed the data as described above. In this case, the heritability for liability was 14.3%. In the multiple regression analysis, the regression model found from the sampling data with CR was as follows:  . Plot of the relationship between 1291 subjects with an identity by descent (IBD) probability with regard to an identity by state (IBS) distance after which is removed a cryptic relatedness of 326 subjects with an IBD probability >1/4 (i.e., monozygotic twins, dizygotic twins, full-sibs, parent-offspring, half-siblings, grandparent, grandchild, aunt/uncle, and niece/nephew); y-axis and x-axis describe IBD probability and IBS distance, respectively.

Results
where "Systolic" denotes systolic blood pressure and * indicates significant, P-values <0.05. Note that adjusted R-squared for equation (1) was 0.14. The regression model found from the sampling data without CR was as follows: Note that adjusted R-squared for equation (2) was 0.15. Compared with equations (1) and (2), the results showed that the presence of CR is apparently not affected by estimating regression coefficient in regression modeling (Table 1). We tested the assumption of parallel regressions for equations (1) and (2). The difference between equations (1) and (2) was not significant (F-value = 0.35, P-value = 0.55).
Furthermore, as described in Methods, we analyzed under the following difference setting, conditional on the fixed prevalence in CR of 26%. The regression model found from the sampling data in the sample size of 400, the ratio of CR to the population 52.5%, prevalence in the population 40%, prevalence in the subpopulation without CR 55%, prevalence in CR 26%, the heritability for liability 24.2% was as follows: Systolic ¼ 141:04 Ã À 0:12 Â BMI À 6:11 Ã Â Alcohol þ 5:34 Ã Â Smoking À 0:04 Â Age À 1:24 Â Gender (3) Note that adjusted R-squared for equation (3) was 0.25. In contrast, the regression model found from the sampling without CR (sample size 190) was as follows: Note that adjusted R-squared for equation (4) was 0.16. Compared with equations (3) and (4), the results showed that the presence of CR is apparently affected by estimating regression coefficient in regression modeling (Table 2). By testing the assumption of parallel regressions for equations (3) and (4), the difference between equations (3) and (4) was significant (F-value = 41.83, Pvalue = 2.103e À10 ). The regression model found based on the sampling data in the sample size of 400, the ratio of CR to the population 52.5%, prevalence in the population 50%, prevalence in the subpopulation without CR 76.3%, prevalence in CR 26%, the heritability for liability 32.4% was as follows: Note that adjusted R-squared for equation (5) was 0.24. In contrast, the regression model found from the sampling without CR (sample size 190) was as follows: Note that adjusted R-squared for equation (6) was 0.42. Compared with equations (5) and (6), the results showed that the presence of CR is apparently affected by estimating regression coefficient in regression modeling (Table 3). By testing the assumption of parallel regressions for equations (5) and (6), the difference between equations (5) and (6) was significant (F-value = 39.74, Pvalue = 5.720e À10 ). The regression model found from the sampling data in the sample size of 500, the ratio of CR to the population 42%, prevalence in the population 40%, prevalence in the subpopulation without CR 50%, prevalence in CR 26%, the heritability for liability 22.1% was as follows: Systolic ¼ 137:18 Ã þ 0:04 Â BMI À 9:47 Ã Â Alcohol þ 3:47 Ã Â Smoking þ 0:10 Â Age À 0:95 Â Gender (7) Note that adjusted R-squared for equation (7) was 0.19. In contrast, the regression model found from the sampling without CR (sample size 290) was as follows: Systolic ¼ 141:55 Ã þ 0:09 Â BMI À 5:34 Ã Â Alcohol À 2:30 Â Smoking þ 0:11 Ã Â Age þ 1:10 Â Gender Note that adjusted R-squared for equation (8) was 0.16. Compared with equations (7) and (8), the results showed that the presence of CR is apparently affected by estimating regression coefficient in regression modeling (Table 4). By testing the assumption of parallel regressions for equations (7) and (8), the difference between equations (7) and (8) was significant (F-value = 24.96 P-value = 7.219e À07 ).
The regression model found from the sampling data in the sample size of 500, the ratio of CR to the population 42%, prevalence in the population 50%, prevalence in the subpopulation without CR 67.2%, prevalence in CR 26%, the heritability for liability 31.7% was as follows: Note that size of sample with CR was 500 subjects and adjusted R-squared for equation (9) Note that adjusted R-squared for equation (10) was 0.24. Compared with equations (9) and (10), the results showed that the presence of CR is apparently affected by estimating regression coefficient in regression modeling (Table 5). By testing the assumption of parallel regres-  sions for equations (9) and (10), the difference between equations (9) and (10) was significant (F-value = 35.46 P-value = 3.915e À09 ).

Discussion
We detected 20.2 % CR of the sample in Takahata cohort study. In our multiple regression models using sample size (N = 1000), there is no significant difference of regression coefficients in the sample with and without CR. In contrast, in the case that the population prevalence of SBP 40-50%, the prevalence in the subpopulation without CR 50-76%, the ratio of CR to the population 42-52%, sample size 400-500 and the heritability for liability 22-32%, the confounding effect of CR in the relationship between SBP and environmental risk factors is not negligible. In general, confounding is a major concern in causal studies because it results in biased estimation of exposure effects.
In this respect, our study showed that confounding due to CR biased the estimation of exposure effects in the case of the heritability for liability by approximately >20%. On the other hand, although the number of predictors in the models were included enough (i.e., using independent variables for BMI, alcohol consumption, and smoking status, which were significantly correlated with blood pressure in several research groups (Kannel 1967;Minami et al. 1999;World Health Organization 2000;Ohira et al. 2009), adjusted R-square values of our regression equation models were not high. A possible explanation of low adjusted R-square values for our models is that other independent variables due to genetic factors might contribute to the SBP phenotype. Genetic factors that confer susceptibility to hypertension were identified in several populations (Jeunemaitre et al. 1992;Hata et al. 1994;Lifton 1996;Cusi et al. 1997). Theoretically, Fisher (1918) indicated that the impact of the effect on the phenotype was evaluated by comparing variances. Falconer and Mackay (1996) showed how the phenotypic variance can be partitioned into causal components of variance using the equation V P = V G + V E . In this sense, V P is the total phenotypic variance, V G is the total genetic variance consistent with the additive variance (V A ), the dominance variance (V D ), the interaction variance (V I ) and V E as the Size of sample without CR was 290 subjects. From the sampling data without CR, equation (8) in Results were found. Adjusted R-squared = 0.16. environmental variance consistent with the special environmental variance (V Es ) that refers to the within individual variance arising from localized circumstances, and general environmental variance V Eg refers to the environmental variance contributing to between-individual component in origin. Note that the ratio V G = V P is the heritability of the character. Moreover, Falconer and Mackay (1996) revealed the existence of two coefficients for genetic variances; one is the coefficient r of the additive variance (V A ) which called the coefficient of relationship between the relatives in question, and the other is the coefficient u of the dominance variance (V D ) which represents the probability of the relatives having the same genotype through IBD. Using these two coefficients, the total genetic variance is given by where cov AD is the covariance of breeding values with dominance deviations (Falconer and Mackay 1996).
According to this mathematical model, it is easily understood how factors associated with genetically close relatives in a sample of ostensibly unrelated individuals contribute to the effect of phenotype; that is, the phenotype is composed of both environmental and genetic elements that contribute to the relationship between relatives. Thus, some of the differences in the estimates of regression coefficients might be because of the adjustment strategies for concomitant confounding effect of CR. Rotimi et al. (1999) examined a familial pattern of blood pressure in a population of Nigerian families and clarified that heritability of <50% for both SBP and diastolic blood pressure (DBP) reinforced the importance of the nonshared familial environmental effects. Thus, one of the approaches to select the best model for the response variable using collected cohort data from a limited or small-sized sample is that the heritability of the blood pressure phenotype might be worth considering. Historically, the most common statistical approach for dealing with confounding in epidemiology was based on stratification. Typically, given the importance of confounding in epidemiology, statistical methods recommend the removal of significantly confounding samples. However, the resulting removal of samples with confounding factors, the sample size is reduced. As with another possible approach, we are now extensively analyzing this issue by a method incorporating principal components of a large subset of GWAS SNPs as regression covariates. This approach does not waste resources; that is we can use the entire sample. There are some similarities between the approach by Price et al. (2006) and our method; however, in contrast we examine principal components of a large subset of GWAS SNPs to adjust the confounding effect of CR. On the other hand, they examined them only to adjust the population structure. Generally, principal components typically reflect genome-wide factors attributable to the demographic history of the populations studied (Price et al. 2006). In this respect, it still remains to be clarified whether principal components reflect genome-wide factors attributable to CR. We are now extensively analyzing under what condition such approach is plausible.
In conclusion, we found a confounding effect of CR in the relationship between SBP and environmental risk factors was not negligible. In our study, we showed that heritability for liability might reflect on the estimation of regression coefficients between SBP and environmental risk factors, because they vary with environmental risk factors that differ across some unsuspected relatedness. For the genetic case-control studies, test statistics are generally inflated relative to the expectation under the association of an independent sample and without genetic association to the disease. These false positives often are attributed to CR (Devlin and Roeder 1999). Thus, more or less in any other epidemiological investigations that were performed previously, a true effect might be hidden due to confounding arising from CR. In our study, we presented a simply modeling to illustrate the effect of CR on the estimation of coefficients. The size of the CR would have a big impact on the precision of the resulting estimates of coefficients. We are now extensively analyzing this issue in different settings. Various statistical methods have been proposed to take into account confounding factors such as linear mixed-effect models (Demidenko 2004) or methods that adjust data based on a principal components analysis (Price et al. 2006). Sturmer et al. (2005) proposed a method of adjusting for multiple unmeasured confounders in a cohort study. The amount of residual confounding due to unmeasured and poorly measured covariates was important enough to qualitatively change the association between NSAID (nonsteroidal anti-inflammatory drug) use and mortality (Sturmer et al. 2005). After data collection, using these techniques in an epidemiological association study, it might be important to adjust the cryptic relative pairs based on genetic data in the relationship between environmental risk factors and phenotypes.