A gene based approach to test genetic association based on an optimally weighted combination of multiple traits

Jianjun Zhang; Qiuying Sha; Guanfu Liu; Xuexia Wang

doi:10.1371/journal.pone.0220914

Abstract

There is increasing evidence showing that pleiotropy is a widespread phenomenon in complex diseases for which multiple correlated traits are often measured. Joint analysis of multiple traits could increase statistical power by aggregating multiple weak effects. Existing methods for multiple trait association tests usually study each of the multiple traits separately and then combine the univariate test statistics or combine p-values of the univariate tests for identifying disease associated genetic variants. However, ignoring correlation between phenotypes may cause power loss. Additionally, the genetic variants in one gene (including common and rare variants) are often viewed as a whole that affects the underlying disease since the basic functional unit of inheritance is a gene rather than a genetic variant. Thus, results from gene level association tests can be more readily integrated with downstream functional and pathogenic investigation, whereas many existing methods for multiple trait association tests only focus on testing a single common variant rather than a gene. In this article, we propose a statistical method by Testing an Optimally Weighted Combination of Multiple traits (TOW-CM) to test the association between multiple traits and multiple variants in a genomic region (a gene or pathway). We investigate the performance of the proposed method through extensive simulation studies. Our simulation studies show that the proposed method has correct type I error rates and is either the most powerful test or comparable with the most powerful tests. Additionally, we illustrate the usefulness of TOW-CM based on a COPDGene study.

Citation: Zhang J, Sha Q, Liu G, Wang X (2019) A gene based approach to test genetic association based on an optimally weighted combination of multiple traits. PLoS ONE 14(8): e0220914. https://doi.org/10.1371/journal.pone.0220914

Editor: Heming Wang, Brigham and Women’s Hospital and Harvard Medical School, UNITED STATES

Received: April 4, 2019; Accepted: July 25, 2019; Published: August 9, 2019

Copyright: © 2019 Zhang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data underlying the results presented in the study are available from COPDGene study (phs000179/HMB and 257 phs000179 /DS-CS-RD).

Funding: Research reported in this publication was supported by the National Human Genome Research Institute of the National Institutes of Health under Award Number R15HG008209 to QS. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. X Wang was supported by the University of North Texas Foundation which was contributed by Dr. Linda Truitt Creagh. The content is solely the responsibility of the authors and does not necessarily represent the views of the University of North Texas Foundation and Dr. Linda Truitt Creagh. The Genetic Analysis workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences. Preparation of the Genetic Analysis Workshop 17 Simulated Exome Data Set was supported in part by NIH R01 MH059490 and used sequencing data from the 1000 Genomes Project (www.1000genomes.org). This research used data generated by the COPDGene study, which was supported by NIH grants U01HL089856 and U01HL089897. The COPDGene project is also supported by the COPD Foundation through contributions made by an Industry Advisory Board comprised of Pfizer, AstraZeneca, Boehringer Ingelheim, Novartis, and Sunovion. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist. X Wang was supported by the University of North Texas Foundation which was contributed by Dr. Linda Truitt Creagh. The content is solely the responsibility of the authors and does not necessarily represent the views of the University of North Texas Foundation and Dr. Linda Truitt Creagh. This does not alter our adherence to PLOS ONE policies on sharing data and materials. Q Sha was supported by the National Human Genome Research Institute of the National Institutes of Health under Award Number R15HG008209. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. This does not alter our adherence to PLOS ONE policies on sharing data and materials. The Genetic Analysis workshops are supported by NIH grant R01 GM031575 from the National nstitute of General Medical Sciences. Preparation of the Genetic Analysis Workshop 17 Simulated Exome Data Set was supported in part by NIH R01 MH059490 and used sequencing data from the 1000 Genomes Project (www.1000genomes.org). This research used data generated by the COPDGene study, which was supported by NIH grants U01HL089856 and U01HL089897. The COPDGene project is also supported by the COPD Foundation through contributions made by an Industry Advisory Board comprised of Pfizer, AstraZeneca, Boehringer Ingelheim, Novartis, and Sunovion. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. This does not alter our adherence to PLOS ONE policies on sharing data and materials.

Introduction

Complex diseases are often characterized by many correlated phenotypes which can better reflect their underlying mechanism. For example, hypertension can be characterized by systolic and diastolic blood pressure [1]; metabolic syndrome is evaluated by four component traits: high-density lipoprotein (HDL) cholesterol, plasma glucose and Type 2 diabetes, abdominal obesity, and diastolic blood pressure [2]; and a person’s cognitive ability is usually measured by tests in domains including memory, intelligence, language, executive function, and visual-spatial function [3]. Also, more and more large cohort studies have collected or are collecting a broad array of correlated phenotypes to reveal the genetic components of many complex human diseases. Therefore, by jointly analyzing these correlated traits, we can not only gain more power by aggregating multiple weak effects, but also understand the genetic architecture of the disease of interest [4].

Even though genome-wide association studies (GWASs) have been remarkably successful in identifying genetic variants associated with complex traits and diseases, the majority of the identified genetic variants only explain a small fraction of total heritability [5]. Furthuer, a gene is the basic functional unit of inheritance whereas the GWAS are primarily focused on the paradigm of single common variant. However, most published GWASs only analyzed each individual phenotype separately, although results on related phenotypes may be reported together. Large-scale GWAS of complex traits have consistently demonstrated that, with few exceptions, common variants have moderate-to-small effects. Therefore, it is important to identify appropriate methods that fully utilize information in multivariate phenotypes to detect novel genes in genetic association studies.

In GWAS, several methods have been developed for multivariate phenotypes association analysis [3] to test association between multivariate continuous phenotypes and a single common variant. To our knowledge, current multivariate phenotypes association methods can be roughly classified into two categories: univariate analysis and multivariate analysis. Univariate analysis methods perform an association test for each trait individually and then combine the univariate test statistics or combine the p-values of the univariate tests [6–9]. Even though such methods are computationally efficient, they neglect the omnipresent correlation between individual phenotypes and may reduce the power compared to multivariate analysis. Multivariate analysis methods jointly analyze more than one phenotype in a unified framework and test for the association between multiple phenotypes and genetic variants. Multivariate analysis methods include multivariate analysis of variance (MANOVA) [10], linear mixed effect models (LMM) [11], and generalized estimating equations (GEE) [12]. Another special approach is to consider reducing the dimension of the multivariate phenotypes by using dimension reduction techniques. The common method for dimensionality reduction is principal component analysis (PCA) [13] which essentially finds the combination of these phenotypes and assumes that the transformed phenotypes are independent. The limitation of this method is that it can not properly account for the variation of phenotypes or genotypes. It is also hard to interpret the meaning of principle components of the multivariate phenotypes, especially in practice.

Recent studies show that complex diseases are caused by both common and rare variants [14–20]. Gene-based analysis requires statistical methods that are fundamentally different from association statistics used for testing common variants. It is essential to develop a novel statistical method to test the association between multiple traits and multiple variants (common and/or rare variants). In this article, we develop a statistical method to test the association between multiple traits and genetic variants (rare and/or common) in a genomic region by Testing the association between an Optimally Weighted combination of Multiple traits (TOW-CM) and the genomic region. TOW-CM is based on the score test under a linear model, in which the weighted combination of phenotypes is obtained by maximizing the score test statistic over weights. The weights at which the score test statistic reaches its maximum are called the optimal weights. We also use extensive simulation studies to compare the performance of TOW-CM with MANOVA [10], multi-trait sequence kernel association test MSKAT [21] and minimum p-value [22]. Simulation studies demonstrate that, in all the simulation scenarios, TOW-CM is either the most powerful test or comparable to the most powerful test among the four tests. We also illustrate the usefulness of TOW-CM by analyzing a real COPDGene study.

Methods

We consider a sample with n unrelated individuals. Each individual has K (potentially correlated) traits and has been genotyped at M variants in a considered region (a gene or a pathway). Denote y_ik as the k^th trait value of the i^th individual and x_im as the genotype score in additive coding of the i^th individual at the m^th variant. Let Y = (Y₁, ⋯, Y_K) denote the random vector of K traits and X = (X₁, ⋯, X_M) denote the random variable of the genotype score at M variants for these n individuals where Y_k = (y_1k, ⋯, y_nk)^T and X_m = (x_1m, ⋯, x_nm)^T. Consider a linear combination of Y denoted as , where w = (w₁, ⋯, w_K)^T.

We model the relationship between the combination of multiple continuous traits with the M genetic variants in the considered region using the linear model (1) where β₀ is the intercept and β = (β₁, ⋯, β_M)^T is the corresponding vector of coefficients. To test the association between the combination of the multiple traits and the M genetic variants is equivalent to test the null hypothesis H₀: β = 0 under Eq (1). We use the score test statistic to test H₀: β = 0 under Eq (1). Let and then the test statistic is: (2) where U = (PX)′ PYw and . The score test can be rewritten as a function of w: (3) where P = P′ and PP′ = P. We propose to maximize S(w) to get the optimal weight and then define the statistic to evaluate the association between the optimally weighted combination of the target traits and test genetic variants.

When D = Y′ PY is positive definite, maximizing S(w) is equivalent to maximizing (4) where L is the lower triangular matrix obtained from the Cholesky decomposition of D = LL^T. However, the matrix of D is usually not full rank because of existing correlation between multiple traits. If the matrix D is semi-positive define matrix, we introduce a ridge parameter λ₀, for which we suggest the choice , where n is the number of individuals in the testing data, and modify the adjustment to mitigate the effect of the non-positive matrix D in order to avoid the instability: D = Y′ PY + λ₀I. Let C = L⁻¹ Y′ PX(X′ PX)⁻¹ X′ PYL^−T and c be the eigenvector corresponding to the largest eigenvalue of the matrix C, then S(w) is maximized when L′(w) equals c. Hence Eq (4) is maximized when w^o = L^−T c. In a special case, if all the traits we consider are independent and M = 1, we can get an analytical weight referred to [22]: (5) for the k^th phenotype, k = 1, 2, 3, …, K. The Eq (5) is equivalent to where the numerator is the correlation coefficient between the k^th phenotype Y_k and the genotypic variant X and the denominator can be viewed as the variance of the k^th phenotype Y_k. It means that w_k has same direction with the correlation between the phenotype Y_k and the genotypic variant X, and puts big weight to the k^th trait when it has strong association with the genotypic variant and/or it has low variance.

We define the statistic to test an optimally weighted combination of multiple traits (TOW-CM), , as (6)

We use permutation methods to evaluate P-values of T. The TOW-CM method can also be extended to incorporate covariates. Suppose that there are p covariates. Let z_il denote l^th covariate of the i^th individual. We adjust both trait value y_ik and genotypic score x_im for the covariates by applying linear regressions. That is,

Let and denote the residuals of y_ik and x_im, respectively. We incorporate the covariate effects in TOW-CM by replacing y_ik and x_im in Eq (6) by and . With covariates, the statistic of TOW-CM is defined as:

Comparison of tests

We compared the performance of our method (TOW-CM) with the following methods: 1) Multivariate Analysis of Variance (MANOVA) [10]; 2) Multi-trait Sequence Kernel Association Test (MSKAT) [21]; 3) Minimum p-value based on the p-values of the individual trait TOW [22] (denoted as minP).

Simulation

In simulation studies, we use the empirical Mini-Exome genotype data including genotypes of 697 unrelated individuals on 3205 genes obtained from Genetic Analysis Workshop 17 (GAW17). Two differen type of variants (Common variants: minor allele frequency (MAF)>0.05 and Rare variants: MAF<0.05) are chosen from a super gene (Sgene) including four genes: ELAVL4 (gene1), MSH4 (gene2), PDE4B (gene3), and ADAMTS4 (gene4). The pattern of the allele frequency distribution of the Sgene is similar as the 3205 genes’ [22]. In our simulation studies, we generate genotypes based on the genotypes of 697 individuals in these four genes. The genotypes are extracted from the sequence alignment files provided by the 1,000 Genomes Project for their pilot3 study (http://www.1000genomes.org). To generate the genotype of an individual, we generate two haplotypes according to the haplotype frequencies.

We test K = 4 related traits with a compound-symmetry correlation matrix and consider two covariates: a standard normal covariate z₁ and a binary covariate z₂ with P(z₂ = 1) = 0.5. We generate trait values based on genotypes by using the following models: where ϵ = (ϵ₁, ϵ₂, ϵ₃, ϵ₄) is zero-mean normal with variances 1 and correlation ρ. We set the magnitude of correlation |ρ| to 0.2, 0.5, and 0.8, and the signs of symmetric location of covariate matrix are randomly chosen from (-1,1). η = (η₁, η₂, η₃, η₄) are contributions from a set of genotypic variants, which are simulated as follows.

For type I error, phenotypes are generated under the null model i.e. η = 0. To evaluate power, we randomly choose one common variant and n_c (20%) rare variants as casual variants. We assume that all the n_c rare causal variants have the same heritability and the heritability of the common causal variant is twice of the heritability of rare causal variants. That is, we model the genotypic variants’ contribution to disease risk as where x_c and x_j denote the common variant and rare variant, respectively. β_c and β_kj represent the corresponding effect size. Let h and h_k denote the heritability of all the causal variants for all the K traits and for the k^th trait, respectively. We generate K random numbers t₁, ⋯, t_K from a uniform distribution between 0 and 1, and the heritability of k^th trait denotes . For the k^th trait, we assign the effect size of common variants (7) and the magnitude of the effect of rare variants (8) where R denotes the ratio of the heritability of rare causal variants to the heritability of the common causal variant.

For power comparisons, we conducted simulations under the four scenarios: each time only the first L traits are associated with the set of variants, L = 1, 2, 3, 4, respectively. Intuitively, in the first scenario (L = 1), when only the first trait is associated with the variants set, the minP method (it equals to test the first trait alone) may have good performance. However, we will show that by simultaneously testing correlated null traits, our proposed method (TOW-CM) could actually improve the detection power compared to test the first trait alone. When there are multiple correlated traits that are associated with the rare variants set, the proposed TOW-CM would offer vastly improved detection power than the minimum p-value based approach. In each scenario, we also consider different percentage of risk variants for rare variants.

Simulation results

Table 1 summarizes the estimated type I error rates of our method TOW-CM with other three comparable methods under different significance levels and different magnitude of trait correlation |ρ|. The type I error rates are evaluated using 10000 replicated samples and the P-values are estimated using 10000 permutations for TOW-CM and minP. For the 10000 replicated samples, the 95% confidence intervals (CIs) for the estimated type I error rates of nominal levels 0.05, 0.01, and 0.001 are (0.046, 0.054), (0.008, 0.012), and (0.0004, 0.0016), respectively. From this table, we can see that all of the estimated type I error rates are either within 95% CIs or close to the bound of the corresponding 95% CIs, which indicate that the type I error rates of all methods are controlled under all considered scenarios.

Download:

Table 1. The estimated type I error rates for TOW-CM, minP, MANOVA and MSKAT.

https://doi.org/10.1371/journal.pone.0220914.t001

In power comparisons, the P-values of TOW-CM, minP are calculated using 1000 permutations, while the P-values of MANOVA and MSKAT are calculated by asymptotic distributions. The powers of all the four tests are evaluated using 1000 replicated samples at a nominal significance level of 0.05. Figs 1–6 present the power under significance level 0.05 for L = 4, 3, 2, 1 respectively.

Download:

Fig 1. Power comparison of four tests as a function of heritability for four continuous traits with the magnitude of correlation at 0.2, 0.5 and 0.8, respectively.

All four traits are associated with the gene for the left panel and only the first three traits are associated with the gene for the right panel. Sample size is 1,000 and 20% of rare variants are causal. All causal variants are risk variants. The powers are evaluated at a significance level of 0.05.

https://doi.org/10.1371/journal.pone.0220914.g001

Download:

Fig 2. Power comparison of four tests as a function of heritability for four continuous traits with the magnitude of correlation at 0.2, 0.5 and 0.8, respectively.

All four traits are associated with the gene for the left panel and only the first three traits are associated with the gene for the right panel. Sample size is 1,000 and 20% of rare variants are causal variants among which 90% of causal variants are risk variants and 10% of causal variants are protective variants. The powers are evaluated at a significance level of 0.05.

https://doi.org/10.1371/journal.pone.0220914.g002

Download:

Fig 3. Power comparison of four tests as a function of heritability for four continuous traits with the magnitude of correlation at 0.2, 0.5 and 0.8, respectively.

All four traits are associated with the gene for the left panel and only the first three traits are associated with the gene for the right panel. Sample size is 1,000 and 20% of rare variants are causal among which 80% of causal variants are risk variants and 20% of causal variants are protective variants. The powers are evaluated at a significance level of 0.05.

https://doi.org/10.1371/journal.pone.0220914.g003

Download:

Fig 4. Power comparison of four tests as a function of heritability for four continuous traits with the magnitude of correlation at 0.2, 0.5 and 0.8, respectively.

Only the first two traits are associated with the gene for left panel and only the first traits are associated with the gene for right panel. Sample size is 1,000 and 20% of rare variants are causal variants. All causal are risk variants. The powers are evaluated at a significance level of 0.05.

https://doi.org/10.1371/journal.pone.0220914.g004

Download:

Fig 5. Power comparison of four tests as a function of heritability for four continuous traits with the magnitude of correlation at 0.2, 0.5 and 0.8, respectively.

Only the first two traits are associated with the gene for left panel and only the first traits are associated with the gene for right panel. Sample size is 1,000 and 20% of rare variants are causal. 90% of causal are risk variants and 10% of causal are protective variants. The powers are evaluated at a significance level of 0.05.

https://doi.org/10.1371/journal.pone.0220914.g005

Download:

Fig 6. Power comparison of four tests as a function of heritability for four continuous traits with the magnitude of correlation at 0.2, 0.5 and 0.8, respectively.

Only the first two traits are associated with the gene for left panel and only the first traits are associated with the gene for right panel. Sample size is 1,000 and 20% of rare variants are causal among which 80% of causal variants are risk variants and 20% of causal variants are protective variants. The powers are evaluated at a significance level of 0.05.

https://doi.org/10.1371/journal.pone.0220914.g006

These figures show the power comparisons of the four tests (TOW-CM, MANOVA, MSKAT and minP). Power is a function of the total heritability based on three cases (all causal are risk variants, 90% causal are risk variants, and 80% causal are risk variants) for each specific scenario L. These figures show that TOW-CM is consistently the most powerful test among the four tests, and MANOVA is the second most powerful test when genotypes of genetic variants have impact on more than 1 traits. MSKAT is consistently less powerful than the other two multivariate tests (TOW-CM and MANOVA) most likely because there are only 8% variants with MAF in the range of (0.01,0.035) in Sgene which the simulations are based on. Similar to SKAT, MSKAT will lose power when the MAF of causal variants are not in the range (0.01,0.035) [23]. The minP method is consistently less powerful than TOW-CM and MANOVA because they ignore the traits’ dependence by directly using minimum of the P-values of testing the four single traits. Overall, we can see that they suffer power loss when the correlations among traits increase.

An interesting scenario is one in which only the first trait is associated with the variants set and all the others are null traits (L = 1). Stephens [24] and Wu et al. [25] have reported that joint testing by incorporating a correlated null trait could improve the power for testing association of a common variant. When only the first trait is associated with the variants set, minP is either the most powerful test or has similar power to the most powerful test especially in the case of both causal variants under weak traits correlation (|ρ| = 0.2). The TOW-CM and MANOVA statistic could benefit from increased traits correlations, and offer vastly improved power by incorporating strongly correlated null traits. Thus, our results verify the conclusion of [24] and [25].

Overall, we can see that the proposed TOW-CM is an attractive approach that provides good power in most of the scenarios.

Application to the COPDGene

Chronic obstructive pulmonary disease (COPD) is one of the most common lung diseases characterized by long term poor airflow and is a major public health problem [26]. The COPDGene Study is a multi-center genetic and epidemiologic investigation dedicated to studying COPD [27]. Participants in the COPDGene Study gave consent for the use of data collected during the study in downstream analyses. This study is sufficiently large and appropriately designed for analysis of COPD. In this study, we consider more than 5000 non-Hispanic Whites (NHW) participants where the participants have completed a detailed protocol, including questionnaires, pre- and post-bronchodilator spirometry, high-resolution CT scanning of the chest, exercise capacity (assessed by six-minute walk distance), and blood samples for genotyping. The participants were genotyped using the Illumina OmniExpress platform. The genotype data have gone through standard quality-control procedures for genome-wide association analysis detailed at http://www.copdgene.org/sites/default/files/GWAS_QC_Methodology_20121115.pdf.

Based on the literature studies of COPD [28, 29], we selected 7 key quantitative COPD-related phenotypes, including FEV1 (% predicted FEV1), Emphysema (Emph), Emphysema Distribution (EmphDist), Gas Trapping (GasTrap), Airway Wall Area (Pi10), Exacerbation frequency (ExacerFreq), Six-minute walk distance (6MWD), and 4 covariates, including BMI, Age, Pack-Years (PackYear) and Sex. EmphDist is the ratio of emphysema at -950 HU in the upper 1/3 of lung fields compared to the lower 1/3 of lung fields where we did a log transformation on EmphDist in the following analysis, referred to [28]. In the analysis, participants with missing data in any of these phenotypes were excluded.

To evaluate the performance of our proposed method on a real data set, we applied all of the 4 methods (TOW-CM, MANOVA, MSKAT and minP) to the COPD associated genes or genes containing significant single-nucleotide polymorphisms (SNPs) in NHW population with COPD-related phenotypes [30]. In the analysis, we first removed the missing data in any genotypic variants and then adjusted each of the 7 phenotypes for the 4 covariates using linear models. In the analysis, participants with missing data in any of the 11 variables were excluded. Therefore, a complete set of 5,430 individuals across 50 genes were used in the following analyses. In order to compare these methods, we adopted the commonly used 10⁷ permutations for TOW-CM and minP methods. For this verification study, we use 0.05 as the significance level for MANOVA, MSKAT and TOW-CM methods and use Bonferroni corrected significance level 0.05/7 = 7.14 × 10⁻³ for minP methods since this method perform association tests across each trait, respectively. The results are summarized in Table 2. From Table 2, we can see that TOW-CM identified 14 genes, minP identified 14 genes, MANOVA identified 12 genes and MSKAT identified 4 genes. Among these four methods, TOW-CM identified the most significant genes where all of these 14 genes had previously been reported to be in association with COPD by eligible studies [7, 30], among which 5 genes (LOC105377462,CHRNA3, CHRNA5,HYKK,IREB2) are statistically significant if we use a more stringent cut-off 1.00 × 10⁻³ for a multiple testing issue with 50 genes in total. Because the MAFs of most variants are not in the range of (0.01,0.035) which is a range favoring MSKAT, MSKAT performs worse than the other three comparable methods (Yang et al. 2017). TOW-CM and minP perform better than MANOVA, which is perhaps because only a proportion of phenotypes are associated with COPD. The method minP missed some genes in comparision to our method TOW-CM, it may because the method minP ignores the correlation between these seven phenotypes.

Download:

Table 2. The p-values of significant genes in the genetic association analysis for COPD using these four different methods.

https://doi.org/10.1371/journal.pone.0220914.t002

Discussion

GWAS have identified many variants with each variant affecting multiple phenotypes, which suggests that pleiotropic effects on human complex phenotypes may be widespread. Also, recent studies have shown that complex diseases are caused by both common and rare variants [14, 16, 19]. Therefore, statistical methods that can jointly analyze multiple phenotypes for common or/and rare variants have advantages over analyzing each phenotype individually or only considering for common variants (GWAS). In this article, we propose TOW-CM method to perform multivariate analysis for multiple phenotypes in association studies based on the following reasons: (1) complex diseases are usually measured by multiple correlated phenotypes in genetic association studies; (2) there is increasing evidence showing that studying multiple correlated phenotypes jointly may increase power for detecting disease associated genetic variants, and (3) there is a shortage of gene-based approaches for multiple traits. Simulation results show that TOW-CM has correct type I error rates and is consistently more powerful in comparision to the other three tests. The real data analysis results show that TOW-CM has excellent performance in identifying genes associated with complex disease with multiple correlated phenotypes such as COPD.

One disadvantage of TOW-CM is that the test statistic does not have an asymptotic distribution and a permutation procedure is needed to estimate its P-value, which is time consuming compared to the methods whose test statistics have asymptotic distributions. To save time when applying TOW-CM to genetic association studies, we can use the “step-up” procedure [31] to determine the number of permutations, which can show evidence of association based on a small number of permutations first (e.g. 1,000) and then a large number of permutations are used to test the selected potentially significant genes. Specifically, for the analysis of real data, the computation time of p-value estimation of TOW-CM for all genes was about three days using our R program on 50 Dell PowerEdge C6320 servers. Each server has two 2.4GHz Intel Xeon E5-2680 v4 fourteen-core processors and 600 MB average memory. We also uploaded the R program on GitHub, https://github.com/Jianjun-CN/TOW-CM/blob/master/R%20Code Furthermore, TOW-CM method can not only be used for gene-based association studies, but also can be extended to transcriptome-wide association study (TWAS), which needs further investigations.

Acknowledgments

The Genetic Analysis workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences. Preparation of the Genetic Analysis Workshop 17 Simulated Exome Data Set was supported in part by NIH R01 MH059490 and used sequencing data from the 1000 Genomes Project (www.1000genomes.org).

This research used data generated by the COPDGene study, which was supported by NIH grants U01HL089856 and U01HL089897. The COPDGene project is also supported by the COPD Foundation through contributions made by an Industry Advisory Board comprised of Pfizer, AstraZeneca, Boehringer Ingelheim, Novartis, and Sunovion.

A superior high-performance computing infrastructure at University of North Texas, was used in obtaining results presented in this publication.

References

1. Newton-Cheh C, Johnson T, Gateva V, Tobin MD, Bochud M, Coin L, et al. Genome-wide association study identifies eight loci associated with blood pressure. Nature Genetics. 2009;41(6):666. pmid:19430483
- View Article
- PubMed/NCBI
- Google Scholar
2. Zabaneh D, Balding DJ. A genome-wide association study of the metabolic syndrome in Indian Asian men. PloS One. 2010;5(8):e11961. pmid:20694148
- View Article
- PubMed/NCBI
- Google Scholar
3. Yang Q, Wang Y. Methods for analyzing multivariate phenotypes in genetic association studies. Journal of Probability and Statistics. 2012. pmid:24748889
- View Article
- PubMed/NCBI
- Google Scholar
4. Solovieff N, Cotsapas C, Lee PH, Purcell SM, Smoller JW. Pleiotropy in complex traits: challenges and strategies. Nature Reviews Genetics. 2013;14(7):483. pmid:23752797
- View Article
- PubMed/NCBI
- Google Scholar
5. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, et al. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747. pmid:19812666
- View Article
- PubMed/NCBI
- Google Scholar
6. Kim J, Bai Y, Pan W. An adaptive association test for multiple phenotypes with GWAS summary statistics. Genetic Epidemiology. 2015;39(8):651–663. pmid:26493956
- View Article
- PubMed/NCBI
- Google Scholar
7. Liang X, Wang Z, Sha Q, Zhang S. An Adaptive Fisher’s Combination Method for Joint Analysis of Multiple Phenotypes in Association Studies. Scientific Reports. 2016;6,34323. pmid:27694844
- View Article
- PubMed/NCBI
- Google Scholar
8. O’Brien PC. Procedures for comparing samples with multiple endpoints. Biometrics. 1984:1079–1087. pmid:6534410
- View Article
- PubMed/NCBI
- Google Scholar
9. Yang Q, Wu H, Guo C Y., Fox CS. Analyze multivariate phenotypes in genetic association studies by combining univariate association tests. Genetic Epidemiology. 2010;34(5):444–454. pmid:20583287
- View Article
- PubMed/NCBI
- Google Scholar
10. Cole DA, Maxwell SE, Arvey R, Salas E. How the power of MANOVA can both increase and decrease as a function of the intercorrelations among the dependent variables. Psychological Bulletin. 1994;115(3),465.
- View Article
- Google Scholar
11. Laird NM, Ware JH. Random-effects models for longitudinal data. Biometrics. 1982:963–974. pmid:7168798
- View Article
- PubMed/NCBI
- Google Scholar
12. Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73(1):13–22.
- View Article
- Google Scholar
13. Ott J, Rabinowitz D. A principal-components approach based on heritability for combining phenotype information. Human heredity. 1999;49(2):106–111. pmid:10077732
- View Article
- PubMed/NCBI
- Google Scholar
14. Bodmer W, Bonilla C. Common and rare variants in multifactorial susceptibility to common diseases. Nature Genetics. 2008;40(6),695. pmid:18509313
- View Article
- PubMed/NCBI
- Google Scholar
15. Kang HM, Sul JH, Service SK, Zaitlen NA, Kong SY, Freimer NB, et al. Variance component model to account for sample structure in genome-wide association studies. Nature Genetics. 2010;42(4),348. pmid:20208533
- View Article
- PubMed/NCBI
- Google Scholar
16. Pritchard JK. Are rare variants responsible for susceptibility to complex diseases?. The American Journal of Human Genetics. 2001;69(1):124–137. pmid:11404818
- View Article
- PubMed/NCBI
- Google Scholar
17. Pritchard J. K, Cox N J. The allelic architecture of human disease genes: common disease–common variant or not?. Human molecular genetics. 2002;11(20):2417–2423. pmid:12351577
- View Article
- PubMed/NCBI
- Google Scholar
18. Stratton MR, Rahman N. The emerging landscape of breast cancer susceptibility. Nature Genetics. 2008;40(1),17. pmid:18163131
- View Article
- PubMed/NCBI
- Google Scholar
19. Teer JK, Mullikin JC. Exome sequencing: the sweet spot before whole genomes. Human Molecular Genetics. 2010;19(R2):R145–R151. pmid:20705737
- View Article
- PubMed/NCBI
- Google Scholar
20. Walsh T, King MC. Ten genes for inherited breast cancer. Cancer Cell. 2007;11(2):103–105. pmid:17292821
- View Article
- PubMed/NCBI
- Google Scholar
21. Wu B, Pankow JS. Sequence kernel association test of multiple continuous phenotypes. Genetic Epidemiology. 2016;40(2):91–100. pmid:26782911
- View Article
- PubMed/NCBI
- Google Scholar
22. Sha Q, Wang X, Wang X, Zhang S. Detecting association of rare and common variants by testing an optimally weighted combination of variants. Genetic Epidemiology. 2012;36(6):561–571. pmid:22714994
- View Article
- PubMed/NCBI
- Google Scholar
23. Yang X, Wang S, Zhang S, Sha Q. Detecting association of rare and common variants based on cross-validation prediction error. Genetic Epidemiology. 2017;41(3):233–243. pmid:28176359
- View Article
- PubMed/NCBI
- Google Scholar
24. Stephens M. A unified framework for association analysis with multiple related phenotypes. PloS One. 2013;8(7),e65245. pmid:23861737
- View Article
- PubMed/NCBI
- Google Scholar
25. Wu B, Pankow JS. Statistical methods for association tests of multiple continuous traits in genome-wide association studies. Annals of Human Genetics. 2015;79(4):282–293. pmid:25857693
- View Article
- PubMed/NCBI
- Google Scholar
26. Murphy TF, Sethi S. Chronic obstructive pulmonary disease. Aging. 2002;19(10):761–775.
- View Article
- Google Scholar
27. Regan EA, Hokanson JE, Murphy JR, Make B, Lynch DA, Beaty TH, et al. Genetic epidemiology of COPD (COPDGene) study design. COPD: Journal of Chronic Obstructive Pulmonary Disease. 2011;7(1):32–43.
- View Article
- Google Scholar
28. Chu JH, Hersh CP, Castaldi PJ, Cho MH, Raby BA, Laird N, et al. Analyzing networks of phenotypes in complex diseases: methodology and applications in COPD. BMC Systems Biology. 2014;8(1):78. pmid:24964944
- View Article
- PubMed/NCBI
- Google Scholar
29. Han MK, Kazerooni EA, Lynch DA, Liu LX, Murray S, Curtis JL, et al. Chronic obstructive pulmonary disease exacerbations in the COPDGene study: associated radiologic phenotypes. Radiology. 2011;26(1):274–282.
- View Article
- Google Scholar
30. Berndt A, Leme AS, Shapiro SD. Emerging genetics of COPD. EMBO Molecular Medicine. 2012;4(11):1144–1155. pmid:23090857
- View Article
- PubMed/NCBI
- Google Scholar
31. Pan W, Kim J, Zhang Y, Shen X, Wei P. A powerful and adaptive association test for rare variants. Genetics. 2014,genetics–114.
- View Article
- Google Scholar

[ref1] 1. Newton-Cheh C, Johnson T, Gateva V, Tobin MD, Bochud M, Coin L, et al. Genome-wide association study identifies eight loci associated with blood pressure. Nature Genetics. 2009;41(6):666. pmid:19430483
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Zabaneh D, Balding DJ. A genome-wide association study of the metabolic syndrome in Indian Asian men. PloS One. 2010;5(8):e11961. pmid:20694148
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Yang Q, Wang Y. Methods for analyzing multivariate phenotypes in genetic association studies. Journal of Probability and Statistics. 2012. pmid:24748889
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Solovieff N, Cotsapas C, Lee PH, Purcell SM, Smoller JW. Pleiotropy in complex traits: challenges and strategies. Nature Reviews Genetics. 2013;14(7):483. pmid:23752797
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, et al. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747. pmid:19812666
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Kim J, Bai Y, Pan W. An adaptive association test for multiple phenotypes with GWAS summary statistics. Genetic Epidemiology. 2015;39(8):651–663. pmid:26493956
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref7] 7. Liang X, Wang Z, Sha Q, Zhang S. An Adaptive Fisher’s Combination Method for Joint Analysis of Multiple Phenotypes in Association Studies. Scientific Reports. 2016;6,34323. pmid:27694844
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref8] 8. O’Brien PC. Procedures for comparing samples with multiple endpoints. Biometrics. 1984:1079–1087. pmid:6534410
View Article
PubMed/NCBI
Google Scholar

[30] View Article

[31] PubMed/NCBI

[32] Google Scholar

[ref9] 9. Yang Q, Wu H, Guo C Y., Fox CS. Analyze multivariate phenotypes in genetic association studies by combining univariate association tests. Genetic Epidemiology. 2010;34(5):444–454. pmid:20583287
View Article
PubMed/NCBI
Google Scholar

[34] View Article

[35] PubMed/NCBI

[36] Google Scholar

[ref10] 10. Cole DA, Maxwell SE, Arvey R, Salas E. How the power of MANOVA can both increase and decrease as a function of the intercorrelations among the dependent variables. Psychological Bulletin. 1994;115(3),465.
View Article
Google Scholar

[38] View Article

[39] Google Scholar

[ref11] 11. Laird NM, Ware JH. Random-effects models for longitudinal data. Biometrics. 1982:963–974. pmid:7168798
View Article
PubMed/NCBI
Google Scholar

[41] View Article

[42] PubMed/NCBI

[43] Google Scholar

[ref12] 12. Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73(1):13–22.
View Article
Google Scholar

[45] View Article

[46] Google Scholar

[ref13] 13. Ott J, Rabinowitz D. A principal-components approach based on heritability for combining phenotype information. Human heredity. 1999;49(2):106–111. pmid:10077732
View Article
PubMed/NCBI
Google Scholar

[48] View Article

[49] PubMed/NCBI

[50] Google Scholar

[ref14] 14. Bodmer W, Bonilla C. Common and rare variants in multifactorial susceptibility to common diseases. Nature Genetics. 2008;40(6),695. pmid:18509313
View Article
PubMed/NCBI
Google Scholar

[52] View Article

[53] PubMed/NCBI

[54] Google Scholar

[ref15] 15. Kang HM, Sul JH, Service SK, Zaitlen NA, Kong SY, Freimer NB, et al. Variance component model to account for sample structure in genome-wide association studies. Nature Genetics. 2010;42(4),348. pmid:20208533
View Article
PubMed/NCBI
Google Scholar

[56] View Article

[57] PubMed/NCBI

[58] Google Scholar

[ref16] 16. Pritchard JK. Are rare variants responsible for susceptibility to complex diseases?. The American Journal of Human Genetics. 2001;69(1):124–137. pmid:11404818
View Article
PubMed/NCBI
Google Scholar

[60] View Article

[61] PubMed/NCBI

[62] Google Scholar

[ref17] 17. Pritchard J. K, Cox N J. The allelic architecture of human disease genes: common disease–common variant or not?. Human molecular genetics. 2002;11(20):2417–2423. pmid:12351577
View Article
PubMed/NCBI
Google Scholar

[64] View Article

[65] PubMed/NCBI

[66] Google Scholar

[ref18] 18. Stratton MR, Rahman N. The emerging landscape of breast cancer susceptibility. Nature Genetics. 2008;40(1),17. pmid:18163131
View Article
PubMed/NCBI
Google Scholar

[68] View Article

[69] PubMed/NCBI

[70] Google Scholar

[ref19] 19. Teer JK, Mullikin JC. Exome sequencing: the sweet spot before whole genomes. Human Molecular Genetics. 2010;19(R2):R145–R151. pmid:20705737
View Article
PubMed/NCBI
Google Scholar

[72] View Article

[73] PubMed/NCBI

[74] Google Scholar

[ref20] 20. Walsh T, King MC. Ten genes for inherited breast cancer. Cancer Cell. 2007;11(2):103–105. pmid:17292821
View Article
PubMed/NCBI
Google Scholar

[76] View Article

[77] PubMed/NCBI

[78] Google Scholar

[ref21] 21. Wu B, Pankow JS. Sequence kernel association test of multiple continuous phenotypes. Genetic Epidemiology. 2016;40(2):91–100. pmid:26782911
View Article
PubMed/NCBI
Google Scholar

[80] View Article

[81] PubMed/NCBI

[82] Google Scholar

[ref22] 22. Sha Q, Wang X, Wang X, Zhang S. Detecting association of rare and common variants by testing an optimally weighted combination of variants. Genetic Epidemiology. 2012;36(6):561–571. pmid:22714994
View Article
PubMed/NCBI
Google Scholar

[84] View Article

[85] PubMed/NCBI

[86] Google Scholar

[ref23] 23. Yang X, Wang S, Zhang S, Sha Q. Detecting association of rare and common variants based on cross-validation prediction error. Genetic Epidemiology. 2017;41(3):233–243. pmid:28176359
View Article
PubMed/NCBI
Google Scholar

[88] View Article

[89] PubMed/NCBI

[90] Google Scholar

[ref24] 24. Stephens M. A unified framework for association analysis with multiple related phenotypes. PloS One. 2013;8(7),e65245. pmid:23861737
View Article
PubMed/NCBI
Google Scholar

[92] View Article

[93] PubMed/NCBI

[94] Google Scholar

[ref25] 25. Wu B, Pankow JS. Statistical methods for association tests of multiple continuous traits in genome-wide association studies. Annals of Human Genetics. 2015;79(4):282–293. pmid:25857693
View Article
PubMed/NCBI
Google Scholar

[96] View Article

[97] PubMed/NCBI

[98] Google Scholar

[ref26] 26. Murphy TF, Sethi S. Chronic obstructive pulmonary disease. Aging. 2002;19(10):761–775.
View Article
Google Scholar

[100] View Article

[101] Google Scholar

[ref27] 27. Regan EA, Hokanson JE, Murphy JR, Make B, Lynch DA, Beaty TH, et al. Genetic epidemiology of COPD (COPDGene) study design. COPD: Journal of Chronic Obstructive Pulmonary Disease. 2011;7(1):32–43.
View Article
Google Scholar

[103] View Article

[104] Google Scholar

[ref28] 28. Chu JH, Hersh CP, Castaldi PJ, Cho MH, Raby BA, Laird N, et al. Analyzing networks of phenotypes in complex diseases: methodology and applications in COPD. BMC Systems Biology. 2014;8(1):78. pmid:24964944
View Article
PubMed/NCBI
Google Scholar

[106] View Article

[107] PubMed/NCBI

[108] Google Scholar

[ref29] 29. Han MK, Kazerooni EA, Lynch DA, Liu LX, Murray S, Curtis JL, et al. Chronic obstructive pulmonary disease exacerbations in the COPDGene study: associated radiologic phenotypes. Radiology. 2011;26(1):274–282.
View Article
Google Scholar

[110] View Article

[111] Google Scholar

[ref30] 30. Berndt A, Leme AS, Shapiro SD. Emerging genetics of COPD. EMBO Molecular Medicine. 2012;4(11):1144–1155. pmid:23090857
View Article
PubMed/NCBI
Google Scholar

[113] View Article

[114] PubMed/NCBI

[115] Google Scholar

[ref31] 31. Pan W, Kim J, Zhang Y, Shen X, Wei P. A powerful and adaptive association test for rare variants. Genetics. 2014,genetics–114.
View Article
Google Scholar

[117] View Article

[118] Google Scholar

Figures

Abstract

Introduction

Methods

Comparison of tests

Simulation

Simulation results

Application to the COPDGene

Discussion

Acknowledgments

References