gwas data analysis

The analyses were also repeated with the gene annotation extended to include a 10 kilobase window around each gene, with the comparison in S11 Fig showing a considerable impact on the results. We can also examine the interaction of SNPs with covariates such as sex.

This gene-set analysis also uses a regression structure to allow generalization to analysis of continuous properties of genes and simultaneous analysis of multiple gene sets and other gene properties. This model first projects the SNP matrix for a gene onto its principal components (PC), pruning away PCs with very small eigenvalues, and then uses those PCs as predictors for the phenotype in the linear regression model.

P-values below 10–8 are truncated to 10–8 (grey points) to preserve the visibility of the other points. –covar-name is the name of the variables that we want to adjust for.

Similarly, the self-contained analysis is equivalent to a one-sided single-sample t-test comparing the mean association of gene-set genes to 0. Simulations and an analysis of Crohn’s Disease data are used to evaluate the performance of MAGMA and to compare it to a number of other gene and gene-set analysis tools. Our results show that MAGMA detects more associated genes and gene-sets than other methods, and is also considerably faster. Department of Complex Trait Genetics, Center for Neurogenomics and Cognitive Research, VU University Amsterdam, Amsterdam, The Netherlands, Gene and gene-set analysis are statistical methods for analysing multiple genetic markers simultaneously to determine their joint effect.
To perform the gene-set analysis, for each gene g the gene p-value pg computed with the gene analysis is converted to a Z-value zg = Φ−1(1 – pg), where Φ−1 is the probit function.

The gene-set analysis is built as a separate layer around the gene analysis for additional flexibility. Because the Likelihood Ratio test appeared to have significantly more power than both the Score test and the MAGMA F-test, empirical p-values for the Likelihood Ratio test were computed by generating up to 10,000 permutations of the Likelihood Ratio statistic.

Preparing the data for rare variant analysis, plink --bfile data --hardy --out qc_hwe_data, # To remove SNPs with deviation from HWE (p value < 0.001), plink --bfile data --hwe 1E-3 --make-bed --out qc_hwe_data, # To check the number of individuals with missing genotype data, plink --bfile MYL2 --missing --out qc_missing, # To remove the individuals with more than 10% missing genotypes, plink --bfile data --mind 0.1 --make-bed --out qc_missing_data, plink --bfile data --geno 0.05 --make-bed --out qc_geno_data, # There is a white space between the family and individual id, plink --bfile data --remove outlier.txt --make-bed --out No_outlier_data, plink --bfile clean_data --logistic --maf 0.01 --out base_logistic, plink --bfile clean_data --logistic dominant --maf 0.01 --out dom_logistic, plink --bfile clean_data --logistic --covar covariates.pheno --covar-name sex,age --maf 0.01 --out add_adj, plink --file data --logistic --covar example.cov --covar-number 1 --interaction \, plink --bfile clean_data --linear --pheno example.pheno --pheno-name glucose --maf 0.01 --out base_linear, plink --bfile clean_data --linear --pheno example.pheno --pheno-name glucose \, --covar example.pheno --covar-name sex,age --maf 0.01 --out linear_adj, linear.logistic <- read.table("linear_adj.assoc.logistic", header = T), n_test <- nrow(linear.logistic) # Number of independent tests, bonferroni <- 0.05/n_test # Bonferroni Correction, # Find SNPs within 100 kb of the peak SNP "rs11065773" to highlight, peaksnps <- SNP[CHR == 12 & BP >= pos - 100000 & BP <= pos + 100000]. Further details can be found in ‘Supplemental Methods—SNP-wise gene analysis’. This approach is known as phenotype-first, in which the participants are classified first by their clinical manifestation, as opposed to genotype-first. For INRICH the results are strongly dependent on the SNP p-value cut-off used, with three significant gene sets at the 0.0001 cut-off but none at the higher ones, further emphasizing the problem of choosing the correct cut-off.

PLINK operates on raw genotype data, whereas all three competitive methods require only SNP p-values as input. A post-hoc power simulation indeed indicates that multi-marker effects with weak marginals are the most probable explanation (see ‘Supplemental Methods—Simulation Studies’). Gene -log10 p-values from the CD data gene analysis in MAGMA for three different gene test-statistics, comparing analyses using (A) the mean χ2 statistic with the top χ2 statistic, (B) the mean χ2 statistic and the PC regression model and (C) the top χ2 statistic and the PC regression model. An additional PLINK analysis using the mean SNP statistic with pruning set to its default (PLINK-prune) was performed as well. The gene-set analysis is divided into two distinct and largely independent parts.

No, Is the Subject Area "Test statistics" applicable to this article? The regression slope gives an estimate of the average proportion of PCs to SNPs. The results of the gene analyses of the CD data are summarized in Table 2, which shows the number of significant genes at a number of different p-value thresholds. However, due to LD, neighbouring genes will generally be correlated, violating this assumption. Analyzed the data: CAdL. Grey points correspond to genes not covered by the reference data-set. The gene p-values and gene correlation matrix are then used in the second part to perform the actual gene-set analysis. Institute for Computing and Information Sciences, Radboud University Nijmegen, Nijmegen, The Netherlands, Affiliations At present no comprehensive evaluation of the differences between existing gene-set analysis methods exists, leaving the causes and implications of these difference unclear. To address these issues we have developed MAGMA, a novel tool for gene and gene-set analysis. This was compared to the asymptotic Likelihood Ratio test p-values (C), revealing a downward bias in the asymptotic p-values. Although INRICH and ALIGATOR show comparable computation times at their lowest SNP p-value cut-off, the need to repeat the analysis at multiple cut-offs means the total analysis for both takes considerably longer. It is beyond the scope of this paper to perform such an evaluation, but the degree of discordance between most methods strongly suggests a need for future research in this direction. Aside from differences between methods, Table 3 also shows a clear difference between self-contained and competitive gene-set analysis. Similarly, the effect of genetic variants on the quantitative trait may also be confounded by covariates such as sex or age and we then need to adjust for the covariates. Two types of gene test statistics have been implemented in MAGMA: the mean of the χ2 statistic for the SNPs in a gene, and the top χ2 statistic among the SNPs in a gene. These two gene annotations were used for all analyses, to ensure that differences in default annotation settings did not cloud the comparison between tools.
https://doi.org/10.1371/journal.pcbi.1004219.s006. Genome-wide association study (GWAS) is an observational study to search the genome for small variations, called single nucleotide polymorphisms (SNPs), which are associated with a particular disease. Conversely, at lower p-value cut-offs the latter three should become more sensitive to gene-sets containing a small number of more strongly associated genes. Of note is also that PLINK-prune does considerably better than PLINK-avg, and that its p-values are somewhat more strongly correlated with those of the MAGMA analysis (Fig 2). Gene set -log10 p-values from the CD data competitive gene-set analysis for MAGMA, ALIGATOR, INRICH and MAGENTA. An additional caveat is that it is unknown to what extent the observed differences in power between methods may depend on the specific genetic architecture of Crohn’s diseases, and as such generalizing the results to other genetic architectures must be done with caution. These correlations reflect the LD between genes, and are needed in order to compensate for the dependencies between genes during the gene-set analysis. Moreover, gene-set analysis can provide additional insight into functional and biological mechanisms underlying the genetic component of a trait. Moreover, there is considerable discordance between different p-values cut-offs for the same methods as well (Fig 4).

The gene test-statistics used are (A) the mean χ2 statistic in MAGMA and PLINK, (B) the top χ2 statistic in MAGMA and PLINK, (C) the mean χ2 statistic in MAGMA and VEGAS with analysis based on SNP p-values and HapMap 3 reference data and (D) the mean χ2 statistic in MAGMA on raw data and with analysis based on SNP p-values and HapMap 3 reference data. The required number of permutations is determined adaptively for each gene during the analysis, to increase computational efficiency. Since raw genotype data may not always be available for analysis, MAGMA also provides more traditional SNP-wise gene analysis models of the type implemented in PLINK and VEGAS. No, Is the Subject Area "Genetics" applicable to this article? Alternatively, dominant logistic regression can be performed so that we compare the results between the additive and logistic regression models. These are considerably inflated because self-contained gene-set analysis by its definition is not designed to correct for polygenicity, illustrating the risk of performing self-contained analysis on polygenic phenotypes. Moreover, gene-set analysis can provide insight into the involvement of specific biological pathways or cellular functions in the genetic etiology of a phenotype. For the competitive test an additional simulation using a polygenic null model was performed, with effects explaining a combined 50% of the phenotypic variance assigned to randomly selected SNPs. In this chapter, we will analyze basic GWAS using gPLINK and HaploView, which are visual interfaces of the PLINK software. Statistical analysis of genome-wide association (GWAS) data Jim Stankovich Menzies Research Institute University of Tasmania J.Stankovich@utas.edu.au To check and remove these individuals with missing genotype data, the “–mind” function is used.

All analyses were performed on the Genetic Cluster Computer, which is part of the Dutch Lisa Cluster. https://doi.org/10.1371/journal.pcbi.1004219.s001, https://doi.org/10.1371/journal.pcbi.1004219.s002, https://doi.org/10.1371/journal.pcbi.1004219.s003. The type 1 error rates for the gene-set analysis were also found to be well controlled for both the self-contained and competitive test (Table S2 in S2 File). Comparison of gene set -log10 p-values from the CD data competitive gene-set analysis at different SNP p-value cut-offs for ALIGATOR (top row), INRICH (middle row) and MAGENTA (bottom row). The correlation matrix for chromosomes 5 and 6 was plotted, with individual pixels corresponding to a pair of genes and the color (from white to black) proportional to the absolute value of the correlation between those genes. An overview of all analyses is given in Table 1. https://doi.org/10.1371/journal.pcbi.1004219.t001. With the matrix of PCs, Y the phenotype and W an optional matrix of covariates the model can thus be written as , where the parameter vector αg represents the genetic effect, βg the effect of the optional covariates, α0g the intercept and εg the vector of residuals.

Gene and gene-set analysis have been suggested as potentially more powerful alternatives to the typical single-SNP analyses performed in GWAS [7].