Search

Scholarly Works (8 results)

Sort By:

Thesis
Peer Reviewed

Challenges in Whole-Genome Analysis: Multilayer Omics Data and Data Encryption

Zhao, Tianjing
Advisor(s): Cheng, Hao

UC Davis Electronic Theses and Dissertations (2023)

With the development of high-throughput sequencing, whole-genome analysis, such as genomic prediction and genome-wide association studies (GWAS), plays an important role in animal and plant breeding. Under the infinitesimal model, complex traits are assumed to be affected by many genes with small additive effects, and the relationship between genotypes and phenotypes is linear. In most GWAS and genomic prediction studies, one goal is to estimate the joint effects of all SNP markers. The landmark paper by Meuwissen et al. (2001) introduced the Bayesian linear mixed model for whole-genome prediction, which has been widely used in breeding programs.

However, as the amount and diversity of omics data continue to grow, several challenges arise for the linear mixed model. First, there is a need to extend mixed models to incorporate multiple sequential layers of data as one connected network (e.g., the regulatory cascades). Second, due to increasing concerns about data privacy, there is a need to adopt mixed models for encrypted data, enabling the sharing of confidential data in genome-to-phenome analyses.

New methods were proposed in this thesis that try to solve these two challenges. For the first challenge, we provide a novel framework named mixed model neural network ("NNMM") to extend the mixed model ("MM") to a multilayer neural network ("NN"), thus incorporating sequential layers of data as a unified multilayer network. Nonlinear relationships between different layers of data are allowed via nonlinear activation functions in neural networks. Moreover, NNMM allows various missing patterns for the data in the middle layer, and the network architecture of NNMM can be predefined to be partially connected.

For the second challenge, a homomorphic encryption method based on high-dimensional random orthogonal transformations of the raw data has been proposed in Mott et al. (2020). This method is specifically suited for single-marker regression in GWAS using linear mixed models with Gaussian errors. In this thesis, we will further generalize this homomorphic encryption for genome-to-phenome analysis using mixed models.

Cover page: Challenges in Whole-Genome Analysis: Multilayer Omics Data and Data Encryption

Article
Peer Reviewed

Interpreting single-step genomic evaluation as a neural network of three layers: pedigree, genotypes, and phenotypes.

UC Davis Previously Published Works (2023)

The single-step approach has become the most widely-used methodology for genomic evaluations when only a subset of phenotyped individuals in the pedigree are genotyped, where the genotypes for non-genotyped individuals are imputed based on gene contents (i.e., genotypes) of genotyped individuals through their pedigree relationships. We proposed a new method named single-step neural network with mixed models (NNMM) to represent single-step genomic evaluations as a neural network of three sequential layers: pedigree, genotypes, and phenotypes. These three sequential layers of information create a unified network instead of two separate steps, allowing the unobserved gene contents of non-genotyped individuals to be sampled based on pedigree, observed genotypes of genotyped individuals, and phenotypes. In addition to imputation of genotypes using all three sources of information, including phenotypes, genotypes, and pedigree, single-step NNMM provides a more flexible framework to allow nonlinear relationships between genotypes and phenotypes, and for individuals to be genotyped with different single-nucleotide polymorphism (SNP) panels. The single-step NNMM has been implemented in the software package JWAS.

Cover page: Interpreting single-step genomic evaluation as a neural network of three layers: pedigree, genotypes, and phenotypes.

Article
Peer Reviewed

Interpretable Artificial Neural Networks incorporating Bayesian Alphabet Models for Genome-wide Prediction and Association Studies

UC Davis Previously Published Works (2021)

In conventional linear models for whole-genome prediction and genome-wide association studies (GWAS), it is usually assumed that the relationship between genotypes and phenotypes is linear. Bayesian neural networks have been used to account for non-linearity such as complex genetic architectures. Here, we introduce a method named NN-Bayes, where "NN" stands for neural networks, and "Bayes" stands for Bayesian Alphabet models, including a collection of Bayesian regression models such as BayesA, BayesB, BayesC, and Bayesian LASSO. NN-Bayes incorporates Bayesian Alphabet models into non-linear neural networks via hidden layers between single-nucleotide polymorphisms (SNPs) and observed traits. Thus, NN-Bayes attempts to improve the performance of genome-wide prediction and GWAS by accommodating non-linear relationships between the hidden nodes and the observed trait, while maintaining genomic interpretability through the Bayesian regression models that connect the SNPs to the hidden nodes. For genomic interpretability, the posterior distribution of marker effects in NN-Bayes is inferred by Markov chain Monte Carlo approaches and used for inference of association through posterior inclusion probabilities and window posterior probability of association. In simulation studies with dominance and epistatic effects, performance of NN-Bayes was significantly better than conventional linear models for both GWAS and whole-genome prediction, and the differences on prediction accuracy were substantial in magnitude. In real-data analyses, for the soy dataset, NN-Bayes achieved significantly higher prediction accuracies than conventional linear models, and results from other four different species showed that NN-Bayes had similar prediction performance to linear models, which is potentially due to the small sample size. Our NN-Bayes is optimized for high-dimensional genomic data and implemented in an open-source package called "JWAS." NN-Bayes can lead to greater use of Bayesian neural networks to account for non-linear relationships due to its interpretability and computational performance.

Cover page: Interpretable Artificial Neural Networks incorporating Bayesian Alphabet Models for Genome-wide Prediction and Association Studies

Article
Peer Reviewed

Fast parallelized sampling of Bayesian regression models for whole-genome prediction

UC Davis Previously Published Works (2020)

Background

Bayesian regression models are widely used in genomic prediction, where the effects of all markers are estimated simultaneously by combining the information from the phenotypic data with priors for the marker effects and other parameters such as variance components or membership probabilities. Inferences from most Bayesian regression models are based on Markov chain Monte Carlo methods, where statistics are computed from a Markov chain constructed to have a stationary distribution that is equal to the posterior distribution of the unknown parameters. In practice, chains of tens of thousands steps are typically used in whole-genome Bayesian analyses, which is computationally intensive.

Methods

In this paper, we propose a fast parallelized algorithm for Bayesian regression models called independent intensive Bayesian regression models (BayesXII, "X" stands for Bayesian alphabet methods and "II" stands for "parallel") and show how the sampling of each marker effect can be made independent of samples for other marker effects within each step of the chain. This is done by augmenting the marker covariate matrix by adding p (the number of markers) new rows such that columns of the augmented marker covariate matrix are orthogonal. Ideally, the computations at each step of the MCMC chain can be accelerated by k times, where k is the number of computer processors, up to p times, where p is the number of markers.

Results

We demonstrate the BayesXII algorithm using the prior for BayesC[Formula: see text], a Bayesian variable selection regression method, which is applied to simulated data with 50,000 individuals and a medium-density marker panel ([Formula: see text] 50,000 markers). To reach about the same accuracy as the conventional samplers for BayesC[Formula: see text] required less than 30 min using the BayesXII algorithm on 24 nodes (computer used as a server) with 24 cores on each node. In this case, the BayesXII algorithm required one tenth of the computation time of conventional samplers for BayesC[Formula: see text]. Addressing the heavy computational burden associated with Bayesian methods by parallel computing will lead to greater use of these methods.

Cover page: Fast parallelized sampling of Bayesian regression models for whole-genome prediction

Article
Peer Reviewed

Extend mixed models to multilayer neural networks for genomic prediction including intermediate omics data.

UC Davis Previously Published Works (2022)

With the growing amount and diversity of intermediate omics data complementary to genomics (e.g. DNA methylation, gene expression, and protein abundance), there is a need to develop methods to incorporate intermediate omics data into conventional genomic evaluation. The omics data help decode the multiple layers of regulation from genotypes to phenotypes, thus forms a connected multilayer network naturally. We developed a new method named NN-MM to model the multiple layers of regulation from genotypes to intermediate omics features, then to phenotypes, by extending conventional linear mixed models ("MM") to multilayer artificial neural networks ("NN"). NN-MM incorporates intermediate omics features by adding middle layers between genotypes and phenotypes. Linear mixed models (e.g. pedigree-based BLUP, GBLUP, Bayesian Alphabet, single-step GBLUP, or single-step Bayesian Alphabet) can be used to sample marker effects or genetic values on intermediate omics features, and activation functions in neural networks are used to capture the nonlinear relationships between intermediate omics features and phenotypes. NN-MM had significantly better prediction performance than the recently proposed single-step approach for genomic prediction with intermediate omics data. Compared to the single-step approach, NN-MM can handle various patterns of missing omics measures and allows nonlinear relationships between intermediate omics features and phenotypes. NN-MM has been implemented in an open-source package called "JWAS".

Cover page: Extend mixed models to multilayer neural networks for genomic prediction including intermediate omics data.

Article
Peer Reviewed

Microbiome-enabled genomic selection improves prediction accuracy for nitrogen-related traits in maize.

UC Davis Previously Published Works (2024)

Root-associated microbiomes in the rhizosphere (rhizobiomes) are increasingly known to play an important role in nutrient acquisition, stress tolerance, and disease resistance of plants. However, it remains largely unclear to what extent these rhizobiomes contribute to trait variation for different genotypes and if their inclusion in the genomic selection protocol can enhance prediction accuracy. To address these questions, we developed a microbiome-enabled genomic selection method that incorporated host SNPs and amplicon sequence variants from plant rhizobiomes in a maize diversity panel under high and low nitrogen (N) field conditions. Our cross-validation results showed that the microbiome-enabled genomic selection model significantly outperformed the conventional genomic selection model for nearly all time-series traits related to plant growth and N responses, with an average relative improvement of 3.7%. The improvement was more pronounced under low N conditions (8.4-40.2% of relative improvement), consistent with the view that some beneficial microbes can enhance N nutrient uptake, particularly in low N fields. However, our study could not definitively rule out the possibility that the observed improvement is partially due to the amplicon sequence variants being influenced by microenvironments. Using a high-dimensional mediation analysis method, our study has also identified microbial mediators that establish a link between plant genotype and phenotype. Some of the detected mediator microbes were previously reported to promote plant growth. The enhanced prediction accuracy of the microbiome-enabled genomic selection models, demonstrated in a single environment, serves as a proof-of-concept for the potential application of microbiome-enabled plant breeding for sustainable agriculture.

Cover page: Microbiome-enabled genomic selection improves prediction accuracy for nitrogen-related traits in maize.

Article
Peer Reviewed

Using encrypted genotypes and phenotypes for collaborative genomic analyses to maintain data confidentiality.

UC Davis Previously Published Works (2024)

To adhere to and capitalize on the benefits of the FAIR (findable, accessible, interoperable, and reusable) principles in agricultural genome-to-phenome studies, it is crucial to address privacy and intellectual property issues that prevent sharing and reuse of data in research and industry. Direct sharing of genotype and phenotype data is often prohibited due to intellectual property and privacy concerns. Thus, there is a pressing need for encryption methods that obscure confidential aspects of the data, without affecting the outcomes of certain statistical analyses. A homomorphic encryption method for genotypes and phenotypes (HEGP) has been proposed for single-marker regression in genome-wide association studies (GWAS) using linear mixed models with Gaussian errors. This methodology permits frequentist likelihood-based parameter estimation and inference. In this paper, we extend HEGP to broader applications in genome-to-phenome analyses. We show that HEGP is suited to commonly used linear mixed models for genetic analyses of quantitative traits including genomic best linear unbiased prediction (GBLUP) and ridge-regression best linear unbiased prediction (RR-BLUP), as well as Bayesian variable selection methods (e.g. those in Bayesian Alphabet), for genetic parameter estimation, genomic prediction, and GWAS. By advancing the capabilities of HEGP, we offer researchers and industry professionals a secure and efficient approach for collaborative genomic analyses while preserving data confidentiality.

Cover page: Using encrypted genotypes and phenotypes for collaborative genomic analyses to maintain data confidentiality.

Article
Peer Reviewed

Learning functional conservation between human and pig to decipher evolutionary mechanisms underlying gene expression and complex traits.

UC Davis Previously Published Works (2023)

Assessment of genomic conservation between humans and pigs at the functional level can improve the potential of pigs as a human biomedical model. To address this, we developed a deep learning-based approach to learn the genomic conservation at the functional level (DeepGCF) between species by integrating 386 and 374 functional profiles from humans and pigs, respectively. DeepGCF demonstrated better prediction performance compared with the previous method. In addition, the resulting DeepGCF score captures the functional conservation between humans and pigs by examining chromatin states, sequence ontologies, and regulatory variants. We identified a core set of genomic regions as functionally conserved that plays key roles in gene regulation and is enriched for the heritability of complex traits and diseases in humans. Our results highlight the importance of cross-species functional comparison in illustrating the genetic and evolutionary basis of complex phenotypes.

Cover page: Learning functional conservation between human and pig to decipher evolutionary mechanisms underlying gene expression and complex traits.