Skip to main content
Next Generation Sequencing For Agrigenomics Hero

Blog

NGS
5 min read

The logic behind imputation analysis in next-generation sequencing for agrigenomics.

Help us improve your Revvity blog experience!

Feedback

Introduction to imputation in agrigenomics

Imputation, in the context of genetics, refers to the statistical inference of missing or unobserved genotypic data. In next-generation sequencing (NGS), especially within the field of agrigenomics, imputation plays a pivotal role “filling in the gaps” and maximizing the value of data derived from plants and livestock.

Agrigenomics often relies on large-scale sequencing to uncover genetic variations that drive key traits like yield, disease resistance, and drought tolerance. However, the cost and computational demands of high-coverage sequencing limit its widespread application.

This is where imputation comes in: by using computational models and reference panels, researchers can infer missing genotypes in low coverage sequencing data (ranging from 0.1x to 10x), transforming sparse datasets into richer, more complete genetic profiles. The result? A cost-effective and resource-efficient approach to genetic discovery and improvement.
 

Methodologies for imputation

Imputation relies on a variety of statistical and machine learning techniques to fill in with reasonable confidence the gaps in genetic data. Some common methods include:

  • K-nearest neighbors (kNN): This technique imputes missing data by identifying and averaging the genotypes of the most genetically similar individuals (neighbors) in the dataset. It is straightforward but may struggle with complex genomic patterns.
  • Random forest: A machine learning approach that uses decision trees to predict missing genotypes.  Multiple decision trees are constructed to model complex interactions between markers. When well-tuned, Random Forest can be robust against noise and works well even for heterogeneous datasets. However, it is computationally intensive.
  • Bayesian approaches: Bayesian methods, such as those implemented in Beagle  or IMPUTE2, incorporate prior knowledge about allele frequencies and linkage disequilibrium patterns to predict missing data. They can integrate haplotype structure, population genetics theory, and known linkage disequilibrium (LD) patterns, often leading to highly accurate imputation, particularly when leveraging robust reference panels*. These methods are popular in studies in major crops like maize and rice, where structured reference panels are often available. This approach is computationally intensive.

Recent research has shown that Bayesian haplotype-based methods often outperform simpler approaches like kNN. However, computational costs and ease of implementation may lead researchers to prefer methods like Random Forest in certain scenarios. Ongoing comparisons highlight that there is no “one-size-fits-all” solution; the choice depends on data characteristics, population structure, and available computational resources.

* A reference panel is a database of fully or deeply sequenced genomes from a representative set of individuals. These reference panels capture the genetic diversity, haplotype structures, and LD patterns prevalent in the target population. By comparing incomplete data against these known haplotypes, imputation algorithms can fill in missing variants with confidence. Strong reference panels, tailored to specific crops or livestock species are crucial for ensuring accuracy and minimizing bias.

Applications of imputation in agrigenomics

Genomic selection Imputation supports genomic selection by increasing the density and completeness of marker data. When breeders predict the genetic merit of plants based on genomic information, having a high-quality and extensive marker set is crucial. Imputation thus improves the accuracy of genomic predictions, enabling breeders to make informed decisions faster, accelerate breeding cycles, and ultimately develop superior crop varieties

Quantitative trait locus (QTL) mapping QTL mapping aims to locate genomic regions that influence specific agronomic traits, such as drought tolerance or nutrient efficiency. Missing data can hamper the statistical power to detect these loci. Through imputation, researchers gain cleaner, more continuous datasets that strengthen the association signals, making it easier to identify and validate key QTLs

Meta-analysis of GWAS Genome-wide association studies (GWAS) often combine data from multiple studies to increase statistical power and uncover subtle genetic associations. However, variations in genotyping platforms, sequencing depths, and populations lead to inconsistent datasets with large amounts of missing data. Imputation harmonizes these datasets, facilitating robust meta-analyses that can detect even small-effect variants associated with traits of interest. This improves the overall understanding of trait heritability and genetic complexity in important crop species.

Limitations of current methods Despite its benefits, imputation in agrigenomics is not without challenges. The reliability of imputation largely depends on the quality and representativeness of the reference panel, the degree of genetic relatedness within the studied population, and the underlying LD structure. For diverse or underrepresented populations, accuracy may decrease, potentially excluding rare but agronomically significant alleles from downstream analysis

  • Low-quality or highly fragmented input data can lead to erroneous imputations.

Imputation, particularly for large-scale datasets, requires significant computational resources. This includes the time and memory needed to run algorithms, as well as the costs associated with maintaining high-performance computing infrastructure. Continuous algorithmic improvements and more efficient software implementations are necessary to keep pace with the growing scale of modern agrigenomics. 

Future directions The future of imputation in agrigenomics is bright, with several exciting developments on the horizon.

Advancements in NGS technologies Improvements in sequencing technology, such as ultra-long-read sequencing and single-cell sequencing, will provide richer datasets, reducing the reliance on imputation. However, these technologies will also enhance reference panels, making imputation more accurate when needed.

Integration with multi-omics data Integrating genomic data with phenotypic, environmental, and transcriptomic data holds the potential to create more robust predictive models. This holistic approach could revolutionize genomic selection and trait discovery.

Machine learning and AI The application of machine learning and artificial intelligence to imputation is still in its infancy but promises to handle the complexity of agrigenomic datasets more effectively. These tools could automate and optimize imputation processes, making them more accessible to researchers worldwide.
 

Conclusion

Imputation analysis has become a cornerstone of modern agrigenomics, bridging the gap between low-cost sequencing and high-quality genetic data. By leveraging advanced statistical techniques and reference panels, researchers can unlock deeper insights into the genetic architecture of plants and livestock, paving the way for sustainable agricultural practices and improved food security. As NGS technologies and computational tools continue to evolve, the potential of imputation in agrigenomics will only grow, ensuring that the field remains at the cutting edge of genetic research.

Revvity offers a solution for low pass whole genome sequencing and a cloud platform for imputation analysis (based on Bayesian approach) to process quickly large amounts of data without the need of any special infrastructure.
 

References
  • Scheet, P., & Stevens, E. L. (2020). Genotype imputation: a review of methods, performance, and applications. Ann. Hum. Genet., 84(6), 439–453.
  • Crossa, J., Pérez-Rodríguez, P., Cuevas, J., Montesinos-López, O. A., Jarquín, D., Juliana, P., & Singh, P. K. (2021). Genomic selection in plant breeding: methods, models, and perspectives. Trends Plant. Sci., 26(11), 1086–1102.
  • Browning, S. R., & Browning, B. L. (2021). Genotype imputation with millions of reference samples. Am. J. Hum. Genet., 108(1), 8–24.
  • Wang, Y., Guo, R., Wang, X., Chen, G., Liu, S., Wen, J., Yi, B., Shen, J., Ma, C., Tu, J., Fu, T., & Shen, J. (2021). Evaluation of genotype imputation performance using different reference panels in low-coverage sequencing data of a rapeseed (Brassica napus L.) Core Collection. Front. Plant Sci., 12, 722744.