Sequencing
- A technology that reads the entire genome.
- Expensive.
Genotyping
- A technology that probes a limited number (usually 500k to 1M) of known variants in the genome.
- Cheap.
Genotype imputation is a way to close the gap.
December 10, 2021
Genotype imputation is a way to close the gap.
The process of estimating genotypes at markers that are not directly assayed on GWAS arrays by comparing each sample to a reference panel of sequenced genomes.
Das, Abecasis, and Browning (2018)
Strand, alleles, positions, ref/alt assignments, frequency differences.
Duplicates, monomorphic sites, missing genotypes, multiallelic sites, indels, invalid alleles, variants not in reference panel, allele mismatches to reference.
Name | Samples | Autosomal Variants | Populations | Assembly | Access |
---|---|---|---|---|---|
1000 Genomes Phase 3 (Version 5) | 2,504 | 47,109,451 | Multiethnic | GRCh37 | Public |
HRC (Version r1.1 2016) | 32,470 | 39,131,556 | Mostly European | GRCh37 | Controlled (EGA) |
TOPMed r2 | 97,256 | 292,323,460 | Multiethnic | GRCh38 | Controlled (some from dbGaP) |
Genome Asia | 1,654 (Pilot), 6,461 (v2) | 21,494,626 | Mostly Asian | GRCh37 | Controlled (EGA) |
CAAPA | 883 | 29,842,689 | Mostly African | GRCh37 | Controlled (dbGaP) |
We have the computational resources and experience with obtaining controlled access datasets, so why not try ourselves?
HRC/1000G Checking: https://www.well.ox.ac.uk/~wrayner/tools/index.html#Checking
module load HRC-1000G-check-bim
PLINK fileset + allele frequencies (--freq
)
One PLINK fileset and VCF file per chromosome, with strands, positions, ref/alt assignment updated and some variants removed (A/T & G/C SNPs if MAF > 0.4, SNPs with differing alleles, SNPs with > 0.2 allele frequency difference, SNPs not in reference panel).
HRC-1000G-check-bim.pl \ -b plink.bim \ -f plink.frq \ -r /mnt/research/quantgen/datasets/HRC-1000G-check-bim/legends/1000GP_Phase3_combined.legend \ -g \ -p ALL
HRC-1000G-check-bim.pl \ -b plink.bim \ -f plink.frq \ -r /mnt/research/quantgen/datasets/HRC-1000G-check-bim/legends/TOPMed/freeze8/ hg38/PASS.Variants.TOPMed_freeze8_hg38_dbSNP.tab \ -h
Eagle 2: https://alkesgroup.broadinstitute.org/Eagle/
module load eagle
Reference and target in tabix-indexed BCF (recommended) or tabix-indexed VCF format. Bi-allelic variants that are contained in both target and reference only.
Phased VCF files.
eagle \ --geneticMapFile /mnt/research/quantgen/datasets/Eagle/tables/ genetic_map_hg19_withX.txt.gz \ --outPrefix chr${CHR} \ --vcfRef /mnt/research/quantgen/datasets/Minimac4/derivative/reference_panels/ G1K_P3_BCF_Files/ ALL.chr${CHR}.phase3_v5.shapeit2_mvncall_integrated.noSingleton.genotypes.bcf \ --vcfTarget /mnt/research/quantgen/projects/EBDS/validation/data/VCF/all/1000G/ chr${CHR}.vcf.gz \ --allowRefAltSwap
Minimac4: https://genome.sph.umich.edu/wiki/Minimac4
module load Minimac4
Reference in M3VCF format, pre-phased target in VCF format.
Imputed VCF files, info files.
Minimac4 \ --refHaps /mnt/research/quantgen/datasets/Minimac4/source/reference_panels/ G1K_P3_M3VCF_FILES_WITH_ESTIMATES/ ${CHR}.1000g.Phase3.v5.With.Parameter.Estimates.m3vcf.gz \ --haps /mnt/research/quantgen/projects/EBDS/validation/PIPELINE/1_phasing/ all/1000G/chr1.vcf.gz \ --prefix chr${CHR} \ --format GT,DS,GP \ --allTypedSites \ --minRatio 0.00001 \ --noPhoneHome
Rsq
(both in VCF file and info file): Confidence of accuracy (1: there is no uncertainty in the imputed genotypes, 0: there is complete uncertainty about the genotypes)Method | Min | Median | Mean | Max | SD |
---|---|---|---|---|---|
On-site (1000G) | 0.0156 | 0.9560 | 0.9236 | 1 | 0.0987 |
MIS (1000G) | 0.0120 | 0.9599 | 0.9249 | 1 | 0.1027 |
MIS (HRC) | 0.0069 | 0.9693 | 0.9432 | 1 | 0.0833 |
TIS (TOPMed) | 0.0069 | 1 | 0.9797 | 1 | 0.0491 |
p = 910
They don’t all fit on this slide. Check out references.bib.
Browning, Sharon R., and Brian L. Browning. 2007. “Rapid and Accurate Haplotype Phasing and Missing-Data Inference for Whole-Genome Association Studies By Use of Localized Haplotype Clustering.” The American Journal of Human Genetics 81 (5): 1084–97. https://doi.org/10.1086/521987.
Das, Sayantan, Gonçalo R. Abecasis, and Brian L. Browning. 2018. “Genotype Imputation from Large Reference Panels.” Annual Review of Genomics and Human Genetics 19 (1): 73–96. https://doi.org/10.1146/annurev-genom-083117-021602.
Das, Sayantan, Lukas Forer, Sebastian Schönherr, Carlo Sidore, Adam E Locke, Alan Kwong, Scott I Vrieze, et al. 2016. “Next-Generation Genotype Imputation Service and Methods.” Nature Genetics 48 (10): 1284–87. https://doi.org/10.1038/ng.3656.
Delaneau, Olivier, Jonathan Marchini, and Jean-François Zagury. 2012. “A Linear Complexity Phasing Method for Thousands of Genomes.” Nature Methods 9 (2): 179–81. https://doi.org/10.1038/nmeth.1785.
Delaneau, Olivier, Jean-Francois Zagury, and Jonathan Marchini. 2013. “Improved Whole-Chromosome Phasing for Disease and Population Genetic Studies.” Nature Methods 10 (1): 5–6. https://doi.org/10.1038/nmeth.2307.
Delaneau, Olivier, Jean-François Zagury, Matthew R. Robinson, Jonathan L. Marchini, and Emmanouil T. Dermitzakis. 2019. “Accurate, Scalable and Integrative Haplotype Estimation.” Nature Communications 10 (1): 5436. https://doi.org/10.1038/s41467-019-13225-y.
ENIGMA Consortium, K Xia, J Zhang, M Ahn, S Jha, J J Crowley, J Szatkiewicz, et al. 2017. “Genome-Wide Association Analysis Identifies Common Variants Influencing Infant Brain Volumes.” Translational Psychiatry 7 (8): e1188–88. https://doi.org/10.1038/tp.2017.159.
Fuchsberger, Christian, Gonçalo R. Abecasis, and David A. Hinds. 2015. “Minimac2: Faster Genotype Imputation.” Bioinformatics 31 (5): 782–84. https://doi.org/10.1093/bioinformatics/btu704.
Howie, Bryan N., Peter Donnelly, and Jonathan Marchini. 2009. “A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies.” Edited by Nicholas J. Schork. PLoS Genetics 5 (6): e1000529. https://doi.org/10.1371/journal.pgen.1000529.
Howie, Bryan, Christian Fuchsberger, Matthew Stephens, Jonathan Marchini, and Gonçalo R Abecasis. 2012. “Fast and Accurate Genotype Imputation in Genome-Wide Association Studies Through Pre-Phasing.” Nature Genetics 44 (8): 955–59. https://doi.org/10.1038/ng.2354.
Li, Yun, Cristen J. Willer, Jun Ding, Paul Scheet, and Gonçalo R. Abecasis. 2010. “ MaCH: Using Sequence and Genotype Data to Estimate Haplotypes and Unobserved Genotypes.” Genetic Epidemiology 34 (8): 816–34. https://doi.org/10.1002/gepi.20533.
Liu, Eric Yi, Mingyao Li, Wei Wang, and Yun Li. 2013. “ MaCH-Admix: Genotype Imputation for Admixed Populations: MaCH-Admix: Imputation for Admixed Populations.” Genetic Epidemiology 37 (1): 25–37. https://doi.org/10.1002/gepi.21690.
Loh, Po-Ru, Petr Danecek, Pier Francesco Palamara, Christian Fuchsberger, Yakir A Reshef, Hilary K Finucane, Sebastian Schoenherr, et al. 2016. “Reference-Based Phasing Using the Haplotype Reference Consortium Panel.” Nature Genetics 48 (11): 1443–48. https://doi.org/10.1038/ng.3679.
Loh, Po-Ru, Pier Francesco Palamara, and Alkes L Price. 2016. “Fast and Accurate Long-Range Phasing in a UK Biobank Cohort.” Nature Genetics 48 (7): 811–16. https://doi.org/10.1038/ng.3571.
Marchini, Jonathan, and Bryan Howie. 2010. “Genotype Imputation for Genome-Wide Association Studies.” Nature Reviews Genetics 11 (7): 499–511. https://doi.org/10.1038/nrg2796.
Marchini, Jonathan, Bryan Howie, Simon Myers, Gil McVean, and Peter Donnelly. 2007. “A New Multipoint Method for Genome-Wide Association Studies by Imputation of Genotypes.” Nature Genetics 39 (7): 906–13. https://doi.org/10.1038/ng2088.
NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium, Daniel Taliun, Daniel N. Harris, Michael D. Kessler, Jedidiah Carlson, Zachary A. Szpiech, Raul Torres, et al. 2021. “Sequencing of 53,831 Diverse Genomes from the NHLBI TOPMed Program.” Nature 590 (7845): 290–99. https://doi.org/10.1038/s41586-021-03205-y.
O’Connell, Jared, Kevin Sharp, Nick Shrine, Louise Wain, Ian Hall, Martin Tobin, Jean-Francois Zagury, Olivier Delaneau, and Jonathan Marchini. 2016. “Haplotype Estimation for Biobank-Scale Data Sets.” Nature Genetics 48 (7): 817–20. https://doi.org/10.1038/ng.3583.
Rubinacci, Simone, Olivier Delaneau, and Jonathan Marchini. 2020. “Genotype Imputation Using the Positional Burrows Wheeler Transform.” Edited by Paul Scheet. PLOS Genetics 16 (11): e1009049. https://doi.org/10.1371/journal.pgen.1009049.
Scheet, Paul, and Matthew Stephens. 2006. “A Fast and Flexible Statistical Model for Large-Scale Population Genotype Data: Applications to Inferring Missing Genotypes and Haplotypic Phase.” The American Journal of Human Genetics 78 (4): 629–44. https://doi.org/10.1086/502802.
Williams, Amy L., Nick Patterson, Joseph Glessner, Hakon Hakonarson, and David Reich. 2012. “Phasing of Many Thousands of Genotyped Samples.” The American Journal of Human Genetics 91 (2): 238–51. https://doi.org/10.1016/j.ajhg.2012.06.013.
Zhao, Hao, Zhifu Sun, Jing Wang, Haojie Huang, Jean-Pierre Kocher, and Liguo Wang. 2014. “ CrossMap: A Versatile Tool for Coordinate Conversion Between Genome Assemblies.” Bioinformatics 30 (7): 1006–7. https://doi.org/10.1093/bioinformatics/btt730.