Genotype Imputation

December 10, 2021

Sequencing vs. Genotyping

Sequencing

  • A technology that reads the entire genome.
  • Expensive.

Genotyping

  • A technology that probes a limited number (usually 500k to 1M) of known variants in the genome.
  • Cheap.

Genotype imputation is a way to close the gap.

Genotype Imputation

Why?

  • Boosting power (increasing the effective sample size by filling in the missing genotypes)
  • Fine-mapping (increasing chances of identifying a causal variant)
  • Meta-analysis (facilitating the combination of results across studies)

What?

The process of estimating genotypes at markers that are not directly assayed on GWAS arrays by comparing each sample to a reference panel of sequenced genomes.

Basic Intuition: The Li and Stephens Model

Das, Abecasis, and Browning (2018)

Imputation (Step 4)

Tools

  • FastPHASE (Scheet and Stephens 2006)
  • IMPUTE (Marchini et al. 2007)
  • Beagle (Browning and Browning 2007)
  • IMPUTE 2 (B. N. Howie, Donnelly, and Marchini 2009)
  • MaCH (Li et al. 2010)
  • Minimac (B. Howie et al. 2012)
  • MaCH-Admix (Liu et al. 2013)
  • Minimac2 (Fuchsberger, Abecasis, and Hinds 2015)
  • Minimac3 (Das et al. 2016)
  • IMPUTE 5 (Rubinacci, Delaneau, and Marchini 2020)

Imputation (Step 4)

Review Papers

  • Das, Abecasis, and Browning (2018)
  • Marchini and Howie (2010)

Pre-Phasing / Haplotype Estimation (Step 3)

Tools

  • Beagle (Browning and Browning 2007)
  • HAPI-UR (Williams et al. 2012)
  • SHAPEIT (Delaneau, Marchini, and Zagury 2012)
  • SHAPEIT 2 (Delaneau, Zagury, and Marchini 2013)
  • Eagle (Loh, Palamara, and Price 2016)
  • Eagle2 (Loh et al. 2016)
  • SHAPEIT 3 (O’Connell et al. 2016)
  • SHAPEIT 4 (Delaneau et al. 2019)

Liftover (Step 2)

Tools

QC (Step 1)

Checks

Strand, alleles, positions, ref/alt assignments, frequency differences.

Exclusions

Duplicates, monomorphic sites, missing genotypes, multiallelic sites, indels, invalid alleles, variants not in reference panel, allele mismatches to reference.

Tools

Summary

Steps

  1. QC
  2. Liftover to assembly of reference panel
  3. Pre-phasing (haplotype estimation)
  4. Imputation

Reference Panels

Name Samples Autosomal Variants Populations Assembly Access
1000 Genomes Phase 3 (Version 5) 2,504 47,109,451 Multiethnic GRCh37 Public
HRC (Version r1.1 2016) 32,470 39,131,556 Mostly European GRCh37 Controlled (EGA)
TOPMed r2 97,256 292,323,460 Multiethnic GRCh38 Controlled (some from dbGaP)
Genome Asia 1,654 (Pilot), 6,461 (v2) 21,494,626 Mostly Asian GRCh37 Controlled (EGA)
CAAPA 883 29,842,689 Mostly African GRCh37 Controlled (dbGaP)

Imputation Servers

Why use imputation servers?

  • Use of controlled access reference panels
  • Access to computational power
  • State of the art tools and configuration

Michigan Imputation Server

TOPMed Imputation Server

Problems with Imputation Servers

  • Trust issues
  • Limited customizability
    • Phasing is done using HRC only
  • Unpredictable wait times

We have the computational resources and experience with obtaining controlled access datasets, so why not try ourselves?

EBDS (Early Brain Development Study)

Overview

  • Paper: ENIGMA Consortium et al. (2017)
  • GWAS of global brain tissue volumes in 561 infants
  • Chip: Affymetrix Axiom Genome-Wide LAT and Exome arrays, GRCh37
  • Called genotypes: 741 samples x 719,820 variants
  • Imputed genotypes: 770 samples x 9,306,407 variants
    • Imputed to 1000 Genomes (phase1_release_v3.20101123) using MaCH-Admix

1. Quality Control - HRC-1000G-check-bim (1/2)

HRC/1000G Checking: https://www.well.ox.ac.uk/~wrayner/tools/index.html#Checking

module load HRC-1000G-check-bim

Input

PLINK fileset + allele frequencies (--freq)

Output

One PLINK fileset and VCF file per chromosome, with strands, positions, ref/alt assignment updated and some variants removed (A/T & G/C SNPs if MAF > 0.4, SNPs with differing alleles, SNPs with > 0.2 allele frequency difference, SNPs not in reference panel).

1. Quality Control - HRC-1000G-check-bim (2/2)

1000G

HRC-1000G-check-bim.pl \
  -b plink.bim \
  -f plink.frq \
  -r /mnt/research/quantgen/datasets/HRC-1000G-check-bim/legends/1000GP_Phase3_combined.legend \
  -g \
  -p ALL

HRC

HRC-1000G-check-bim.pl \
  -b plink.bim \
  -f plink.frq \
  -r /mnt/research/quantgen/datasets/HRC-1000G-check-bim/legends/TOPMed/freeze8/
     hg38/PASS.Variants.TOPMed_freeze8_hg38_dbSNP.tab \
  -h

2. Phasing - Eagle (1/2)

Eagle 2: https://alkesgroup.broadinstitute.org/Eagle/

module load eagle

Input

Reference and target in tabix-indexed BCF (recommended) or tabix-indexed VCF format. Bi-allelic variants that are contained in both target and reference only.

Output

Phased VCF files.

2. Phasing - Eagle (2/2)

eagle \
  --geneticMapFile /mnt/research/quantgen/datasets/Eagle/tables/
                   genetic_map_hg19_withX.txt.gz \
  --outPrefix chr${CHR} \
  --vcfRef /mnt/research/quantgen/datasets/Minimac4/derivative/reference_panels/
           G1K_P3_BCF_Files/
           ALL.chr${CHR}.phase3_v5.shapeit2_mvncall_integrated.noSingleton.genotypes.bcf \
  --vcfTarget /mnt/research/quantgen/projects/EBDS/validation/data/VCF/all/1000G/
              chr${CHR}.vcf.gz \
  --allowRefAltSwap

3. Imputation - Minimac4 (1/2)

3. Imputation - Minimac4 (2/2)

Minimac4 \
  --refHaps /mnt/research/quantgen/datasets/Minimac4/source/reference_panels/
            G1K_P3_M3VCF_FILES_WITH_ESTIMATES/
            ${CHR}.1000g.Phase3.v5.With.Parameter.Estimates.m3vcf.gz \
  --haps /mnt/research/quantgen/projects/EBDS/validation/PIPELINE/1_phasing/
         all/1000G/chr1.vcf.gz \
  --prefix chr${CHR} \
  --format GT,DS,GP \
  --allTypedSites \
  --minRatio 0.00001 \
  --noPhoneHome

4. QC

  • Rsq (both in VCF file and info file): Confidence of accuracy (1: there is no uncertainty in the imputed genotypes, 0: there is complete uncertainty about the genotypes)
  • Minimal Rsq for common variants: >= 0.3
  • Minimal Rsq for low frequency / rare variants: >= 0.5

Results

  • Sample size: 741 (they used more)
  • Number of autosomal variants in original BED file: 696,173
  • Number of autosomal SNPs after conversion to VCF: 692,534
  • Number of SNPs after phasing: 685,916
  • Number of SNPs after imputation: 47,109,487 (compared to 49,017,216 in reference panel)
  • Number of SNPs after QC (Rsq > 0.8, MAF > 0.01): 7,492,132
  • Number of variants in their imputation: 9,306,407

Validation

Protocol

  1. Randomly choose 1,000 variants [from the intersection of hg37 genotypes, hg38 genotypes, and imputed genotypes]
  2. Remove validation set from genotypes before QC
  3. Perform pre-imputation QC
  4. Submit to imputation servers and to on-site pipeline
  5. Perform post-imputation QC (keep Rsq > 0.3, MAF > 0.01)
  6. Extract validation set from imputed VCF files and convert to PLINK fileset
  7. [Get intersection of validation set and imputed genotypes and] Compute absolute correlations between calls and imputed data

Validation Results

Method Min Median Mean Max SD
On-site (1000G) 0.0156 0.9560 0.9236 1 0.0987
MIS (1000G) 0.0120 0.9599 0.9249 1 0.1027
MIS (HRC) 0.0069 0.9693 0.9432 1 0.0833
TIS (TOPMed) 0.0069 1 0.9797 1 0.0491

p = 910

Questions?

References

They don’t all fit on this slide. Check out references.bib.

Browning, Sharon R., and Brian L. Browning. 2007. “Rapid and Accurate Haplotype Phasing and Missing-Data Inference for Whole-Genome Association Studies By Use of Localized Haplotype Clustering.” The American Journal of Human Genetics 81 (5): 1084–97. https://doi.org/10.1086/521987.

Das, Sayantan, Gonçalo R. Abecasis, and Brian L. Browning. 2018. “Genotype Imputation from Large Reference Panels.” Annual Review of Genomics and Human Genetics 19 (1): 73–96. https://doi.org/10.1146/annurev-genom-083117-021602.

Das, Sayantan, Lukas Forer, Sebastian Schönherr, Carlo Sidore, Adam E Locke, Alan Kwong, Scott I Vrieze, et al. 2016. “Next-Generation Genotype Imputation Service and Methods.” Nature Genetics 48 (10): 1284–87. https://doi.org/10.1038/ng.3656.

Delaneau, Olivier, Jonathan Marchini, and Jean-François Zagury. 2012. “A Linear Complexity Phasing Method for Thousands of Genomes.” Nature Methods 9 (2): 179–81. https://doi.org/10.1038/nmeth.1785.

Delaneau, Olivier, Jean-Francois Zagury, and Jonathan Marchini. 2013. “Improved Whole-Chromosome Phasing for Disease and Population Genetic Studies.” Nature Methods 10 (1): 5–6. https://doi.org/10.1038/nmeth.2307.

Delaneau, Olivier, Jean-François Zagury, Matthew R. Robinson, Jonathan L. Marchini, and Emmanouil T. Dermitzakis. 2019. “Accurate, Scalable and Integrative Haplotype Estimation.” Nature Communications 10 (1): 5436. https://doi.org/10.1038/s41467-019-13225-y.

ENIGMA Consortium, K Xia, J Zhang, M Ahn, S Jha, J J Crowley, J Szatkiewicz, et al. 2017. “Genome-Wide Association Analysis Identifies Common Variants Influencing Infant Brain Volumes.” Translational Psychiatry 7 (8): e1188–88. https://doi.org/10.1038/tp.2017.159.

Fuchsberger, Christian, Gonçalo R. Abecasis, and David A. Hinds. 2015. “Minimac2: Faster Genotype Imputation.” Bioinformatics 31 (5): 782–84. https://doi.org/10.1093/bioinformatics/btu704.

Howie, Bryan N., Peter Donnelly, and Jonathan Marchini. 2009. “A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies.” Edited by Nicholas J. Schork. PLoS Genetics 5 (6): e1000529. https://doi.org/10.1371/journal.pgen.1000529.

Howie, Bryan, Christian Fuchsberger, Matthew Stephens, Jonathan Marchini, and Gonçalo R Abecasis. 2012. “Fast and Accurate Genotype Imputation in Genome-Wide Association Studies Through Pre-Phasing.” Nature Genetics 44 (8): 955–59. https://doi.org/10.1038/ng.2354.

Li, Yun, Cristen J. Willer, Jun Ding, Paul Scheet, and Gonçalo R. Abecasis. 2010. MaCH: Using Sequence and Genotype Data to Estimate Haplotypes and Unobserved Genotypes.” Genetic Epidemiology 34 (8): 816–34. https://doi.org/10.1002/gepi.20533.

Liu, Eric Yi, Mingyao Li, Wei Wang, and Yun Li. 2013. MaCH-Admix: Genotype Imputation for Admixed Populations: MaCH-Admix: Imputation for Admixed Populations.” Genetic Epidemiology 37 (1): 25–37. https://doi.org/10.1002/gepi.21690.

Loh, Po-Ru, Petr Danecek, Pier Francesco Palamara, Christian Fuchsberger, Yakir A Reshef, Hilary K Finucane, Sebastian Schoenherr, et al. 2016. “Reference-Based Phasing Using the Haplotype Reference Consortium Panel.” Nature Genetics 48 (11): 1443–48. https://doi.org/10.1038/ng.3679.

Loh, Po-Ru, Pier Francesco Palamara, and Alkes L Price. 2016. “Fast and Accurate Long-Range Phasing in a UK Biobank Cohort.” Nature Genetics 48 (7): 811–16. https://doi.org/10.1038/ng.3571.

Marchini, Jonathan, and Bryan Howie. 2010. “Genotype Imputation for Genome-Wide Association Studies.” Nature Reviews Genetics 11 (7): 499–511. https://doi.org/10.1038/nrg2796.

Marchini, Jonathan, Bryan Howie, Simon Myers, Gil McVean, and Peter Donnelly. 2007. “A New Multipoint Method for Genome-Wide Association Studies by Imputation of Genotypes.” Nature Genetics 39 (7): 906–13. https://doi.org/10.1038/ng2088.

NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium, Daniel Taliun, Daniel N. Harris, Michael D. Kessler, Jedidiah Carlson, Zachary A. Szpiech, Raul Torres, et al. 2021. “Sequencing of 53,831 Diverse Genomes from the NHLBI TOPMed Program.” Nature 590 (7845): 290–99. https://doi.org/10.1038/s41586-021-03205-y.

O’Connell, Jared, Kevin Sharp, Nick Shrine, Louise Wain, Ian Hall, Martin Tobin, Jean-Francois Zagury, Olivier Delaneau, and Jonathan Marchini. 2016. “Haplotype Estimation for Biobank-Scale Data Sets.” Nature Genetics 48 (7): 817–20. https://doi.org/10.1038/ng.3583.

Rubinacci, Simone, Olivier Delaneau, and Jonathan Marchini. 2020. “Genotype Imputation Using the Positional Burrows Wheeler Transform.” Edited by Paul Scheet. PLOS Genetics 16 (11): e1009049. https://doi.org/10.1371/journal.pgen.1009049.

Scheet, Paul, and Matthew Stephens. 2006. “A Fast and Flexible Statistical Model for Large-Scale Population Genotype Data: Applications to Inferring Missing Genotypes and Haplotypic Phase.” The American Journal of Human Genetics 78 (4): 629–44. https://doi.org/10.1086/502802.

Williams, Amy L., Nick Patterson, Joseph Glessner, Hakon Hakonarson, and David Reich. 2012. “Phasing of Many Thousands of Genotyped Samples.” The American Journal of Human Genetics 91 (2): 238–51. https://doi.org/10.1016/j.ajhg.2012.06.013.

Zhao, Hao, Zhifu Sun, Jing Wang, Haojie Huang, Jean-Pierre Kocher, and Liguo Wang. 2014. CrossMap: A Versatile Tool for Coordinate Conversion Between Genome Assemblies.” Bioinformatics 30 (7): 1006–7. https://doi.org/10.1093/bioinformatics/btt730.