Genotype Imputation

December 10, 2021

Sequencing vs. Genotyping

Sequencing

A technology that reads the entire genome.
Expensive.

Genotyping

A technology that probes a limited number (usually 500k to 1M) of known variants in the genome.
Cheap.

Genotype imputation is a way to close the gap.

Genotype Imputation

Why?

Boosting power (increasing the effective sample size by filling in the missing genotypes)
Fine-mapping (increasing chances of identifying a causal variant)
Meta-analysis (facilitating the combination of results across studies)
…

What?

The process of estimating genotypes at markers that are not directly assayed on GWAS arrays by comparing each sample to a reference panel of sequenced genomes.

Basic Intuition: The Li and Stephens Model

Das, Abecasis, and Browning (2018)

“The basic intuition behind genotype imputation is as follows: Any two individuals, even if apparently unrelated, can share short stretches of chromosome derived from a distant common ancestor. Consequently, once a study sample is genotyped on a commercial array (with mostly missing data), the observed genotypes can be used to identify DNA segments shared between the study sample and a reference panel of sequenced genomes (with no missing data). In this way, a study haplotype can be represented as a mosaic of short segments of related haplotypes found in the reference panel, enabling one to impute the sites that were not genotyped. Points where the reference haplotype template changes represent historical recombination events. Points where the observed target allele differs from the template allele represent historical mutation events, gene conversion events, genotype error, or even erroneously assigned matches. Since a study haplotype can be represented by many possible mosaics of reference haplotype segments, a probabilistic framework is needed to summarize information from all possible mosaics into imputed alleles.” (Das, Abecasis, and Browning 2018)

If you go up your ancestry tree, you will find short stretches of DNA that get inherited together, so-called haplotypes. Imputation is basically just pattern matching.

Imputation (Step 4)

Tools

FastPHASE (Scheet and Stephens 2006)
IMPUTE (Marchini et al. 2007)
Beagle (Browning and Browning 2007)
IMPUTE 2 (B. N. Howie, Donnelly, and Marchini 2009)
MaCH (Li et al. 2010)
Minimac (B. Howie et al. 2012)
MaCH-Admix (Liu et al. 2013)
Minimac2 (Fuchsberger, Abecasis, and Hinds 2015)
Minimac3 (Das et al. 2016)
IMPUTE 5 (Rubinacci, Delaneau, and Marchini 2020)

Imputation (Step 4)

Review Papers

Das, Abecasis, and Browning (2018)
Marchini and Howie (2010)

Pre-Phasing / Haplotype Estimation (Step 3)

Tools

Beagle (Browning and Browning 2007)
HAPI-UR (Williams et al. 2012)
SHAPEIT (Delaneau, Marchini, and Zagury 2012)
SHAPEIT 2 (Delaneau, Zagury, and Marchini 2013)
Eagle (Loh, Palamara, and Price 2016)
Eagle2 (Loh et al. 2016)
SHAPEIT 3 (O’Connell et al. 2016)
SHAPEIT 4 (Delaneau et al. 2019)

Liftover (Step 2)

Tools

liftOver
CrossMap (Zhao et al. 2014)

QC (Step 1)

Checks

Strand, alleles, positions, ref/alt assignments, frequency differences.

Exclusions

Duplicates, monomorphic sites, missing genotypes, multiallelic sites, indels, invalid alleles, variants not in reference panel, allele mismatches to reference.

Tools

Summary

Steps

QC
Liftover to assembly of reference panel
Pre-phasing (haplotype estimation)
Imputation

Reference Panels

Populations used to compare the allele frequencies between your data and the reference panel.

Imputed TO the reference panel

“a combination of technologies including low coverage WGS (mean depth 7.4X), high coverage whole exome sequencing (mean depth 65.7X), and microarray genotyping.” (https://www.biorxiv.org/content/10.1101/2021.02.06.430068v1)

reference panel only contains the variants!!!

“Coverage refers to the number of times the sequencing machine will sequence your genome. Because you have 6 billion letters in your genome, even if the sequencing machine was incredibly accurate 99.99% of the time, the 0.01% error rate means that your genome may have 600,000 errors!” (https://sequencing.com/blog/post/what-30x-and-04x-whole-genome-sequencing)

Larger panel: more detailed catalog of genetic variants, which increases the chance of imputing a causal variant, and better imputation accuracy, which improves the power of downstream association analyses, especially for rare variants

Name	Samples	Autosomal Variants	Populations	Assembly	Access
1000 Genomes Phase 3 (Version 5)	2,504	47,109,451	Multiethnic	GRCh37	Public
HRC (Version r1.1 2016)	32,470	39,131,556	Mostly European	GRCh37	Controlled (EGA)
TOPMed r2	97,256	292,323,460	Multiethnic	GRCh38	Controlled (some from dbGaP)
Genome Asia	1,654 (Pilot), 6,461 (v2)	21,494,626	Mostly Asian	GRCh37	Controlled (EGA)
CAAPA	883	29,842,689	Mostly African	GRCh37	Controlled (dbGaP)

Imputation Servers

Why use imputation servers?

Use of controlled access reference panels
Access to computational power
State of the art tools and configuration

Michigan Imputation Server

Paper: Das et al. (2016)
Website: https://imputationserver.sph.umich.edu
Pipeline description: https://imputationserver.readthedocs.io/en/latest/pipeline/
GitHub: https://github.com/genepi/imputationserver
Reference Panels:
- 1000 Genomes
- HRC
- Genome Asia
- CAAPA
- HapMap 2
- HLA

TOPMed Imputation Server

Paper: NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium et al. (2021)
Website: https://imputation.biodatacatalyst.nhlbi.nih.gov
Reference Panels:
- TOPMed r2 (based on TOPMed Freeze 8)

Problems with Imputation Servers

Trust issues
Limited customizability
- Phasing is done using HRC only
Unpredictable wait times

We have the computational resources and experience with obtaining controlled access datasets, so why not try ourselves?

EBDS (Early Brain Development Study)

Overview

Paper: ENIGMA Consortium et al. (2017)
GWAS of global brain tissue volumes in 561 infants
Chip: Affymetrix Axiom Genome-Wide LAT and Exome arrays, GRCh37
Called genotypes: 741 samples x 719,820 variants
Imputed genotypes: 770 samples x 9,306,407 variants
- Imputed to 1000 Genomes (phase1_release_v3.20101123) using MaCH-Admix

1. Quality Control - HRC-1000G-check-bim (1/2)

HRC/1000G Checking: https://www.well.ox.ac.uk/~wrayner/tools/index.html#Checking

module load HRC-1000G-check-bim

Input

PLINK fileset + allele frequencies (--freq)

Output

One PLINK fileset and VCF file per chromosome, with strands, positions, ref/alt assignment updated and some variants removed (A/T & G/C SNPs if MAF > 0.4, SNPs with differing alleles, SNPs with > 0.2 allele frequency difference, SNPs not in reference panel).

1. Quality Control - HRC-1000G-check-bim (2/2)

1000G

HRC-1000G-check-bim.pl \
  -b plink.bim \
  -f plink.frq \
  -r /mnt/research/quantgen/datasets/HRC-1000G-check-bim/legends/1000GP_Phase3_combined.legend \
  -g \
  -p ALL

HRC

HRC-1000G-check-bim.pl \
  -b plink.bim \
  -f plink.frq \
  -r /mnt/research/quantgen/datasets/HRC-1000G-check-bim/legends/TOPMed/freeze8/
     hg38/PASS.Variants.TOPMed_freeze8_hg38_dbSNP.tab \
  -h

2. Phasing - Eagle (1/2)

Eagle 2: https://alkesgroup.broadinstitute.org/Eagle/

module load eagle

Input

Reference and target in tabix-indexed BCF (recommended) or tabix-indexed VCF format. Bi-allelic variants that are contained in both target and reference only.

Output

Phased VCF files.

2. Phasing - Eagle (2/2)

eagle \
  --geneticMapFile /mnt/research/quantgen/datasets/Eagle/tables/
                   genetic_map_hg19_withX.txt.gz \
  --outPrefix chr${CHR} \
  --vcfRef /mnt/research/quantgen/datasets/Minimac4/derivative/reference_panels/
           G1K_P3_BCF_Files/
           ALL.chr${CHR}.phase3_v5.shapeit2_mvncall_integrated.noSingleton.genotypes.bcf \
  --vcfTarget /mnt/research/quantgen/projects/EBDS/validation/data/VCF/all/1000G/
              chr${CHR}.vcf.gz \
  --allowRefAltSwap

3. Imputation - Minimac4 (1/2)

Minimac4: https://genome.sph.umich.edu/wiki/Minimac4

module load Minimac4

Input

Reference in M3VCF format, pre-phased target in VCF format.

Output

Imputed VCF files, info files.

3. Imputation - Minimac4 (2/2)

Minimac4 \
  --refHaps /mnt/research/quantgen/datasets/Minimac4/source/reference_panels/
            G1K_P3_M3VCF_FILES_WITH_ESTIMATES/
            ${CHR}.1000g.Phase3.v5.With.Parameter.Estimates.m3vcf.gz \
  --haps /mnt/research/quantgen/projects/EBDS/validation/PIPELINE/1_phasing/
         all/1000G/chr1.vcf.gz \
  --prefix chr${CHR} \
  --format GT,DS,GP \
  --allTypedSites \
  --minRatio 0.00001 \
  --noPhoneHome

4. QC

Rsq (both in VCF file and info file): Confidence of accuracy (1: there is no uncertainty in the imputed genotypes, 0: there is complete uncertainty about the genotypes)
Minimal Rsq for common variants: >= 0.3
Minimal Rsq for low frequency / rare variants: >= 0.5

Results

Sample size: 741 (they used more)
Number of autosomal variants in original BED file: 696,173
Number of autosomal SNPs after conversion to VCF: 692,534
Number of SNPs after phasing: 685,916
Number of SNPs after imputation: 47,109,487 (compared to 49,017,216 in reference panel)
Number of SNPs after QC (Rsq > 0.8, MAF > 0.01): 7,492,132
Number of variants in their imputation: 9,306,407

Validation

Protocol

Randomly choose 1,000 variants [from the intersection of hg37 genotypes, hg38 genotypes, and imputed genotypes]
Remove validation set from genotypes before QC
Perform pre-imputation QC
Submit to imputation servers and to on-site pipeline
Perform post-imputation QC (keep Rsq > 0.3, MAF > 0.01)
Extract validation set from imputed VCF files and convert to PLINK fileset
[Get intersection of validation set and imputed genotypes and] Compute absolute correlations between calls and imputed data

Validation Results

Method	Min	Median	Mean	Max	SD
On-site (1000G)	0.0156	0.9560	0.9236	1	0.0987
MIS (1000G)	0.0120	0.9599	0.9249	1	0.1027
MIS (HRC)	0.0069	0.9693	0.9432	1	0.0833
TIS (TOPMed)	0.0069	1	0.9797	1	0.0491

p = 910

Questions?

References

They don’t all fit on this slide. Check out references.bib.

Browning, Sharon R., and Brian L. Browning. 2007. “Rapid and Accurate Haplotype Phasing and Missing-Data Inference for Whole-Genome Association Studies By Use of Localized Haplotype Clustering.” The American Journal of Human Genetics 81 (5): 1084–97. https://doi.org/10.1086/521987.

Das, Sayantan, Gonçalo R. Abecasis, and Brian L. Browning. 2018. “Genotype Imputation from Large Reference Panels.” Annual Review of Genomics and Human Genetics 19 (1): 73–96. https://doi.org/10.1146/annurev-genom-083117-021602.

Das, Sayantan, Lukas Forer, Sebastian Schönherr, Carlo Sidore, Adam E Locke, Alan Kwong, Scott I Vrieze, et al. 2016. “Next-Generation Genotype Imputation Service and Methods.” Nature Genetics 48 (10): 1284–87. https://doi.org/10.1038/ng.3656.

Delaneau, Olivier, Jonathan Marchini, and Jean-François Zagury. 2012. “A Linear Complexity Phasing Method for Thousands of Genomes.” Nature Methods 9 (2): 179–81. https://doi.org/10.1038/nmeth.1785.

Delaneau, Olivier, Jean-Francois Zagury, and Jonathan Marchini. 2013. “Improved Whole-Chromosome Phasing for Disease and Population Genetic Studies.” Nature Methods 10 (1): 5–6. https://doi.org/10.1038/nmeth.2307.

Delaneau, Olivier, Jean-François Zagury, Matthew R. Robinson, Jonathan L. Marchini, and Emmanouil T. Dermitzakis. 2019. “Accurate, Scalable and Integrative Haplotype Estimation.” Nature Communications 10 (1): 5436. https://doi.org/10.1038/s41467-019-13225-y.

ENIGMA Consortium, K Xia, J Zhang, M Ahn, S Jha, J J Crowley, J Szatkiewicz, et al. 2017. “Genome-Wide Association Analysis Identifies Common Variants Influencing Infant Brain Volumes.” Translational Psychiatry 7 (8): e1188–88. https://doi.org/10.1038/tp.2017.159.

Fuchsberger, Christian, Gonçalo R. Abecasis, and David A. Hinds. 2015. “Minimac2: Faster Genotype Imputation.” Bioinformatics 31 (5): 782–84. https://doi.org/10.1093/bioinformatics/btu704.

Howie, Bryan N., Peter Donnelly, and Jonathan Marchini. 2009. “A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies.” Edited by Nicholas J. Schork. PLoS Genetics 5 (6): e1000529. https://doi.org/10.1371/journal.pgen.1000529.

Howie, Bryan, Christian Fuchsberger, Matthew Stephens, Jonathan Marchini, and Gonçalo R Abecasis. 2012. “Fast and Accurate Genotype Imputation in Genome-Wide Association Studies Through Pre-Phasing.” Nature Genetics 44 (8): 955–59. https://doi.org/10.1038/ng.2354.

Li, Yun, Cristen J. Willer, Jun Ding, Paul Scheet, and Gonçalo R. Abecasis. 2010. “ MaCH: Using Sequence and Genotype Data to Estimate Haplotypes and Unobserved Genotypes.” Genetic Epidemiology 34 (8): 816–34. https://doi.org/10.1002/gepi.20533.

Liu, Eric Yi, Mingyao Li, Wei Wang, and Yun Li. 2013. “ MaCH-Admix: Genotype Imputation for Admixed Populations: MaCH-Admix: Imputation for Admixed Populations.” Genetic Epidemiology 37 (1): 25–37. https://doi.org/10.1002/gepi.21690.

Loh, Po-Ru, Petr Danecek, Pier Francesco Palamara, Christian Fuchsberger, Yakir A Reshef, Hilary K Finucane, Sebastian Schoenherr, et al. 2016. “Reference-Based Phasing Using the Haplotype Reference Consortium Panel.” Nature Genetics 48 (11): 1443–48. https://doi.org/10.1038/ng.3679.

Loh, Po-Ru, Pier Francesco Palamara, and Alkes L Price. 2016. “Fast and Accurate Long-Range Phasing in a UK Biobank Cohort.” Nature Genetics 48 (7): 811–16. https://doi.org/10.1038/ng.3571.

Marchini, Jonathan, and Bryan Howie. 2010. “Genotype Imputation for Genome-Wide Association Studies.” Nature Reviews Genetics 11 (7): 499–511. https://doi.org/10.1038/nrg2796.

Marchini, Jonathan, Bryan Howie, Simon Myers, Gil McVean, and Peter Donnelly. 2007. “A New Multipoint Method for Genome-Wide Association Studies by Imputation of Genotypes.” Nature Genetics 39 (7): 906–13. https://doi.org/10.1038/ng2088.

NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium, Daniel Taliun, Daniel N. Harris, Michael D. Kessler, Jedidiah Carlson, Zachary A. Szpiech, Raul Torres, et al. 2021. “Sequencing of 53,831 Diverse Genomes from the NHLBI TOPMed Program.” Nature 590 (7845): 290–99. https://doi.org/10.1038/s41586-021-03205-y.

O’Connell, Jared, Kevin Sharp, Nick Shrine, Louise Wain, Ian Hall, Martin Tobin, Jean-Francois Zagury, Olivier Delaneau, and Jonathan Marchini. 2016. “Haplotype Estimation for Biobank-Scale Data Sets.” Nature Genetics 48 (7): 817–20. https://doi.org/10.1038/ng.3583.

Rubinacci, Simone, Olivier Delaneau, and Jonathan Marchini. 2020. “Genotype Imputation Using the Positional Burrows Wheeler Transform.” Edited by Paul Scheet. PLOS Genetics 16 (11): e1009049. https://doi.org/10.1371/journal.pgen.1009049.

Scheet, Paul, and Matthew Stephens. 2006. “A Fast and Flexible Statistical Model for Large-Scale Population Genotype Data: Applications to Inferring Missing Genotypes and Haplotypic Phase.” The American Journal of Human Genetics 78 (4): 629–44. https://doi.org/10.1086/502802.

Williams, Amy L., Nick Patterson, Joseph Glessner, Hakon Hakonarson, and David Reich. 2012. “Phasing of Many Thousands of Genotyped Samples.” The American Journal of Human Genetics 91 (2): 238–51. https://doi.org/10.1016/j.ajhg.2012.06.013.

Zhao, Hao, Zhifu Sun, Jing Wang, Haojie Huang, Jean-Pierre Kocher, and Liguo Wang. 2014. “ CrossMap: A Versatile Tool for Coordinate Conversion Between Genome Assemblies.” Bioinformatics 30 (7): 1006–7. https://doi.org/10.1093/bioinformatics/btt730.