← Back to Roadmap

Step 4: Genotype Imputation

Cloud-based imputation via Michigan Imputation Server v2 — phasing (EAGLE v2.4) and imputation (Minimac v4.1.6) against the 1000 Genomes Phase 3 30x reference panel (GRCh38/hg38)

✓ Completed — April 2026

1. Overview

Genotype imputation statistically infers untyped variants using haplotype patterns shared between the study samples and a large reference panel. A genotyping array captures ~500,000–700,000 tag SNPs; imputation leverages linkage disequilibrium (LD) to predict the remaining ~10–40 million common variants in the genome.

The process has two stages: (1) phasing — resolving diploid genotypes into haplotypes — and (2) imputation — matching study haplotypes against reference panel haplotypes to infer missing positions. Each imputed genotype receives a dosage (0–2 continuous) and an INFO/R² quality score reflecting imputation confidence.

Why impute? Association studies gain statistical power from increased variant density. Imputation also enables meta-analysis across studies genotyped on different arrays (each array types a slightly different SNP set, but imputation fills the gaps so all studies share a common variant space).

Key Metrics

456,684
Typed Variants (input)
448,305
Matched to Reference
98.31%
Reference Overlap
48.89M
Total Variants (output)
0.407
Mean R² Score

2. Imputation Concepts

2a. Haplotype Phasing (EAGLE2)

Genotyping arrays produce unphased diploid genotypes — at each biallelic locus we know the individual carries alleles A and B, but not which allele sits on which chromosome copy. Phasing resolves this ambiguity by examining patterns of co-inheritance across many individuals. EAGLE2 uses a positional Burrows-Wheeler transform (PBWT) to identify long haplotype matches and infers phase with high accuracy (>99% switch accuracy at N > 1,000).

Accurate phasing is critical because imputation operates on haplotypes, not genotypes. Phase errors propagate into imputation errors, particularly for rare variants where fewer reference haplotypes match.

2b. Statistical Imputation (Minimac4)

Minimac4 takes phased study haplotypes and compares them to the reference panel haplotypes in a sliding window across each chromosome. For each untyped position, it identifies which reference haplotypes best match the flanking typed SNPs and computes a weighted average of reference alleles — producing a dosage value (continuous 0–2) rather than a hard genotype call.

Dosages preserve imputation uncertainty: a dosage of 1.85 means "highly likely 2 (homozygous alternate) but ~7.5% probability of 1 (heterozygous)." Downstream association tests should use dosages, not hard calls, to properly account for this uncertainty.

2c. Reference Panel: 1000 Genomes Phase 3 v5

The reference panel provides the haplotype templates against which study samples are imputed. 1000 Genomes Phase 3 v5 contains 2,504 individuals from 26 populations across 5 super-populations (AFR, AMR, EAS, EUR, SAS). It includes ~80 million biallelic variants on the GRCh38 reference.

ParameterValue
Panel1000 Genomes Phase 3 30x (GRCh38/hg38)
Reference samples2,504
Populations26 (5 super-populations)
Panel variants~80 million biallelic sites
Most relevant super-popEUR + SAS (Central Asian samples share ancestry with both)
Panel choice note: The 1000 Genomes panel does not include Central Asian populations. Uzbek samples are genetically intermediate between European and South Asian groups (as shown by PCA in Step 7). Imputation quality for population-specific variants may be lower than for cosmpolitan variants. TOPMed (larger, more diverse panel) would likely improve accuracy but requires dbGaP access.

2d. INFO / R² Quality Score

The INFO score (also called Minimac R²) measures imputation quality per variant. It estimates the correlation between the true genotype and the imputed dosage:

  • INFO = 1.0: Perfect imputation (typically typed variants).
  • INFO > 0.90: High quality — suitable for all downstream analyses.
  • INFO 0.80–0.90: Moderate quality — acceptable for most analyses; increased noise in effect size estimates.
  • INFO 0.30–0.80: Low quality — imputation uncertain; use with caution.
  • INFO < 0.30: Poor quality — typically excluded.

Common practice: retain variants with INFO ≥ 0.30 (lenient) or ≥ 0.80 (stringent). The choice depends on the analysis: discovery GWAS tolerates INFO ≥ 0.30; fine-mapping or candidate gene analysis requires INFO ≥ 0.80.

3. INFO Score Distribution

Imputation Quality (INFO / R²) Distribution
N = 48,887,364 imputed variants | Mean R² = 0.4069 | Spring 2026
INFO RangeVariants% of TotalInterpretation
≥ 0.90 (high quality)7,295,24014.9%Suitable for all analyses
0.80–0.903,788,5157.7%Acceptable; slight noise increase
0.30–0.8014,269,68629.2%Use with caution
< 0.3023,533,92348.1%Typically excluded
✓ Michigan QC passed: 448,305 of 456,684 input variants matched the 1000G Phase 3 30x reference (98.31% overlap). 647 sites excluded (264 invalid alleles, 383 allele mismatches). 0 strand flips.
R² distribution is typical for raw imputation output: 48.1% of variants have R² < 0.30 and 22.7% have R² ≥ 0.80. This is expected — the raw output includes all imputed positions including rare variants (MAF < 0.5%) and poorly-tagged regions. Standard post-imputation filtering (Step 5: INFO ≥ 0.30 or ≥ 0.80) retains only well-imputed variants for downstream analysis. Mean R² = 0.407, consistent with 1000G-based imputation of a Central Asian cohort.
Variants with low INFO: These are predominantly rare variants (MAF < 1% in the reference panel) and variants in regions of low LD where the typed tag SNPs provide limited information. Centromeric regions, segmental duplications, and the HLA region on chromosome 6 typically have lower INFO scores due to complex haplotype structure.

4. Input & Output Data

Input (from Step 3)

Fileschr1.vcf.gz through chr22.vcf.gz (22 per-chromosome VCFs, fixref'd)
Location/staging/ALSU-analysis/spring2026/
Samples1,093
Variants456,684 SNPs (fixref + PLINK QC, autosomes only)

Michigan QC Summary

Reference overlap98.31%
Variants matched448,305
Allele switches0
Strand flips0
A/T, C/G genotypes0
Excluded sites647 (264 invalid alleles, 383 allele mismatches)
Typed-only sites7,732
Chunks excluded3 of 155 (2 low reference overlap, 1 low sample call rate)
Chunks remaining152

Output (from Michigan Server)

Fileschr1.dose.vcf.gz through chr22.dose.vcf.gz (22 imputed VCF files)
Samples1,093 (unchanged)
Total variants48,887,364 (typed + imputed, before R² filtering)
FormatVCF with GT (hard call), DS (dosage), GP (genotype probabilities)
INFO filechr*.info.gz — VCF format with R² in INFO field (Minimac v4.1.6)

Intermediate Files

FileDescription
chr*.phased.vcf.gzEAGLE2 phased haplotypes (before imputation)
chr*.info.gzMinimac4 quality scores per variant (INFO/R², estimated MAF)
chr*.empiricalDose.vcf.gzLeave-one-out cross-validation dosages (for typed SNPs)
statistics.txtPer-chromosome imputation summary statistics

5. Commands Executed

Step 1: Submit to Michigan Imputation Server

# Michigan Imputation Server v2 (https://imputationserver.sph.umich.edu) # Upload: 22 per-chromosome VCF files from Step 3 (fixref'd, GRCh38) Job Configuration: Reference Panel: 1000G Phase 3 30x (GRCh38/hg38) Phasing: EAGLE v2.4 Imputation: Minimac v4.1.6 Population: Mixed (no frequency check — skip allele frequency check) Mode: Quality Control & Imputation Input: 456,684 variants × 1,093 samples (22 per-chromosome VCFs) Source: /staging/ALSU-analysis/spring2026/chr{1..22}.vcf.gz

Step 2: Server-side QC & Processing

# Server-side pipeline (automated): # 1. Input Validation — strand check, allele frequency comparison # 2. Phasing — EAGLE v2.4 resolves diploid to haploid (per chromosome) # 3. Imputation — Minimac v4.1.6 fills untyped positions from reference # 4. Quality Estimation — compute INFO/R² per variant Michigan QC Report (spring 2026): Input variants: 456,684 Reference overlap: 98.31% Matched variants: 448,305 Allele switches: 0 Strand flips: 0 A/T, C/G genotypes: 0 Alt allele freq > 0.5: 77,363 Excluded sites: 647 - Invalid alleles: 264 - Allele mismatches: 383 Typed-only sites: 7,732 Chunks excluded: 3 of 155 - chunk_14 (chr14 0-20M): ref overlap 27.7%, 1 low-callrate sample - chunk_15 (chr15 0-20M): 3 low-callrate samples - chunk_9 (chr9 40-60M): ref overlap 38.5%, 5 low-callrate samples Remaining chunks: 152

Step 3: Download results

# Download encrypted results from Michigan Server # (download links sent via email, password-protected AES-encrypted zip per chromosome) # Michigan provides a download script with curl commands for all 27 files # (22 chr zips + qc_report.txt + quality-control.html + 3 statistics files) # Make download script resume-safe (Michigan's set -e kills on HTTP 416 = already complete) curl -sL "${MICHIGAN_DOWNLOAD_URL}" | \ sed 's/set -e/set +e/; s/curl -L /curl -C - -L /g' | bash # Results: 22 encrypted zip files, 66 GB total # chr_1.zip (5.2G) through chr_22.zip (1.1G) # + qc_report.txt, quality-control.html, statistics/

Per-chromosome download sizes

ChrSizeChrSizeChrSizeChrSize
15.2 GB73.9 GB132.4 GB191.7 GB
25.4 GB83.6 GB142.1 GB201.6 GB
34.6 GB92.9 GB152.0 GB211.1 GB
44.9 GB103.3 GB162.2 GB221.1 GB
54.2 GB113.3 GB172.0 GBTotal: 66 GB
64.0 GB123.1 GB182.0 GB

Step 4: Extract imputed VCFs

# Each zip is AES-encrypted; password provided by Michigan via email # Extract dose VCF + info file # unzip -P PASSWORD: supply decryption password # unzip -o: overwrite existing files without prompting IMPUTE_PASSWORD="your_michigan_password" for chr in $(seq 1 22); do unzip -P "$IMPUTE_PASSWORD" -o chr_${chr}.zip \ chr${chr}.dose.vcf.gz \ chr${chr}.info.gz done # Verify extraction ls -lh chr*.dose.vcf.gz | wc -l # expect 22 ls -lh chr*.info.gz | wc -l # expect 22 # Check sample count and ID format bcftools query -l chr1.dose.vcf.gz | wc -l # → 1,093 bcftools query -l chr1.dose.vcf.gz | head -5 # 01-01, 01-02, ... bcftools query -l chr1.dose.vcf.gz | grep -cP '^\d+_' # → 0 (no Michigan prefix)
✓ Extraction verified — 22 dose VCFs + 22 info.gz extracted successfully. 1,093 samples in all VCFs. No non-ASCII characters (Cyrillic fix from Step 3e confirmed effective). Total directory size: 132 GB (66 GB zips + 66 GB extracted files).
Michigan v2 quirk: chr1–9 have numeric prefix (1_sampleID, 2_sampleID, …), chr10–22 do not. Step 5 strips these per-chromosome.

Per-chromosome variant counts (from info.gz)

ChrVariantsChrVariantsChrVariantsChrVariants
13,897,75472,860,787131,759,289191,079,422
24,206,61082,741,343141,570,644201,101,754
33,486,65492,138,603151,426,80121686,519
43,476,643102,412,511161,576,00322680,124
53,192,854112,436,799171,385,733Total: 48,887,364
63,045,143122,332,171181,393,203

6. Quality Verification

✓ Michigan QC passed — 448,305 of 456,684 variants matched reference (98.31%). 0 strand flips. 0 allele switches. Fixref pre-processing eliminated all strand/allele issues.

Excluded Chunks (3 of 155)

ChunkRegionSNPsRef OverlapLow-Callrate SamplesReason
chunk_14chr14: 0–20 Mb1327.7%1Reference overlap < 50%
chunk_15chr15: 0–20 Mb6100%3Low sample call rate
chunk_9chr9: 40–60 Mb538.5%5Reference overlap < 50%
Excluded chunk regions correspond to centromeric/pericentromeric areas with very few typed SNPs on the Illumina GSA array. These regions are not imputable regardless of reference panel choice.

Excluded SNPs (647 total)

CategoryCountDescription
Invalid alleles264Alleles not recognized by reference encoding
Allele mismatch383ALT allele not present in reference at that position (monomorphic in ref)

Post-imputation R² analysis

# Extract R² from .info.gz VCFs (Minimac v4.1.6 format) # .info.gz files are VCFs with R2= in the INFO field, NOT plain-text Minimac4 info zcat chr1.info.gz | grep -v '^#' | sed 's/.*R2=\([0-9.]*\).*/\1/' | head -3 0.00029 0.01504 0.00191 # R² distribution analysis — verified from all 22 info.gz files (spring 2026) # Total: 48,887,364 variants | Mean R² = 0.4069 | R² ≥ 0.80: 11,083,755 (22.7%)

7. Comparison: Winter 2025 vs Spring 2026

MetricWinter 2025Spring 2026Notes
Samples1,0981,093Spring re-ran Step 1 with corrected F_MISS threshold
Typed input variants472,191456,684Spring added fixref + palindromic removal in Step 3
Michigan serverv1 (Minimac4)v2 (Minimac v4.1.6)
Reference panel1000G Phase 3 v51000G Phase 3 30x30x = NYGC high-coverage re-sequencing
Strand flips185,6330Spring fixref eliminated all strand issues
Raw imputed variants58,886,95248,887,364Different variant space in 30x panel
R² ≥ 0.80~10.0M11.1M+10% more high-quality variants despite fewer raw
Why fewer raw variants but more high-quality ones? The 1000G Phase 3 30x panel is based on 30× whole-genome sequencing (vs the original ~7× low-coverage), producing more accurate haplotype scaffolds. This means fewer spurious rare variant positions in the reference (reducing total raw output) but better imputation accuracy for the variants that remain — yielding more variants passing the R² ≥ 0.80 threshold. Additionally, the spring pipeline’s fixref pre-processing (0 strand flips vs 185K in winter) provides cleaner input, directly improving imputation quality.

8. Chronological Log

2026-04-10
Spring 2026 VCFs prepared (Step 3)
22 per-chromosome VCFs generated via fixref pipeline: 456,684 variants × 1,093 samples on GRCh38. 0 strand flips after fixref correction.
2026-04-10
Job submitted to Michigan Imputation Server v2
Reference: 1000G Phase 3 30x (GRCh38/hg38), phasing: EAGLE v2.4, imputation: Minimac v4.1.6.
2026-04-10
Server-side QC passed
448,305 variants matched reference (98.31% overlap). 647 excluded, 3 chunks excluded, 0 strand flips.
2026-04-10
Imputation completed
Results available for download. Per-chromosome encrypted archives (AES) with .dose.vcf.gz and .info.gz files.
2026-04-10
R² analysis completed
Info-only extraction of all 22 .info.gz files. 48,887,364 total variants, Mean R² = 0.4069, 22.7% ≥ 0.80.
2026-04-11
Full download completed
All 22 encrypted zip archives downloaded to /staging/ALSU-analysis/spring2026/imputation/ (66 GB total). QC report, statistics, and quality-control.html also retrieved.
2026-04-11
Extraction & verification completed
All 22 dose VCFs + 22 info.gz extracted (132 GB total). 1,093 samples confirmed, no Michigan prefix, no non-ASCII. Per-chr variant counts verified: 48,887,364 total (matches R² analysis).