Step 4: Genotype Imputation

1. Overview

Genotype imputation statistically infers untyped variants using haplotype patterns shared between the study samples and a large reference panel. A genotyping array captures ~500,000–700,000 tag SNPs; imputation leverages linkage disequilibrium (LD) to predict the remaining ~10–40 million common variants in the genome.

The process has two stages: (1) phasing — resolving diploid genotypes into haplotypes — and (2) imputation — matching study haplotypes against reference panel haplotypes to infer missing positions. Each imputed genotype receives a dosage (0–2 continuous) and an INFO/R² quality score reflecting imputation confidence.

Why impute? Association studies gain statistical power from increased variant density. Imputation also enables meta-analysis across studies genotyped on different arrays (each array types a slightly different SNP set, but imputation fills the gaps so all studies share a common variant space).

Key Metrics

456,684

Typed Variants (input)

448,305

Matched to Reference

98.31%

Reference Overlap

48.89M

Total Variants (output)

0.407

Mean R² Score

2. Imputation Concepts

2a. Haplotype Phasing (EAGLE2)

Genotyping arrays produce unphased diploid genotypes — at each biallelic locus we know the individual carries alleles A and B, but not which allele sits on which chromosome copy. Phasing resolves this ambiguity by examining patterns of co-inheritance across many individuals. EAGLE2 uses a positional Burrows-Wheeler transform (PBWT) to identify long haplotype matches and infers phase with high accuracy (>99% switch accuracy at N > 1,000).

Accurate phasing is critical because imputation operates on haplotypes, not genotypes. Phase errors propagate into imputation errors, particularly for rare variants where fewer reference haplotypes match.

2b. Statistical Imputation (Minimac4)

Minimac4 takes phased study haplotypes and compares them to the reference panel haplotypes in a sliding window across each chromosome. For each untyped position, it identifies which reference haplotypes best match the flanking typed SNPs and computes a weighted average of reference alleles — producing a dosage value (continuous 0–2) rather than a hard genotype call.

Dosages preserve imputation uncertainty: a dosage of 1.85 means "highly likely 2 (homozygous alternate) but ~7.5% probability of 1 (heterozygous)." Downstream association tests should use dosages, not hard calls, to properly account for this uncertainty.

2c. Reference Panel: 1000 Genomes Phase 3 v5

The reference panel provides the haplotype templates against which study samples are imputed. 1000 Genomes Phase 3 v5 contains 2,504 individuals from 26 populations across 5 super-populations (AFR, AMR, EAS, EUR, SAS). It includes ~80 million biallelic variants on the GRCh38 reference.

Parameter	Value
Panel	1000 Genomes Phase 3 30x (GRCh38/hg38)
Reference samples	2,504
Populations	26 (5 super-populations)
Panel variants	~80 million biallelic sites
Most relevant super-pop	EUR + SAS (Central Asian samples share ancestry with both)

Panel choice note: The 1000 Genomes panel does not include Central Asian populations. Uzbek samples are genetically intermediate between European and South Asian groups (as shown by PCA in Step 7). Imputation quality for population-specific variants may be lower than for cosmpolitan variants. TOPMed (larger, more diverse panel) would likely improve accuracy but requires dbGaP access.

2d. INFO / R² Quality Score

The INFO score (also called Minimac R²) measures imputation quality per variant. It estimates the correlation between the true genotype and the imputed dosage:

INFO = 1.0: Perfect imputation (typically typed variants).
INFO > 0.90: High quality — suitable for all downstream analyses.
INFO 0.80–0.90: Moderate quality — acceptable for most analyses; increased noise in effect size estimates.
INFO 0.30–0.80: Low quality — imputation uncertain; use with caution.
INFO < 0.30: Poor quality — typically excluded.

Common practice: retain variants with INFO ≥ 0.30 (lenient) or ≥ 0.80 (stringent). The choice depends on the analysis: discovery GWAS tolerates INFO ≥ 0.30; fine-mapping or candidate gene analysis requires INFO ≥ 0.80.

3. INFO Score Distribution

Imputation Quality (INFO / R²) Distribution

N = 48,887,364 imputed variants | Mean R² = 0.4069 | Spring 2026

INFO Range	Variants	% of Total	Interpretation
≥ 0.90 (high quality)	7,295,240	14.9%	Suitable for all analyses
0.80–0.90	3,788,515	7.7%	Acceptable; slight noise increase
0.30–0.80	14,269,686	29.2%	Use with caution
< 0.30	23,533,923	48.1%	Typically excluded

✓ Michigan QC passed: 448,305 of 456,684 input variants matched the 1000G Phase 3 30x reference (98.31% overlap). 647 sites excluded (264 invalid alleles, 383 allele mismatches). 0 strand flips.

R² distribution is typical for raw imputation output: 48.1% of variants have R² < 0.30 and 22.7% have R² ≥ 0.80. This is expected — the raw output includes all imputed positions including rare variants (MAF < 0.5%) and poorly-tagged regions. Standard post-imputation filtering (Step 5: INFO ≥ 0.30 or ≥ 0.80) retains only well-imputed variants for downstream analysis. Mean R² = 0.407, consistent with 1000G-based imputation of a Central Asian cohort.

Variants with low INFO: These are predominantly rare variants (MAF < 1% in the reference panel) and variants in regions of low LD where the typed tag SNPs provide limited information. Centromeric regions, segmental duplications, and the HLA region on chromosome 6 typically have lower INFO scores due to complex haplotype structure.

4. Input & Output Data

Input (from Step 3)

Files	chr1.vcf.gz through chr22.vcf.gz (22 per-chromosome VCFs, fixref'd)
Location	`/staging/ALSU-analysis/spring2026/`
Samples	1,093
Variants	456,684 SNPs (fixref + PLINK QC, autosomes only)

Michigan QC Summary

Reference overlap	98.31%
Variants matched	448,305
Allele switches	0
Strand flips	0
A/T, C/G genotypes	0
Excluded sites	647 (264 invalid alleles, 383 allele mismatches)
Typed-only sites	7,732
Chunks excluded	3 of 155 (2 low reference overlap, 1 low sample call rate)
Chunks remaining	152

Output (from Michigan Server)

Files	chr1.dose.vcf.gz through chr22.dose.vcf.gz (22 imputed VCF files)
Samples	1,093 (unchanged)
Total variants	48,887,364 (typed + imputed, before R² filtering)
Format	VCF with GT (hard call), DS (dosage), GP (genotype probabilities)
INFO file	chr*.info.gz — VCF format with R² in INFO field (Minimac v4.1.6)

Intermediate Files

File	Description
chr*.phased.vcf.gz	EAGLE2 phased haplotypes (before imputation)
chr*.info.gz	Minimac4 quality scores per variant (INFO/R², estimated MAF)
chr*.empiricalDose.vcf.gz	Leave-one-out cross-validation dosages (for typed SNPs)
statistics.txt	Per-chromosome imputation summary statistics

5. Commands Executed

Step 1: Submit to Michigan Imputation Server

# Michigan Imputation Server v2 (https://imputationserver.sph.umich.edu)
# Upload: 22 per-chromosome VCF files from Step 3 (fixref'd, GRCh38)

Job Configuration:
  Reference Panel: 1000G Phase 3 30x (GRCh38/hg38)
  Phasing:         EAGLE v2.4
  Imputation:      Minimac v4.1.6
  Population:      Mixed (no frequency check — skip allele frequency check)
  Mode:            Quality Control & Imputation

Input: 456,684 variants × 1,093 samples (22 per-chromosome VCFs)
  Source: /staging/ALSU-analysis/spring2026/chr{1..22}.vcf.gz

Step 2: Server-side QC & Processing

# Server-side pipeline (automated):
# 1. Input Validation — strand check, allele frequency comparison
# 2. Phasing — EAGLE v2.4 resolves diploid to haploid (per chromosome)
# 3. Imputation — Minimac v4.1.6 fills untyped positions from reference
# 4. Quality Estimation — compute INFO/R² per variant

Michigan QC Report (spring 2026):
  Input variants:            456,684
  Reference overlap:         98.31%
  Matched variants:          448,305
  Allele switches:           0
  Strand flips:              0
  A/T, C/G genotypes:       0
  Alt allele freq > 0.5:    77,363

  Excluded sites:            647
    - Invalid alleles:       264
    - Allele mismatches:     383
  Typed-only sites:          7,732
  Chunks excluded:           3 of 155
    - chunk_14 (chr14 0-20M): ref overlap 27.7%, 1 low-callrate sample
    - chunk_15 (chr15 0-20M): 3 low-callrate samples
    - chunk_9 (chr9 40-60M):  ref overlap 38.5%, 5 low-callrate samples
  Remaining chunks:          152

Step 3: Download results

# Download encrypted results from Michigan Server
# (download links sent via email, password-protected AES-encrypted zip per chromosome)
# Michigan provides a download script with curl commands for all 27 files
# (22 chr zips + qc_report.txt + quality-control.html + 3 statistics files)

# Make download script resume-safe (Michigan's set -e kills on HTTP 416 = already complete)
curl -sL "${MICHIGAN_DOWNLOAD_URL}" | \
  sed 's/set -e/set +e/; s/curl -L /curl -C - -L /g' | bash

# Results: 22 encrypted zip files, 66 GB total
# chr_1.zip (5.2G) through chr_22.zip (1.1G)
# + qc_report.txt, quality-control.html, statistics/

Per-chromosome download sizes

Chr	Size	Chr	Size	Chr	Size	Chr	Size
1	5.2 GB	7	3.9 GB	13	2.4 GB	19	1.7 GB
2	5.4 GB	8	3.6 GB	14	2.1 GB	20	1.6 GB
3	4.6 GB	9	2.9 GB	15	2.0 GB	21	1.1 GB
4	4.9 GB	10	3.3 GB	16	2.2 GB	22	1.1 GB
5	4.2 GB	11	3.3 GB	17	2.0 GB	Total: 66 GB
6	4.0 GB	12	3.1 GB	18	2.0 GB

Step 4: Extract imputed VCFs

# Each zip is AES-encrypted; password provided by Michigan via email
# Extract dose VCF + info file
# unzip -P PASSWORD: supply decryption password
# unzip -o: overwrite existing files without prompting

IMPUTE_PASSWORD="your_michigan_password"

for chr in $(seq 1 22); do
  unzip -P "$IMPUTE_PASSWORD" -o chr_${chr}.zip \
    chr${chr}.dose.vcf.gz \
    chr${chr}.info.gz
done

# Verify extraction
ls -lh chr*.dose.vcf.gz | wc -l   # expect 22
ls -lh chr*.info.gz | wc -l       # expect 22

# Check sample count and ID format
bcftools query -l chr1.dose.vcf.gz | wc -l          # → 1,093
bcftools query -l chr1.dose.vcf.gz | head -5         # 01-01, 01-02, ...
bcftools query -l chr1.dose.vcf.gz | grep -cP '^\d+_'  # → 0 (no Michigan prefix)

✓ Extraction verified — 22 dose VCFs + 22 info.gz extracted successfully. 1,093 samples in all VCFs. No non-ASCII characters (Cyrillic fix from Step 3e confirmed effective). Total directory size: 132 GB (66 GB zips + 66 GB extracted files).
Michigan v2 quirk: chr1–9 have numeric prefix (1_sampleID, 2_sampleID, …), chr10–22 do not. Step 5 strips these per-chromosome.

Per-chromosome variant counts (from info.gz)

Chr	Variants	Chr	Variants	Chr	Variants	Chr	Variants
1	3,897,754	7	2,860,787	13	1,759,289	19	1,079,422
2	4,206,610	8	2,741,343	14	1,570,644	20	1,101,754
3	3,486,654	9	2,138,603	15	1,426,801	21	686,519
4	3,476,643	10	2,412,511	16	1,576,003	22	680,124
5	3,192,854	11	2,436,799	17	1,385,733	Total: 48,887,364
6	3,045,143	12	2,332,171	18	1,393,203

6. Quality Verification

✓ Michigan QC passed — 448,305 of 456,684 variants matched reference (98.31%). 0 strand flips. 0 allele switches. Fixref pre-processing eliminated all strand/allele issues.

Excluded Chunks (3 of 155)

Chunk	Region	SNPs	Ref Overlap	Low-Callrate Samples	Reason
chunk_14	chr14: 0–20 Mb	13	27.7%	1	Reference overlap < 50%
chunk_15	chr15: 0–20 Mb	6	100%	3	Low sample call rate
chunk_9	chr9: 40–60 Mb	5	38.5%	5	Reference overlap < 50%

Excluded chunk regions correspond to centromeric/pericentromeric areas with very few typed SNPs on the Illumina GSA array. These regions are not imputable regardless of reference panel choice.

Excluded SNPs (647 total)

Category	Count	Description
Invalid alleles	264	Alleles not recognized by reference encoding
Allele mismatch	383	ALT allele not present in reference at that position (monomorphic in ref)

Post-imputation R² analysis

# Extract R² from .info.gz VCFs (Minimac v4.1.6 format)
# .info.gz files are VCFs with R2= in the INFO field, NOT plain-text Minimac4 info
zcat chr1.info.gz | grep -v '^#' | sed 's/.*R2=\([0-9.]*\).*/\1/' | head -3
0.00029
0.01504
0.00191

# R² distribution analysis — verified from all 22 info.gz files (spring 2026)
# Total: 48,887,364 variants | Mean R² = 0.4069 | R² ≥ 0.80: 11,083,755 (22.7%)

7. Comparison: Winter 2025 vs Spring 2026

Metric	Winter 2025	Spring 2026	Notes
Samples	1,098	1,093	Spring re-ran Step 1 with corrected F_MISS threshold
Typed input variants	472,191	456,684	Spring added fixref + palindromic removal in Step 3
Michigan server	v1 (Minimac4)	v2 (Minimac v4.1.6)
Reference panel	1000G Phase 3 v5	1000G Phase 3 30x	30x = NYGC high-coverage re-sequencing
Strand flips	185,633	0	Spring fixref eliminated all strand issues
Raw imputed variants	58,886,952	48,887,364	Different variant space in 30x panel
R² ≥ 0.80	~10.0M	11.1M	+10% more high-quality variants despite fewer raw

Why fewer raw variants but more high-quality ones? The 1000G Phase 3 30x panel is based on 30× whole-genome sequencing (vs the original ~7× low-coverage), producing more accurate haplotype scaffolds. This means fewer spurious rare variant positions in the reference (reducing total raw output) but better imputation accuracy for the variants that remain — yielding more variants passing the R² ≥ 0.80 threshold. Additionally, the spring pipeline’s fixref pre-processing (0 strand flips vs 185K in winter) provides cleaner input, directly improving imputation quality.

8. Chronological Log

2026-04-10

Spring 2026 VCFs prepared (Step 3)
22 per-chromosome VCFs generated via fixref pipeline: 456,684 variants × 1,093 samples on GRCh38. 0 strand flips after fixref correction.

2026-04-10

Job submitted to Michigan Imputation Server v2
Reference: 1000G Phase 3 30x (GRCh38/hg38), phasing: EAGLE v2.4, imputation: Minimac v4.1.6.

2026-04-10

Server-side QC passed
448,305 variants matched reference (98.31% overlap). 647 excluded, 3 chunks excluded, 0 strand flips.

2026-04-10

Imputation completed
Results available for download. Per-chromosome encrypted archives (AES) with .dose.vcf.gz and .info.gz files.

2026-04-10

R² analysis completed
Info-only extraction of all 22 .info.gz files. 48,887,364 total variants, Mean R² = 0.4069, 22.7% ≥ 0.80.

2026-04-11

Full download completed
All 22 encrypted zip archives downloaded to /staging/ALSU-analysis/spring2026/imputation/ (66 GB total). QC report, statistics, and quality-control.html also retrieved.

2026-04-11

Extraction & verification completed
All 22 dose VCFs + 22 info.gz extracted (132 GB total). 1,093 samples confirmed, no Michigan prefix, no non-ASCII. Per-chr variant counts verified: 48,887,364 total (matches R² analysis).