2a. Haplotype Phasing (EAGLE2)
Genotyping arrays produce unphased diploid genotypes — at each biallelic locus we know the individual
carries alleles A and B, but not which allele sits on which chromosome copy. Phasing resolves this
ambiguity by examining patterns of co-inheritance across many individuals. EAGLE2 uses a positional
Burrows-Wheeler transform (PBWT) to identify long haplotype matches and infers phase with high accuracy
(>99% switch accuracy at N > 1,000).
Accurate phasing is critical because imputation operates on haplotypes, not genotypes. Phase errors
propagate into imputation errors, particularly for rare variants where fewer reference haplotypes match.
2b. Statistical Imputation (Minimac4)
Minimac4 takes phased study haplotypes and compares them to the reference panel haplotypes in a sliding
window across each chromosome. For each untyped position, it identifies which reference haplotypes best
match the flanking typed SNPs and computes a weighted average of reference alleles — producing a
dosage value (continuous 0–2) rather than a hard genotype call.
Dosages preserve imputation uncertainty: a dosage of 1.85 means "highly likely 2 (homozygous alternate)
but ~7.5% probability of 1 (heterozygous)." Downstream association tests should use dosages, not
hard calls, to properly account for this uncertainty.
2c. Reference Panel: 1000 Genomes Phase 3 v5
The reference panel provides the haplotype templates against which study samples are imputed.
1000 Genomes Phase 3 v5 contains 2,504 individuals from 26 populations across 5 super-populations
(AFR, AMR, EAS, EUR, SAS). It includes ~80 million biallelic variants on the GRCh38 reference.
| Parameter | Value |
| Panel | 1000 Genomes Phase 3 30x (GRCh38/hg38) |
| Reference samples | 2,504 |
| Populations | 26 (5 super-populations) |
| Panel variants | ~80 million biallelic sites |
| Most relevant super-pop | EUR + SAS (Central Asian samples share ancestry with both) |
Panel choice note: The 1000 Genomes panel does not include Central Asian populations.
Uzbek samples are genetically intermediate between European and South Asian groups (as shown by PCA in
Step 7). Imputation quality for population-specific variants may be lower than for cosmpolitan variants.
TOPMed (larger, more diverse panel) would likely improve accuracy but requires dbGaP access.
2d. INFO / R² Quality Score
The INFO score (also called Minimac R²) measures imputation quality per variant. It estimates the
correlation between the true genotype and the imputed dosage:
- INFO = 1.0: Perfect imputation (typically typed variants).
- INFO > 0.90: High quality — suitable for all downstream analyses.
- INFO 0.80–0.90: Moderate quality — acceptable for most analyses; increased
noise in effect size estimates.
- INFO 0.30–0.80: Low quality — imputation uncertain; use with caution.
- INFO < 0.30: Poor quality — typically excluded.
Common practice: retain variants with INFO ≥ 0.30 (lenient) or ≥ 0.80 (stringent).
The choice depends on the analysis: discovery GWAS tolerates INFO ≥ 0.30; fine-mapping
or candidate gene analysis requires INFO ≥ 0.80.