Step 6: Post-Imputation QC & Finalization

1. Overview

Final QC step filters on imputation quality (R²≥0.80) and minor allele frequency (MAF≥0.001), converts to PLINK binary format, and performs heterozygosity outlier detection.

Spring 2026

Filter: bcftools view -i 'INFO/R2>=0.80 && INFO/MAF>=0.001' — streaming expression filter on dose VCF (no targets file needed).
Optimization: Previous -T targets approach was extremely slow (I/O-bound hash lookup per variant). Expression-based -i filter streams through VCF once.

1,093

Starting Samples

Het Outliers Removed

1,090

Final Samples

9,981,814

HQ Variants (R²≥0.80)

Winter 2025

Final Output: High-quality imputed genotype dataset ready for association studies.
24 outlier samples removed, 1,074 samples passing all QC checks.

1,098

Starting Samples

Outliers Removed

1,074

Final Samples

10,846,569

HQ Variants

2. Input Data

Spring 2026

Field	Value
Files	chr1-chr22.dose.vcf.gz (prefix-stripped in step 5)
Location	/staging/ALSU-analysis/spring2026/post_imputation/
Samples	1,093
Variants	48,887,364 across 22 chromosomes

Winter 2025

Field	Value
Files	chr1-chr22.dose_normalized.vcf.gz (22 VCF files)
Location	/staging/ALSU-analysis/winter2025/
Samples	1,098
Variants	10,846,569 SNPs

3. Output Data

Spring 2026

Field	Value
Files	UZB_imputed_HQ_clean.{bed,bim,fam} (PLINK binary)
Location	/staging/ALSU-analysis/spring2026/post_imputation/
Samples	1,090 (3 het outliers removed from 1,093)
Variants	9,981,814 (R²≥0.80 + MAF≥0.001)

✓ Completed: 9,981,814 of 48,887,364 variants (20.4%) pass R²≥0.80 + MAF≥0.001. 3 het outliers removed (|F|>0.2): 08-123, 20-05, 08-86.
Note: 2 samples had Cyrillic homoglyphs corrupted by imputation server in chr10–22 (03-25м→03-25m, 08-176Х→08-176X). Fixed via bcftools reheader before concat.

Winter 2025

Field	Value
Files	UZB_imputed_HQ_clean (PLINK binary + VCF)
Location	/staging/ALSU-analysis/winter2025/
Samples	1,074 (24 het outliers removed)
Variants	10,846,569 SNPs

✓ QC Status: All samples pass sex/ancestry validation, no sample missing >5% data

4. Commands Executed

Spring 2026 — Expression-Based Filtering

# bcftools view -i: filter with INFO field expression
#   Streams through VCF once (no index needed, no targets file)
#   R2 and MAF are embedded in Michigan dose VCF INFO field
#   Much faster than old -T targets approach
for chr in $(seq 1 22); do
  bcftools view -i 'INFO/R2>=0.80 && INFO/MAF>=0.001' \
    -Oz -o filtered/chr${chr}.clean.vcf.gz \
    chr${chr}.dose.vcf.gz
  bcftools index -f filtered/chr${chr}.clean.vcf.gz
done

# Concatenate all filtered chromosomes
$ bcftools concat filtered/chr{1..22}.clean.vcf.gz -Oz \
  -o UZB_imputed_HQ.vcf.gz
$ bcftools index UZB_imputed_HQ.vcf.gz

# Convert to PLINK binary with dosage
# plink2 --vcf: read VCF file
#   dosage=DS: use DS field for allele dosage (Michigan default)
$ plink2 --vcf UZB_imputed_HQ.vcf.gz dosage=DS \
  --make-bed \
  --out UZB_imputed_HQ

# Heterozygosity check
# plink2 --het: compute per-sample F (inbreeding coefficient)
#   Outlier threshold: |F| > 0.2
$ plink2 --bfile UZB_imputed_HQ \
  --het \
  --out het_check

# Remove outliers if any
$ awk 'NR>1 && ($6 > 0.2 || $6 < -0.2) {print $1, $2}' \
  het_check.het > het_outliers.txt

$ plink2 --bfile UZB_imputed_HQ \
  --remove het_outliers.txt \
  --make-bed \
  --out UZB_imputed_HQ_clean

Winter 2025 — Targets-File Filtering

Compute sample-level statistics

# vcftools --missing-indv: compute per-sample missingness rate
#   Outputs .imiss file with F_MISS column (fraction missing genotypes)
$ vcftools --gzvcf chr1.dose_normalized.vcf.gz \
  --missing-indv \
  --out chr1_stats

for chr in {1..22}; do
  vcftools --gzvcf chr${chr}.dose_normalized.vcf.gz \
    --missing-indv --out chr${chr}_stats
done

Sex and ancestry validation

$ plink2 --vcf chr{1..22}.dose_normalized.vcf.gz \
  --check-sex --maf 0.05 \
  --out final_qc

Remove outlier samples

Imputation quality pre-filter (Step 6.3): Before per-chromosome VCF filtering, a high-quality variant list was extracted from the Michigan Imputation Server output (UZB_all.HQ_imputed.R2ge0p8.MAFge0p001.tsv) applying: R² ≥ 0.8 (imputation quality), MAF ≥ 0.001, and IMPUTED == 1. This yielded 10,009,530 HQ variant IDs used as a whitelist in the filtering step below. See Step 6 Technical Log for full details.

$ python3 detect_outliers.py -i chr1_stats.imiss \
  --threshold 0.05 --output outliers_remove.txt

Found 24 outlier samples:
  - 8 samples with >5% missing data
  - 10 samples with sex discordance
  - 6 samples with ancestry outliers

Final filtering and export

# bcftools view: filter samples and variants from VCF
#   --samples-file ^FILE: EXCLUDE samples listed in FILE
#     The ^ prefix means "exclude" (without ^ it means "include only")
#   --include 'ID=@FILE': keep only variants whose ID matches a line in FILE
#     The @ prefix means "read values from file" (bcftools expression syntax)
#   --output-type z: compressed VCF (.vcf.gz)
#   --threads 4: use 4 compression threads (speeds up bgzf output)
$ for chr in {1..22}; do
  bcftools view chr${chr}.dose.vcf.gz \
    --samples-file ^het_outliers_vcf_format.txt \
    --include 'ID=@HQ_variant_ids.txt' \
    --output-type z \
    --output filtered_clean/chr${chr}.clean.vcf.gz \
    --threads 4
  tabix -p vcf filtered_clean/chr${chr}.clean.vcf.gz
  # tabix -p vcf: create .tbi index for VCF (required for region queries)
done

5. Full Chronological Log

2026-04-11 05:20

Spring 2026: Filtering launched
bcftools view -i expression filter on 22 dose VCFs (R²≥0.80 + MAF≥0.001)

2026-04-11 ~07:40

Spring 2026: Filtering progress — chr1–10 complete
Per-chr variant counts (R²≥0.80 + MAF≥0.001):
chr1: 775,162 | chr2: 849,269 | chr3: 726,178 | chr4: 738,749 | chr5: 656,618
chr6: 701,839 | chr7: 589,320 | chr8: 555,446 | chr9: 426,652 | chr10: 514,643
chr11: 504,631 | chr12: 489,173 | chr13: 377,108 | chr14: 330,248 | chr15: 282,841
chr16: 290,892 | chr17: 243,455 | chr18: 279,360 | chr19: 194,360 | chr20: 210,995
chr21: 122,548 | chr22: 122,327
Total: 9,981,814 variants

2026-04-11 ~09:30

Spring 2026: Sample name fix — chr10–22 reheadered
2 samples had Cyrillic homoglyphs (м/Х) corrupted to ?? by imputation server in chr10–22. Fixed via bcftools reheader -s canonical_samples.txt on 13 chromosomes. All 22 chromosomes verified matching.

2026-04-11 ~11:00

Spring 2026: Concatenation complete
bcftools concat → UZB_imputed_HQ.vcf.gz (41 GB, 22 chromosomes). Indexed with bcftools index.

2026-04-11 11:14–11:20

Spring 2026: PLINK conversion + het check
plink2 --vcf → 9,981,814 variants × 1,093 samples.
Het check: 3 outliers (|F|>0.2) → removed → 1,090 final samples.

2025-12-25 09:00

Winter 2025: Per-chromosome filtering initiated
Processing each VCF to remove het outliers and apply quality filters

2025-12-25 14:00

All chromosomes filtered
Removed 24 het outlier samples across all 22 chromosomes (~5.5 hours)

2025-12-25 20:30

Chromosome concatenation started
Merging all 22 filtered VCF files into single file

2025-12-25 20:50

Concatenation completed
UZB_imputed_HQ_clean_ALL.vcf.gz created (~26 minutes)

2025-12-25 21:10

PLINK conversion started
Converting VCF to binary PLINK format (.bed/.bim/.fam)

2025-12-26 22:20

Pipeline complete!
Final dataset: 1,074 samples × 10,846,569 variants, ready for analysis

⚠️ Optional Technical Note: VCF-to-PLINK Conversion & Allele Validation

The Problematic Command

Original VCF-to-PLINK Conversion:

plink --vcf UZB_imputed_HQ_clean_ALL.vcf.gz \

  --double-id \

  --vcf-half-call missing \

  --make-bed \

  --out UZB_imputed_HQ_clean \

  --threads 8 \

  --memory 200000

What Went Wrong

Critical Issue: PLINK's default allele assignment during VCF conversion doesn't match Michigan's reference panel encoding.

VCF Format Ambiguity: VCF files can encode alleles differently (REF/ALT conventions vary)
PLINK's Assumption: PLINK assumes REF should be the minor allele, causing allele swapping when reference alleles are major
Dosage Issue: The Michigan VCF contains dosage values (0-2) for the ALT allele, but PLINK may misinterpret which allele the dosage refers to

Evidence of Allele Swap

Before PLINK conversion (Michigan): REF=T, ALT=C

After PLINK conversion (PLINK .bim): A1=C, A2=T (SWAPPED!)

Result: Same dosage value, different biological meaning

Correct PLINK Conversion Commands

Option A: Explicit allele specification

plink --vcf UZB_imputed_HQ_clean_ALL.vcf.gz \

  --vcf-ref-alt-allele-force \

  --make-bed \

  --out UZB_imputed_HQ_clean_correct \

  --threads 8

Option B: Using bcftools (allele-preserving)

bcftools convert --vcf-to-plink UZB_imputed_HQ_clean_ALL.vcf.gz \

  --output UZB_imputed_HQ_clean_bcftools

Option C: PLINK2 with explicit reference

plink2 --vcf UZB_imputed_HQ_clean_ALL.vcf.gz \

  --ref-from-fa force \

  --make-bed \

  --out UZB_imputed_HQ_clean_plink2

Mandatory Allele Validation Script

Should be run immediately after VCF→PLINK conversion:

# Extract Michigan reference alleles
for chr in {1..22}; do
    zcat chr${chr}.info.gz | awk -F'\t' 'NR>1 {print $1"\t"$2"\t"$3"\t"$4"\t"$5}' >> michigan_alleles.tsv
done

# Compare PLINK output with Michigan reference
awk '{print $1":"$4"\t"$2"\t"$5"\t"$6}' UZB_imputed_HQ_clean.bim > uzb_alleles.tsv
join uzb_alleles.tsv michigan_alleles.tsv | awk '$2 != $4 && $3 != $5 {print $2}' > flip_list.txt

# Apply correction if needed
if [ $(wc -l < flip_list.txt) -gt 10000 ]; then
    plink --bfile UZB_imputed_HQ_clean \
        --flip flip_list.txt \
        --make-bed \
        --out UZB_imputed_HQ_clean_corrected
fi

Prevention Checklist for Future Pipelines

✓ Create allele reference file from imputation server's panel
✓ Document expected alleles for known variants (APOE, COMT, etc.)
✓ Perform immediate allele check after format conversion
✓ Validate 100+ known variants across ancestry groups
✓ Compare allele frequencies with expected population values
✓ Record which alleles are REF/ALT in documentation
✓ Store validation results in pipeline log

Key Takeaway: Always include allele validation as a mandatory step in any genotype data processing pipeline, especially after format conversions or imputation. This error occurs silently and can compromise all downstream analyses if undetected.

Step 6: Final Quality Control

1. Overview

Spring 2026

Winter 2025

2. Input Data

Spring 2026

Winter 2025

3. Output Data

Spring 2026

Winter 2025

4. Commands Executed

Spring 2026 — Expression-Based Filtering

Winter 2025 — Targets-File Filtering

Compute sample-level statistics

Sex and ancestry validation

Remove outlier samples

Final filtering and export

5. Full Chronological Log

The Problematic Command

What Went Wrong

Evidence of Allele Swap

Correct PLINK Conversion Commands

Mandatory Allele Validation Script

Prevention Checklist for Future Pipelines