← Back to Roadmap

Step 3: SNP Quality Control & VCF Export

Filter variants by allele frequency, Hardy-Weinberg equilibrium, and genotyping rate; export to VCF for imputation

✓ Completed — April 10, 2026

1. Overview

After removing low-quality samples (Steps 1–2), this step applies variant-level quality control. Three filters are applied simultaneously: minimum minor allele frequency (MAF), Hardy-Weinberg equilibrium (HWE) test, and per-SNP genotyping rate. These filters remove uninformative, technically artefactual, or poorly genotyped variants before imputation.

Accurate imputation depends critically on having a clean set of typed variants. Low-quality or artefactual typed SNPs propagate errors into imputed genotypes — a phenomenon known as garbage in, garbage out. Pre-imputation QC is therefore more stringent than general analysis QC.

Why filter before imputation? Imputation servers (Michigan, TOPMed) use typed variants as the scaffold for inferring untyped variants. If a typed SNP has incorrect allele frequencies (due to genotyping error or population stratification artefact), all imputed variants in its LD neighbourhood inherit that error. Clean typed variants = clean imputed genome.

Key Metrics

620,492
Autosomal Input
160,669
Removed by QC
459,823
After PLINK QC
456,684
Michigan-Ready
73.6%
Retention Rate

2. QC Parameters

2a. Minor Allele Frequency (MAF)

MAF is the frequency of the less common allele at a biallelic locus. Variants with very low MAF (<1%) are removed because: (1) they provide little discriminatory power in frequency-based analyses, (2) at very low MAF a single genotyping error dramatically shifts the observed frequency, and (3) imputation accuracy for rare variants in the typed panel is limited.

ParameterValueEffect
--maf 0.01 MAF ≥ 1% Removes monomorphic and rare variants. The minor allele must appear in at least ~22 of 2,186 allele copies (1,093 individuals × 2).

2b. Hardy-Weinberg Equilibrium (HWE)

Extreme departure from HWE (p < 10⁻⁶) at a given locus almost always indicates a genotyping artefact rather than biology — e.g., miscalling heterozygotes as homozygotes due to poor cluster separation, copy number variation producing excess heterozygosity, or probe-specific technical failures. The lenient 10⁻⁶ threshold avoids removing loci under genuine selection or modest population substructure effects.

ParameterValueEffect
--hwe 1e-6 HWE p > 10⁻⁶ Removes only severely violating loci (767 of 620,492 autosomal variants).

2c. Per-SNP Genotyping Rate

Per-SNP missingness measures the fraction of individuals for whom a variant could not be called. The --geno 0.10 threshold requires ≥90% call rate per SNP (≤109 missing out of 1,093). This matches the original winter 2025 pipeline parameters, recovered forensically from the data (max lmiss among retained variants = 0.099270, confirming the 0.10 cutoff).

Threshold recovery (forensic). The winter 2025 snpqc log was overwritten by a --hardy run on Dec 25. The original --geno threshold was recovered by measuring the maximum per-SNP missingness among the 473,081 retained winter variants: max F_MISS = 0.099270 (109/1098), proving --geno 0.10 was used. See data/recover_winter_out.txt for full forensic output.
ParameterValueEffect
--geno 0.10 ≥ 90% call rate per SNP Removes SNPs missing in >10% of samples (>109 of 1,093 individuals). Removed 9,345 variants.

Threshold Sensitivity (spring 2026 data, 1,093 samples)

Each row shows the combined result of --maf 0.01 --hwe 1e-6 --geno X --chr 1-22 applied to the 620,492 autosomal variants in ConvSK_mind20_dedup.

--genogeno removedHWE removedMAF removedVariants retained% of autosomal
0.005330,2894479,237210,92234.0%
0.01237,3569597,116285,92546.1%
0.0384,920388129,591405,59365.4%
0.0538,323570140,349441,25071.1%
0.109,345767150,557459,82374.1%

2d. Autosome Restriction

ParameterValueEffect
--chr 1-22 Autosomes only Excludes 33,535 non-autosomal variants (chrX, chrY, chrMT, unplaced). 654,027 → 620,492.

3. QC Metric Distributions

Distributions of the three QC metrics across all 620,492 autosomal input variants (computed on the 1,093 post-dedup samples). Dashed red lines indicate applied thresholds.

Minor Allele Frequency (MAF) Spectrum
N = 620,492 autosomal variants (1,093 samples) | Dashed line = MAF 0.01 threshold
HWE p-value Distribution
-log₁₀(p) | Dashed line = 1e-6 threshold
Per-SNP Missingness Distribution
Fraction of samples missing | Dashed line = 10%
MAF spectrum: The characteristic L-shape (excess of rare variants) is expected from population genetic theory (site frequency spectrum). The 0.00–0.01 and 0.01–0.05 bins are nearly equal (157,939 and 157,971) — the MAF filter removes the leftmost bin.
Per-SNP missingness: 62% of autosomal variants (383,136) have <1% missingness. The long right tail (4,433 variants with >30% missingness) represents probe-specific failures. The 10% threshold sits well into the tail, cutting only 9,345 variants.

4. Input & Output

Input

FilesConvSK_mind20_dedup.bed, .bim, .fam
Location/staging/ALSU-analysis/spring2026/
Samples1,093
Variants654,027 (620,492 autosomal)
Genotyping rate0.9799 (97.99%)

Output — QC-filtered PLINK

FilesConvSK_mind20_dedup_snpqc.bed, .bim, .fam
Samples1,093 (unchanged)
Variants459,823 (160,669 removed by combined filters)
Genotyping rate0.9880 (98.80%)

Output — Final VCF for imputation

Filesimpute_chr1.vcf.gz through impute_chr22.vcf.gz (22 files)
Variants458,954 (after removing 479 I/D alleles + 390 duplicate positions)
Samples1,093 (2 Cyrillic IDs fixed to ASCII)
Formatbgzip-compressed VCF with tabix index

Output — Michigan-ready VCF (after fixref)

Filesmichigan_ready/impute_chr{1..22}.vcf.gz (22 files)
Variants456,684 (after chr rename + fixref strand alignment + palindromic removal)
Samples1,093
Chromosomeschr1–chr22 (hg38/GRCh38 naming convention)
StrandAll alleles aligned to GRCh38 reference forward strand
Formatbgzip-compressed VCF with tabix index

5. Commands

Step 3a: Apply combined SNP QC filters

cd /staging/ALSU-analysis/spring2026

plink --bfile ConvSK_mind20_dedup \
  --maf 0.01 \
  --hwe 1e-6 \
  --geno 0.10 \
  --chr 1-22 \
  --make-bed \
  --out ConvSK_mind20_dedup_snpqc
PLINK log output:
620492 out of 654027 variants loaded from .bim file.
1093 people loaded from .fam.
Total genotyping rate is 0.979912.
9345 variants removed due to missing genotype data (--geno).
--hwe: 767 variants removed due to Hardy-Weinberg exact test.
150557 variants removed due to minor allele threshold(s).
459823 variants and 1093 people pass filters and QC.

Step 3b: Build VCF-safe SNP list

# 479 variants have I/D alleles (insertion/deletion shorthand)
# These are not valid VCF alleles — must be excluded
# BIM cols: $1=chr  $2=snpID  $3=cM  $4=pos  $5=allele1  $6=allele2
awk '($5!="I" && $5!="D" && $6!="I" && $6!="D") {print $2}' \
  ConvSK_mind20_dedup_snpqc.bim > vcf_ok_snps.txt
wc -l vcf_ok_snps.txt
# 459344

Step 3c: Export per-chromosome VCFs

# --recode vcf bgz: export PLINK binary as bgzip-compressed VCF (.vcf.gz)
# tabix -f -p vcf: create tabix index (-f = overwrite, -p vcf = VCF format preset)
for chr in $(seq 1 22); do
  plink --bfile ConvSK_mind20_dedup_snpqc \
    --chr $chr --extract vcf_ok_snps.txt \
    --recode vcf bgz --out chr${chr}
  tabix -f -p vcf chr${chr}.vcf.gz
done

Step 3d: Concatenate and clean VCF

# bcftools concat: join VCFs from different regions (same samples)
#   -Oz: bgzip-compressed VCF output
bcftools concat -Oz -o impute_in.autosomes.vcf.gz chr{1..22}.vcf.gz
tabix -f -p vcf impute_in.autosomes.vcf.gz
# 459,344 variants

# Keep biallelic SNPs only (no multiallelic found)
# -m2 -M2: require exactly 2 alleles (min 2, max 2 = biallelic)
# -v snps: retain only SNP variant type (exclude indels/MNPs)
bcftools view -m2 -M2 -v snps -Oz \
  -o impute_in.autosomes.clean.vcf.gz impute_in.autosomes.vcf.gz
tabix -f -p vcf impute_in.autosomes.clean.vcf.gz
# 459,344 variants

# Remove duplicate positions (390 removed)
# bcftools norm -d all: remove all types of duplicate records at same position
bcftools norm -d all -Oz \
  -o impute_in.autosomes.clean.nodup.vcf.gz \
  impute_in.autosomes.clean.vcf.gz
tabix -f -p vcf impute_in.autosomes.clean.nodup.vcf.gz
# 458,954 variants

Step 3e: Fix non-ASCII sample IDs

Universal detection: Rather than hard-coding known Cyrillic IDs, this step auto-detects all non-ASCII sample names and maps Cyrillic homoglyphs to their Latin equivalents. If your dataset has no non-ASCII IDs, the rename file will be empty and the reheader is skipped. This also eliminates the need for a separate Cyrillic fix after imputation (the fix in Step 5 becomes unnecessary — Michigan receives clean ASCII IDs).
# Auto-detect non-ASCII sample IDs and generate rename mapping
# bcftools query -l: list all sample names from VCF header
# python3 inline: map all Cyrillic homoglyphs → Latin equivalents
bcftools query -l impute_in.autosomes.clean.nodup.vcf.gz | \
  python3 -c "
import sys
# Cyrillic-to-Latin homoglyph map (visually identical characters)
CYR2LAT = {
    '\u0410':'A', '\u0412':'B', '\u0415':'E', '\u041a':'K', '\u041c':'M',
    '\u041d':'H', '\u041e':'O', '\u0420':'P', '\u0421':'C', '\u0422':'T',
    '\u0425':'X', '\u0430':'a', '\u0435':'e', '\u043e':'o', '\u0440':'p',
    '\u0441':'c', '\u0445':'x', '\u043c':'m',
}
count = 0
for line in sys.stdin:
    old = line.rstrip('\n')
    new = ''.join(CYR2LAT.get(c, c) for c in old)
    if new != old:
        remaining = [c for c in new if ord(c) > 127]
        if remaining:
            print(f'WARNING: {old} still has unmapped non-ASCII: {remaining}', file=sys.stderr)
        print(f'{old}\t{new}')
        count += 1
print(f'{count} rename mapping(s) generated', file=sys.stderr)
" > rename_samples.tsv

cat rename_samples.tsv
# Spring 2026 dataset — expected 2 mappings:
#   808_03-25м        808_03-25m
#   1038_08-176Х-00006  1038_08-176X-00006

# Apply rename only if mappings were found
if [ -s rename_samples.tsv ]; then
  # bcftools reheader -s FILE: replace sample names using old→new tsv
  bcftools reheader -s rename_samples.tsv \
    -o impute_in.final.vcf.gz \
    impute_in.autosomes.clean.nodup.vcf.gz
else
  # No non-ASCII IDs found — just rename the file
  mv impute_in.autosomes.clean.nodup.vcf.gz impute_in.final.vcf.gz
fi

# Index and verify
bcftools index impute_in.final.vcf.gz          # create CSI index for random access
bcftools index -n impute_in.final.vcf.gz       # -n: print variant count (quick check)
# 458954

# Confirm no non-ASCII IDs remain
# grep -cP '[^\x00-\x7F]': count lines containing non-ASCII bytes
bcftools query -l impute_in.final.vcf.gz | grep -cP '[^\x00-\x7F]'
# 0

Step 3f: Split back to per-chromosome VCFs

The concatenation + cleanup + reheader steps above produced a single genome-wide VCF. Michigan Imputation Server requires one VCF per chromosome, so we split impute_in.final.vcf.gz back into per-chromosome files:

# bcftools view -r CHR: extract records from region (chromosome) CHR
for chr in $(seq 1 22); do
  bcftools view -r $chr impute_in.final.vcf.gz \
    -Oz -o impute_chr${chr}.vcf.gz
  bcftools index impute_chr${chr}.vcf.gz
done

Step 3g: Chromosome rename, reference strand alignment & palindromic removal

Why these steps are needed: PLINK exports chromosomes as bare numbers (1, 2, …) but Michigan Imputation Server with hg38 requires the chr prefix (chr1, chr2, …). Additionally, PLINK stores alleles in the orientation of the genotyping array, not the reference genome forward strand — without bcftools +fixref, ~59% of variants are on the wrong strand, causing massive strand-flip rejection at the imputation server (185,633 strand flips on first upload attempt). Finally, palindromic A/T and C/G SNPs cannot be unambiguously strand-resolved and must be removed.
# Create chromosome rename file (1 → chr1, ..., 22 → chr22)
for i in $(seq 1 22); do echo "$i chr$i"; done > chr_rename.tsv

mkdir -p michigan_ready

# For each chromosome: rename → fixref → remove palindromic
for chr in $(seq 1 22); do
  # --rename-chrs FILE: 2-column mapping (old_name  new_name) for chromosome IDs
  bcftools annotate --rename-chrs chr_rename.tsv \
    impute_chr${chr}.vcf.gz -Oz -o tmp_chr${chr}.vcf.gz
  bcftools index tmp_chr${chr}.vcf.gz

  # bcftools +fixref: plugin that aligns alleles to reference genome forward strand
  #   -Ou: uncompressed BCF output (fast for piping, never touches disk)
  #   --: separator between bcftools args and plugin-specific args
  #   -f FILE: reference FASTA for strand determination
  #   -m top: strand matching mode (Illumina TOP/BOT convention)
  # bcftools view -e EXPR: exclude records matching the filter expression
  #   Removes palindromic A/T and C/G SNPs (ambiguous strand, unresolvable)
  # bcftools index -t: create tabix (.tbi) index (vs default CSI)
  bcftools +fixref tmp_chr${chr}.vcf.gz -Ou \
    -- -f /staging/Genomes/Human/chr/GRCh38.fa -m top \
    2>fixref_chr${chr}.log \
  | bcftools view \
    -e 'TYPE="snp" && ((REF="A" && ALT="T") || (REF="T" && ALT="A") || (REF="C" && ALT="G") || (REF="G" && ALT="C"))' \
    -Oz -o michigan_ready/impute_chr${chr}.vcf.gz

  bcftools index -ft michigan_ready/impute_chr${chr}.vcf.gz
done

# Clean up intermediates (tmp_chr = 3g rename intermediates)
rm -f tmp_chr*.vcf.gz tmp_chr*.vcf.gz.csi
Critical: fixref stderr must NOT enter the pipe. When running fixref in a exec > log 2>&1 context, stderr from bcftools +fixref corrupts the BCF data stream piped to bcftools view. Always redirect fixref stderr to a separate file (2>fixref_chrN.log).
fixref results (representative — chr10, 23,476 input sites):
ref match:     9,713 (41.4%)
ref mismatch: 13,749 (58.6%)
  ↳ flipped:    9,794 (41.7%)
  ↳ swapped:    1,989 (8.5%)
  ↳ flip+swap:  1,971 (8.4%)
  ↳ unresolved: 0 (0.0%)
All 22 chromosomes show the same ~41/59 ref-match/mismatch pattern.

Per-chromosome counts (after fixref + palindromic removal)

ChrVariantsChrVariantsChrVariantsChrVariants
136,006725,5321316,7021910,365
237,939823,7081415,0182011,429
331,653919,9321514,250216,582
429,3031023,3861615,327226,790
527,3861122,4781713,576Total: 456,684
633,5041222,1401813,678

6. Quality Verification

✓ All 459,823 retained variants pass: MAF ≥ 0.01, HWE p > 10⁻⁶, genotyping rate ≥ 90%, autosomes only (chr 1–22). After VCF cleanup, fixref strand alignment, and palindromic removal, 456,684 variants × 1,093 samples are Michigan-ready for imputation.
# Verify PLINK QC variant count
wc -l ConvSK_mind20_dedup_snpqc.bim
# 459823

# --freq: output allele frequencies (.frq)
# --hardy: output Hardy-Weinberg exact test p-values (.hwe)
# --missing: output per-SNP (.lmiss) and per-sample (.imiss) missingness
plink --bfile ConvSK_mind20_dedup_snpqc \
  --freq --hardy --missing \
  --out snpqc_verify

# Check MAF — expect 0
# .frq cols: CHR  SNP  A1  A2  MAF  NCHROBS → $5=MAF
awk 'NR>1 && $5+0 < 0.01' snpqc_verify.frq | wc -l
# 0

# Check HWE — expect 0
# .hwe cols: CHR SNP TEST A1 A2 GENO O(HET) E(HET) P → $9=P
awk 'NR>1 && $9+0 < 1e-6 && $9 != "NA"' snpqc_verify.hwe | wc -l
# 0

# Check per-SNP missingness — expect 0
# .lmiss cols: CHR  SNP  N_MISS  N_GENO  F_MISS → $5=F_MISS
awk 'NR>1 && $5+0 > 0.10' snpqc_verify.lmiss | wc -l
# 0

Per-chromosome variant counts

ChrVariantsChrVariantsChrVariantsChrVariants
136,175725,6461316,6941910,424
238,146823,8061415,0592011,462
331,777920,0301514,324216,593
429,4271023,4491615,384226,789
527,4981122,5731713,596Total: 458,954
634,1621222,2581813,682

7. Variant Flow Summary

StageVariantsChange
Input (all chromosomes)654,027
Autosomal only (chr 1–22)620,492−33,535
After --geno 0.10611,147−9,345
After --hwe 1e-6610,380−767
After --maf 0.01459,823−150,557
Remove I/D alleles459,344−479
Remove duplicate positions458,954−390
Chr rename + fixref strand alignment458,954
Remove palindromic A/T C/G SNPs456,684−2,270
Michigan-ready VCF456,684
Note on filter interaction: PLINK applies all filters simultaneously in a single pass. The per-filter counts above (9,345 + 767 + 150,557 = 160,669) equal the total removed (620,492 − 459,823 = 160,669) because PLINK reports each variant's primary removal reason — the first failing filter in its internal evaluation order (geno → hwe → maf).