Step 3: SNP QC & VCF Export

1. Overview

After removing low-quality samples (Steps 1–2), this step applies variant-level quality control. Three filters are applied simultaneously: minimum minor allele frequency (MAF), Hardy-Weinberg equilibrium (HWE) test, and per-SNP genotyping rate. These filters remove uninformative, technically artefactual, or poorly genotyped variants before imputation.

Accurate imputation depends critically on having a clean set of typed variants. Low-quality or artefactual typed SNPs propagate errors into imputed genotypes — a phenomenon known as garbage in, garbage out. Pre-imputation QC is therefore more stringent than general analysis QC.

Why filter before imputation? Imputation servers (Michigan, TOPMed) use typed variants as the scaffold for inferring untyped variants. If a typed SNP has incorrect allele frequencies (due to genotyping error or population stratification artefact), all imputed variants in its LD neighbourhood inherit that error. Clean typed variants = clean imputed genome.

Key Metrics

620,492

Autosomal Input

160,669

Removed by QC

459,823

After PLINK QC

456,684

Michigan-Ready

73.6%

Retention Rate

2. QC Parameters

2a. Minor Allele Frequency (MAF)

MAF is the frequency of the less common allele at a biallelic locus. Variants with very low MAF (<1%) are removed because: (1) they provide little discriminatory power in frequency-based analyses, (2) at very low MAF a single genotyping error dramatically shifts the observed frequency, and (3) imputation accuracy for rare variants in the typed panel is limited.

Parameter	Value	Effect
`--maf 0.01`	MAF ≥ 1%	Removes monomorphic and rare variants. The minor allele must appear in at least ~22 of 2,186 allele copies (1,093 individuals × 2).

2b. Hardy-Weinberg Equilibrium (HWE)

Extreme departure from HWE (p < 10⁻⁶) at a given locus almost always indicates a genotyping artefact rather than biology — e.g., miscalling heterozygotes as homozygotes due to poor cluster separation, copy number variation producing excess heterozygosity, or probe-specific technical failures. The lenient 10⁻⁶ threshold avoids removing loci under genuine selection or modest population substructure effects.

Parameter	Value	Effect
`--hwe 1e-6`	HWE p > 10⁻⁶	Removes only severely violating loci (767 of 620,492 autosomal variants).

2c. Per-SNP Genotyping Rate

Per-SNP missingness measures the fraction of individuals for whom a variant could not be called. The --geno 0.10 threshold requires ≥90% call rate per SNP (≤109 missing out of 1,093). This matches the original winter 2025 pipeline parameters, recovered forensically from the data (max lmiss among retained variants = 0.099270, confirming the 0.10 cutoff).

Threshold recovery (forensic). The winter 2025 snpqc log was overwritten by a --hardy run on Dec 25. The original --geno threshold was recovered by measuring the maximum per-SNP missingness among the 473,081 retained winter variants: max F_MISS = 0.099270 (109/1098), proving --geno 0.10 was used. See data/recover_winter_out.txt for full forensic output.

Parameter	Value	Effect
`--geno 0.10`	≥ 90% call rate per SNP	Removes SNPs missing in >10% of samples (>109 of 1,093 individuals). Removed 9,345 variants.

Threshold Sensitivity (spring 2026 data, 1,093 samples)

Each row shows the combined result of --maf 0.01 --hwe 1e-6 --geno X --chr 1-22 applied to the 620,492 autosomal variants in ConvSK_mind20_dedup.

`--geno`	geno removed	HWE removed	MAF removed	Variants retained	% of autosomal
0.005	330,289	44	79,237	210,922	34.0%
0.01	237,356	95	97,116	285,925	46.1%
0.03	84,920	388	129,591	405,593	65.4%
0.05	38,323	570	140,349	441,250	71.1%
0.10	9,345	767	150,557	459,823	74.1%

2d. Autosome Restriction

Parameter	Value	Effect
`--chr 1-22`	Autosomes only	Excludes 33,535 non-autosomal variants (chrX, chrY, chrMT, unplaced). 654,027 → 620,492.

3. QC Metric Distributions

Distributions of the three QC metrics across all 620,492 autosomal input variants (computed on the 1,093 post-dedup samples). Dashed red lines indicate applied thresholds.

Minor Allele Frequency (MAF) Spectrum

N = 620,492 autosomal variants (1,093 samples) | Dashed line = MAF 0.01 threshold

HWE p-value Distribution

-log₁₀(p) | Dashed line = 1e-6 threshold

Per-SNP Missingness Distribution

Fraction of samples missing | Dashed line = 10%

MAF spectrum: The characteristic L-shape (excess of rare variants) is expected from population genetic theory (site frequency spectrum). The 0.00–0.01 and 0.01–0.05 bins are nearly equal (157,939 and 157,971) — the MAF filter removes the leftmost bin.

Per-SNP missingness: 62% of autosomal variants (383,136) have <1% missingness. The long right tail (4,433 variants with >30% missingness) represents probe-specific failures. The 10% threshold sits well into the tail, cutting only 9,345 variants.

4. Input & Output

Input

Files	ConvSK_mind20_dedup.bed, .bim, .fam
Location	`/staging/ALSU-analysis/spring2026/`
Samples	1,093
Variants	654,027 (620,492 autosomal)
Genotyping rate	0.9799 (97.99%)

Output — QC-filtered PLINK

Files	ConvSK_mind20_dedup_snpqc.bed, .bim, .fam
Samples	1,093 (unchanged)
Variants	459,823 (160,669 removed by combined filters)
Genotyping rate	0.9880 (98.80%)

Output — Final VCF for imputation

Files	impute_chr1.vcf.gz through impute_chr22.vcf.gz (22 files)
Variants	458,954 (after removing 479 I/D alleles + 390 duplicate positions)
Samples	1,093 (2 Cyrillic IDs fixed to ASCII)
Format	bgzip-compressed VCF with tabix index

Output — Michigan-ready VCF (after fixref)

Files	michigan_ready/impute_chr{1..22}.vcf.gz (22 files)
Variants	456,684 (after chr rename + fixref strand alignment + palindromic removal)
Samples	1,093
Chromosomes	chr1–chr22 (hg38/GRCh38 naming convention)
Strand	All alleles aligned to GRCh38 reference forward strand
Format	bgzip-compressed VCF with tabix index

5. Commands

Step 3a: Apply combined SNP QC filters

cd /staging/ALSU-analysis/spring2026

plink --bfile ConvSK_mind20_dedup \
  --maf 0.01 \
  --hwe 1e-6 \
  --geno 0.10 \
  --chr 1-22 \
  --make-bed \
  --out ConvSK_mind20_dedup_snpqc

PLINK log output:
620492 out of 654027 variants loaded from .bim file.
1093 people loaded from .fam.
Total genotyping rate is 0.979912.
9345 variants removed due to missing genotype data (--geno).
--hwe: 767 variants removed due to Hardy-Weinberg exact test.
150557 variants removed due to minor allele threshold(s).
459823 variants and 1093 people pass filters and QC.

Step 3b: Build VCF-safe SNP list

# 479 variants have I/D alleles (insertion/deletion shorthand)
# These are not valid VCF alleles — must be excluded
# BIM cols: $1=chr  $2=snpID  $3=cM  $4=pos  $5=allele1  $6=allele2
awk '($5!="I" && $5!="D" && $6!="I" && $6!="D") {print $2}' \
  ConvSK_mind20_dedup_snpqc.bim > vcf_ok_snps.txt
wc -l vcf_ok_snps.txt
# 459344

Step 3c: Export per-chromosome VCFs

# --recode vcf bgz: export PLINK binary as bgzip-compressed VCF (.vcf.gz)
# tabix -f -p vcf: create tabix index (-f = overwrite, -p vcf = VCF format preset)
for chr in $(seq 1 22); do
  plink --bfile ConvSK_mind20_dedup_snpqc \
    --chr $chr --extract vcf_ok_snps.txt \
    --recode vcf bgz --out chr${chr}
  tabix -f -p vcf chr${chr}.vcf.gz
done

Step 3d: Concatenate and clean VCF

# bcftools concat: join VCFs from different regions (same samples)
#   -Oz: bgzip-compressed VCF output
bcftools concat -Oz -o impute_in.autosomes.vcf.gz chr{1..22}.vcf.gz
tabix -f -p vcf impute_in.autosomes.vcf.gz
# 459,344 variants

# Keep biallelic SNPs only (no multiallelic found)
# -m2 -M2: require exactly 2 alleles (min 2, max 2 = biallelic)
# -v snps: retain only SNP variant type (exclude indels/MNPs)
bcftools view -m2 -M2 -v snps -Oz \
  -o impute_in.autosomes.clean.vcf.gz impute_in.autosomes.vcf.gz
tabix -f -p vcf impute_in.autosomes.clean.vcf.gz
# 459,344 variants

# Remove duplicate positions (390 removed)
# bcftools norm -d all: remove all types of duplicate records at same position
bcftools norm -d all -Oz \
  -o impute_in.autosomes.clean.nodup.vcf.gz \
  impute_in.autosomes.clean.vcf.gz
tabix -f -p vcf impute_in.autosomes.clean.nodup.vcf.gz
# 458,954 variants

Step 3e: Fix non-ASCII sample IDs

Universal detection: Rather than hard-coding known Cyrillic IDs, this step auto-detects all non-ASCII sample names and maps Cyrillic homoglyphs to their Latin equivalents. If your dataset has no non-ASCII IDs, the rename file will be empty and the reheader is skipped. This also eliminates the need for a separate Cyrillic fix after imputation (the fix in Step 5 becomes unnecessary — Michigan receives clean ASCII IDs).

# Auto-detect non-ASCII sample IDs and generate rename mapping
# bcftools query -l: list all sample names from VCF header
# python3 inline: map all Cyrillic homoglyphs → Latin equivalents
bcftools query -l impute_in.autosomes.clean.nodup.vcf.gz | \
  python3 -c "
import sys
# Cyrillic-to-Latin homoglyph map (visually identical characters)
CYR2LAT = {
    '\u0410':'A', '\u0412':'B', '\u0415':'E', '\u041a':'K', '\u041c':'M',
    '\u041d':'H', '\u041e':'O', '\u0420':'P', '\u0421':'C', '\u0422':'T',
    '\u0425':'X', '\u0430':'a', '\u0435':'e', '\u043e':'o', '\u0440':'p',
    '\u0441':'c', '\u0445':'x', '\u043c':'m',
}
count = 0
for line in sys.stdin:
    old = line.rstrip('\n')
    new = ''.join(CYR2LAT.get(c, c) for c in old)
    if new != old:
        remaining = [c for c in new if ord(c) > 127]
        if remaining:
            print(f'WARNING: {old} still has unmapped non-ASCII: {remaining}', file=sys.stderr)
        print(f'{old}\t{new}')
        count += 1
print(f'{count} rename mapping(s) generated', file=sys.stderr)
" > rename_samples.tsv

cat rename_samples.tsv
# Spring 2026 dataset — expected 2 mappings:
#   808_03-25м        808_03-25m
#   1038_08-176Х-00006  1038_08-176X-00006

# Apply rename only if mappings were found
if [ -s rename_samples.tsv ]; then
  # bcftools reheader -s FILE: replace sample names using old→new tsv
  bcftools reheader -s rename_samples.tsv \
    -o impute_in.final.vcf.gz \
    impute_in.autosomes.clean.nodup.vcf.gz
else
  # No non-ASCII IDs found — just rename the file
  mv impute_in.autosomes.clean.nodup.vcf.gz impute_in.final.vcf.gz
fi

# Index and verify
bcftools index impute_in.final.vcf.gz          # create CSI index for random access
bcftools index -n impute_in.final.vcf.gz       # -n: print variant count (quick check)
# 458954

# Confirm no non-ASCII IDs remain
# grep -cP '[^\x00-\x7F]': count lines containing non-ASCII bytes
bcftools query -l impute_in.final.vcf.gz | grep -cP '[^\x00-\x7F]'
# 0

Step 3f: Split back to per-chromosome VCFs

The concatenation + cleanup + reheader steps above produced a single genome-wide VCF. Michigan Imputation Server requires one VCF per chromosome, so we split impute_in.final.vcf.gz back into per-chromosome files:

# bcftools view -r CHR: extract records from region (chromosome) CHR
for chr in $(seq 1 22); do
  bcftools view -r $chr impute_in.final.vcf.gz \
    -Oz -o impute_chr${chr}.vcf.gz
  bcftools index impute_chr${chr}.vcf.gz
done

Step 3g: Chromosome rename, reference strand alignment & palindromic removal

Why these steps are needed: PLINK exports chromosomes as bare numbers (1, 2, …) but Michigan Imputation Server with hg38 requires the chr prefix (chr1, chr2, …). Additionally, PLINK stores alleles in the orientation of the genotyping array, not the reference genome forward strand — without bcftools +fixref, ~59% of variants are on the wrong strand, causing massive strand-flip rejection at the imputation server (185,633 strand flips on first upload attempt). Finally, palindromic A/T and C/G SNPs cannot be unambiguously strand-resolved and must be removed.

# Create chromosome rename file (1 → chr1, ..., 22 → chr22)
for i in $(seq 1 22); do echo "$i chr$i"; done > chr_rename.tsv

mkdir -p michigan_ready

# For each chromosome: rename → fixref → remove palindromic
for chr in $(seq 1 22); do
  # --rename-chrs FILE: 2-column mapping (old_name  new_name) for chromosome IDs
  bcftools annotate --rename-chrs chr_rename.tsv \
    impute_chr${chr}.vcf.gz -Oz -o tmp_chr${chr}.vcf.gz
  bcftools index tmp_chr${chr}.vcf.gz

  # bcftools +fixref: plugin that aligns alleles to reference genome forward strand
  #   -Ou: uncompressed BCF output (fast for piping, never touches disk)
  #   --: separator between bcftools args and plugin-specific args
  #   -f FILE: reference FASTA for strand determination
  #   -m top: strand matching mode (Illumina TOP/BOT convention)
  # bcftools view -e EXPR: exclude records matching the filter expression
  #   Removes palindromic A/T and C/G SNPs (ambiguous strand, unresolvable)
  # bcftools index -t: create tabix (.tbi) index (vs default CSI)
  bcftools +fixref tmp_chr${chr}.vcf.gz -Ou \
    -- -f /staging/Genomes/Human/chr/GRCh38.fa -m top \
    2>fixref_chr${chr}.log \
  | bcftools view \
    -e 'TYPE="snp" && ((REF="A" && ALT="T") || (REF="T" && ALT="A") || (REF="C" && ALT="G") || (REF="G" && ALT="C"))' \
    -Oz -o michigan_ready/impute_chr${chr}.vcf.gz

  bcftools index -ft michigan_ready/impute_chr${chr}.vcf.gz
done

# Clean up intermediates (tmp_chr = 3g rename intermediates)
rm -f tmp_chr*.vcf.gz tmp_chr*.vcf.gz.csi

Critical: fixref stderr must NOT enter the pipe. When running fixref in a exec > log 2>&1 context, stderr from bcftools +fixref corrupts the BCF data stream piped to bcftools view. Always redirect fixref stderr to a separate file (2>fixref_chrN.log).

fixref results (representative — chr10, 23,476 input sites):
ref match:     9,713 (41.4%)
ref mismatch: 13,749 (58.6%)
  ↳ flipped:    9,794 (41.7%)
  ↳ swapped:    1,989 (8.5%)
  ↳ flip+swap: 1,971 (8.4%)
  ↳ unresolved: 0 (0.0%)
All 22 chromosomes show the same ~41/59 ref-match/mismatch pattern.

Per-chromosome counts (after fixref + palindromic removal)

Chr	Variants	Chr	Variants	Chr	Variants	Chr	Variants
1	36,006	7	25,532	13	16,702	19	10,365
2	37,939	8	23,708	14	15,018	20	11,429
3	31,653	9	19,932	15	14,250	21	6,582
4	29,303	10	23,386	16	15,327	22	6,790
5	27,386	11	22,478	17	13,576	Total: 456,684
6	33,504	12	22,140	18	13,678

6. Quality Verification

✓ All 459,823 retained variants pass: MAF ≥ 0.01, HWE p > 10⁻⁶, genotyping rate ≥ 90%, autosomes only (chr 1–22). After VCF cleanup, fixref strand alignment, and palindromic removal, 456,684 variants × 1,093 samples are Michigan-ready for imputation.

# Verify PLINK QC variant count
wc -l ConvSK_mind20_dedup_snpqc.bim
# 459823

# --freq: output allele frequencies (.frq)
# --hardy: output Hardy-Weinberg exact test p-values (.hwe)
# --missing: output per-SNP (.lmiss) and per-sample (.imiss) missingness
plink --bfile ConvSK_mind20_dedup_snpqc \
  --freq --hardy --missing \
  --out snpqc_verify

# Check MAF — expect 0
# .frq cols: CHR  SNP  A1  A2  MAF  NCHROBS → $5=MAF
awk 'NR>1 && $5+0 < 0.01' snpqc_verify.frq | wc -l
# 0

# Check HWE — expect 0
# .hwe cols: CHR SNP TEST A1 A2 GENO O(HET) E(HET) P → $9=P
awk 'NR>1 && $9+0 < 1e-6 && $9 != "NA"' snpqc_verify.hwe | wc -l
# 0

# Check per-SNP missingness — expect 0
# .lmiss cols: CHR  SNP  N_MISS  N_GENO  F_MISS → $5=F_MISS
awk 'NR>1 && $5+0 > 0.10' snpqc_verify.lmiss | wc -l
# 0

Per-chromosome variant counts

Chr	Variants	Chr	Variants	Chr	Variants	Chr	Variants
1	36,175	7	25,646	13	16,694	19	10,424
2	38,146	8	23,806	14	15,059	20	11,462
3	31,777	9	20,030	15	14,324	21	6,593
4	29,427	10	23,449	16	15,384	22	6,789
5	27,498	11	22,573	17	13,596	Total: 458,954
6	34,162	12	22,258	18	13,682

7. Variant Flow Summary

Stage	Variants	Change
Input (all chromosomes)	654,027	—
Autosomal only (chr 1–22)	620,492	−33,535
After --geno 0.10	611,147	−9,345
After --hwe 1e-6	610,380	−767
After --maf 0.01	459,823	−150,557
Remove I/D alleles	459,344	−479
Remove duplicate positions	458,954	−390
Chr rename + fixref strand alignment	458,954	—
Remove palindromic A/T C/G SNPs	456,684	−2,270
Michigan-ready VCF	456,684

Note on filter interaction: PLINK applies all filters simultaneously in a single pass. The per-filter counts above (9,345 + 767 + 150,557 = 160,669) equal the total removed (620,492 − 459,823 = 160,669) because PLINK reports each variant's primary removal reason — the first failing filter in its internal evaluation order (geno → hwe → maf).

Step 3: SNP Quality Control & VCF Export