Step 3a: Apply combined SNP QC filters
cd /staging/ALSU-analysis/spring2026
plink --bfile ConvSK_mind20_dedup \
--maf 0.01 \
--hwe 1e-6 \
--geno 0.10 \
--chr 1-22 \
--make-bed \
--out ConvSK_mind20_dedup_snpqc
PLINK log output:
620492 out of 654027 variants loaded from .bim file.
1093 people loaded from .fam.
Total genotyping rate is 0.979912.
9345 variants removed due to missing genotype data (--geno).
--hwe: 767 variants removed due to Hardy-Weinberg exact test.
150557 variants removed due to minor allele threshold(s).
459823 variants and 1093 people pass filters and QC.
Step 3b: Build VCF-safe SNP list
# 479 variants have I/D alleles (insertion/deletion shorthand)
# These are not valid VCF alleles — must be excluded
# BIM cols: $1=chr $2=snpID $3=cM $4=pos $5=allele1 $6=allele2
awk '($5!="I" && $5!="D" && $6!="I" && $6!="D") {print $2}' \
ConvSK_mind20_dedup_snpqc.bim > vcf_ok_snps.txt
wc -l vcf_ok_snps.txt
# 459344
Step 3c: Export per-chromosome VCFs
# --recode vcf bgz: export PLINK binary as bgzip-compressed VCF (.vcf.gz)
# tabix -f -p vcf: create tabix index (-f = overwrite, -p vcf = VCF format preset)
for chr in $(seq 1 22); do
plink --bfile ConvSK_mind20_dedup_snpqc \
--chr $chr --extract vcf_ok_snps.txt \
--recode vcf bgz --out chr${chr}
tabix -f -p vcf chr${chr}.vcf.gz
done
Step 3d: Concatenate and clean VCF
# bcftools concat: join VCFs from different regions (same samples)
# -Oz: bgzip-compressed VCF output
bcftools concat -Oz -o impute_in.autosomes.vcf.gz chr{1..22}.vcf.gz
tabix -f -p vcf impute_in.autosomes.vcf.gz
# 459,344 variants
# Keep biallelic SNPs only (no multiallelic found)
# -m2 -M2: require exactly 2 alleles (min 2, max 2 = biallelic)
# -v snps: retain only SNP variant type (exclude indels/MNPs)
bcftools view -m2 -M2 -v snps -Oz \
-o impute_in.autosomes.clean.vcf.gz impute_in.autosomes.vcf.gz
tabix -f -p vcf impute_in.autosomes.clean.vcf.gz
# 459,344 variants
# Remove duplicate positions (390 removed)
# bcftools norm -d all: remove all types of duplicate records at same position
bcftools norm -d all -Oz \
-o impute_in.autosomes.clean.nodup.vcf.gz \
impute_in.autosomes.clean.vcf.gz
tabix -f -p vcf impute_in.autosomes.clean.nodup.vcf.gz
# 458,954 variants
Step 3e: Fix non-ASCII sample IDs
Universal detection: Rather than hard-coding known Cyrillic IDs, this step
auto-detects all non-ASCII sample names and maps Cyrillic homoglyphs to their
Latin equivalents. If your dataset has no non-ASCII IDs, the rename file will be empty
and the reheader is skipped. This also eliminates the need for a separate Cyrillic fix
after imputation (the fix in Step 5 becomes unnecessary — Michigan receives clean ASCII IDs).
# Auto-detect non-ASCII sample IDs and generate rename mapping
# bcftools query -l: list all sample names from VCF header
# python3 inline: map all Cyrillic homoglyphs → Latin equivalents
bcftools query -l impute_in.autosomes.clean.nodup.vcf.gz | \
python3 -c "
import sys
# Cyrillic-to-Latin homoglyph map (visually identical characters)
CYR2LAT = {
'\u0410':'A', '\u0412':'B', '\u0415':'E', '\u041a':'K', '\u041c':'M',
'\u041d':'H', '\u041e':'O', '\u0420':'P', '\u0421':'C', '\u0422':'T',
'\u0425':'X', '\u0430':'a', '\u0435':'e', '\u043e':'o', '\u0440':'p',
'\u0441':'c', '\u0445':'x', '\u043c':'m',
}
count = 0
for line in sys.stdin:
old = line.rstrip('\n')
new = ''.join(CYR2LAT.get(c, c) for c in old)
if new != old:
remaining = [c for c in new if ord(c) > 127]
if remaining:
print(f'WARNING: {old} still has unmapped non-ASCII: {remaining}', file=sys.stderr)
print(f'{old}\t{new}')
count += 1
print(f'{count} rename mapping(s) generated', file=sys.stderr)
" > rename_samples.tsv
cat rename_samples.tsv
# Spring 2026 dataset — expected 2 mappings:
# 808_03-25м 808_03-25m
# 1038_08-176Х-00006 1038_08-176X-00006
# Apply rename only if mappings were found
if [ -s rename_samples.tsv ]; then
# bcftools reheader -s FILE: replace sample names using old→new tsv
bcftools reheader -s rename_samples.tsv \
-o impute_in.final.vcf.gz \
impute_in.autosomes.clean.nodup.vcf.gz
else
# No non-ASCII IDs found — just rename the file
mv impute_in.autosomes.clean.nodup.vcf.gz impute_in.final.vcf.gz
fi
# Index and verify
bcftools index impute_in.final.vcf.gz # create CSI index for random access
bcftools index -n impute_in.final.vcf.gz # -n: print variant count (quick check)
# 458954
# Confirm no non-ASCII IDs remain
# grep -cP '[^\x00-\x7F]': count lines containing non-ASCII bytes
bcftools query -l impute_in.final.vcf.gz | grep -cP '[^\x00-\x7F]'
# 0
Step 3f: Split back to per-chromosome VCFs
The concatenation + cleanup + reheader steps above produced a single genome-wide VCF.
Michigan Imputation Server requires one VCF per chromosome, so we split
impute_in.final.vcf.gz back into per-chromosome files:
# bcftools view -r CHR: extract records from region (chromosome) CHR
for chr in $(seq 1 22); do
bcftools view -r $chr impute_in.final.vcf.gz \
-Oz -o impute_chr${chr}.vcf.gz
bcftools index impute_chr${chr}.vcf.gz
done
Step 3g: Chromosome rename, reference strand alignment & palindromic removal
Why these steps are needed: PLINK exports chromosomes as bare numbers (1, 2, …)
but Michigan Imputation Server with hg38 requires the chr prefix (chr1, chr2, …).
Additionally, PLINK stores alleles in the orientation of the genotyping array, not the reference
genome forward strand — without bcftools +fixref, ~59% of variants are on the wrong
strand, causing massive strand-flip rejection at the imputation server (185,633 strand flips on
first upload attempt). Finally, palindromic A/T and C/G SNPs cannot be unambiguously strand-resolved
and must be removed.
# Create chromosome rename file (1 → chr1, ..., 22 → chr22)
for i in $(seq 1 22); do echo "$i chr$i"; done > chr_rename.tsv
mkdir -p michigan_ready
# For each chromosome: rename → fixref → remove palindromic
for chr in $(seq 1 22); do
# --rename-chrs FILE: 2-column mapping (old_name new_name) for chromosome IDs
bcftools annotate --rename-chrs chr_rename.tsv \
impute_chr${chr}.vcf.gz -Oz -o tmp_chr${chr}.vcf.gz
bcftools index tmp_chr${chr}.vcf.gz
# bcftools +fixref: plugin that aligns alleles to reference genome forward strand
# -Ou: uncompressed BCF output (fast for piping, never touches disk)
# --: separator between bcftools args and plugin-specific args
# -f FILE: reference FASTA for strand determination
# -m top: strand matching mode (Illumina TOP/BOT convention)
# bcftools view -e EXPR: exclude records matching the filter expression
# Removes palindromic A/T and C/G SNPs (ambiguous strand, unresolvable)
# bcftools index -t: create tabix (.tbi) index (vs default CSI)
bcftools +fixref tmp_chr${chr}.vcf.gz -Ou \
-- -f /staging/Genomes/Human/chr/GRCh38.fa -m top \
2>fixref_chr${chr}.log \
| bcftools view \
-e 'TYPE="snp" && ((REF="A" && ALT="T") || (REF="T" && ALT="A") || (REF="C" && ALT="G") || (REF="G" && ALT="C"))' \
-Oz -o michigan_ready/impute_chr${chr}.vcf.gz
bcftools index -ft michigan_ready/impute_chr${chr}.vcf.gz
done
# Clean up intermediates (tmp_chr = 3g rename intermediates)
rm -f tmp_chr*.vcf.gz tmp_chr*.vcf.gz.csi
Critical: fixref stderr must NOT enter the pipe. When running fixref in a
exec > log 2>&1 context, stderr from bcftools +fixref
corrupts the BCF data stream piped to bcftools view. Always redirect fixref
stderr to a separate file (2>fixref_chrN.log).
fixref results (representative — chr10, 23,476 input sites):
ref match: 9,713 (41.4%)
ref mismatch: 13,749 (58.6%)
↳ flipped: 9,794 (41.7%)
↳ swapped: 1,989 (8.5%)
↳ flip+swap: 1,971 (8.4%)
↳ unresolved: 0 (0.0%)
All 22 chromosomes show the same ~41/59 ref-match/mismatch pattern.
Per-chromosome counts (after fixref + palindromic removal)
| Chr | Variants | Chr | Variants | Chr | Variants | Chr | Variants |
| 1 | 36,006 | 7 | 25,532 | 13 | 16,702 | 19 | 10,365 |
| 2 | 37,939 | 8 | 23,708 | 14 | 15,018 | 20 | 11,429 |
| 3 | 31,653 | 9 | 19,932 | 15 | 14,250 | 21 | 6,582 |
| 4 | 29,303 | 10 | 23,386 | 16 | 15,327 | 22 | 6,790 |
| 5 | 27,386 | 11 | 22,478 | 17 | 13,576 | Total: 456,684 |
| 6 | 33,504 | 12 | 22,140 | 18 | 13,678 | |