← Back to Roadmap

Step 5: Sample ID Normalization

Normalize sample IDs after imputation (prefix stripping / encoding fixes)

✓ Spring 2026 — April 11, 2026 ✓ Winter 2025 — December 22, 2025

1. Overview

After imputation, Michigan Imputation Server v2 may alter sample IDs in the output VCFs. This step detects and corrects any ID mismatches so that downstream tools (PLINK, ADMIXTURE) see consistent identifiers.

Spring 2026 Run

Michigan v2 Prefix Quirk: Chromosomes 1–9 received a numeric prefix (1_sampleID, 2_sampleID, …), while chromosomes 10–22 kept the original clean IDs. The prefixes were stripped via per-chromosome bcftools reheader; chr10-22 were symlinked without modification.
1,093
Total Samples
0
Cyrillic Issues
9
CHR Prefix-Stripped
13
CHR Symlinked (clean)

Winter 2025 Run

Historical: In winter 2025, 2 sample IDs contained Cyrillic homoglyphs (м→m, Х→X) that differed between PLINK and VCF formats. Step 3e now auto-detects and corrects all Cyrillic homoglyphs before Michigan upload, so this is no longer needed.
1,098
Total Samples
2
IDs Fixed (м→m, Х→X)
1,098
Normalized
100%
Match Rate

2. Input Data

Spring 2026

FieldValue
Fileschr1-chr22.dose.vcf.gz (from Michigan v2)
Location/staging/ALSU-analysis/spring2026/imputation/
Samples1,093 (all ASCII — Cyrillic fixed in step 3e)
IssueMichigan v2 adds numeric prefix to chr1-9 sample IDs; chr10-22 are clean
Total Variants48,887,364 across 22 chromosomes

Winter 2025

FieldValue
Fileschr1-chr22.dose.vcf.gz + ConvSK_mind20_dedup_snpqc.fam
Location/staging/ALSU-analysis/winter2025/
Samples1,098 (2 with Cyrillic homoglyphs)
IssueSample IDs in VCF ≠ PLINK metadata due to encoding

3. Output Data

Spring 2026

FieldValue
Fileschr1-chr22.dose.vcf.gz (9 reheadered + 13 symlinked)
Location/staging/ALSU-analysis/spring2026/post_imputation/
Samples1,093 (all clean ASCII, no prefix)
Variants48,887,364 (unchanged)
✓ Verification: All 1,093 sample IDs are clean ASCII across all 22 chromosomes. No Cyrillic issues (fixed in step 3e). No Michigan prefix remaining.

Winter 2025

FieldValue
Fileschr1-chr22.dose_normalized.vcf.gz (22 VCF files)
Location/staging/ALSU-analysis/winter2025/
Samples1,098 (all ASCII IDs)
Variants10,846,569 SNPs (unchanged)
✓ Verification: All sample IDs ASCII-only, matching PLINK metadata

4. Commands Executed

Spring 2026 — Michigan Prefix Detection & Stripping

# Check first sample ID in each chromosome for numeric prefix for chr in $(seq 1 22); do first=$(bcftools query -l chr${chr}.dose.vcf.gz | head -1) echo "chr${chr}: $first" done # Result: chr1-9 have prefix (1_sampleID, 2_sampleID, ...) # chr10-22 have clean IDs (no prefix)
# Create strip-prefix mapping from chr1's sample list $ bcftools query -l chr1.dose.vcf.gz > sample_list.txt $ awk -F'_' 'NF>1 && $1~/^[0-9]+$/ { old=$0; sub(/^[0-9]+_/,""); print old"\t"$0 }' sample_list.txt > strip_prefix.tsv # 1,093 prefix mappings created
# Per-chromosome: strip prefix (chr1-9) or symlink (chr10-22) for chr in $(seq 1 22); do first=$(bcftools query -l chr${chr}.dose.vcf.gz | head -1) if echo "$first" | grep -qP '^\d+_'; then # Has prefix — strip it bcftools reheader -s strip_prefix.tsv \ -o post_imputation/chr${chr}.dose.vcf.gz \ imputation/chr${chr}.dose.vcf.gz else # No prefix — symlink ln -sf imputation/chr${chr}.dose.vcf.gz \ post_imputation/chr${chr}.dose.vcf.gz fi done # chr1-9: prefix stripped via bcftools reheader # chr10-22: symlinked (already clean)

Winter 2025 — Cyrillic Homoglyph Fix

Identify samples with Cyrillic homoglyphs

# comm -3: output lines unique to either file (symmetric difference) # Compares two SORTED inputs; -3 suppresses lines common to both # <(...): bash process substitution — feeds command output as a virtual file $ comm -3 \ <(awk '{print $2}' ConvSK_mind20_dedup.fam | sort) \ <(bcftools query -l chr1.dose.vcf.gz | sed 's/^[0-9]\+_//' | sort) # └── sed: strip numeric prefix + underscore # added by Michigan server to sample IDs Found 2 samples with Cyrillic homoglyphs: - 03-25м (Cyrillic м U+043C) - 08-176Х-00006 (Cyrillic Х U+0425)

Create ID correction mapping

# gsub(regex, replacement, target): global substitution in awk # Replace visually-identical Cyrillic characters with ASCII equivalents $ awk 'BEGIN{OFS="\t"} { new=$2 gsub(/м/,"m",new) # Cyrillic м (U+043C) → Latin m (U+006D) gsub(/Х/,"X",new) # Cyrillic Х (U+0425) → Latin X (U+0058) if(new!=$2) print $1,$2,$1,new # Output: oldFID oldIID newFID newIID }' ConvSK_mind20_dedup.fam > update_ids_homoglyphs.txt Result: 2 ID mappings created: 03-25м → 03-25m 08-176Х-00006 → 08-176X-00006

Apply ID normalization to PLINK and VCF

# --update-ids FILE: remap sample IDs using 4-column file # Format: oldFID oldIID newFID newIID (tab-separated) $ plink --bfile ConvSK_mind20_dedup \ --update-ids update_ids_homoglyphs.txt \ --make-bed --out ConvSK_mind20_dedup_ascii --update-ids: 2 people updated. 1091 people retained. # bcftools reheader -s FILE: rename samples in VCF header # FILE format: oldID newID (one pair per line) For VCF files: bcftools reheader -s id_mapping.txt \ chr1.dose.vcf.gz -o chr1.dose.vcf.gz

Execution Timeline

2026-04-11 05:08
Spring 2026: Prefix detection
Discovered Michigan v2 quirk: chr1-9 have numeric prefix, chr10-22 clean
2026-04-11 05:09
Spring 2026: Prefix stripping
1,093 prefix mappings → bcftools reheader on chr1-9, symlink chr10-22
2026-04-11 05:10
Spring 2026: Verified
All 22 chromosomes have clean ASCII IDs, ready for step 6 QC
2025-12-22
Winter 2025: Cyrillic fix
2 samples fixed (м→m, Х→X) via bcftools reheader