Step 5: Sample ID Normalization

1. Overview

After imputation, Michigan Imputation Server v2 may alter sample IDs in the output VCFs. This step detects and corrects any ID mismatches so that downstream tools (PLINK, ADMIXTURE) see consistent identifiers.

Spring 2026 Run

Michigan v2 Prefix Quirk: Chromosomes 1–9 received a numeric prefix (1_sampleID, 2_sampleID, …), while chromosomes 10–22 kept the original clean IDs. The prefixes were stripped via per-chromosome bcftools reheader; chr10-22 were symlinked without modification.

1,093

Total Samples

0

Cyrillic Issues

9

CHR Prefix-Stripped

13

CHR Symlinked (clean)

Winter 2025 Run

Historical: In winter 2025, 2 sample IDs contained Cyrillic homoglyphs (м→m, Х→X) that differed between PLINK and VCF formats. Step 3e now auto-detects and corrects all Cyrillic homoglyphs before Michigan upload, so this is no longer needed.

1,098

Total Samples

2

IDs Fixed (м→m, Х→X)

1,098

Normalized

100%

Match Rate

2. Input Data

Spring 2026

Field	Value
Files	chr1-chr22.dose.vcf.gz (from Michigan v2)
Location	/staging/ALSU-analysis/spring2026/imputation/
Samples	1,093 (all ASCII — Cyrillic fixed in step 3e)
Issue	Michigan v2 adds numeric prefix to chr1-9 sample IDs; chr10-22 are clean
Total Variants	48,887,364 across 22 chromosomes

Winter 2025

Field	Value
Files	chr1-chr22.dose.vcf.gz + ConvSK_mind20_dedup_snpqc.fam
Location	/staging/ALSU-analysis/winter2025/
Samples	1,098 (2 with Cyrillic homoglyphs)
Issue	Sample IDs in VCF ≠ PLINK metadata due to encoding

3. Output Data

Spring 2026

Field	Value
Files	chr1-chr22.dose.vcf.gz (9 reheadered + 13 symlinked)
Location	/staging/ALSU-analysis/spring2026/post_imputation/
Samples	1,093 (all clean ASCII, no prefix)
Variants	48,887,364 (unchanged)

✓ Verification: All 1,093 sample IDs are clean ASCII across all 22 chromosomes. No Cyrillic issues (fixed in step 3e). No Michigan prefix remaining.

Winter 2025

Field	Value
Files	chr1-chr22.dose_normalized.vcf.gz (22 VCF files)
Location	/staging/ALSU-analysis/winter2025/
Samples	1,098 (all ASCII IDs)
Variants	10,846,569 SNPs (unchanged)

✓ Verification: All sample IDs ASCII-only, matching PLINK metadata

4. Commands Executed

Spring 2026 — Michigan Prefix Detection & Stripping

# Check first sample ID in each chromosome for numeric prefix
for chr in $(seq 1 22); do
  first=$(bcftools query -l chr${chr}.dose.vcf.gz | head -1)
  echo "chr${chr}: $first"
done

# Result: chr1-9 have prefix (1_sampleID, 2_sampleID, ...)
#         chr10-22 have clean IDs (no prefix)

# Create strip-prefix mapping from chr1's sample list
$ bcftools query -l chr1.dose.vcf.gz > sample_list.txt
$ awk -F'_' 'NF>1 && $1~/^[0-9]+$/ {
  old=$0; sub(/^[0-9]+_/,""); print old"\t"$0
}' sample_list.txt > strip_prefix.tsv

# 1,093 prefix mappings created

# Per-chromosome: strip prefix (chr1-9) or symlink (chr10-22)
for chr in $(seq 1 22); do
  first=$(bcftools query -l chr${chr}.dose.vcf.gz | head -1)
  if echo "$first" | grep -qP '^\d+_'; then
    # Has prefix — strip it
    bcftools reheader -s strip_prefix.tsv \
      -o post_imputation/chr${chr}.dose.vcf.gz \
      imputation/chr${chr}.dose.vcf.gz
  else
    # No prefix — symlink
    ln -sf imputation/chr${chr}.dose.vcf.gz \
      post_imputation/chr${chr}.dose.vcf.gz
  fi
done

# chr1-9: prefix stripped via bcftools reheader
# chr10-22: symlinked (already clean)

Winter 2025 — Cyrillic Homoglyph Fix

Identify samples with Cyrillic homoglyphs

# comm -3: output lines unique to either file (symmetric difference)
#   Compares two SORTED inputs; -3 suppresses lines common to both
# <(...): bash process substitution — feeds command output as a virtual file
$ comm -3 \
  <(awk '{print $2}' ConvSK_mind20_dedup.fam | sort) \
  <(bcftools query -l chr1.dose.vcf.gz | sed 's/^[0-9]\+_//' | sort)
  #                                        └── sed: strip numeric prefix + underscore
  #                                            added by Michigan server to sample IDs

Found 2 samples with Cyrillic homoglyphs:
  - 03-25м (Cyrillic м U+043C)
  - 08-176Х-00006 (Cyrillic Х U+0425)

Create ID correction mapping

# gsub(regex, replacement, target): global substitution in awk
#   Replace visually-identical Cyrillic characters with ASCII equivalents
$ awk 'BEGIN{OFS="\t"}
{
  new=$2
  gsub(/м/,"m",new)   # Cyrillic м (U+043C) → Latin m (U+006D)
  gsub(/Х/,"X",new)   # Cyrillic Х (U+0425) → Latin X (U+0058)
  if(new!=$2) print $1,$2,$1,new  # Output: oldFID oldIID newFID newIID
}' ConvSK_mind20_dedup.fam > update_ids_homoglyphs.txt

Result:
2 ID mappings created:
  03-25м → 03-25m
  08-176Х-00006 → 08-176X-00006

Apply ID normalization to PLINK and VCF

# --update-ids FILE: remap sample IDs using 4-column file
#   Format: oldFID oldIID newFID newIID (tab-separated)
$ plink --bfile ConvSK_mind20_dedup \
  --update-ids update_ids_homoglyphs.txt \
  --make-bed --out ConvSK_mind20_dedup_ascii

--update-ids: 2 people updated.
1091 people retained.

# bcftools reheader -s FILE: rename samples in VCF header
#   FILE format: oldID newID (one pair per line)
For VCF files:
bcftools reheader -s id_mapping.txt \
  chr1.dose.vcf.gz -o chr1.dose.vcf.gz

Execution Timeline

2026-04-11 05:08

Spring 2026: Prefix detection
Discovered Michigan v2 quirk: chr1-9 have numeric prefix, chr10-22 clean

2026-04-11 05:09

Spring 2026: Prefix stripping
1,093 prefix mappings → bcftools reheader on chr1-9, symlink chr10-22

2026-04-11 05:10

Spring 2026: Verified
All 22 chromosomes have clean ASCII IDs, ready for step 6 QC

2025-12-22

Winter 2025: Cyrillic fix
2 samples fixed (м→m, Х→X) via bcftools reheader