ALSU Genotyping Pipeline

Workflow Documentation with Data Flow Roadmap

Step 1: Sample Missingness Input: 1,247 samples Output: 1,155 samples | Dec 15 Step 2: IBD Dedup & Remove Input: 1,155 samples Output: 1,098 samples | Dec 16 Step 3: SNP QC & VCF Export Input: 1,098 samples + 654K variants Output: 473K variants (VCF format) | Dec 17 Step 4: Michigan Imputation Input: 473K variants (VCF) Output: 10.8M variants imputed | Dec 18-22 Step 5: ID Normalization Input: 1,098 IDs + 10.8M variants Output: Normalized dataset | Dec 23 Step 6: Final QC & Output Input: Normalized dataset (10.8M variants) Output: 1,047 samples, 5.4M variants | Dec 26 / Mar 2026 Step 7: Local PCA Analysis Input: 1,047 samples, 5.4M variants Output: PCA scores, plots | Jan 3-4 Step 8: Global PCA + 1000G Input: UZB + 1000 Genomes reference Output: Global PCA, ancestry inference | Jan 4 Step 9: Fst Analysis (UZB vs EUR) Input: UZB + 1000G EUR (376K SNPs) Output: Genome-wide Fst = 0.020 | Oct–Nov Step 10: Multi-Pop & PBS Input: UZB + EUR/EAS/SAS/AFR (376K SNPs) Output: PBS + Uzbek-specific SNPs | Feb 2026 Step 11: ADMIXTURE K=2–8 Input: UZB + 1000G (376K SNPs) Output: K=5 parsimonious (global) | Mar 2026 Step 12: PBS SNP Annotation Input: 490 Uzbek-specific PBS SNPs Output: VEP + GWAS + GTEx results | Mar 2026 Step 13: LD Analysis Input: 490 PBS SNPs + UZB_only BED Output: 401 independent loci | Mar 2026 Step 14: FST & MDS Input: Global merged BED (5 pops) Output: 5×5 FST matrix + MDS | Mar 2026 Step 15: ROH & IBD Input: UZB BED (1,047 × 5.41M SNPs) Output: 36.7K ROH | F_ROH | IBD | Mar 2026

All Pipeline Steps

Click on any step to view detailed documentation including technical overview and chronological activity log

1 Sample Missingness Filter Dec 15
Remove high-missingness samples (F_MISS > 0.20). Result: 1,247 → 1,155
✓ Completed
2 IBD Deduplication & Removal Dec 16
Identify and remove IBD-related duplicates. Result: 1,155 → 1,098 samples.
✓ Completed
3 SNP QC & VCF Export Dec 17
Variant-level QC (call rate, HWE, MAF) and conversion to VCF for imputation. 654K → 473K variants.
✓ Completed
4 Michigan Imputation Server Dec 18–22
Phasing (Eagle2) + imputation (Minimac4) against TOPMed r2. 473K → 10.8M variants.
✓ Completed
5 ID Normalization Dec 23
Harmonize sample IDs between genotype array and imputed dataset. Validate 1,098-sample merge.
✓ Completed
6 Final QC & Output Dec 26
Post-imputation QC (R²≥0.3, MAF, HWE). Final dataset: 1,074 samples, 10.1M variants.
✓ Completed
7 Local PCA Analysis Jan 3–4
PLINK PCA on UZB cohort. PC1 vs PC2 scatter, outlier removal, population stratification assessment.
✓ Completed
8 Global PCA + 1000 Genomes Jan 4
Project UZB onto 1000G reference panel. Ancestry inference: broad Central Asian cluster between EUR and SAS/EAS.
✓ Completed
9 FST Analysis (UZB vs EUR) Oct–Nov
Genome-wide Weir & Cockerham FST UZB vs 1000G EUR. Weighted FST = 0.020. Top outlier regions identified.
✓ Completed
10 Multi-Population Analysis & PBS Feb 16
PBS multi-pop FST, delta-AF vs EUR/EAS/SAS/AFR. 8 Uzbek-specific SNPs identified. ADMIXTURE K=2–8 complete.
✓ Complete
11 ADMIXTURE K=2–8 Analysis Mar 2026
Unsupervised ADMIXTURE on UZB + all 2,548 1000G. K=5 most parsimonious (CV=0.295); sNMF validated. Interactive stacked-bar plots.
✓ Complete
12 PBS SNP Functional Annotation Mar 2026
Ensembl VEP annotation of 490 Uzbek-specific PBS candidates. 264 unique genes; 5 missense variants (4 damaging incl. SPI1, SLC6A2). HLA/MHC over-represented.
✓ Complete
13 LD Clumping & Decay Analysis Mar 2026
PLINK LD clumping reduced 490 PBS candidates to 401 independent loci. LD decay from r²=0.096 at 0 kb to plateau ~0.011 at 200 kb+. Chr6/MHC: 135 loci (33.7%).
✓ Complete
14 FST Heatmap & MDS Mar 2026
Full 5×5 pairwise FST matrix (UZB, EUR, EAS, SAS, AFR). Interactive heatmap + classical MDS. UZB closest to SAS (0.018), then EUR (0.020). All values match published 1000G ranges.
✓ Complete
15 ROH & IBD Analysis Mar 2026
36,702 ROH segments across 1,047 individuals. Median FROH=0.015 (comparable to outbred populations). IBD reveals 6,368 pairs (PI_HAT≥0.05), minimal close relatedness. Critical for pregnancy loss GWAS design.
✓ Complete
Next Steps: Planned Analyses Mar 28
ROH, DAPC, pregnancy loss GWAS design (April 2026), covariates, case/control criteria.
📋 Roadmap

Daily Activity Logs

Daily logs with terminal command history and execution details from Biotech2024 workserver.

Dec 15, 2025
Sample missingness analysis
View Log →
Mar 9, 2026
sNMF validation, FST matrix, step 14
View Log →

Workserver Connection

To sync logs from Biotech2024 workserver automatically, configure SSH connection in config.json

PCA Viewer (Enhanced)

The enhanced PCA viewer loads in the frame below. If you have CORS or CDN restrictions, consider vendorizing the viewer's dependencies.