1. sNMF Independent Validation of K Selection
Goal: Validate ADMIXTURE's cross-validation K selection using sNMF (sparse Non-negative Matrix Factorization) from the LEA R package — a completely independent algorithm.
Global dataset (2,095 samples × 172,537 biallelic SNPs after vcf2geno conversion):
- Optimal K = 9 by minimum cross-entropy (0.534309)
- But K=7 → K=9 improvement is only 0.08%, i.e. plateau from K=7
- ADMIXTURE concordance: K=3 r=0.9996, K=5 r=0.9969, K=7 r=0.9921
- ✓ Confirms ADMIXTURE's K=7 optimal selection
Uzbek-only dataset (1,074 samples × 172,537 SNPs):
- Cross-entropy monotonically increases from K=2 — confirms K=2 is optimal
- ADMIXTURE concordance: K=2 r=0.9979 (excellent), K=5 r=0.6339 (collapses for higher K)
- ✓ Confirms Uzbek-only ADMIXTURE CV result (K=2 minimum)
Scripts created: scripts/run_snmf.R, scripts/run_snmf_uzbek.R, scripts/snmf_complete.R
Server outputs: /staging/.../snmf_results/, /staging/.../snmf_results_uzbek/
2. Evanno ΔK Finalization
Status: 51/70 batch runs completed (K=2–6: all 10 reps; K=7: 1/10; K=8: 0/10). K=7 and K=8 failed with exit code 127 after server reboot.
Decision: sNMF independently validates K=7 (global) and K=2 (Uzbek-only), so K=7-8 reruns are unnecessary. Killed evanno_rerun tmux session.
Data corrections applied:
- FIX Log-likelihoods were from Uzbek-only dataset (~−353M) — corrected to global values (~−250M)
- FIX CV errors had minor precision errors — updated from actual pipeline.log
Step 11 updates: §3.1 Evanno ΔK with statistical caveat (method designed for STRUCTURE's MCMC, not ADMIXTURE's ML), §3.2 sNMF global results, §3.3 Uzbek-only validation. Preliminary banner removed.
3. Full Pairwise FST Matrix
Previously computed (Feb 16, 2026): UZB–EUR, UZB–EAS, UZB–SAS, UZB–AFR, EUR–EAS (5 pairs)
Computed today: AFR–EUR, AFR–EAS, AFR–SAS, EUR–SAS, EAS–SAS (5 missing pairs)
| Pair | Weighted FST | Mean FST | Category |
|---|---|---|---|
| UZB – SAS | 0.0179 | 0.0148 | Very low |
| UZB – EUR | 0.0204 | 0.0160 | Low |
| EUR – SAS | 0.0279 | 0.0218 | Low |
| UZB – EAS | 0.0493 | 0.0369 | Moderate |
| EAS – SAS | 0.0653 | 0.0451 | Moderate |
| EUR – EAS | 0.1061 | 0.0724 | High |
| UZB – AFR | 0.1094 | 0.0729 | High |
| SAS – AFR | 0.1306 | 0.0784 | High |
| EUR – AFR | 0.1448 | 0.0912 | Very high |
| EAS – AFR | 0.1681 | 0.1161 | Very high |
Key insight: UZB is closest to SAS (0.018), confirming Indo-Iranian heritage. All values fall within published 1000 Genomes Phase 3 ranges.
Method: PLINK 1.9 --fst with Weir & Cockerham estimator on 376,208 LD-pruned SNPs. Population assignment via --within cluster files.
4. New Visualization: Step 14 — FST Heatmap & MDS
Created: steps/step14.html — comprehensive interactive page featuring:
- Interactive canvas heatmap with mouse-hover tooltips showing population pairs and FST values
- Classical MDS plot — 2D projection of the 5×5 distance matrix via eigendecomposition of doubly-centered D² matrix (implemented in pure JS)
- Bar chart — FST distance from UZB to all reference populations
- Color-coded matrix table — green (FST < 0.03) → orange → red (FST > 0.10)
- Population ranking tables — proximity to UZB + largest global divergences
- Published comparison table — validates against 1000G Phase 3 literature
Also updated: index.html (new step card + SVG pipeline node), next_steps.html (corrected FST values in existing heatmap)
5. Annotation JSON Downloaded
File: data/snp_data_full.json (242,957 bytes)
Contents: 490 PBS-selected candidate SNPs with 27 annotation fields:
- Identifiers: snp_id (chr:pos), rsid, chrom, pos, A1, A2
- Population genetics: PBS_UZB, PBS_EUR, PBS_EAS, MAF_UZB, MAF_EUR, MAF_EAS, MAF_SAS, MAF_AFR, near_private
- Functional: gene, consequence, sift, polyphen
- Database hits: gwas_hits, gwas_top, gwas_detail, clinvar_sig, clinvar_cond, gtex_hits, gtex_top, gtex_detail
Method: Base64 transfer via SSH (SCP broken due to Dragen MOTD). Marker-delimited extraction: ===START=== / ===END=== to isolate base64 from MOTD contamination.
6. Task Tracker Updates
Updated todo.html tasks:
- DONE #2: sNMF global validation
- DONE #3: Evanno 70-rep batch
- DONE #5: sNMF Uzbek-only
- DONE #10: step11 cleanup
- DONE #16: k=7-8 rerun → cancelled (sNMF confirms)
- DONE #6: FST heatmap (today) — step 14 created
- DONE #12: Download annotation JSON
- ⏳ #19-22: GWAS tasks deferred to April 2026
7. Files Modified / Created
| File | Action | Description |
|---|---|---|
| steps/step11.html | Updated | §3.2 sNMF global, §3.3 Uzbek-only, removed preliminary banner |
| steps/step14.html | NEW | FST heatmap + MDS visualization page |
| index.html | Updated | Added step 14 card + SVG pipeline node |
| steps/next_steps.html | Updated | Corrected FST matrix (old estimates → computed values) |
| data/snp_data_full.json | NEW | 490 annotated SNPs (242 KB) |
| todo.html | Updated | Tasks 2,3,5,6,10,12,16 marked done |
| scripts/run_snmf.R | NEW | Global sNMF pipeline (LEA, vcf2geno) |
| scripts/run_snmf_uzbek.R | NEW | Uzbek-only sNMF pipeline |
| daily/2026-03-09.html | NEW | This daily log |