← Back to Home

Daily Log: March 9, 2026

sNMF Validation, FST Matrix Completion, Heatmap Visualization & Data Downloads

📊 Session Summary
✓ 5 TASKS COMPLETED
Duration: ~3 hours | Server: Biotech2024 (100.104.25.22)
DONE sNMF validation — global + Uzbek-only datasets
DONE Evanno finalization — step11 updated, rerun cancelled
DONE Full 5×5 FST matrix — 5 missing pairs computed
NEW Step 14 page — interactive heatmap + MDS visualization
DONE Annotation JSON downloaded (490 SNPs, 27 fields)

1. sNMF Independent Validation of K Selection

Goal: Validate ADMIXTURE's cross-validation K selection using sNMF (sparse Non-negative Matrix Factorization) from the LEA R package — a completely independent algorithm.

Global dataset (2,095 samples × 172,537 biallelic SNPs after vcf2geno conversion):

  • Optimal K = 9 by minimum cross-entropy (0.534309)
  • But K=7 → K=9 improvement is only 0.08%, i.e. plateau from K=7
  • ADMIXTURE concordance: K=3 r=0.9996, K=5 r=0.9969, K=7 r=0.9921
  • Confirms ADMIXTURE's K=7 optimal selection

Uzbek-only dataset (1,074 samples × 172,537 SNPs):

  • Cross-entropy monotonically increases from K=2 — confirms K=2 is optimal
  • ADMIXTURE concordance: K=2 r=0.9979 (excellent), K=5 r=0.6339 (collapses for higher K)
  • Confirms Uzbek-only ADMIXTURE CV result (K=2 minimum)

Scripts created: scripts/run_snmf.R, scripts/run_snmf_uzbek.R, scripts/snmf_complete.R

Server outputs: /staging/.../snmf_results/, /staging/.../snmf_results_uzbek/

2. Evanno ΔK Finalization

Status: 51/70 batch runs completed (K=2–6: all 10 reps; K=7: 1/10; K=8: 0/10). K=7 and K=8 failed with exit code 127 after server reboot.

Decision: sNMF independently validates K=7 (global) and K=2 (Uzbek-only), so K=7-8 reruns are unnecessary. Killed evanno_rerun tmux session.

Data corrections applied:

  • FIX Log-likelihoods were from Uzbek-only dataset (~−353M) — corrected to global values (~−250M)
  • FIX CV errors had minor precision errors — updated from actual pipeline.log

Step 11 updates: §3.1 Evanno ΔK with statistical caveat (method designed for STRUCTURE's MCMC, not ADMIXTURE's ML), §3.2 sNMF global results, §3.3 Uzbek-only validation. Preliminary banner removed.

3. Full Pairwise FST Matrix

Previously computed (Feb 16, 2026): UZB–EUR, UZB–EAS, UZB–SAS, UZB–AFR, EUR–EAS (5 pairs)

Computed today: AFR–EUR, AFR–EAS, AFR–SAS, EUR–SAS, EAS–SAS (5 missing pairs)

PairWeighted FSTMean FSTCategory
UZB – SAS0.01790.0148Very low
UZB – EUR0.02040.0160Low
EUR – SAS0.02790.0218Low
UZB – EAS0.04930.0369Moderate
EAS – SAS0.06530.0451Moderate
EUR – EAS0.10610.0724High
UZB – AFR0.10940.0729High
SAS – AFR0.13060.0784High
EUR – AFR0.14480.0912Very high
EAS – AFR0.16810.1161Very high

Key insight: UZB is closest to SAS (0.018), confirming Indo-Iranian heritage. All values fall within published 1000 Genomes Phase 3 ranges.

Method: PLINK 1.9 --fst with Weir & Cockerham estimator on 376,208 LD-pruned SNPs. Population assignment via --within cluster files.

4. New Visualization: Step 14 — FST Heatmap & MDS

Created: steps/step14.html — comprehensive interactive page featuring:

  • Interactive canvas heatmap with mouse-hover tooltips showing population pairs and FST values
  • Classical MDS plot — 2D projection of the 5×5 distance matrix via eigendecomposition of doubly-centered D² matrix (implemented in pure JS)
  • Bar chart — FST distance from UZB to all reference populations
  • Color-coded matrix table — green (FST < 0.03) → orange → red (FST > 0.10)
  • Population ranking tables — proximity to UZB + largest global divergences
  • Published comparison table — validates against 1000G Phase 3 literature

Also updated: index.html (new step card + SVG pipeline node), next_steps.html (corrected FST values in existing heatmap)

5. Annotation JSON Downloaded

File: data/snp_data_full.json (242,957 bytes)

Contents: 490 PBS-selected candidate SNPs with 27 annotation fields:

  • Identifiers: snp_id (chr:pos), rsid, chrom, pos, A1, A2
  • Population genetics: PBS_UZB, PBS_EUR, PBS_EAS, MAF_UZB, MAF_EUR, MAF_EAS, MAF_SAS, MAF_AFR, near_private
  • Functional: gene, consequence, sift, polyphen
  • Database hits: gwas_hits, gwas_top, gwas_detail, clinvar_sig, clinvar_cond, gtex_hits, gtex_top, gtex_detail

Method: Base64 transfer via SSH (SCP broken due to Dragen MOTD). Marker-delimited extraction: ===START=== / ===END=== to isolate base64 from MOTD contamination.

6. Task Tracker Updates

Updated todo.html tasks:

  • DONE #2: sNMF global validation
  • DONE #3: Evanno 70-rep batch
  • DONE #5: sNMF Uzbek-only
  • DONE #10: step11 cleanup
  • DONE #16: k=7-8 rerun → cancelled (sNMF confirms)
  • DONE #6: FST heatmap (today) — step 14 created
  • DONE #12: Download annotation JSON
  • #19-22: GWAS tasks deferred to April 2026

7. Files Modified / Created

FileActionDescription
steps/step11.htmlUpdated§3.2 sNMF global, §3.3 Uzbek-only, removed preliminary banner
steps/step14.htmlNEWFST heatmap + MDS visualization page
index.htmlUpdatedAdded step 14 card + SVG pipeline node
steps/next_steps.htmlUpdatedCorrected FST matrix (old estimates → computed values)
data/snp_data_full.jsonNEW490 annotated SNPs (242 KB)
todo.htmlUpdatedTasks 2,3,5,6,10,12,16 marked done
scripts/run_snmf.RNEWGlobal sNMF pipeline (LEA, vcf2geno)
scripts/run_snmf_uzbek.RNEWUzbek-only sNMF pipeline
daily/2026-03-09.htmlNEWThis daily log