Step 11: Global ADMIXTURE with 1000 Genomes
Unsupervised ancestry decomposition of the Uzbek cohort with continental reference populations
✓ Spring 2026 — April 11, 2026 Completed — March 20261. Overview
While the Uzbek-only ADMIXTURE (Step 10) identified a West–East admixture cline within the cohort, it could not assign those components to known continental ancestries. This step merges 1,047 Uzbek samples with 2,548 individuals from the 1000 Genomes Project (all 26 populations across 5 superpopulations), enabling ADMIXTURE’s unsupervised algorithm to anchor components to known African, European, South Asian, and East Asian reference panels.
Spring 2026 ADMIXTURE Summary
2. Pipeline & QC
2.1 Reference Panel Selection
We use the complete 1000 Genomes Phase 3 reference (all 26 populations grouped into 5 superpopulations), providing broad continental representation.
| Superpopulation | Populations included | N |
|---|---|---|
| AFR | YRI, LWK, GWD, MSL, ESN, ACB, ASW | 671 |
| EUR | CEU, GBR, FIN, IBS, TSI | 522 |
| SAS | GIH, PJL, BEB, STU, ITU | 492 |
| EAS | CHB, JPT, CHS, CDX, KHV | 515 |
| AMR | MXL, PUR, CLM, PEL | 348 |
| UZB | Uzbek cohort (this study) | 1,047 |
| Total | 3,595 | |
2.2 Processing Pipeline
3. Cross-Validation Error
ADMIXTURE’s 5-fold cross-validation error measures predictive accuracy. The lowest CV error indicates the best-fitting number of ancestral populations.
| K | CV Error Winter 2025 (77,111 SNPs) | CV Error Spring 2026 (60,279 SNPs) | Δ from prev (Spring) | Note |
|---|---|---|---|---|
| 2 | 0.31043 | 0.30003 | — | AFR vs non-AFR split |
| 3 | 0.29992 | 0.28936 | −0.01067 | EAS separates |
| 4 | 0.29703 | 0.28658 | −0.00278 | SAS separates |
| 5 | 0.29503 | 0.28476 | −0.00182 | AMR / Central Asian component |
| 6 | 0.29458 | 0.28436 | −0.00040 | Plateau begins |
| 7 | 0.29442 | 0.28418 | −0.00018 | Effectively tied with K=8 |
| 8 | 0.29422 | 0.28401 | −0.00017 | ← Minimum CV (K=7–8 plateau) |
3.1 Evanno Method (ΔK)
We include this analysis for comparison with published literature, but cross-validation error (Section 3) remains the statistically appropriate model selection criterion for ADMIXTURE. An independent sNMF analysis (Section 3.2) provides additional validation via cross-entropy.
The Evanno method (Evanno et al., 2005) computes the second-order rate of change of the log-likelihood, |L″(K)|, across consecutive K values. The value of K where |L″(K)| is maximized indicates the most abrupt change in model fit — corresponding to the uppermost level of hierarchical population structure. Note that this approach was designed for STRUCTURE’s MCMC framework and its application to ADMIXTURE output should be interpreted with caution (see caveat above).
Using the log-likelihood values L(K) from the global ADMIXTURE run at each K = 2–8:
Winter 2025 (3,595 × 77,111 SNPs)
| K | L(K) (single run) | L′(K) = L(K) − L(K−1) | |L″(K)| (not true ΔK) |
|---|---|---|---|
| 2 | −147,676,905 | — | — |
| 3 | −144,699,938 | 2,976,966 | 2,087,440 ← max |
| 4 | −143,810,412 | 889,526 | 241,190 |
| 5 | −143,162,075 | 648,337 | 437,625 |
| 6 | −142,951,364 | 210,711 | 84,113 |
| 7 | −142,824,765 | 126,599 | 26,085 |
| 8 | −142,672,081 | 152,684 | — |
Spring 2026 (3,595 × 60,279 SNPs)
| K | L(K) (single run) | L′(K) = L(K) − L(K−1) | |L″(K)| (not true ΔK) |
|---|---|---|---|
| 2 | −110,554,044 | — | — |
| 3 | −108,197,827 | 2,356,217 | 1,682,893 ← max |
| 4 | −107,524,503 | 673,324 | 201,187 |
| 5 | −107,052,366 | 472,137 | 315,425 |
| 6 | −106,895,654 | 156,712 | 60,902 |
| 7 | −106,799,844 | 95,810 | 24,454 |
| 8 | −106,679,580 | 120,264 | — |
3.2 sNMF (Sparse Non-negative Matrix Factorization)
To provide an independent, statistically rigorous validation of K selection, we supplement
ADMIXTURE with sNMF (Frichot et al., 2014), implemented in the R package LEA (Bioconductor).
sNMF offers several advantages over both ADMIXTURE and STRUCTURE:
| Property | ADMIXTURE | sNMF (LEA) |
|---|---|---|
| Method | ML via EM optimization | Regularized least-squares NMF |
| K selection | Cross-validation error | Cross-entropy criterion |
| Speed | Hours per run | Minutes per run |
| Missing data | Imputation required | Handled natively |
| Regularization | None | L2 penalty (α) — reduces overfitting |
| Clinal structure | Forces discrete clusters | Still clusters, but regularization dampens artifacts |
sNMF uses a cross-entropy criterion for model selection: a fraction of genotypes are masked, the model is fitted on the remaining data, and cross-entropy measures how well the model predicts the masked entries. Unlike ADMIXTURE’s CV error, this does not depend on likelihood-based assumptions.
Cross-Entropy Results
| K | Mean CE | SD | Min CE | Best Run |
|---|---|---|---|---|
| 2 | 0.548159 | 0.000163 | 0.547834 | 4 |
| 3 | 0.539540 | 0.000170 | 0.539228 | 4 |
| 4 | 0.537119 | 0.000171 | 0.536827 | 4 |
| 5 | 0.535743 | 0.000171 | 0.535451 | 4 |
| 6 | 0.534949 | 0.000168 | 0.534680 | 4 |
| 7 | 0.534713 | 0.000161 | 0.534462 | 4 |
| 8 | 0.534436 | 0.000373 | 0.534062 | 10 |
| 9 | 0.534309 | 0.000420 | 0.533806 | 4 |
| 10 | 0.534314 | 0.000335 | 0.533921 | 5 |
The cross-entropy minimum is at K = 9 (CE = 0.534309), but the curve flattens dramatically after K = 7: the improvement from K=7 to K=9 is only 0.08% (0.534713 → 0.534309), while K=6→7 already shows only 0.04% improvement. This plateau—combined with increasing SD at K≥8—suggests K = 7 is the most parsimonious choice, consistent with ADMIXTURE’s CV-error minimum. The marginal gains at K=8–9 likely reflect minor substructure or noise.
sNMF vs ADMIXTURE Concordance
| K | Mean Component Correlation | Range |
|---|---|---|
| 3 | 0.9996 | 0.9994 – 0.9998 |
| 5 | 0.9969 | 0.9939 – 0.9998 |
| 7 | 0.9921 | 0.9661 – 0.9998 |
The near-perfect correlations (r > 0.99 for all K) between sNMF and ADMIXTURE Q-matrices confirm that both methods recover essentially identical ancestry proportions. This provides strong independent validation of the ADMIXTURE results. The slight decrease in correlation at K=7 (min = 0.966 for one component) is expected as smaller ancestry components become harder to align precisely.
R Code (as executed)
Total runtime: ~3 hours for 90 runs (10 reps × K=2–10) on 16 CPU cores, compared to ~2 weeks for the equivalent ADMIXTURE batch.
3.3 Uzbek-Only Validation (ADMIXTURE + sNMF)
To assess within-Uzbek substructure independently of reference populations, we ran both ADMIXTURE (with cross-validation) and sNMF on the Uzbek-only subset. If significant internal population stratification exists, it would manifest as K > 2 being favoured.
Uzbek-Only: ADMIXTURE CV Error
| K | CV Error Winter 2025 | CV Error Spring 2026 | Note |
|---|---|---|---|
| 2 | 0.30762 | 0.26309 | ← Minimum (both) |
| 3 | 0.30811 | 0.26350 | |
| 4 | 0.30958 | running… | |
| 5 | 0.31103 | pending | |
| 6 | 0.31266 | pending | |
| 7 | 0.31445 | pending | |
| 8 | 0.31627 | pending |
CV error increases monotonically from K=2 through K=8 (winter), and the same trend holds so far in spring (K=2 < K=3), confirming no significant internal substructure beyond a two-way split. The monotonic pattern across all 7 K-values is definitive.
Uzbek-Only: sNMF Cross-Entropy
| K | Mean CE | SD | Min CE | Best Run |
|---|---|---|---|---|
| 2 | 0.425013 | 0.000322 | 0.424585 | 1 |
| 3 | 0.425046 | 0.000325 | 0.424631 | 1 |
| 4 | 0.425388 | 0.000335 | 0.424947 | 1 |
| 5 | 0.425985 | 0.000331 | 0.425570 | 9 |
| 6 | 0.426539 | 0.000436 | 0.425946 | 1 |
| 7 | 0.427069 | 0.000323 | 0.426616 | 1 |
| 8 | 0.427624 | 0.000299 | 0.427281 | 9 |
| 9 | 0.428241 | 0.000346 | 0.427853 | 9 |
| 10 | 0.428758 | 0.000330 | 0.428315 | 1 |
Cross-entropy increases monotonically from K=2 (0.4250) to K=10 (0.4288), independently confirming that Uzbeks form a relatively homogeneous group without deep ancestral subdivisions.
Uzbek-Only: sNMF vs ADMIXTURE Concordance
| K | Mean Component Correlation | Note |
|---|---|---|
| 2 | 0.997 | Near-perfect |
| 3 | 0.975 | Excellent |
| 4 | 0.923 | Good |
| 5 | 0.801 | Components diverge at noise level |
At K=2–3, sNMF and ADMIXTURE recover nearly identical ancestry proportions (r > 0.97). By K=5, concordance drops to r = 0.80, confirming that higher K values capture noise rather than real structure within this sample.
4. Population Structure (Interactive)
Select a K value to see the mean ancestry proportions for each population:
5. Ancestry Composition at K=5 (Most Parsimonious)
5.1 Component Assignment
At K=5 with the superpopulation-level reference, five continental ancestry components resolve cleanly:
| Component | Color | Dominant in | Interpretation |
|---|---|---|---|
| Q1 | ■ | EAS (99.6%) | East Asian |
| Q2 | ■ | AFR (95.6%) | African |
| Q3 | ■ | AMR (43.0%) | Americas-specific (Native American-like) |
| Q4 | ■ | SAS (89.8%) | South Asian |
| Q5 | ■ | EUR (96.3%), UZB (48.8%) | European / West Eurasian |
5.2 Uzbek Cohort Ancestry Breakdown (K=5)
| Component | Ancestry | Uzbek Mean | Interpretation |
|---|---|---|---|
| Q5 | European / West Eurasian | 48.8% | Core West Eurasian ancestry (EUR + Central Asian) |
| Q1 | East Asian | 28.8% | Steppe / Turkic / Mongol heritage |
| Q4 | South Asian | 19.2% | Indo-Aryan substrate / trade contact |
| Q3 | Americas-specific | 2.9% | Shared Ancient North Eurasian ancestry |
| Q2 | African | 0.3% | Minimal — effectively absent |
6. Covariate Validation
Kruskal–Wallis tests assessed whether self-reported ethnicity or birthplace within the Uzbek cohort predicts any ADMIXTURE component (1,068 matched samples).
6.1 Ethnicity
| K | Components tested | Any significant? | Conclusion |
|---|---|---|---|
| 2 | Q1, Q2 | No (all ns) | Ethnicity is completely non-significant at every K and every component. ADMIXTURE components do not reflect ethnic self-identification. |
| 3 | Q1–Q3 | No (all ns) | |
| 4 | Q1–Q4 | No (all ns) | |
| 5 | Q1–Q5 | No (all ns) | |
| 6 | Q1–Q6 | No (all ns) | |
| 7 | Q1–Q7 | No (all ns) | |
| 8 | Q1–Q8 | No (all ns) |
6.2 Birthplace (Geographic Origin)
| K | Most significant component | p-value | η² |
|---|---|---|---|
| 2 | Q1 / Q2 | 2.25 × 10−22 | 0.073 |
| 3 | Q3 | 1.47 × 10−22 | 0.074 |
| 4 | Q4 | 3.13 × 10−22 | 0.074 |
| 5 | Q3 | 8.83 × 10−23 | 0.074 |
| 6 | Q6 | 1.82 × 10−22 | 0.073 |
| 7 | Q6 | 3.24 × 10−23 | 0.068 |
| 8 | Q6 | 1.78 × 10−23 | 0.063 |
6.3 Geographic Cline at K=7
Mean East Asian component (Q6) by birthplace reveals the geographic gradient:
| Region | N | Central Asian (Q2) | East Asian (Q6) | S. European (Q1) | Interpretation |
|---|---|---|---|---|---|
| Jizzakh | 84 | 73.9% | 19.6% | 1.3% | Most Eastern-shifted |
| Andijan | 18 | 71.6% | 19.9% | 2.9% | Eastern |
| Tashkent region | 160 | 70.6% | 15.5% | 3.1% | Central |
| Tashkent city | 461 | 69.5% | 11.0% | 4.8% | Central-Western |
| Karakalpakstan | 42 | 79.6% | 5.1% | 7.2% | Western-shifted |
| Fergana | 30 | 76.3% | 2.7% | 8.2% | Most Western-shifted |
7. Key Findings & Biological Interpretation
8. Output Files
| File | Description |
|---|---|
global_for_admixture.bed/bim/fam | LD-pruned merged dataset (3,595 × 60,279; winter: 77,111) |
admix_results/K{2-8}.Q | ADMIXTURE Q-matrices for each K |
global_pop_labels.txt | Population and superpopulation labels |
validation/validation_results.json | Covariate test results (JSON) |
validation/ethnicity_q_values.tsv | Ethnicity × Q-value means per group |
validation/birthplace_q_values.tsv | Birthplace × Q-value means per region |
validation/per_sample_covariates.tsv | Per-sample Q-values with covariates |