← Back to Roadmap

Step 11: Global ADMIXTURE with 1000 Genomes

Unsupervised ancestry decomposition of the Uzbek cohort with continental reference populations

✓ Spring 2026 — April 11, 2026 Completed — March 2026

1. Overview

While the Uzbek-only ADMIXTURE (Step 10) identified a West–East admixture cline within the cohort, it could not assign those components to known continental ancestries. This step merges 1,047 Uzbek samples with 2,548 individuals from the 1000 Genomes Project (all 26 populations across 5 superpopulations), enabling ADMIXTURE’s unsupervised algorithm to anchor components to known African, European, South Asian, and East Asian reference panels.

Goal: Determine the continental ancestry composition of the Uzbek cohort and establish whether the geographic cline observed in the Uzbek-only run corresponds to a EUR↔EAS gradient.

Spring 2026 ADMIXTURE Summary

3,595
Global Samples
60,279
LD-Pruned SNPs
0.28401
Min CV (K=8)
K=3
Evanno |L″| Peak
Spring 2026 global ADMIXTURE: Merged 79,767 common SNPs → QC (HWE −11,905, MAF −7,377) → 60,485 → LD pruning (50 10 0.1, −206) → 60,279 final variants. CV monotonically decreasing K=2→8. Evanno |L″(K)| peaks at K=3 (1,682,893), consistent with winter. UZB-only: 1,047 samples × 294,597 variants. K=2 CV=0.26309 (minimum), K=3 CV=0.26350 — ADMIXTURE K=4–8 still running.

2. Pipeline & QC

2.1 Reference Panel Selection

We use the complete 1000 Genomes Phase 3 reference (all 26 populations grouped into 5 superpopulations), providing broad continental representation.

SuperpopulationPopulations includedN
AFRYRI, LWK, GWD, MSL, ESN, ACB, ASW671
EURCEU, GBR, FIN, IBS, TSI522
SASGIH, PJL, BEB, STU, ITU492
EASCHB, JPT, CHS, CDX, KHV515
AMRMXL, PUR, CLM, PEL348
UZBUzbek cohort (this study)1,047
Total3,595

2.2 Processing Pipeline

# Per-chromosome VCF extraction from 1000G Phase 3 # bcftools view -S FILE: keep only samples listed in FILE (one ID per line) # -R FILE: restrict to regions/sites in FILE (chr:pos format) # -Oz: output as bgzip-compressed VCF for chr in {1..22}; do bcftools view -S ref_samples.txt -R uzbek_snps_chr${chr}.txt \ 1000G_chr${chr}.vcf.gz -Oz -o ref_chr${chr}.vcf.gz done # Merge with Uzbek data per chromosome, then genome-wide merge for chr in {1..22}; do plink --vcf ref_chr${chr}.vcf.gz --bmerge uzbek_chr${chr} --make-bed --out merged_chr${chr} done # --merge-list FILE: merge base dataset with all filesets listed (one prefix per line) plink --merge-list chr_list.txt --make-bed --out global_merged # QC: MAF > 0.01, geno < 0.02, HWE 1e-6 plink --bfile global_merged --maf 0.01 --geno 0.02 --hwe 1e-6 --make-bed --out global_qc # LD pruning: window=50, step=10, r²=0.1 (matching Uzbek-only pipeline) plink --bfile global_qc --indep-pairwise 50 10 0.1 --out global_prune plink --bfile global_qc --extract global_prune.prune.in --make-bed --out global_for_admixture # ADMIXTURE v1.3.0, K=2–8 with 32 threads and cross-validation # --cv: enable 5-fold cross-validation (predictive error for model selection) # -j32: use 32 parallel threads for the EM algorithm # tee: copy stdout to log file while still printing to terminal for K in {2..8}; do admixture --cv -j32 global_for_admixture.bed $K | tee log_K${K}.out done
Final dataset (Spring 2026): 3,595 samples × 60,279 LD-pruned SNPs Winter 2025: 3,595 × 77,111

3. Cross-Validation Error

ADMIXTURE’s 5-fold cross-validation error measures predictive accuracy. The lowest CV error indicates the best-fitting number of ancestral populations.

KCV Error
Winter 2025
(77,111 SNPs)
CV Error
Spring 2026
(60,279 SNPs)
Δ from prev
(Spring)
Note
20.310430.30003AFR vs non-AFR split
30.299920.28936−0.01067EAS separates
40.297030.28658−0.00278SAS separates
50.295030.28476−0.00182AMR / Central Asian component
60.294580.28436−0.00040Plateau begins
70.294420.28418−0.00018Effectively tied with K=8
80.294220.28401−0.00017← Minimum CV (K=7–8 plateau)
Spring 2026: K=8 remains the nominal minimum (CV = 0.28401), consistent with winter (0.29422). All CV values are ~0.01 lower due to SNP reduction (77,111→60,279 after stricter QC), but the shape is identical: steep drop K=2→3 (3.6%), gradual decline through K=5, then plateau K=5–8 (<0.015% per step). K=5 remains the most parsimonious model capturing all major continental groups.

3.1 Evanno Method (ΔK)

Note: The log-likelihoods below are from one representative ADMIXTURE run per K on the global dataset (3,595 samples; winter: 77,111 SNPs, spring: 60,279 SNPs). The table shows single-run values rather than mean ± SD across replicates. The |L″(K)| column is therefore a raw second difference, not the full Evanno ΔK (which requires run-to-run variance in the denominator). Since the Evanno method is statistically inappropriate for ADMIXTURE (see caveat below), we rely on CV error and sNMF cross-entropy for K selection instead.
⚠️ Statistical caveat: The Evanno method was designed for STRUCTURE, which uses Bayesian MCMC sampling — the run-to-run variance in L(K) reflects stochastic convergence of the Markov chain. ADMIXTURE uses deterministic maximum-likelihood optimization (EM/block relaxation), so run-to-run variance is merely local-optimum noise from different random seeds, not meaningful MCMC variability. This means ΔK = mean|L″(K)| / sd(L(K)) is operating on the wrong type of statistic, and the resulting peak can be unreliable — especially for clinal populations like Central Asian groups where discrete ancestral populations may never have existed.

We include this analysis for comparison with published literature, but cross-validation error (Section 3) remains the statistically appropriate model selection criterion for ADMIXTURE. An independent sNMF analysis (Section 3.2) provides additional validation via cross-entropy.

The Evanno method (Evanno et al., 2005) computes the second-order rate of change of the log-likelihood, |L″(K)|, across consecutive K values. The value of K where |L″(K)| is maximized indicates the most abrupt change in model fit — corresponding to the uppermost level of hierarchical population structure. Note that this approach was designed for STRUCTURE’s MCMC framework and its application to ADMIXTURE output should be interpreted with caution (see caveat above).

Using the log-likelihood values L(K) from the global ADMIXTURE run at each K = 2–8:

Winter 2025 (3,595 × 77,111 SNPs)

K L(K) (single run) L′(K) = L(K) − L(K−1) |L″(K)| (not true ΔK)
2−147,676,905
3−144,699,9382,976,9662,087,440 ← max
4−143,810,412889,526241,190
5−143,162,075648,337437,625
6−142,951,364210,71184,113
7−142,824,765126,59926,085
8−142,672,081152,684

Spring 2026 (3,595 × 60,279 SNPs)

K L(K) (single run) L′(K) = L(K) − L(K−1) |L″(K)| (not true ΔK)
2−110,554,044
3−108,197,8272,356,2171,682,893 ← max
4−107,524,503673,324201,187
5−107,052,366472,137315,425
6−106,895,654156,71260,902
7−106,799,84495,81024,454
8−106,679,580120,264
Evanno peak at K=3 confirmed in both runs. Spring |L″(3)| = 1,682,893 (winter: 2,087,440) — both peak at K=3, corresponding to the three continental super-groups (AFR / EUR+SAS / EAS). Secondary peak at K=5 in both (spring: 315,425; winter: 437,625). The consistent K=3 peak across different SNP sets reinforces that the deepest split is three-way continental, with finer K values adding within-group resolution.
|L″(K)| peaks at K = 3 (2,087,440). The most fundamental split separates three broad ancestral groups: African, West Eurasian (European + South Asian), and East Eurasian (East Asian + Central Asian). A secondary peak at K=5 (437,625) reflects emergence of the Central Asian / AMR components.
Reconciling CV error (K=5–8 plateau) vs Evanno |L″(K)| (K=3): These two criteria answer different questions. The Evanno |L″(K)| detects the dominant structural break — the deepest split in the population tree. Cross-validation error measures overall predictive accuracy. K = 3 captures continental-level divergence (AFR / EUR+SAS / EAS), while the CV error plateau at K=5–8 suggests no single optimal resolution. These methods give different K values (K=3 vs K=5–8), which is expected — they measure fundamentally different properties. With superpopulation-level references, fewer K are needed to capture the main structure. K = 5 captures all major continental groups plus the Central Asian component; K=6–8 provide diminishing additional resolution. For downstream GWAS covariate adjustment, K = 5 is the most parsimonious choice.

3.2 sNMF (Sparse Non-negative Matrix Factorization)

To provide an independent, statistically rigorous validation of K selection, we supplement ADMIXTURE with sNMF (Frichot et al., 2014), implemented in the R package LEA (Bioconductor). sNMF offers several advantages over both ADMIXTURE and STRUCTURE:

PropertyADMIXTUREsNMF (LEA)
MethodML via EM optimizationRegularized least-squares NMF
K selectionCross-validation errorCross-entropy criterion
SpeedHours per runMinutes per run
Missing dataImputation requiredHandled natively
RegularizationNoneL2 penalty (α) — reduces overfitting
Clinal structureForces discrete clustersStill clusters, but regularization dampens artifacts

sNMF uses a cross-entropy criterion for model selection: a fraction of genotypes are masked, the model is fitted on the remaining data, and cross-entropy measures how well the model predicts the masked entries. Unlike ADMIXTURE’s CV error, this does not depend on likelihood-based assumptions.

Cross-Entropy Results

KMean CESDMin CEBest Run
20.5481590.0001630.5478344
30.5395400.0001700.5392284
40.5371190.0001710.5368274
50.5357430.0001710.5354514
60.5349490.0001680.5346804
70.5347130.0001610.5344624
80.5344360.0003730.53406210
90.5343090.0004200.5338064
100.5343140.0003350.5339215

The cross-entropy minimum is at K = 9 (CE = 0.534309), but the curve flattens dramatically after K = 7: the improvement from K=7 to K=9 is only 0.08% (0.534713 → 0.534309), while K=6→7 already shows only 0.04% improvement. This plateau—combined with increasing SD at K≥8—suggests K = 7 is the most parsimonious choice, consistent with ADMIXTURE’s CV-error minimum. The marginal gains at K=8–9 likely reflect minor substructure or noise.

sNMF vs ADMIXTURE Concordance

KMean Component CorrelationRange
30.99960.9994 – 0.9998
50.99690.9939 – 0.9998
70.99210.9661 – 0.9998

The near-perfect correlations (r > 0.99 for all K) between sNMF and ADMIXTURE Q-matrices confirm that both methods recover essentially identical ancestry proportions. This provides strong independent validation of the ADMIXTURE results. The slight decrease in correlation at K=7 (min = 0.966 for one component) is expected as smaller ancestry components become harder to align precisely.

R Code (as executed)

library(LEA) # Convert PLINK bed → VCF → geno format # plink --bfile global_for_admixture --recode vcf --out snmf_results/global_for_admixture geno_file <- vcf2geno("global_for_admixture.vcf") # 172,537 biallelic SNPs retained (of 380,376 input) # Run sNMF: K=2-10, 10 replicates each, with cross-entropy # sNMF = sparse Non-negative Matrix Factorization (Frichot et al. 2014) obj <- snmf(geno_file, K = 2:10, repetitions = 10, entropy = TRUE, # compute cross-entropy (model selection criterion) alpha = 10, # L2 regularization strength (shrinks Q toward uniform) iterations = 200, # max EM iterations per run (200 sufficient for convergence) tolerance = 1e-5, # convergence threshold (stop when improvement < 1e-5) project = "new", # start fresh project (vs "continue" to add runs) CPU = 16) # parallel threads for the 90 total runs # Extract Q-matrix at optimal K K_opt <- which.min(sapply(2:10, function(k) mean(cross.entropy(obj, K=k)))) Q <- Q(obj, K = K_opt, run = which.min(cross.entropy(obj, K = K_opt))) # Compare with ADMIXTURE Q-matrices: greedy column matching by |cor| cor_mat <- abs(cor(snmf_Q, admixture_Q)) # K=7: mean r = 0.9921

Total runtime: ~3 hours for 90 runs (10 reps × K=2–10) on 16 CPU cores, compared to ~2 weeks for the equivalent ADMIXTURE batch.

3.3 Uzbek-Only Validation (ADMIXTURE + sNMF)

To assess within-Uzbek substructure independently of reference populations, we ran both ADMIXTURE (with cross-validation) and sNMF on the Uzbek-only subset. If significant internal population stratification exists, it would manifest as K > 2 being favoured.

Uzbek-Only: ADMIXTURE CV Error

KCV Error
Winter 2025
CV Error
Spring 2026
Note
20.307620.26309← Minimum (both)
30.308110.26350
40.30958running…
50.31103pending
60.31266pending
70.31445pending
80.31627pending
Spring 2026 UZB-only: 1,047 samples × 294,597 LD-pruned variants. K=2 again shows the minimum CV (0.26309), consistent with winter (0.30762). K=4–8 still running on server.

CV error increases monotonically from K=2 through K=8 (winter), and the same trend holds so far in spring (K=2 < K=3), confirming no significant internal substructure beyond a two-way split. The monotonic pattern across all 7 K-values is definitive.

Uzbek-Only: sNMF Cross-Entropy

KMean CESDMin CEBest Run
20.4250130.0003220.4245851
30.4250460.0003250.4246311
40.4253880.0003350.4249471
50.4259850.0003310.4255709
60.4265390.0004360.4259461
70.4270690.0003230.4266161
80.4276240.0002990.4272819
90.4282410.0003460.4278539
100.4287580.0003300.4283151

Cross-entropy increases monotonically from K=2 (0.4250) to K=10 (0.4288), independently confirming that Uzbeks form a relatively homogeneous group without deep ancestral subdivisions.

Uzbek-Only: sNMF vs ADMIXTURE Concordance

KMean Component CorrelationNote
20.997Near-perfect
30.975Excellent
40.923Good
50.801Components diverge at noise level

At K=2–3, sNMF and ADMIXTURE recover nearly identical ancestry proportions (r > 0.97). By K=5, concordance drops to r = 0.80, confirming that higher K values capture noise rather than real structure within this sample.

Interpretation: The K = 2 split within Uzbeks most likely reflects the well-documented East Eurasian / West Eurasian dual ancestry of Central Asian populations, arising from historical admixture between Turkic/Mongolic and Iranian-speaking groups along the Silk Road. This is a continuous cline rather than two discrete subpopulations, consistent with the global ADMIXTURE results where the Uzbek component is a blend of multiple ancestral sources.
On the limits of discrete-K models: Both ADMIXTURE and sNMF assume that observed genotypes can be decomposed into K discrete ancestral populations. In reality, especially across Eurasia and along the Silk Road corridor, population history involves continuous gene flow, isolation-by-distance, and admixture clines rather than discrete founding events. This is why log-likelihood tends to improve indefinitely with increasing K, and why no single “true K” may exist. The practical interpretation is that K = 5 provides the most parsimonious descriptive resolution of ancestry variation for covariate adjustment, not that exactly 5 ancestral populations existed historically.

4. Population Structure (Interactive)

Select a K value to see the mean ancestry proportions for each population:

5. Ancestry Composition at K=5 (Most Parsimonious)

5.1 Component Assignment

At K=5 with the superpopulation-level reference, five continental ancestry components resolve cleanly:

ComponentColorDominant inInterpretation
Q1EAS (99.6%)East Asian
Q2AFR (95.6%)African
Q3AMR (43.0%)Americas-specific (Native American-like)
Q4SAS (89.8%)South Asian
Q5EUR (96.3%), UZB (48.8%)European / West Eurasian

5.2 Uzbek Cohort Ancestry Breakdown (K=5)

ComponentAncestryUzbek MeanInterpretation
Q5European / West Eurasian48.8%Core West Eurasian ancestry (EUR + Central Asian)
Q1East Asian28.8%Steppe / Turkic / Mongol heritage
Q4South Asian19.2%Indo-Aryan substrate / trade contact
Q3Americas-specific2.9%Shared Ancient North Eurasian ancestry
Q2African0.3%Minimal — effectively absent
Central Asian complexity: With the superpopulation-level reference, the Uzbek cohort is ~49% West Eurasian, ~29% East Asian, and ~19% South Asian. Uzbek ancestry decomposes directly into continental sources — reflecting that the Central Asian component is itself a mixture of West Eurasian, East Asian, and South Asian ancestries. The ~3% AMR-like component likely reflects shared Ancient North Eurasian (ANE) ancestry.

6. Covariate Validation

Kruskal–Wallis tests assessed whether self-reported ethnicity or birthplace within the Uzbek cohort predicts any ADMIXTURE component (1,068 matched samples).

6.1 Ethnicity

KComponents testedAny significant?Conclusion
2Q1, Q2No (all ns) Ethnicity is completely non-significant at every K and every component. ADMIXTURE components do not reflect ethnic self-identification.
3Q1–Q3No (all ns)
4Q1–Q4No (all ns)
5Q1–Q5No (all ns)
6Q1–Q6No (all ns)
7Q1–Q7No (all ns)
8Q1–Q8No (all ns)

6.2 Birthplace (Geographic Origin)

KMost significant componentp-valueη²
2Q1 / Q22.25 × 10−220.073
3Q31.47 × 10−220.074
4Q43.13 × 10−220.074
5Q38.83 × 10−230.074
6Q61.82 × 10−220.073
7Q63.24 × 10−230.068
8Q61.78 × 10−230.063
Key finding: Even with 1000G references anchoring continental ancestry components, self-reported ethnicity remains non-significant across ALL K values. Birthplace (region within Uzbekistan) remains highly significant (p ∼ 10−22). This definitively confirms that ADMIXTURE components capture geographic structure, not ethnic self-identification. The admixture cline is genuine and geographically driven.

6.3 Geographic Cline at K=7

Mean East Asian component (Q6) by birthplace reveals the geographic gradient:

RegionNCentral Asian (Q2)East Asian (Q6)S. European (Q1)Interpretation
Jizzakh8473.9%19.6%1.3%Most Eastern-shifted
Andijan1871.6%19.9%2.9%Eastern
Tashkent region16070.6%15.5%3.1%Central
Tashkent city46169.5%11.0%4.8%Central-Western
Karakalpakstan4279.6%5.1%7.2%Western-shifted
Fergana3076.3%2.7%8.2%Most Western-shifted
Geographic interpretation: The East Asian component ranges from ~2.7% (Fergana) to ~19.9% (Andijan), a 7.4-fold difference across Uzbekistan. Fergana Valley individuals carry the highest Southern European ancestry (8.2%), consistent with the region’s historical role as a Sogdian/Greek cultural hub. Eastern regions (Jizzakh, Andijan) show stronger steppe/Turkic-Mongol introgression, consistent with proximity to the Kazakh steppe and historical nomadic corridors.

7. Key Findings & Biological Interpretation

Finding 1: At K=5 the Uzbek cohort is predominantly European (48.8%) and East Asian (28.8%), with a substantial South Asian (19.2%) component — a three-way admixture not captured by any single 1000G reference population.
Finding 2: East Asian ancestry (~28.8% mean) represents Turkic-Mongol steppe heritage, varying across regions. This variation is a potential confounder for disease association studies.
Finding 3: Western Eurasian admixture (EUR + SAS, ~68.0%) reflects the Silk Road Indo-European substrate with Sogdian and Indo-Aryan contributions, strongest in Fergana Valley.
Finding 4: Ethnicity is NOT a confounder — all 56 Kruskal–Wallis tests (7 K-values × up to 8 components each) returned non-significant p-values.
Implication for RPL GWAS: Birthplace-based ancestry proportions at K=5 should be used as covariates in any genome-wide association study to control for population stratification. Ethnic labels are uninformative.

8. Output Files

FileDescription
global_for_admixture.bed/bim/famLD-pruned merged dataset (3,595 × 60,279; winter: 77,111)
admix_results/K{2-8}.QADMIXTURE Q-matrices for each K
global_pop_labels.txtPopulation and superpopulation labels
validation/validation_results.jsonCovariate test results (JSON)
validation/ethnicity_q_values.tsvEthnicity × Q-value means per group
validation/birthplace_q_values.tsvBirthplace × Q-value means per region
validation/per_sample_covariates.tsvPer-sample Q-values with covariates

← Step 10: Multi-Pop PBS   |   Next Steps →