Step 11: Global ADMIXTURE with 1000 Genomes

1. Overview

While the Uzbek-only ADMIXTURE (Step 10) identified a West–East admixture cline within the cohort, it could not assign those components to known continental ancestries. This step merges 1,047 Uzbek samples with 2,548 individuals from the 1000 Genomes Project (all 26 populations across 5 superpopulations), enabling ADMIXTURE’s unsupervised algorithm to anchor components to known African, European, South Asian, and East Asian reference panels.

Goal: Determine the continental ancestry composition of the Uzbek cohort and establish whether the geographic cline observed in the Uzbek-only run corresponds to a EUR↔EAS gradient.

Spring 2026 ADMIXTURE Summary

3,595

Global Samples

60,279

LD-Pruned SNPs

0.28401

Min CV (K=8)

K=3

Evanno |L″| Peak

Spring 2026 global ADMIXTURE: Merged 79,767 common SNPs → QC (HWE −11,905, MAF −7,377) → 60,485 → LD pruning (50 10 0.1, −206) → 60,279 final variants. CV monotonically decreasing K=2→8. Evanno |L″(K)| peaks at K=3 (1,682,893), consistent with winter. UZB-only: 1,047 samples × 294,597 variants. K=2 CV=0.26309 (minimum), K=3 CV=0.26350 — ADMIXTURE K=4–8 still running.

2. Pipeline & QC

2.1 Reference Panel Selection

We use the complete 1000 Genomes Phase 3 reference (all 26 populations grouped into 5 superpopulations), providing broad continental representation.

Superpopulation	Populations included	N
AFR	YRI, LWK, GWD, MSL, ESN, ACB, ASW	671
EUR	CEU, GBR, FIN, IBS, TSI	522
SAS	GIH, PJL, BEB, STU, ITU	492
EAS	CHB, JPT, CHS, CDX, KHV	515
AMR	MXL, PUR, CLM, PEL	348
UZB	Uzbek cohort (this study)	1,047
Total		3,595

2.2 Processing Pipeline

# Per-chromosome VCF extraction from 1000G Phase 3
# bcftools view -S FILE: keep only samples listed in FILE (one ID per line)
#              -R FILE: restrict to regions/sites in FILE (chr:pos format)
#              -Oz: output as bgzip-compressed VCF
for chr in {1..22}; do
    bcftools view -S ref_samples.txt -R uzbek_snps_chr${chr}.txt \
        1000G_chr${chr}.vcf.gz -Oz -o ref_chr${chr}.vcf.gz
done

# Merge with Uzbek data per chromosome, then genome-wide merge
for chr in {1..22}; do
    plink --vcf ref_chr${chr}.vcf.gz --bmerge uzbek_chr${chr} --make-bed --out merged_chr${chr}
done
# --merge-list FILE: merge base dataset with all filesets listed (one prefix per line)
plink --merge-list chr_list.txt --make-bed --out global_merged

# QC: MAF > 0.01, geno < 0.02, HWE 1e-6
plink --bfile global_merged --maf 0.01 --geno 0.02 --hwe 1e-6 --make-bed --out global_qc

# LD pruning: window=50, step=10, r²=0.1 (matching Uzbek-only pipeline)
plink --bfile global_qc --indep-pairwise 50 10 0.1 --out global_prune
plink --bfile global_qc --extract global_prune.prune.in --make-bed --out global_for_admixture

# ADMIXTURE v1.3.0, K=2–8 with 32 threads and cross-validation
# --cv: enable 5-fold cross-validation (predictive error for model selection)
# -j32: use 32 parallel threads for the EM algorithm
# tee: copy stdout to log file while still printing to terminal
for K in {2..8}; do
    admixture --cv -j32 global_for_admixture.bed $K | tee log_K${K}.out
done

Final dataset (Spring 2026): 3,595 samples × 60,279 LD-pruned SNPs Winter 2025: 3,595 × 77,111

3. Cross-Validation Error

ADMIXTURE’s 5-fold cross-validation error measures predictive accuracy. The lowest CV error indicates the best-fitting number of ancestral populations.

K	CV Error Winter 2025 (77,111 SNPs)	CV Error Spring 2026 (60,279 SNPs)	Δ from prev (Spring)	Note
2	0.31043	0.30003	—	AFR vs non-AFR split
3	0.29992	0.28936	−0.01067	EAS separates
4	0.29703	0.28658	−0.00278	SAS separates
5	0.29503	0.28476	−0.00182	AMR / Central Asian component
6	0.29458	0.28436	−0.00040	Plateau begins
7	0.29442	0.28418	−0.00018	Effectively tied with K=8
8	0.29422	0.28401	−0.00017	← Minimum CV (K=7–8 plateau)

Spring 2026: K=8 remains the nominal minimum (CV = 0.28401), consistent with winter (0.29422). All CV values are ~0.01 lower due to SNP reduction (77,111→60,279 after stricter QC), but the shape is identical: steep drop K=2→3 (3.6%), gradual decline through K=5, then plateau K=5–8 (<0.015% per step). K=5 remains the most parsimonious model capturing all major continental groups.

3.1 Evanno Method (ΔK)

Note: The log-likelihoods below are from one representative ADMIXTURE run per K on the global dataset (3,595 samples; winter: 77,111 SNPs, spring: 60,279 SNPs). The table shows single-run values rather than mean ± SD across replicates. The |L″(K)| column is therefore a raw second difference, not the full Evanno ΔK (which requires run-to-run variance in the denominator). Since the Evanno method is statistically inappropriate for ADMIXTURE (see caveat below), we rely on CV error and sNMF cross-entropy for K selection instead.

⚠️ Statistical caveat: The Evanno method was designed for STRUCTURE, which uses Bayesian MCMC sampling — the run-to-run variance in L(K) reflects stochastic convergence of the Markov chain. ADMIXTURE uses deterministic maximum-likelihood optimization (EM/block relaxation), so run-to-run variance is merely local-optimum noise from different random seeds, not meaningful MCMC variability. This means ΔK = mean|L″(K)| / sd(L(K)) is operating on the wrong type of statistic, and the resulting peak can be unreliable — especially for clinal populations like Central Asian groups where discrete ancestral populations may never have existed.

We include this analysis for comparison with published literature, but cross-validation error (Section 3) remains the statistically appropriate model selection criterion for ADMIXTURE. An independent sNMF analysis (Section 3.2) provides additional validation via cross-entropy.

The Evanno method (Evanno et al., 2005) computes the second-order rate of change of the log-likelihood, |L″(K)|, across consecutive K values. The value of K where |L″(K)| is maximized indicates the most abrupt change in model fit — corresponding to the uppermost level of hierarchical population structure. Note that this approach was designed for STRUCTURE’s MCMC framework and its application to ADMIXTURE output should be interpreted with caution (see caveat above).

Using the log-likelihood values L(K) from the global ADMIXTURE run at each K = 2–8:

Winter 2025 (3,595 × 77,111 SNPs)

K	L(K) (single run)	L′(K) = L(K) − L(K−1)	\|L″(K)\| (not true ΔK)
2	−147,676,905	—	—
3	−144,699,938	2,976,966	2,087,440 ← max
4	−143,810,412	889,526	241,190
5	−143,162,075	648,337	437,625
6	−142,951,364	210,711	84,113
7	−142,824,765	126,599	26,085
8	−142,672,081	152,684	—

Spring 2026 (3,595 × 60,279 SNPs)

K	L(K) (single run)	L′(K) = L(K) − L(K−1)	\|L″(K)\| (not true ΔK)
2	−110,554,044	—	—
3	−108,197,827	2,356,217	1,682,893 ← max
4	−107,524,503	673,324	201,187
5	−107,052,366	472,137	315,425
6	−106,895,654	156,712	60,902
7	−106,799,844	95,810	24,454
8	−106,679,580	120,264	—

Evanno peak at K=3 confirmed in both runs. Spring |L″(3)| = 1,682,893 (winter: 2,087,440) — both peak at K=3, corresponding to the three continental super-groups (AFR / EUR+SAS / EAS). Secondary peak at K=5 in both (spring: 315,425; winter: 437,625). The consistent K=3 peak across different SNP sets reinforces that the deepest split is three-way continental, with finer K values adding within-group resolution.

|L″(K)| peaks at K = 3 (2,087,440). The most fundamental split separates three broad ancestral groups: African, West Eurasian (European + South Asian), and East Eurasian (East Asian + Central Asian). A secondary peak at K=5 (437,625) reflects emergence of the Central Asian / AMR components.

Reconciling CV error (K=5–8 plateau) vs Evanno |L″(K)| (K=3): These two criteria answer different questions. The Evanno |L″(K)| detects the dominant structural break — the deepest split in the population tree. Cross-validation error measures overall predictive accuracy. K = 3 captures continental-level divergence (AFR / EUR+SAS / EAS), while the CV error plateau at K=5–8 suggests no single optimal resolution. These methods give different K values (K=3 vs K=5–8), which is expected — they measure fundamentally different properties. With superpopulation-level references, fewer K are needed to capture the main structure. K = 5 captures all major continental groups plus the Central Asian component; K=6–8 provide diminishing additional resolution. For downstream GWAS covariate adjustment, K = 5 is the most parsimonious choice.

3.2 sNMF (Sparse Non-negative Matrix Factorization)

To provide an independent, statistically rigorous validation of K selection, we supplement ADMIXTURE with sNMF (Frichot et al., 2014), implemented in the R package LEA (Bioconductor). sNMF offers several advantages over both ADMIXTURE and STRUCTURE:

Property	ADMIXTURE	sNMF (LEA)
Method	ML via EM optimization	Regularized least-squares NMF
K selection	Cross-validation error	Cross-entropy criterion
Speed	Hours per run	Minutes per run
Missing data	Imputation required	Handled natively
Regularization	None	L2 penalty (α) — reduces overfitting
Clinal structure	Forces discrete clusters	Still clusters, but regularization dampens artifacts

sNMF uses a cross-entropy criterion for model selection: a fraction of genotypes are masked, the model is fitted on the remaining data, and cross-entropy measures how well the model predicts the masked entries. Unlike ADMIXTURE’s CV error, this does not depend on likelihood-based assumptions.

Cross-Entropy Results

K	Mean CE	SD	Min CE	Best Run
2	0.548159	0.000163	0.547834	4
3	0.539540	0.000170	0.539228	4
4	0.537119	0.000171	0.536827	4
5	0.535743	0.000171	0.535451	4
6	0.534949	0.000168	0.534680	4
7	0.534713	0.000161	0.534462	4
8	0.534436	0.000373	0.534062	10
9	0.534309	0.000420	0.533806	4
10	0.534314	0.000335	0.533921	5

The cross-entropy minimum is at K = 9 (CE = 0.534309), but the curve flattens dramatically after K = 7: the improvement from K=7 to K=9 is only 0.08% (0.534713 → 0.534309), while K=6→7 already shows only 0.04% improvement. This plateau—combined with increasing SD at K≥8—suggests K = 7 is the most parsimonious choice, consistent with ADMIXTURE’s CV-error minimum. The marginal gains at K=8–9 likely reflect minor substructure or noise.

sNMF vs ADMIXTURE Concordance

K	Mean Component Correlation	Range
3	0.9996	0.9994 – 0.9998
5	0.9969	0.9939 – 0.9998
7	0.9921	0.9661 – 0.9998

The near-perfect correlations (r > 0.99 for all K) between sNMF and ADMIXTURE Q-matrices confirm that both methods recover essentially identical ancestry proportions. This provides strong independent validation of the ADMIXTURE results. The slight decrease in correlation at K=7 (min = 0.966 for one component) is expected as smaller ancestry components become harder to align precisely.

R Code (as executed)

library(LEA)

# Convert PLINK bed → VCF → geno format
# plink --bfile global_for_admixture --recode vcf --out snmf_results/global_for_admixture
geno_file <- vcf2geno("global_for_admixture.vcf")
# 172,537 biallelic SNPs retained (of 380,376 input)

# Run sNMF: K=2-10, 10 replicates each, with cross-entropy
# sNMF = sparse Non-negative Matrix Factorization (Frichot et al. 2014)
obj <- snmf(geno_file,
            K = 2:10,
            repetitions = 10,
            entropy = TRUE,      # compute cross-entropy (model selection criterion)
            alpha = 10,          # L2 regularization strength (shrinks Q toward uniform)
            iterations = 200,    # max EM iterations per run (200 sufficient for convergence)
            tolerance = 1e-5,    # convergence threshold (stop when improvement < 1e-5)
            project = "new",     # start fresh project (vs "continue" to add runs)
            CPU = 16)            # parallel threads for the 90 total runs

# Extract Q-matrix at optimal K
K_opt <- which.min(sapply(2:10, function(k) mean(cross.entropy(obj, K=k))))
Q <- Q(obj, K = K_opt, run = which.min(cross.entropy(obj, K = K_opt)))

# Compare with ADMIXTURE Q-matrices: greedy column matching by |cor|
cor_mat <- abs(cor(snmf_Q, admixture_Q))  # K=7: mean r = 0.9921

Total runtime: ~3 hours for 90 runs (10 reps × K=2–10) on 16 CPU cores, compared to ~2 weeks for the equivalent ADMIXTURE batch.

3.3 Uzbek-Only Validation (ADMIXTURE + sNMF)

To assess within-Uzbek substructure independently of reference populations, we ran both ADMIXTURE (with cross-validation) and sNMF on the Uzbek-only subset. If significant internal population stratification exists, it would manifest as K > 2 being favoured.

Uzbek-Only: ADMIXTURE CV Error

K	CV Error Winter 2025	CV Error Spring 2026	Note
2	0.30762	0.26309	← Minimum (both)
3	0.30811	0.26350
4	0.30958	running…
5	0.31103	pending
6	0.31266	pending
7	0.31445	pending
8	0.31627	pending

Spring 2026 UZB-only: 1,047 samples × 294,597 LD-pruned variants. K=2 again shows the minimum CV (0.26309), consistent with winter (0.30762). K=4–8 still running on server.

CV error increases monotonically from K=2 through K=8 (winter), and the same trend holds so far in spring (K=2 < K=3), confirming no significant internal substructure beyond a two-way split. The monotonic pattern across all 7 K-values is definitive.

Uzbek-Only: sNMF Cross-Entropy

K	Mean CE	SD	Min CE	Best Run
2	0.425013	0.000322	0.424585	1
3	0.425046	0.000325	0.424631	1
4	0.425388	0.000335	0.424947	1
5	0.425985	0.000331	0.425570	9
6	0.426539	0.000436	0.425946	1
7	0.427069	0.000323	0.426616	1
8	0.427624	0.000299	0.427281	9
9	0.428241	0.000346	0.427853	9
10	0.428758	0.000330	0.428315	1

Cross-entropy increases monotonically from K=2 (0.4250) to K=10 (0.4288), independently confirming that Uzbeks form a relatively homogeneous group without deep ancestral subdivisions.

Uzbek-Only: sNMF vs ADMIXTURE Concordance

K	Mean Component Correlation	Note
2	0.997	Near-perfect
3	0.975	Excellent
4	0.923	Good
5	0.801	Components diverge at noise level

At K=2–3, sNMF and ADMIXTURE recover nearly identical ancestry proportions (r > 0.97). By K=5, concordance drops to r = 0.80, confirming that higher K values capture noise rather than real structure within this sample.

Interpretation: The K = 2 split within Uzbeks most likely reflects the well-documented East Eurasian / West Eurasian dual ancestry of Central Asian populations, arising from historical admixture between Turkic/Mongolic and Iranian-speaking groups along the Silk Road. This is a continuous cline rather than two discrete subpopulations, consistent with the global ADMIXTURE results where the Uzbek component is a blend of multiple ancestral sources.

On the limits of discrete-K models: Both ADMIXTURE and sNMF assume that observed genotypes can be decomposed into K discrete ancestral populations. In reality, especially across Eurasia and along the Silk Road corridor, population history involves continuous gene flow, isolation-by-distance, and admixture clines rather than discrete founding events. This is why log-likelihood tends to improve indefinitely with increasing K, and why no single “true K” may exist. The practical interpretation is that K = 5 provides the most parsimonious descriptive resolution of ancestry variation for covariate adjustment, not that exactly 5 ancestral populations existed historically.

4. Population Structure (Interactive)

Select a K value to see the mean ancestry proportions for each population:

5. Ancestry Composition at K=5 (Most Parsimonious)

5.1 Component Assignment

At K=5 with the superpopulation-level reference, five continental ancestry components resolve cleanly:

Component	Color	Dominant in	Interpretation
Q1	■	EAS (99.6%)	East Asian
Q2	■	AFR (95.6%)	African
Q3	■	AMR (43.0%)	Americas-specific (Native American-like)
Q4	■	SAS (89.8%)	South Asian
Q5	■	EUR (96.3%), UZB (48.8%)	European / West Eurasian

5.2 Uzbek Cohort Ancestry Breakdown (K=5)

Component	Ancestry	Uzbek Mean	Interpretation
Q5	European / West Eurasian	48.8%	Core West Eurasian ancestry (EUR + Central Asian)
Q1	East Asian	28.8%	Steppe / Turkic / Mongol heritage
Q4	South Asian	19.2%	Indo-Aryan substrate / trade contact
Q3	Americas-specific	2.9%	Shared Ancient North Eurasian ancestry
Q2	African	0.3%	Minimal — effectively absent

Central Asian complexity: With the superpopulation-level reference, the Uzbek cohort is ~49% West Eurasian, ~29% East Asian, and ~19% South Asian. Uzbek ancestry decomposes directly into continental sources — reflecting that the Central Asian component is itself a mixture of West Eurasian, East Asian, and South Asian ancestries. The ~3% AMR-like component likely reflects shared Ancient North Eurasian (ANE) ancestry.

6. Covariate Validation

Kruskal–Wallis tests assessed whether self-reported ethnicity or birthplace within the Uzbek cohort predicts any ADMIXTURE component (1,068 matched samples).

6.1 Ethnicity

K	Components tested	Any significant?	Conclusion
2	Q1, Q2	No (all ns)	Ethnicity is completely non-significant at every K and every component. ADMIXTURE components do not reflect ethnic self-identification.
3	Q1–Q3	No (all ns)
4	Q1–Q4	No (all ns)
5	Q1–Q5	No (all ns)
6	Q1–Q6	No (all ns)
7	Q1–Q7	No (all ns)
8	Q1–Q8	No (all ns)

6.2 Birthplace (Geographic Origin)

K	Most significant component	p-value	η²
2	Q1 / Q2	2.25 × 10⁻²²	0.073
3	Q3	1.47 × 10⁻²²	0.074
4	Q4	3.13 × 10⁻²²	0.074
5	Q3	8.83 × 10⁻²³	0.074
6	Q6	1.82 × 10⁻²²	0.073
7	Q6	3.24 × 10⁻²³	0.068
8	Q6	1.78 × 10⁻²³	0.063

Key finding: Even with 1000G references anchoring continental ancestry components, self-reported ethnicity remains non-significant across ALL K values. Birthplace (region within Uzbekistan) remains highly significant (p ∼ 10⁻²²). This definitively confirms that ADMIXTURE components capture geographic structure, not ethnic self-identification. The admixture cline is genuine and geographically driven.

6.3 Geographic Cline at K=7

Mean East Asian component (Q6) by birthplace reveals the geographic gradient:

Region	N	Central Asian (Q2)	East Asian (Q6)	S. European (Q1)	Interpretation
Jizzakh	84	73.9%	19.6%	1.3%	Most Eastern-shifted
Andijan	18	71.6%	19.9%	2.9%	Eastern
Tashkent region	160	70.6%	15.5%	3.1%	Central
Tashkent city	461	69.5%	11.0%	4.8%	Central-Western
Karakalpakstan	42	79.6%	5.1%	7.2%	Western-shifted
Fergana	30	76.3%	2.7%	8.2%	Most Western-shifted

Geographic interpretation: The East Asian component ranges from ~2.7% (Fergana) to ~19.9% (Andijan), a 7.4-fold difference across Uzbekistan. Fergana Valley individuals carry the highest Southern European ancestry (8.2%), consistent with the region’s historical role as a Sogdian/Greek cultural hub. Eastern regions (Jizzakh, Andijan) show stronger steppe/Turkic-Mongol introgression, consistent with proximity to the Kazakh steppe and historical nomadic corridors.

7. Key Findings & Biological Interpretation

Finding 1: At K=5 the Uzbek cohort is predominantly European (48.8%) and East Asian (28.8%), with a substantial South Asian (19.2%) component — a three-way admixture not captured by any single 1000G reference population.

Finding 2: East Asian ancestry (~28.8% mean) represents Turkic-Mongol steppe heritage, varying across regions. This variation is a potential confounder for disease association studies.

Finding 3: Western Eurasian admixture (EUR + SAS, ~68.0%) reflects the Silk Road Indo-European substrate with Sogdian and Indo-Aryan contributions, strongest in Fergana Valley.

Finding 4: Ethnicity is NOT a confounder — all 56 Kruskal–Wallis tests (7 K-values × up to 8 components each) returned non-significant p-values.

Implication for RPL GWAS: Birthplace-based ancestry proportions at K=5 should be used as covariates in any genome-wide association study to control for population stratification. Ethnic labels are uninformative.

8. Output Files

File	Description
`global_for_admixture.bed/bim/fam`	LD-pruned merged dataset (3,595 × 60,279; winter: 77,111)
`admix_results/K{2-8}.Q`	ADMIXTURE Q-matrices for each K
`global_pop_labels.txt`	Population and superpopulation labels
`validation/validation_results.json`	Covariate test results (JSON)
`validation/ethnicity_q_values.tsv`	Ethnicity × Q-value means per group
`validation/birthplace_q_values.tsv`	Birthplace × Q-value means per region
`validation/per_sample_covariates.tsv`	Per-sample Q-values with covariates

← Step 10: Multi-Pop PBS | Next Steps →