Next Steps: Planned Analyses
Roadmap for Uzbek population genetics and pregnancy loss association study
Updated: March 20261. Current Status Summary
- QC pipeline: missingness → IBD dedup → SNP filtering → imputation → normalization → final QC
- PCA: Uzbek-internal + global (1000 Genomes) — Uzbek sits on EUR–EAS cline
- FST: UZB equidistant from SAS and EUR (0.014), then EAS (0.039)
- ADMIXTURE K=2–8: Uzbek-only K=2 optimal (CV monotonically increases); Global K=5–8 plateau (CV=0.294, with all 2,548 1000G)
- PBS multi-population analysis: 8 Uzbek-specific variants, all with PBS ≥ 0.3. See Step 12
- LD pruning: 5.41M → 88.7K independent SNPs for population genetics
- ROH & IBD: 36,702 ROH segments, median FROH=0.015. 6,368 related pairs (PI_HAT ≥ 0.05), 428 related (PI_HAT > 0.0884). See Step 15
CORE COMPLETE — Global ADMIXTURE complete; Evanno replicates running on DRAGEN
- Evanno ΔK: Replicate runs launched on DRAGEN: Global (3,595 × 77,111) + UZB-only (1,047 × 88,722), K=2–8 × 10 reps each. See Step 11 §3.1
- Global ADMIXTURE — COMPLETE. K=5–8 plateau (CV=0.294). 3,595 samples (2,548 1000G + 1,047 UZB). See Step 11
- sNMF validation — COMPLETE. K=2 Uzbek-only (monotonic CE increase, 0.425). See Step 11 §3.2–3.3
2. Uzbek Population Genetics Analyses
2.1. Evanno ΔK Correction (Evanno et al. 2005)
The Evanno method uses the second-order rate of change of the log-likelihood L(K) to find the “true” K. Instead of looking at CV error alone, it computes:
|L″(K)| = |L(K+1) − 2·L(K) + L(K−1)| (second derivative: where improvement plateaus)
ΔK = mean(|L″(K)|) / sd(L(K)) (normalized by variance across runs)
Uzbek-only ADMIXTURE log-likelihoods (single run per K):
| K | Log-likelihood L(K) (single run) | CV Error | L′(K) | |L″(K)| (not true ΔK) |
|---|---|---|---|---|
| 2 | -353,410,112 | 0.51335 | — | — |
| 3 | -352,954,902 | 0.51343 | +455,209 | 143,051 ← peak |
| 4 | -352,642,744 | 0.51579 | +312,158 | 5,464 |
| 5 | -352,336,050 | 0.51636 | +306,694 | 34,278 |
| 6 | -352,063,633 | 0.51682 | +272,417 | 10,310 |
| 7 | -351,780,907 | 0.51885 | +282,726 | 19,557 |
| 8 | -351,517,738 | 0.51974 | +263,169 | — |
The |L″(K)| peaks sharply at K=3 (143,051 vs. 5,464–34,278 for K=4–7). In the Evanno framework, this means the largest “elbow” in likelihood improvement occurs when going from K=2 to K=3 — the most informative split is the transition from 2 to 3 components.
Caveat — single-run limitation: The table above shows Uzbek-only log-likelihoods from a single ADMIXTURE run per K (not replicated). The proper Evanno ΔK = mean|L″(K)| / sd(L(K)) requires multiple replicates per K to compute the denominator; what is shown here is simply |L″(K)| from one run, which should be interpreted as an approximation. For the global dataset, the single-run |L″(K)| again peaks at K=3 (2,087,440). See Step 11 §3.1 for full results. CV error forms a plateau at K=5–8 with K=5 as most parsimonious.
Bottom line: CV error (minimum at K=2) and the Evanno second derivative (peak at K=3) give slightly different answers: CV favours K=2 as best-fitting, while the Evanno |L″(K)| peak at K=3 indicates the sharpest structural break occurs when adding a 3rd component. A natural reconciliation is that K=2 captures the dominant ancestry structure (Western vs. Eastern Eurasian), while K=3 adds a minor component that does not significantly improve model fit. For publication, state both K values explicitly rather than collapsing into a single answer.
2.2. Runs of Homozygosity (ROH)
✓ COMPLETE — See Step 15
ROH analysis completed on 1,047 post-QC Uzbek samples (5,405,898 SNPs). 36,702 ROH segments detected. Median FROH = 0.015. See Step 15 for full results.
IBD analysis (PLINK --genome, 1,047 post-QC samples) identified 6,368 related pairs (PI_HAT ≥ 0.05),
of which 428 are related at PI_HAT > 0.0884: 0 duplicates, 3 first-degree, 2 second-degree, and 423 third-degree relatives.
Clinical relevance: FROH will be used as a covariate in the pregnancy loss GWAS. Mixed linear models (BOLT-LMM/SAIGE) are essential given the extensive cryptic relatedness.
2.3. Covariate Validation of ADMIXTURE Components
ID mapping: genotype IDs (
PREFIX_realID) → phenotype IDs (realID) by stripping numeric prefix.
A. Ethnicity Distribution in Genotyped Samples
| Ethnicity | N | Mean Q1 (Western) | Mean Q2 (Eastern) | SD |
|---|---|---|---|---|
| Uzbek | 807 | 0.652 | 0.348 | 0.149 |
| Russian | 35 | 0.653 | 0.347 | 0.163 |
| Tajik | 31 | 0.660 | 0.340 | 0.126 |
| Tatar | 31 | 0.662 | 0.338 | 0.115 |
| Korean | 22 | 0.561 | 0.439 | 0.201 |
| Kazakh | 17 | 0.657 | 0.343 | 0.108 |
| Karakalpak | 5 | 0.689 | 0.311 | 0.154 |
| Unknown/NA | 51 | 0.632 | 0.368 | 0.189 |
| Other (n≤4 each) | 39 | Uyghur, Kyrgyz, Armenian, Ukrainian, Chinese, Afghan, etc. | ||
At K=2, all ethnic groups share the same two-component gradient — consistent with a shared admixture cline across Central Asian sub-populations.
However, at K=3: HIGHLY significant (H=68.88, p=6.67×10−8, η²=0.115) — the third component differentiates Koreans and Chinese (higher Q3) from the Uzbek/Tajik/Tatar core, matching the expected East Asian affinity.
B. Geographic Origin (Birthplace Region)
| Region | N | Q1 (Western) | Q2 (Eastern) |
|---|---|---|---|
| Fergana | 30 | 0.724 | 0.276 |
| Karakalpakstan | 42 | 0.704 | 0.296 |
| Tashkent city | 449 | 0.671 | 0.329 |
| Namangan | 18 | 0.661 | 0.339 |
| Samarkand | 49 | 0.656 | 0.345 |
| Kashkadarya | 57 | 0.642 | 0.358 |
| Navoi | 26 | 0.642 | 0.358 |
| Syrdarya | 30 | 0.629 | 0.371 |
| Tashkent region | 154 | 0.616 | 0.385 |
| Khorezm | 12 | 0.605 | 0.395 |
| Andijan | 18 | 0.572 | 0.428 |
| Jizzakh | 83 | 0.569 | 0.431 |
| Russia (born) | 3 | 0.966 | 0.034 |
The Q2 (Eastern) ancestry proportion varies from 27.6% (Fergana) to 43.1% (Jizzakh) — a 15.5 percentage-point spread across regions.
B2. Geographic Gradient: Historical Interpretation
The ancestry gradient across Uzbekistan’s regions aligns with known historical migration patterns along the Silk Road:
| Tier | Regions | Q1 (Western) | Q2 (Eastern) | Historical context |
|---|---|---|---|---|
| Most Western | Fergana (n=30) Karakalpakstan (n=42) |
72.4% 70.4% |
27.6% 29.6% |
Fergana Valley: one of Central Asia’s oldest continuously settled agricultural regions; deep Sogdian (Iranian) roots predating Turkic migrations. Ancient cities of Kokand and Margilan were major Silk Road hubs. Karakalpakstan: Aral Sea region; the individuals genotyped from here are predominantly Uzbek residents (not Karakalpak ethnic) who may trace to the settled Khorezm oasis civilization with strong Indo-Iranian substrate. |
| Central / Mixed | Tashkent city (n=449) Samarkand (n=49) Kashkadarya (n=57) Navoi (n=26) |
67.1% 65.6% 64.2% 64.2% |
32.9% 34.5% 35.8% 35.8% |
Tashkent: modern capital, cosmopolitan admixture. Samarkand: Sogdian heartland, but also seat of Timur’s Turco-Mongol empire. Kashkadarya & Navoi: southern and central steppe — balanced migration influence from both Iranian and Turkic populations. |
| Most Eastern | Andijan (n=18) Jizzakh (n=83) |
57.2% 56.9% |
42.8% 43.1% |
Andijan: easternmost Fergana Valley, bordering Kyrgyzstan — historically a corridor for Turkic and Mongol nomadic groups entering the valley. Jizzakh: steppe gateway between Tashkent and Samarkand, at the edge of the Hungry Steppe (Mirzachul) — a historically nomadic pastoralist region with greater Turkic/Mongol influence. The Jizakh Gate was a major pass for nomadic incursions. |
- 15.5% spread in Eastern ancestry (Q2: 27.6%–43.1%) across regions within a single country — a substantial intra-national genetic structure
- The gradient follows a “settled oasis vs. nomadic steppe” axis, not simply east–west geography: Fergana (far east, but settled) is the most “Western” genetically, while Jizzakh (geographically central, but steppe) is the most “Eastern”
- This distinction between geographic position and genetic ancestry reflects the different settlement histories: irrigated oasis cities retained more of the ancient Sogdian/Iranian substrate, while steppe regions absorbed more Turkic/Mongol gene flow
- η²=0.082 means birthplace explains ~8.2% of the variance in ancestry proportions — a modest but highly significant effect
- Publication-worthy: this regional genetic structure within Uzbekistan has not been previously characterized at this resolution in any dataset of this size
- GWAS implication: birthplace region should be included as a covariate (or use regional ancestry Q-values) to prevent confounding by this intra-Uzbek population structure
C. Concordance Summary (K=2)
| Group | Expected | Observed | Match? |
|---|---|---|---|
| Korean (n=22) | Higher Eastern | Q2=0.44 (highest) | ✓ Yes |
| Chinese (n=4) | Higher Eastern | Q2=0.52 | ✓ Yes |
| Russian (n=35) | Higher Western | Q1=0.65 (same as Uzbek) | ~ Weak |
| Tajik (n=31) | Higher Western (Indo-Iranian) | Q1=0.66 (same as Uzbek) | ~ Weak |
| Uzbek (n=807) | Mixed admixed | Q1=0.65, Q2=0.35 | ✓ Yes |
- At K=2, this cohort’s two components capture a geographic signal (birthplace) rather than ethnicity — consistent with a Silk Road admixture cline
- East Asian minorities (Korean, Chinese) do show higher Eastern ancestry, confirming the biological meaning of the Q2 component
- Russians/Tajiks are not distinguishable from Uzbeks at K=2 — because most are long-term residents who have admixed into the local gradient. The global ADMIXTURE with 1000G references (running separately) should separate them much more clearly
- The strong geographic signal supports using birthplace as a covariate in the GWAS model alongside ancestry PCs
D. Global ADMIXTURE with 1000 Genomes References
COMPLETE — March 2026
3,595 samples (1,047 Uzbek + 2,548 1000G Phase 3) × 77,111 LD-pruned SNPs:
| Superpopulation | Populations | N samples |
|---|---|---|
| AFR | YRI, LWK, GWD, MSL, ESN, ACB, ASW | 671 |
| EUR | CEU, GBR, FIN, IBS, TSI | 522 |
| SAS | GIH, PJL, BEB, STU, ITU | 492 |
| EAS | CHB, JPT, CHS, CDX, KHV | 515 |
| AMR | MXL, PUR, CLM, PEL | 348 |
| UZB | Uzbek cohort | 1,047 |
| Total | 3,595 |
Cross-Validation Error (CV)
| K | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|
| CV | 0.310 | 0.300 | 0.297 | 0.295 | 0.295 | 0.294 | 0.294 |
K=5–8 plateau (CV difference <0.1%). K=5 most parsimonious; K=8 nominal minimum (0.29422).
Global Covariate Validation
Re-ran validate_admixture_covariates.py --global on 1,007 matched Uzbek samples (of 1,047 genotyped; 96.2% match rate to phenotype covariates):
| Covariate | Test | K=2 | K=3 | K=4 | K=5 | K=6 | K=7 | K=8 |
|---|---|---|---|---|---|---|---|---|
| Ethnicity | Kruskal–Wallis | ns | ns | ns | ns | ns | ns | ns |
| Birthplace | Kruskal–Wallis | *** | *** | *** | *** | *** | *** | *** |
See Step 11 for full interactive visualization of global ADMIXTURE results.
2.4. Local Ancestry Inference
MEDIUM PRIORITY — Enables ancestry-specific association testing
ADMIXTURE gives genome-wide ancestry proportions. Local ancestry inference (RFMix, LAMP-LD, or ELAI) estimates ancestry at each chromosomal position: “this chunk is Western, that chunk is Eastern.”
Why it matters:
- A SNP’s effect on pregnancy loss may depend on its local ancestry background
- Enables admixture mapping: test whether cases have more Western or Eastern ancestry at specific loci
- Provides a chromosome-level “painting” of each individual’s genome
Requirements: Phased data (from Step 4 BEAGLE output) + reference panels (EUR and EAS from 1000 Genomes)
2.5. IBD Analysis Stratified by Ancestry
✓ COMPLETE — See Step 15
IBD analysis (PLINK --genome) completed on post-QC dataset (1,047 samples, 88,722 LD-pruned SNPs; 547,581 pairwise comparisons).
- 6,368 related pairs (PI_HAT ≥ 0.05); 428 at PI_HAT > 0.0884: 0 duplicates, 3 first-degree, 2 second-degree, 423 third-degree
- PI_HAT distribution: heavy tail reveals cryptic relatedness consistent with founder effects
- Cross-reference with ADMIXTURE K=2 (Q1 mean=0.650, Q2 mean=0.350) shows FROH is slightly elevated at ancestry extremes, consistent with subgroup endogamy
Remaining:
- IBDNe: reconstruct Ne trajectory over time — Silk Road expansion, Mongol bottleneck, Soviet-era recovery
- Segment-level IBD: hap-IBD or GERMLINE for haplotype-level analysis
2.6. Gene Annotation of PBS Candidates
COMPLETE — Ensembl VEP annotation retrieved
All 490 Uzbek-specific SNPs annotated via Ensembl VEP REST API (27 fields per SNP). Results in Step 12:
- 350/490 SNPs mapped to 264 unique genes
- 5 missense variants (4 potentially damaging by SIFT/PolyPhen): SPI1, TNXB, SLC6A2, ATL2
- HLA/MHC region over-represented: 135 chr6 SNPs (33.7%)
- GWAS Catalog + ClinVar + GTEx eQTL cross-referenced
Remaining:
- Pathway enrichment: KEGG/GO/Reactome — do Uzbek-specific variants cluster in immune, metabolic, or reproductive pathways?
- eQTL deep dive: GTEx tissue-specific expression in uterus, placenta, blood
2.7. Extended Haplotype Homozygosity (iHS/nSL)
LOWER PRIORITY — Confirms positive selection signals
For top PBS regions, compute iHS (integrated haplotype score) to independently confirm recent positive selection (vs. drift):
- Requires phased data (BEAGLE, Step 4)
- Software:
selscanfor iHS and nSL computation - A SNP with both high PBS and extreme iHS is a strong selection candidate
2.8. DAPC (Discriminant Analysis of Principal Components)
DAPC (Jombart et al., 2010) is a model-free alternative to ADMIXTURE that combines PCA with Linear Discriminant Analysis. Unlike ADMIXTURE (which assumes Hardy–Weinberg equilibrium within ancestral populations), DAPC makes no assumptions about the underlying population model.
How it works:
- Run PCA to reduce dimensionality (retain ~40 PCs covering ≥80% variance)
- Use K-means clustering on the PCs to find groups (using BIC to select K)
- Run Discriminant Analysis on the PCs to maximize between-group separation
Advantages over ADMIXTURE:
| Feature | ADMIXTURE | DAPC |
|---|---|---|
| Model | Parametric (HWE assumption) | Non-parametric (no HWE assumption) |
| Speed | Hours per K | Minutes total |
| Cluster assignment | Soft (ancestry proportions) | Hard (group membership) + posterior probabilities |
| K selection | CV error or ΔK | BIC curve |
| Best for | Admixed populations | Discrete clusters |
Implementation:
2.9. FST Graphical Visualization
Several ways to visualize the FST matrix (from Step 9; see Step 14 for the full interactive version):
A. Heatmap
The pairwise FST matrix as a color-coded heatmap:
B. Neighbor-Joining Tree
Convert FST to distances and build a neighbor-joining (NJ) tree using the ape package in R. This shows the phylogenetic relationships:
C. Multi-Dimensional Scaling (MDS)
Project the FST distance matrix into 2D space:
3. ADMIXTURE Bar Plots: Continuous or Divisible?
Short answer: Our data is a continuous admixture cline — not discrete clusters.
Here’s why:
- K=2 structure: Most individuals fall on a smooth gradient from ~100% Western to ~100% Eastern, with the majority around 65/35. There are no large “gaps” in the distribution that would indicate distinct groups.
- This is expected for Central Asia: The Uzbek population formed through continuous gene flow between Indo-Iranian (Western) and Turkic/Mongol (Eastern) populations along the Silk Road. Unlike, say, an African-American + European dataset (which would show a bimodal distribution), the Uzbek individuals represent a true admixture cline.
However, there are some ways to create meaningful vertical divisions:
| Method | How | Biological meaning |
|---|---|---|
| Outlier groups | Mark individuals with >95% single component | “Nearly pure” Eastern or Western individuals (likely ethnic minorities in cohort: Tajik/Russian vs. Kazakh/Kyrgyz) |
| Quartile split | Divide into 4 bins by K=2 Component 1 proportion | Arbitrary but useful for stratified analysis (e.g., compare pregnancy loss rates across ancestry quartiles) |
| K-means on Q | Cluster individuals by ancestry proportions | Data-driven grouping, but may not produce cleanly separated clusters if the cline is smooth |
| Self-reported ethnicity | If available in phenotype data, overlay ethnic labels | Most informative — shows whether genetic clusters match self-identification |
4. Pregnancy Loss Association Study
4.1. Phenotype–Genotype ID Mapping
IMMEDIATE BLOCKER
The phenotype CSV (1,815 rows × 246 columns, from the “GWAS ot 27.08” file) uses a different sample ID scheme than the genotype files. Before any association testing, we must:
- Map phenotype IDs ↔ genotype IDs (the genotype IDs are like
2_01-02,12_02-81) - Verify and fix the reportedly inverted case/control labels
- Determine which of the 246 phenotype columns are relevant covariates
4.2. Case/Control Definition
HIGH PRIORITY
Preliminary criteria for separating cases from controls:
| Cases | Controls | |
|---|---|---|
| Primary criterion | ≥2 pregnancy losses (recurrent) | ≥1 live birth, 0 losses |
| Strict criterion | ≥3 consecutive losses (RPL by ESHRE definition) | ≥2 live births, 0 losses |
Sub-phenotypes to consider:
- Early loss (<12 weeks) — most common, often chromosomal/hormonal
- Late loss (13–20 weeks) — often structural/immunological
- Stillbirth (>20 weeks) — placental/vascular causes
- Recurrent early vs. single late — different genetic architectures
4.3. Covariates for Association Testing
HIGH PRIORITY — Must include in GWAS model
| Category | Covariate | Rationale |
|---|---|---|
| Population structure | K=2 ancestry Q1 (continuous) | Primary admixture axis; prevents false positives from population stratification |
| PC1–PC3 from PCA | Captures residual structure not modeled by ADMIXTURE | |
| Demographic | Maternal age at loss/birth | Strong risk factor; aneuploidy risk increases exponentially after 35 |
| BMI | Obesity associated with miscarriage (OR ~1.3–1.7) | |
| Parity (gravidity) | More pregnancies = more opportunities for loss | |
| Clinical | Thrombophilia history | Factor V Leiden, antiphospholipid syndrome → placental thrombosis |
| Thyroid disorders | Both hypo- and hyperthyroidism increase loss risk | |
| Diabetes (pre-existing or gestational) | Uncontrolled glucose → embryotoxic | |
| Genetic | FROH (inbreeding coefficient) | Homozygosity burden — especially in a population with consanguinity |
| Total CNV burden | If available from SNP array intensity data |
4.4. Association Testing Framework
AFTER ID MAPPING
Model:
Two-tier testing strategy:
| Tier | SNP set | Threshold | Rationale |
|---|---|---|---|
| 1. Targeted | 8 PBS candidates | p < 0.00625 (Bonferroni for 8) | Hypothesis: population-specific variants may affect fitness-related phenotypes |
| 2. Genome-wide | All QC’d SNPs | p < 5 × 10−8 (standard GWAS) | Hypothesis-free scan; requires large sample size for power |
Command:
4.5. Known Pregnancy Loss Gene Screen
QUICK WIN — Can be done immediately
Check allele frequencies of known recurrent pregnancy loss (RPL) variants in the Uzbek cohort vs. published European frequencies:
| Gene | Variant | EUR frequency | RPL association |
|---|---|---|---|
| F5 (Factor V) | rs6025 (Leiden) | ~3–5% | OR 2.0–3.5 for RPL |
| F2 (Prothrombin) | rs1799963 (G20210A) | ~1–3% | OR 2.0–2.5 |
| MTHFR | rs1801133 (C677T) | ~25–35% | OR 1.2–1.5 (homozygous) |
| SERPINE1 (PAI-1) | rs1799889 (4G/5G) | ~50% | OR 1.3–1.5 |
| HLA-G | Various | Variable | Immune tolerance at maternal–fetal interface |
| SYCP3 | Rare | <1% | Meiotic segregation errors |
What’s unknown: The frequencies of these variants in Central Asian populations are poorly characterized. Even a simple frequency comparison is publication-worthy.
4.6. Ancestry-Specific Risk Analysis
EXPLORATORY
Tests unique to admixed populations:
- Dose-response: Does more Eastern/Western ancestry proportion correlate with pregnancy loss risk?
- Interaction: Does a SNP’s effect differ depending on the individual’s global ancestry?
- Admixture mapping: Using local ancestry, test whether pregnancy loss cases have excess Western or Eastern ancestry at specific chromosomal regions
5. Priority Roadmap
| # | Task | Priority | Depends on | Time estimate |
|---|---|---|---|---|
| 1 | Covariate validation of ADMIXTURE | DONE | — | Complete |
| 2 | Phenotype–genotype ID mapping | IMMEDIATE | — | 1–2 days |
| 3 | Case/control definition | IMMEDIATE | #2 | 1 day |
| 4 | ROH analysis (FROH) | DONE | — | Complete (Step 15) |
| 5 | Covariate preparation | HIGH | #2, #4 | 1 day |
| 6 | GWAS (targeted 8 PBS + genome-wide) | HIGH | #3, #5 | 1–2 days |
| 7 | Known RPL variant screening | HIGH | — | 2 hours |
| 8 | Gene annotation (Ensembl VEP) | DONE | — | Complete (Step 12) |
| 9 | IBD stratified by ancestry | DONE | — | Complete (Step 15) |
| 10 | DAPC analysis | MEDIUM | — | 0.5 day |
| 11 | Evanno ΔK replicates (sNMF complete) | IN PROGRESS | — | Replicates running on DRAGEN |
| 12 | Global ADMIXTURE with 1000G | DONE | K=5–8 plateau (CV=0.294) | Complete |
| 13 | Local ancestry inference | LATER | — | 2–3 days |
| 14 | iHS/nSL selection scans | LATER | — | 1–2 days |
| 15 | FST heatmap & MDS visualization | DONE | — | Complete (Step 14) |