Next Steps: Planned Analyses

1. Current Status Summary

Completed (Steps 1–15):

QC pipeline: missingness → IBD dedup → SNP filtering → imputation → normalization → final QC
PCA: Uzbek-internal + global (1000 Genomes) — Uzbek sits on EUR–EAS cline
F_ST: UZB equidistant from SAS and EUR (0.014), then EAS (0.039)
ADMIXTURE K=2–8: Uzbek-only K=2 optimal (CV monotonically increases); Global K=5–8 plateau (CV=0.294, with all 2,548 1000G)
PBS multi-population analysis: 8 Uzbek-specific variants, all with PBS ≥ 0.3. See Step 12
LD pruning: 5.41M → 88.7K independent SNPs for population genetics
ROH & IBD: 36,702 ROH segments, median F_ROH=0.015. 6,368 related pairs (PI_HAT ≥ 0.05), 428 related (PI_HAT > 0.0884). See Step 15

CORE COMPLETE — Global ADMIXTURE complete; Evanno replicates running on DRAGEN

Evanno ΔK: Replicate runs launched on DRAGEN: Global (3,595 × 77,111) + UZB-only (1,047 × 88,722), K=2–8 × 10 reps each. See Step 11 §3.1
Global ADMIXTURE — COMPLETE. K=5–8 plateau (CV=0.294). 3,595 samples (2,548 1000G + 1,047 UZB). See Step 11
sNMF validation — COMPLETE. K=2 Uzbek-only (monotonic CE increase, 0.425). See Step 11 §3.2–3.3

2. Uzbek Population Genetics Analyses

2.1. Evanno ΔK Correction (Evanno et al. 2005)

Reference: Evanno G, Regnaut S, Goudet J (2005). Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. Molecular Ecology, 14(8), 2611–2620. doi:10.1111/j.1365-294X.2005.02553.x

The Evanno method uses the second-order rate of change of the log-likelihood L(K) to find the “true” K. Instead of looking at CV error alone, it computes:

L′(K) = L(K) − L(K−1)     (first derivative: rate of improvement)
|L″(K)| = |L(K+1) − 2·L(K) + L(K−1)|     (second derivative: where improvement plateaus)
ΔK = mean(|L″(K)|) / sd(L(K))     (normalized by variance across runs)

Uzbek-only ADMIXTURE log-likelihoods (single run per K):

K	Log-likelihood L(K) (single run)	CV Error	L′(K)	\|L″(K)\| (not true ΔK)
2	-353,410,112	0.51335	—	—
3	-352,954,902	0.51343	+455,209	143,051 ← peak
4	-352,642,744	0.51579	+312,158	5,464
5	-352,336,050	0.51636	+306,694	34,278
6	-352,063,633	0.51682	+272,417	10,310
7	-351,780,907	0.51885	+282,726	19,557
8	-351,517,738	0.51974	+263,169	—

Interpretation:

The |L″(K)| peaks sharply at K=3 (143,051 vs. 5,464–34,278 for K=4–7). In the Evanno framework, this means the largest “elbow” in likelihood improvement occurs when going from K=2 to K=3 — the most informative split is the transition from 2 to 3 components.

Caveat — single-run limitation: The table above shows Uzbek-only log-likelihoods from a single ADMIXTURE run per K (not replicated). The proper Evanno ΔK = mean|L″(K)| / sd(L(K)) requires multiple replicates per K to compute the denominator; what is shown here is simply |L″(K)| from one run, which should be interpreted as an approximation. For the global dataset, the single-run |L″(K)| again peaks at K=3 (2,087,440). See Step 11 §3.1 for full results. CV error forms a plateau at K=5–8 with K=5 as most parsimonious.

Bottom line: CV error (minimum at K=2) and the Evanno second derivative (peak at K=3) give slightly different answers: CV favours K=2 as best-fitting, while the Evanno |L″(K)| peak at K=3 indicates the sharpest structural break occurs when adding a 3rd component. A natural reconciliation is that K=2 captures the dominant ancestry structure (Western vs. Eastern Eurasian), while K=3 adds a minor component that does not significantly improve model fit. For publication, state both K values explicitly rather than collapsing into a single answer.

2.2. Runs of Homozygosity (ROH)

✓ COMPLETE — See Step 15

ROH analysis completed on 1,047 post-QC Uzbek samples (5,405,898 SNPs). 36,702 ROH segments detected. Median F_ROH = 0.015. See Step 15 for full results.

IBD analysis (PLINK --genome, 1,047 post-QC samples) identified 6,368 related pairs (PI_HAT ≥ 0.05), of which 428 are related at PI_HAT > 0.0884: 0 duplicates, 3 first-degree, 2 second-degree, and 423 third-degree relatives.

Clinical relevance: F_ROH will be used as a covariate in the pregnancy loss GWAS. Mixed linear models (BOLT-LMM/SAIGE) are essential given the extensive cryptic relatedness.

2.3. Covariate Validation of ADMIXTURE Components

Question: Do ADMIXTURE ancestry components correlate with self-reported ethnicity and geographic origin? This validates whether the inferred genetic structure reflects known demographic patterns.

✓ Validation Complete — 1,007 / 1,047 genotyped samples matched to phenotype covariates (96.2% match rate).
ID mapping: genotype IDs (PREFIX_realID) → phenotype IDs (realID) by stripping numeric prefix.

A. Ethnicity Distribution in Genotyped Samples

Ethnicity	N	Mean Q1 (Western)	Mean Q2 (Eastern)	SD
Uzbek	807	0.652	0.348	0.149
Russian	35	0.653	0.347	0.163
Tajik	31	0.660	0.340	0.126
Tatar	31	0.662	0.338	0.115
Korean	22	0.561	0.439	0.201
Kazakh	17	0.657	0.343	0.108
Karakalpak	5	0.689	0.311	0.154
Unknown/NA	51	0.632	0.368	0.189
Other (n≤4 each)	39	Uyghur, Kyrgyz, Armenian, Ukrainian, Chinese, Afghan, etc.

K=2 by ethnicity: NOT significant (Kruskal-Wallis H=14.42, p=0.345, η²=0.020).
At K=2, all ethnic groups share the same two-component gradient — consistent with a shared admixture cline across Central Asian sub-populations.

However, at K=3: HIGHLY significant (H=68.88, p=6.67×10⁻⁸, η²=0.115) — the third component differentiates Koreans and Chinese (higher Q3) from the Uzbek/Tajik/Tatar core, matching the expected East Asian affinity.

B. Geographic Origin (Birthplace Region)

Region	N	Q1 (Western)	Q2 (Eastern)
Fergana	30	0.724	0.276
Karakalpakstan	42	0.704	0.296
Tashkent city	449	0.671	0.329
Namangan	18	0.661	0.339
Samarkand	49	0.656	0.345
Kashkadarya	57	0.642	0.358
Navoi	26	0.642	0.358
Syrdarya	30	0.629	0.371
Tashkent region	154	0.616	0.385
Khorezm	12	0.605	0.395
Andijan	18	0.572	0.428
Jizzakh	83	0.569	0.431
Russia (born)	3	0.966	0.034

K=2 by birthplace: HIGHLY significant (Kruskal-Wallis H=145.41, p=1.93×10⁻²², η²=0.082).
The Q2 (Eastern) ancestry proportion varies from 27.6% (Fergana) to 43.1% (Jizzakh) — a 15.5 percentage-point spread across regions.

B2. Geographic Gradient: Historical Interpretation

The ancestry gradient across Uzbekistan’s regions aligns with known historical migration patterns along the Silk Road:

Tier	Regions	Q1 (Western)	Q2 (Eastern)	Historical context
Most Western	Fergana (n=30) Karakalpakstan (n=42)	72.4% 70.4%	27.6% 29.6%	Fergana Valley: one of Central Asia’s oldest continuously settled agricultural regions; deep Sogdian (Iranian) roots predating Turkic migrations. Ancient cities of Kokand and Margilan were major Silk Road hubs. Karakalpakstan: Aral Sea region; the individuals genotyped from here are predominantly Uzbek residents (not Karakalpak ethnic) who may trace to the settled Khorezm oasis civilization with strong Indo-Iranian substrate.
Central / Mixed	Tashkent city (n=449) Samarkand (n=49) Kashkadarya (n=57) Navoi (n=26)	67.1% 65.6% 64.2% 64.2%	32.9% 34.5% 35.8% 35.8%	Tashkent: modern capital, cosmopolitan admixture. Samarkand: Sogdian heartland, but also seat of Timur’s Turco-Mongol empire. Kashkadarya & Navoi: southern and central steppe — balanced migration influence from both Iranian and Turkic populations.
Most Eastern	Andijan (n=18) Jizzakh (n=83)	57.2% 56.9%	42.8% 43.1%	Andijan: easternmost Fergana Valley, bordering Kyrgyzstan — historically a corridor for Turkic and Mongol nomadic groups entering the valley. Jizzakh: steppe gateway between Tashkent and Samarkand, at the edge of the Hungry Steppe (Mirzachul) — a historically nomadic pastoralist region with greater Turkic/Mongol influence. The Jizakh Gate was a major pass for nomadic incursions.

Key insights:

15.5% spread in Eastern ancestry (Q2: 27.6%–43.1%) across regions within a single country — a substantial intra-national genetic structure
The gradient follows a “settled oasis vs. nomadic steppe” axis, not simply east–west geography: Fergana (far east, but settled) is the most “Western” genetically, while Jizzakh (geographically central, but steppe) is the most “Eastern”
This distinction between geographic position and genetic ancestry reflects the different settlement histories: irrigated oasis cities retained more of the ancient Sogdian/Iranian substrate, while steppe regions absorbed more Turkic/Mongol gene flow
η²=0.082 means birthplace explains ~8.2% of the variance in ancestry proportions — a modest but highly significant effect
Publication-worthy: this regional genetic structure within Uzbekistan has not been previously characterized at this resolution in any dataset of this size
GWAS implication: birthplace region should be included as a covariate (or use regional ancestry Q-values) to prevent confounding by this intra-Uzbek population structure

C. Concordance Summary (K=2)

Group	Expected	Observed	Match?
Korean (n=22)	Higher Eastern	Q2=0.44 (highest)	✓ Yes
Chinese (n=4)	Higher Eastern	Q2=0.52	✓ Yes
Russian (n=35)	Higher Western	Q1=0.65 (same as Uzbek)	~ Weak
Tajik (n=31)	Higher Western (Indo-Iranian)	Q1=0.66 (same as Uzbek)	~ Weak
Uzbek (n=807)	Mixed admixed	Q1=0.65, Q2=0.35	✓ Yes

Interpretation:

At K=2, this cohort’s two components capture a geographic signal (birthplace) rather than ethnicity — consistent with a Silk Road admixture cline
East Asian minorities (Korean, Chinese) do show higher Eastern ancestry, confirming the biological meaning of the Q2 component
Russians/Tajiks are not distinguishable from Uzbeks at K=2 — because most are long-term residents who have admixed into the local gradient. The global ADMIXTURE with 1000G references (running separately) should separate them much more clearly
The strong geographic signal supports using birthplace as a covariate in the GWAS model alongside ancestry PCs

D. Global ADMIXTURE with 1000 Genomes References

COMPLETE — March 2026

3,595 samples (1,047 Uzbek + 2,548 1000G Phase 3) × 77,111 LD-pruned SNPs:

Superpopulation	Populations	N samples
AFR	YRI, LWK, GWD, MSL, ESN, ACB, ASW	671
EUR	CEU, GBR, FIN, IBS, TSI	522
SAS	GIH, PJL, BEB, STU, ITU	492
EAS	CHB, JPT, CHS, CDX, KHV	515
AMR	MXL, PUR, CLM, PEL	348
UZB	Uzbek cohort	1,047
	Total	3,595

Cross-Validation Error (CV)

K	2	3	4	5	6	7	8
CV	0.310	0.300	0.297	0.295	0.295	0.294	0.294

K=5–8 plateau (CV difference <0.1%). K=5 most parsimonious; K=8 nominal minimum (0.29422).

Global Covariate Validation

Re-ran validate_admixture_covariates.py --global on 1,007 matched Uzbek samples (of 1,047 genotyped; 96.2% match rate to phenotype covariates):

Covariate	Test	K=2	K=3	K=4	K=5	K=6	K=7	K=8
Ethnicity	Kruskal–Wallis	ns	ns	ns	ns	ns	ns	ns
Birthplace	Kruskal–Wallis	***	***	***	***	***	***	***

Key finding: Even with 1000G references anchoring continental ancestry components, self-reported ethnicity remains completely non-significant across ALL K values and ALL components. Birthplace remains highly significant (p~10⁻²²) at every K. This definitively confirms that ADMIXTURE components capture geographic structure within Uzbekistan, not ethnic self-identification. The admixture cline is genuine and geographically driven.

See Step 11 for full interactive visualization of global ADMIXTURE results.

2.4. Local Ancestry Inference

MEDIUM PRIORITY — Enables ancestry-specific association testing

ADMIXTURE gives genome-wide ancestry proportions. Local ancestry inference (RFMix, LAMP-LD, or ELAI) estimates ancestry at each chromosomal position: “this chunk is Western, that chunk is Eastern.”

Why it matters:

A SNP’s effect on pregnancy loss may depend on its local ancestry background
Enables admixture mapping: test whether cases have more Western or Eastern ancestry at specific loci
Provides a chromosome-level “painting” of each individual’s genome

Requirements: Phased data (from Step 4 BEAGLE output) + reference panels (EUR and EAS from 1000 Genomes)

2.5. IBD Analysis Stratified by Ancestry

✓ COMPLETE — See Step 15

IBD analysis (PLINK --genome) completed on post-QC dataset (1,047 samples, 88,722 LD-pruned SNPs; 547,581 pairwise comparisons).

6,368 related pairs (PI_HAT ≥ 0.05); 428 at PI_HAT > 0.0884: 0 duplicates, 3 first-degree, 2 second-degree, 423 third-degree
PI_HAT distribution: heavy tail reveals cryptic relatedness consistent with founder effects
Cross-reference with ADMIXTURE K=2 (Q1 mean=0.650, Q2 mean=0.350) shows F_ROH is slightly elevated at ancestry extremes, consistent with subgroup endogamy

Remaining:

IBDNe: reconstruct N_e trajectory over time — Silk Road expansion, Mongol bottleneck, Soviet-era recovery
Segment-level IBD: hap-IBD or GERMLINE for haplotype-level analysis

2.6. Gene Annotation of PBS Candidates

COMPLETE — Ensembl VEP annotation retrieved

⚠ Note: The annotation below is from the initial run (490 candidates). PBS recomputation reduced the candidate set to 8 variants (see Step 12). Re-annotation of these 8 variants has not yet been performed.

All 490 Uzbek-specific SNPs annotated via Ensembl VEP REST API (27 fields per SNP). Results in Step 12:

350/490 SNPs mapped to 264 unique genes
5 missense variants (4 potentially damaging by SIFT/PolyPhen): SPI1, TNXB, SLC6A2, ATL2
HLA/MHC region over-represented: 135 chr6 SNPs (33.7%)
GWAS Catalog + ClinVar + GTEx eQTL cross-referenced

Remaining:

Pathway enrichment: KEGG/GO/Reactome — do Uzbek-specific variants cluster in immune, metabolic, or reproductive pathways?
eQTL deep dive: GTEx tissue-specific expression in uterus, placenta, blood

2.7. Extended Haplotype Homozygosity (iHS/nSL)

LOWER PRIORITY — Confirms positive selection signals

For top PBS regions, compute iHS (integrated haplotype score) to independently confirm recent positive selection (vs. drift):

Requires phased data (BEAGLE, Step 4)
Software: selscan for iHS and nSL computation
A SNP with both high PBS and extreme iHS is a strong selection candidate

2.8. DAPC (Discriminant Analysis of Principal Components)

What is DAPC?

DAPC (Jombart et al., 2010) is a model-free alternative to ADMIXTURE that combines PCA with Linear Discriminant Analysis. Unlike ADMIXTURE (which assumes Hardy–Weinberg equilibrium within ancestral populations), DAPC makes no assumptions about the underlying population model.

How it works:

Run PCA to reduce dimensionality (retain ~40 PCs covering ≥80% variance)
Use K-means clustering on the PCs to find groups (using BIC to select K)
Run Discriminant Analysis on the PCs to maximize between-group separation

Advantages over ADMIXTURE:

Feature	ADMIXTURE	DAPC
Model	Parametric (HWE assumption)	Non-parametric (no HWE assumption)
Speed	Hours per K	Minutes total
Cluster assignment	Soft (ancestry proportions)	Hard (group membership) + posterior probabilities
K selection	CV error or ΔK	BIC curve
Best for	Admixed populations	Discrete clusters

Implementation:

# R code using adegenet package
library(adegenet)
library(vcfR)

# Load data
vcf <- read.vcfR("UZB_pruned.vcf.gz")
gl <- vcfR2genlight(vcf)

# Find optimal K using BIC
grp <- find.clusters(gl, max.n.clust=10, n.pca=100)

# Run DAPC
dapc_result <- dapc(gl, grp$grp, n.pca=40, n.da=5)

# Plot
scatter(dapc_result, posi.da="bottomright")
compoplot(dapc_result)  # ancestry-like bar plot

For our data: Since the Uzbek cohort is admixed (not discrete clusters), DAPC may show a continuous cline rather than distinct groups — similar to what ADMIXTURE K=2 shows. DAPC is most informative as a complement to ADMIXTURE: if both methods agree on 2 components, this is robust evidence. If DAPC finds more structure, it may reflect LD patterns that ADMIXTURE misses.

2.9. F_ST Graphical Visualization

Several ways to visualize the F_ST matrix (from Step 9; see Step 14 for the full interactive version):

A. Heatmap

The pairwise F_ST matrix as a color-coded heatmap:

B. Neighbor-Joining Tree

Convert F_ST to distances and build a neighbor-joining (NJ) tree using the ape package in R. This shows the phylogenetic relationships:

# R code
library(ape)
fst_matrix <- matrix(c(
  0.0000, 0.0144, 0.0145, 0.0393, 0.1293,
  0.0144, 0.0000, 0.0310, 0.0564, 0.1282,
  0.0145, 0.0310, 0.0000, 0.0845, 0.1393,
  0.0393, 0.0564, 0.0845, 0.0000, 0.1650,
  0.1293, 0.1282, 0.1393, 0.1650, 0.0000
), nrow=5, dimnames=list(
  c("UZB","SAS","EUR","EAS","AFR"),
  c("UZB","SAS","EUR","EAS","AFR")))
tree <- nj(as.dist(fst_matrix))
plot(tree, type="unrooted", main="NJ Tree from Fst")

C. Multi-Dimensional Scaling (MDS)

Project the F_ST distance matrix into 2D space:

# R code
mds <- cmdscale(as.dist(fst_matrix), k=2)
plot(mds, pch=19, cex=2, xlab="Dim 1", ylab="Dim 2",
     main="MDS from Fst distances")
text(mds, labels=rownames(mds), pos=3)

3. ADMIXTURE Bar Plots: Continuous or Divisible?

Question: Can we divide the ADMIXTURE bar plots into discrete vertical groups, or is the data a continuous gradient?

Short answer: Our data is a continuous admixture cline — not discrete clusters.

Here’s why:

K=2 structure: Most individuals fall on a smooth gradient from ~100% Western to ~100% Eastern, with the majority around 65/35. There are no large “gaps” in the distribution that would indicate distinct groups.
This is expected for Central Asia: The Uzbek population formed through continuous gene flow between Indo-Iranian (Western) and Turkic/Mongol (Eastern) populations along the Silk Road. Unlike, say, an African-American + European dataset (which would show a bimodal distribution), the Uzbek individuals represent a true admixture cline.

However, there are some ways to create meaningful vertical divisions:

Method	How	Biological meaning
Outlier groups	Mark individuals with >95% single component	“Nearly pure” Eastern or Western individuals (likely ethnic minorities in cohort: Tajik/Russian vs. Kazakh/Kyrgyz)
Quartile split	Divide into 4 bins by K=2 Component 1 proportion	Arbitrary but useful for stratified analysis (e.g., compare pregnancy loss rates across ancestry quartiles)
K-means on Q	Cluster individuals by ancestry proportions	Data-driven grouping, but may not produce cleanly separated clusters if the cline is smooth
Self-reported ethnicity	If available in phenotype data, overlay ethnic labels	Most informative — shows whether genetic clusters match self-identification

Recommendation: For the pregnancy loss GWAS, do not split into discrete groups. Instead, use the continuous K=2 ancestry proportion (Q1) as a covariate in logistic regression. This preserves all information and avoids arbitrary cutoffs. If you want visual grouping, overlay self-reported ethnicity labels (if available in the phenotype data) on the sorted ADMIXTURE plot.

4. Pregnancy Loss Association Study

4.1. Phenotype–Genotype ID Mapping

IMMEDIATE BLOCKER

The phenotype CSV (1,815 rows × 246 columns, from the “GWAS ot 27.08” file) uses a different sample ID scheme than the genotype files. Before any association testing, we must:

Map phenotype IDs ↔ genotype IDs (the genotype IDs are like 2_01-02, 12_02-81)
Verify and fix the reportedly inverted case/control labels
Determine which of the 246 phenotype columns are relevant covariates

4.2. Case/Control Definition

HIGH PRIORITY

Preliminary criteria for separating cases from controls:

	Cases	Controls
Primary criterion	≥2 pregnancy losses (recurrent)	≥1 live birth, 0 losses
Strict criterion	≥3 consecutive losses (RPL by ESHRE definition)	≥2 live births, 0 losses

Sub-phenotypes to consider:

Early loss (<12 weeks) — most common, often chromosomal/hormonal
Late loss (13–20 weeks) — often structural/immunological
Stillbirth (>20 weeks) — placental/vascular causes
Recurrent early vs. single late — different genetic architectures

4.3. Covariates for Association Testing

HIGH PRIORITY — Must include in GWAS model

Category	Covariate	Rationale
Population structure	K=2 ancestry Q1 (continuous)	Primary admixture axis; prevents false positives from population stratification
Population structure	PC1–PC3 from PCA	Captures residual structure not modeled by ADMIXTURE
Demographic	Maternal age at loss/birth	Strong risk factor; aneuploidy risk increases exponentially after 35
	BMI	Obesity associated with miscarriage (OR ~1.3–1.7)
	Parity (gravidity)	More pregnancies = more opportunities for loss
Clinical	Thrombophilia history	Factor V Leiden, antiphospholipid syndrome → placental thrombosis
	Thyroid disorders	Both hypo- and hyperthyroidism increase loss risk
	Diabetes (pre-existing or gestational)	Uncontrolled glucose → embryotoxic
Genetic	F_ROH (inbreeding coefficient)	Homozygosity burden — especially in a population with consanguinity
Genetic	Total CNV burden	If available from SNP array intensity data

4.4. Association Testing Framework

AFTER ID MAPPING

Model:

logit(P[case]) = β₀ + β_SNP·genotype + β_Q1·ancestry + β_age·age + β_BMI·BMI + β_PC1·PC1 + β_PC2·PC2 + β_PC3·PC3

Two-tier testing strategy:

Tier	SNP set	Threshold	Rationale
1. Targeted	8 PBS candidates	p < 0.00625 (Bonferroni for 8)	Hypothesis: population-specific variants may affect fitness-related phenotypes
2. Genome-wide	All QC’d SNPs	p < 5 × 10⁻⁸ (standard GWAS)	Hypothesis-free scan; requires large sample size for power

Command:

plink2 --bfile UZB_final_QC \
    --pheno pregnancy_loss.pheno \
    --covar covariates.txt \
    --glm hide-covar cols=chrom,pos,ref,alt,a1freq,test,nobs,beta,se,p \
    --out UZB_GWAS_results

4.5. Known Pregnancy Loss Gene Screen

QUICK WIN — Can be done immediately

Check allele frequencies of known recurrent pregnancy loss (RPL) variants in the Uzbek cohort vs. published European frequencies:

Gene	Variant	EUR frequency	RPL association
F5 (Factor V)	rs6025 (Leiden)	~3–5%	OR 2.0–3.5 for RPL
F2 (Prothrombin)	rs1799963 (G20210A)	~1–3%	OR 2.0–2.5
MTHFR	rs1801133 (C677T)	~25–35%	OR 1.2–1.5 (homozygous)
SERPINE1 (PAI-1)	rs1799889 (4G/5G)	~50%	OR 1.3–1.5
HLA-G	Various	Variable	Immune tolerance at maternal–fetal interface
SYCP3	Rare	<1%	Meiotic segregation errors

What’s unknown: The frequencies of these variants in Central Asian populations are poorly characterized. Even a simple frequency comparison is publication-worthy.

4.6. Ancestry-Specific Risk Analysis

EXPLORATORY

Tests unique to admixed populations:

Dose-response: Does more Eastern/Western ancestry proportion correlate with pregnancy loss risk?
Interaction: Does a SNP’s effect differ depending on the individual’s global ancestry?
Admixture mapping: Using local ancestry, test whether pregnancy loss cases have excess Western or Eastern ancestry at specific chromosomal regions

5. Priority Roadmap

#	Task	Priority	Depends on	Time estimate
1	Covariate validation of ADMIXTURE	DONE	—	Complete
2	Phenotype–genotype ID mapping	IMMEDIATE	—	1–2 days
3	Case/control definition	IMMEDIATE	#2	1 day
4	ROH analysis (F_ROH)	DONE	—	Complete (Step 15)
5	Covariate preparation	HIGH	#2, #4	1 day
6	GWAS (targeted 8 PBS + genome-wide)	HIGH	#3, #5	1–2 days
7	Known RPL variant screening	HIGH	—	2 hours
8	Gene annotation (Ensembl VEP)	DONE	—	Complete (Step 12)
9	IBD stratified by ancestry	DONE	—	Complete (Step 15)
10	DAPC analysis	MEDIUM	—	0.5 day
11	Evanno ΔK replicates (sNMF complete)	IN PROGRESS	—	Replicates running on DRAGEN
12	Global ADMIXTURE with 1000G	DONE	K=5–8 plateau (CV=0.294)	Complete
13	Local ancestry inference	LATER	—	2–3 days
14	iHS/nSL selection scans	LATER	—	1–2 days
15	F_ST heatmap & MDS visualization	DONE	—	Complete (Step 14)

Next Steps: Planned Analyses

1. Current Status Summary

CORE COMPLETE — Global ADMIXTURE complete; Evanno replicates running on DRAGEN

2. Uzbek Population Genetics Analyses

2.1. Evanno ΔK Correction (Evanno et al. 2005)

Uzbek-only ADMIXTURE log-likelihoods (single run per K):

2.2. Runs of Homozygosity (ROH)

✓ COMPLETE — See Step 15

2.3. Covariate Validation of ADMIXTURE Components

A. Ethnicity Distribution in Genotyped Samples

B. Geographic Origin (Birthplace Region)

B2. Geographic Gradient: Historical Interpretation

C. Concordance Summary (K=2)

D. Global ADMIXTURE with 1000 Genomes References

COMPLETE — March 2026

Cross-Validation Error (CV)

Global Covariate Validation

2.4. Local Ancestry Inference

MEDIUM PRIORITY — Enables ancestry-specific association testing

2.5. IBD Analysis Stratified by Ancestry

✓ COMPLETE — See Step 15

2.6. Gene Annotation of PBS Candidates

COMPLETE — Ensembl VEP annotation retrieved

2.7. Extended Haplotype Homozygosity (iHS/nSL)

LOWER PRIORITY — Confirms positive selection signals

2.8. DAPC (Discriminant Analysis of Principal Components)

2.9. FST Graphical Visualization

A. Heatmap

B. Neighbor-Joining Tree

C. Multi-Dimensional Scaling (MDS)

3. ADMIXTURE Bar Plots: Continuous or Divisible?

4. Pregnancy Loss Association Study

4.1. Phenotype–Genotype ID Mapping

IMMEDIATE BLOCKER

4.2. Case/Control Definition

HIGH PRIORITY

4.3. Covariates for Association Testing

HIGH PRIORITY — Must include in GWAS model

4.4. Association Testing Framework

AFTER ID MAPPING

4.5. Known Pregnancy Loss Gene Screen

QUICK WIN — Can be done immediately

4.6. Ancestry-Specific Risk Analysis

EXPLORATORY

5. Priority Roadmap

2.9. F_ST Graphical Visualization