← Back to Main

Next Steps: Planned Analyses

Roadmap for Uzbek population genetics and pregnancy loss association study

Updated: March 2026

1. Current Status Summary

Completed (Steps 1–15):
  • QC pipeline: missingness → IBD dedup → SNP filtering → imputation → normalization → final QC
  • PCA: Uzbek-internal + global (1000 Genomes) — Uzbek sits on EUR–EAS cline
  • FST: UZB equidistant from SAS and EUR (0.014), then EAS (0.039)
  • ADMIXTURE K=2–8: Uzbek-only K=2 optimal (CV monotonically increases); Global K=5–8 plateau (CV=0.294, with all 2,548 1000G)
  • PBS multi-population analysis: 8 Uzbek-specific variants, all with PBS ≥ 0.3. See Step 12
  • LD pruning: 5.41M → 88.7K independent SNPs for population genetics
  • ROH & IBD: 36,702 ROH segments, median FROH=0.015. 6,368 related pairs (PI_HAT ≥ 0.05), 428 related (PI_HAT > 0.0884). See Step 15

CORE COMPLETE — Global ADMIXTURE complete; Evanno replicates running on DRAGEN

  • Evanno ΔK: Replicate runs launched on DRAGEN: Global (3,595 × 77,111) + UZB-only (1,047 × 88,722), K=2–8 × 10 reps each. See Step 11 §3.1
  • Global ADMIXTURECOMPLETE. K=5–8 plateau (CV=0.294). 3,595 samples (2,548 1000G + 1,047 UZB). See Step 11
  • sNMF validationCOMPLETE. K=2 Uzbek-only (monotonic CE increase, 0.425). See Step 11 §3.2–3.3

2. Uzbek Population Genetics Analyses

2.1. Evanno ΔK Correction (Evanno et al. 2005)

Reference: Evanno G, Regnaut S, Goudet J (2005). Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. Molecular Ecology, 14(8), 2611–2620. doi:10.1111/j.1365-294X.2005.02553.x

The Evanno method uses the second-order rate of change of the log-likelihood L(K) to find the “true” K. Instead of looking at CV error alone, it computes:

L′(K) = L(K) − L(K−1)     (first derivative: rate of improvement)
|L″(K)| = |L(K+1) − 2·L(K) + L(K−1)|     (second derivative: where improvement plateaus)
ΔK = mean(|L″(K)|) / sd(L(K))     (normalized by variance across runs)

Uzbek-only ADMIXTURE log-likelihoods (single run per K):

K Log-likelihood L(K) (single run) CV Error L′(K) |L″(K)| (not true ΔK)
2 -353,410,112 0.51335
3 -352,954,902 0.51343 +455,209 143,051 ← peak
4 -352,642,744 0.51579 +312,158 5,464
5 -352,336,050 0.51636 +306,694 34,278
6 -352,063,633 0.51682 +272,417 10,310
7 -351,780,907 0.51885 +282,726 19,557
8 -351,517,738 0.51974 +263,169
Interpretation:

The |L″(K)| peaks sharply at K=3 (143,051 vs. 5,464–34,278 for K=4–7). In the Evanno framework, this means the largest “elbow” in likelihood improvement occurs when going from K=2 to K=3 — the most informative split is the transition from 2 to 3 components.

Caveat — single-run limitation: The table above shows Uzbek-only log-likelihoods from a single ADMIXTURE run per K (not replicated). The proper Evanno ΔK = mean|L″(K)| / sd(L(K)) requires multiple replicates per K to compute the denominator; what is shown here is simply |L″(K)| from one run, which should be interpreted as an approximation. For the global dataset, the single-run |L″(K)| again peaks at K=3 (2,087,440). See Step 11 §3.1 for full results. CV error forms a plateau at K=5–8 with K=5 as most parsimonious.

Bottom line: CV error (minimum at K=2) and the Evanno second derivative (peak at K=3) give slightly different answers: CV favours K=2 as best-fitting, while the Evanno |L″(K)| peak at K=3 indicates the sharpest structural break occurs when adding a 3rd component. A natural reconciliation is that K=2 captures the dominant ancestry structure (Western vs. Eastern Eurasian), while K=3 adds a minor component that does not significantly improve model fit. For publication, state both K values explicitly rather than collapsing into a single answer.

2.2. Runs of Homozygosity (ROH)

✓ COMPLETE — See Step 15

ROH analysis completed on 1,047 post-QC Uzbek samples (5,405,898 SNPs). 36,702 ROH segments detected. Median FROH = 0.015. See Step 15 for full results.

IBD analysis (PLINK --genome, 1,047 post-QC samples) identified 6,368 related pairs (PI_HAT ≥ 0.05), of which 428 are related at PI_HAT > 0.0884: 0 duplicates, 3 first-degree, 2 second-degree, and 423 third-degree relatives.

Clinical relevance: FROH will be used as a covariate in the pregnancy loss GWAS. Mixed linear models (BOLT-LMM/SAIGE) are essential given the extensive cryptic relatedness.

2.3. Covariate Validation of ADMIXTURE Components

Question: Do ADMIXTURE ancestry components correlate with self-reported ethnicity and geographic origin? This validates whether the inferred genetic structure reflects known demographic patterns.
✓ Validation Complete — 1,007 / 1,047 genotyped samples matched to phenotype covariates (96.2% match rate).
ID mapping: genotype IDs (PREFIX_realID) → phenotype IDs (realID) by stripping numeric prefix.

A. Ethnicity Distribution in Genotyped Samples

EthnicityNMean Q1 (Western)Mean Q2 (Eastern)SD
Uzbek8070.6520.3480.149
Russian350.6530.3470.163
Tajik310.6600.3400.126
Tatar310.6620.3380.115
Korean220.5610.4390.201
Kazakh170.6570.3430.108
Karakalpak50.6890.3110.154
Unknown/NA510.6320.3680.189
Other (n≤4 each)39Uyghur, Kyrgyz, Armenian, Ukrainian, Chinese, Afghan, etc.
K=2 by ethnicity: NOT significant (Kruskal-Wallis H=14.42, p=0.345, η²=0.020).
At K=2, all ethnic groups share the same two-component gradient — consistent with a shared admixture cline across Central Asian sub-populations.

However, at K=3: HIGHLY significant (H=68.88, p=6.67×10−8, η²=0.115) — the third component differentiates Koreans and Chinese (higher Q3) from the Uzbek/Tajik/Tatar core, matching the expected East Asian affinity.

B. Geographic Origin (Birthplace Region)

RegionNQ1 (Western)Q2 (Eastern)
Fergana300.7240.276
Karakalpakstan420.7040.296
Tashkent city4490.6710.329
Namangan180.6610.339
Samarkand490.6560.345
Kashkadarya570.6420.358
Navoi260.6420.358
Syrdarya300.6290.371
Tashkent region1540.6160.385
Khorezm120.6050.395
Andijan180.5720.428
Jizzakh830.5690.431
Russia (born)30.9660.034
K=2 by birthplace: HIGHLY significant (Kruskal-Wallis H=145.41, p=1.93×10−22, η²=0.082).
The Q2 (Eastern) ancestry proportion varies from 27.6% (Fergana) to 43.1% (Jizzakh) — a 15.5 percentage-point spread across regions.

B2. Geographic Gradient: Historical Interpretation

The ancestry gradient across Uzbekistan’s regions aligns with known historical migration patterns along the Silk Road:

TierRegionsQ1 (Western)Q2 (Eastern)Historical context
Most Western Fergana (n=30)
Karakalpakstan (n=42)
72.4%
70.4%
27.6%
29.6%
Fergana Valley: one of Central Asia’s oldest continuously settled agricultural regions; deep Sogdian (Iranian) roots predating Turkic migrations. Ancient cities of Kokand and Margilan were major Silk Road hubs.
Karakalpakstan: Aral Sea region; the individuals genotyped from here are predominantly Uzbek residents (not Karakalpak ethnic) who may trace to the settled Khorezm oasis civilization with strong Indo-Iranian substrate.
Central / Mixed Tashkent city (n=449)
Samarkand (n=49)
Kashkadarya (n=57)
Navoi (n=26)
67.1%
65.6%
64.2%
64.2%
32.9%
34.5%
35.8%
35.8%
Tashkent: modern capital, cosmopolitan admixture. Samarkand: Sogdian heartland, but also seat of Timur’s Turco-Mongol empire.
Kashkadarya & Navoi: southern and central steppe — balanced migration influence from both Iranian and Turkic populations.
Most Eastern Andijan (n=18)
Jizzakh (n=83)
57.2%
56.9%
42.8%
43.1%
Andijan: easternmost Fergana Valley, bordering Kyrgyzstan — historically a corridor for Turkic and Mongol nomadic groups entering the valley.
Jizzakh: steppe gateway between Tashkent and Samarkand, at the edge of the Hungry Steppe (Mirzachul) — a historically nomadic pastoralist region with greater Turkic/Mongol influence. The Jizakh Gate was a major pass for nomadic incursions.
Key insights:
  • 15.5% spread in Eastern ancestry (Q2: 27.6%–43.1%) across regions within a single country — a substantial intra-national genetic structure
  • The gradient follows a “settled oasis vs. nomadic steppe” axis, not simply east–west geography: Fergana (far east, but settled) is the most “Western” genetically, while Jizzakh (geographically central, but steppe) is the most “Eastern”
  • This distinction between geographic position and genetic ancestry reflects the different settlement histories: irrigated oasis cities retained more of the ancient Sogdian/Iranian substrate, while steppe regions absorbed more Turkic/Mongol gene flow
  • η²=0.082 means birthplace explains ~8.2% of the variance in ancestry proportions — a modest but highly significant effect
  • Publication-worthy: this regional genetic structure within Uzbekistan has not been previously characterized at this resolution in any dataset of this size
  • GWAS implication: birthplace region should be included as a covariate (or use regional ancestry Q-values) to prevent confounding by this intra-Uzbek population structure

C. Concordance Summary (K=2)

GroupExpectedObservedMatch?
Korean (n=22)Higher EasternQ2=0.44 (highest)✓ Yes
Chinese (n=4)Higher EasternQ2=0.52✓ Yes
Russian (n=35)Higher WesternQ1=0.65 (same as Uzbek)~ Weak
Tajik (n=31)Higher Western (Indo-Iranian)Q1=0.66 (same as Uzbek)~ Weak
Uzbek (n=807)Mixed admixedQ1=0.65, Q2=0.35✓ Yes
Interpretation:
  • At K=2, this cohort’s two components capture a geographic signal (birthplace) rather than ethnicity — consistent with a Silk Road admixture cline
  • East Asian minorities (Korean, Chinese) do show higher Eastern ancestry, confirming the biological meaning of the Q2 component
  • Russians/Tajiks are not distinguishable from Uzbeks at K=2 — because most are long-term residents who have admixed into the local gradient. The global ADMIXTURE with 1000G references (running separately) should separate them much more clearly
  • The strong geographic signal supports using birthplace as a covariate in the GWAS model alongside ancestry PCs

D. Global ADMIXTURE with 1000 Genomes References

COMPLETE — March 2026

3,595 samples (1,047 Uzbek + 2,548 1000G Phase 3) × 77,111 LD-pruned SNPs:

SuperpopulationPopulationsN samples
AFRYRI, LWK, GWD, MSL, ESN, ACB, ASW671
EURCEU, GBR, FIN, IBS, TSI522
SASGIH, PJL, BEB, STU, ITU492
EASCHB, JPT, CHS, CDX, KHV515
AMRMXL, PUR, CLM, PEL348
UZBUzbek cohort1,047
Total3,595
Cross-Validation Error (CV)
K2345678
CV0.3100.3000.2970.2950.2950.2940.294

K=5–8 plateau (CV difference <0.1%). K=5 most parsimonious; K=8 nominal minimum (0.29422).

Global Covariate Validation

Re-ran validate_admixture_covariates.py --global on 1,007 matched Uzbek samples (of 1,047 genotyped; 96.2% match rate to phenotype covariates):

CovariateTestK=2K=3K=4K=5K=6K=7K=8
Ethnicity Kruskal–Wallis nsnsnsnsnsnsns
Birthplace Kruskal–Wallis *********************
Key finding: Even with 1000G references anchoring continental ancestry components, self-reported ethnicity remains completely non-significant across ALL K values and ALL components. Birthplace remains highly significant (p~10−22) at every K. This definitively confirms that ADMIXTURE components capture geographic structure within Uzbekistan, not ethnic self-identification. The admixture cline is genuine and geographically driven.

See Step 11 for full interactive visualization of global ADMIXTURE results.

2.4. Local Ancestry Inference

MEDIUM PRIORITY — Enables ancestry-specific association testing

ADMIXTURE gives genome-wide ancestry proportions. Local ancestry inference (RFMix, LAMP-LD, or ELAI) estimates ancestry at each chromosomal position: “this chunk is Western, that chunk is Eastern.”

Why it matters:

  • A SNP’s effect on pregnancy loss may depend on its local ancestry background
  • Enables admixture mapping: test whether cases have more Western or Eastern ancestry at specific loci
  • Provides a chromosome-level “painting” of each individual’s genome

Requirements: Phased data (from Step 4 BEAGLE output) + reference panels (EUR and EAS from 1000 Genomes)

2.5. IBD Analysis Stratified by Ancestry

✓ COMPLETE — See Step 15

IBD analysis (PLINK --genome) completed on post-QC dataset (1,047 samples, 88,722 LD-pruned SNPs; 547,581 pairwise comparisons).

  • 6,368 related pairs (PI_HAT ≥ 0.05); 428 at PI_HAT > 0.0884: 0 duplicates, 3 first-degree, 2 second-degree, 423 third-degree
  • PI_HAT distribution: heavy tail reveals cryptic relatedness consistent with founder effects
  • Cross-reference with ADMIXTURE K=2 (Q1 mean=0.650, Q2 mean=0.350) shows FROH is slightly elevated at ancestry extremes, consistent with subgroup endogamy

Remaining:

  • IBDNe: reconstruct Ne trajectory over time — Silk Road expansion, Mongol bottleneck, Soviet-era recovery
  • Segment-level IBD: hap-IBD or GERMLINE for haplotype-level analysis

2.6. Gene Annotation of PBS Candidates

COMPLETE — Ensembl VEP annotation retrieved

⚠ Note: The annotation below is from the initial run (490 candidates). PBS recomputation reduced the candidate set to 8 variants (see Step 12). Re-annotation of these 8 variants has not yet been performed.

All 490 Uzbek-specific SNPs annotated via Ensembl VEP REST API (27 fields per SNP). Results in Step 12:

  • 350/490 SNPs mapped to 264 unique genes
  • 5 missense variants (4 potentially damaging by SIFT/PolyPhen): SPI1, TNXB, SLC6A2, ATL2
  • HLA/MHC region over-represented: 135 chr6 SNPs (33.7%)
  • GWAS Catalog + ClinVar + GTEx eQTL cross-referenced

Remaining:

  • Pathway enrichment: KEGG/GO/Reactome — do Uzbek-specific variants cluster in immune, metabolic, or reproductive pathways?
  • eQTL deep dive: GTEx tissue-specific expression in uterus, placenta, blood

2.7. Extended Haplotype Homozygosity (iHS/nSL)

LOWER PRIORITY — Confirms positive selection signals

For top PBS regions, compute iHS (integrated haplotype score) to independently confirm recent positive selection (vs. drift):

  • Requires phased data (BEAGLE, Step 4)
  • Software: selscan for iHS and nSL computation
  • A SNP with both high PBS and extreme iHS is a strong selection candidate

2.8. DAPC (Discriminant Analysis of Principal Components)

What is DAPC?

DAPC (Jombart et al., 2010) is a model-free alternative to ADMIXTURE that combines PCA with Linear Discriminant Analysis. Unlike ADMIXTURE (which assumes Hardy–Weinberg equilibrium within ancestral populations), DAPC makes no assumptions about the underlying population model.

How it works:

  1. Run PCA to reduce dimensionality (retain ~40 PCs covering ≥80% variance)
  2. Use K-means clustering on the PCs to find groups (using BIC to select K)
  3. Run Discriminant Analysis on the PCs to maximize between-group separation

Advantages over ADMIXTURE:

FeatureADMIXTUREDAPC
ModelParametric (HWE assumption)Non-parametric (no HWE assumption)
SpeedHours per KMinutes total
Cluster assignmentSoft (ancestry proportions)Hard (group membership) + posterior probabilities
K selectionCV error or ΔKBIC curve
Best forAdmixed populationsDiscrete clusters

Implementation:

# R code using adegenet package library(adegenet) library(vcfR) # Load data vcf <- read.vcfR("UZB_pruned.vcf.gz") gl <- vcfR2genlight(vcf) # Find optimal K using BIC grp <- find.clusters(gl, max.n.clust=10, n.pca=100) # Run DAPC dapc_result <- dapc(gl, grp$grp, n.pca=40, n.da=5) # Plot scatter(dapc_result, posi.da="bottomright") compoplot(dapc_result) # ancestry-like bar plot
For our data: Since the Uzbek cohort is admixed (not discrete clusters), DAPC may show a continuous cline rather than distinct groups — similar to what ADMIXTURE K=2 shows. DAPC is most informative as a complement to ADMIXTURE: if both methods agree on 2 components, this is robust evidence. If DAPC finds more structure, it may reflect LD patterns that ADMIXTURE misses.

2.9. FST Graphical Visualization

Several ways to visualize the FST matrix (from Step 9; see Step 14 for the full interactive version):

A. Heatmap

The pairwise FST matrix as a color-coded heatmap:

B. Neighbor-Joining Tree

Convert FST to distances and build a neighbor-joining (NJ) tree using the ape package in R. This shows the phylogenetic relationships:

# R code library(ape) fst_matrix <- matrix(c( 0.0000, 0.0144, 0.0145, 0.0393, 0.1293, 0.0144, 0.0000, 0.0310, 0.0564, 0.1282, 0.0145, 0.0310, 0.0000, 0.0845, 0.1393, 0.0393, 0.0564, 0.0845, 0.0000, 0.1650, 0.1293, 0.1282, 0.1393, 0.1650, 0.0000 ), nrow=5, dimnames=list( c("UZB","SAS","EUR","EAS","AFR"), c("UZB","SAS","EUR","EAS","AFR"))) tree <- nj(as.dist(fst_matrix)) plot(tree, type="unrooted", main="NJ Tree from Fst")

C. Multi-Dimensional Scaling (MDS)

Project the FST distance matrix into 2D space:

# R code mds <- cmdscale(as.dist(fst_matrix), k=2) plot(mds, pch=19, cex=2, xlab="Dim 1", ylab="Dim 2", main="MDS from Fst distances") text(mds, labels=rownames(mds), pos=3)

3. ADMIXTURE Bar Plots: Continuous or Divisible?

Question: Can we divide the ADMIXTURE bar plots into discrete vertical groups, or is the data a continuous gradient?

Short answer: Our data is a continuous admixture cline — not discrete clusters.

Here’s why:

  • K=2 structure: Most individuals fall on a smooth gradient from ~100% Western to ~100% Eastern, with the majority around 65/35. There are no large “gaps” in the distribution that would indicate distinct groups.
  • This is expected for Central Asia: The Uzbek population formed through continuous gene flow between Indo-Iranian (Western) and Turkic/Mongol (Eastern) populations along the Silk Road. Unlike, say, an African-American + European dataset (which would show a bimodal distribution), the Uzbek individuals represent a true admixture cline.

However, there are some ways to create meaningful vertical divisions:

MethodHowBiological meaning
Outlier groups Mark individuals with >95% single component “Nearly pure” Eastern or Western individuals (likely ethnic minorities in cohort: Tajik/Russian vs. Kazakh/Kyrgyz)
Quartile split Divide into 4 bins by K=2 Component 1 proportion Arbitrary but useful for stratified analysis (e.g., compare pregnancy loss rates across ancestry quartiles)
K-means on Q Cluster individuals by ancestry proportions Data-driven grouping, but may not produce cleanly separated clusters if the cline is smooth
Self-reported ethnicity If available in phenotype data, overlay ethnic labels Most informative — shows whether genetic clusters match self-identification
Recommendation: For the pregnancy loss GWAS, do not split into discrete groups. Instead, use the continuous K=2 ancestry proportion (Q1) as a covariate in logistic regression. This preserves all information and avoids arbitrary cutoffs. If you want visual grouping, overlay self-reported ethnicity labels (if available in the phenotype data) on the sorted ADMIXTURE plot.

4. Pregnancy Loss Association Study

4.1. Phenotype–Genotype ID Mapping

IMMEDIATE BLOCKER

The phenotype CSV (1,815 rows × 246 columns, from the “GWAS ot 27.08” file) uses a different sample ID scheme than the genotype files. Before any association testing, we must:

  1. Map phenotype IDs ↔ genotype IDs (the genotype IDs are like 2_01-02, 12_02-81)
  2. Verify and fix the reportedly inverted case/control labels
  3. Determine which of the 246 phenotype columns are relevant covariates

4.2. Case/Control Definition

HIGH PRIORITY

Preliminary criteria for separating cases from controls:

CasesControls
Primary criterion ≥2 pregnancy losses (recurrent) ≥1 live birth, 0 losses
Strict criterion ≥3 consecutive losses (RPL by ESHRE definition) ≥2 live births, 0 losses

Sub-phenotypes to consider:

  • Early loss (<12 weeks) — most common, often chromosomal/hormonal
  • Late loss (13–20 weeks) — often structural/immunological
  • Stillbirth (>20 weeks) — placental/vascular causes
  • Recurrent early vs. single late — different genetic architectures

4.3. Covariates for Association Testing

HIGH PRIORITY — Must include in GWAS model

CategoryCovariateRationale
Population structure K=2 ancestry Q1 (continuous) Primary admixture axis; prevents false positives from population stratification
PC1–PC3 from PCA Captures residual structure not modeled by ADMIXTURE
Demographic Maternal age at loss/birth Strong risk factor; aneuploidy risk increases exponentially after 35
BMI Obesity associated with miscarriage (OR ~1.3–1.7)
Parity (gravidity) More pregnancies = more opportunities for loss
Clinical Thrombophilia history Factor V Leiden, antiphospholipid syndrome → placental thrombosis
Thyroid disorders Both hypo- and hyperthyroidism increase loss risk
Diabetes (pre-existing or gestational) Uncontrolled glucose → embryotoxic
Genetic FROH (inbreeding coefficient) Homozygosity burden — especially in a population with consanguinity
Total CNV burden If available from SNP array intensity data

4.4. Association Testing Framework

AFTER ID MAPPING

Model:

logit(P[case]) = β0 + βSNP·genotype + βQ1·ancestry + βage·age + βBMI·BMI + βPC1·PC1 + βPC2·PC2 + βPC3·PC3

Two-tier testing strategy:

TierSNP setThresholdRationale
1. Targeted 8 PBS candidates p < 0.00625 (Bonferroni for 8) Hypothesis: population-specific variants may affect fitness-related phenotypes
2. Genome-wide All QC’d SNPs p < 5 × 10−8 (standard GWAS) Hypothesis-free scan; requires large sample size for power

Command:

plink2 --bfile UZB_final_QC \ --pheno pregnancy_loss.pheno \ --covar covariates.txt \ --glm hide-covar cols=chrom,pos,ref,alt,a1freq,test,nobs,beta,se,p \ --out UZB_GWAS_results

4.5. Known Pregnancy Loss Gene Screen

QUICK WIN — Can be done immediately

Check allele frequencies of known recurrent pregnancy loss (RPL) variants in the Uzbek cohort vs. published European frequencies:

GeneVariantEUR frequencyRPL association
F5 (Factor V)rs6025 (Leiden)~3–5%OR 2.0–3.5 for RPL
F2 (Prothrombin)rs1799963 (G20210A)~1–3%OR 2.0–2.5
MTHFRrs1801133 (C677T)~25–35%OR 1.2–1.5 (homozygous)
SERPINE1 (PAI-1)rs1799889 (4G/5G)~50%OR 1.3–1.5
HLA-GVariousVariableImmune tolerance at maternal–fetal interface
SYCP3Rare<1%Meiotic segregation errors

What’s unknown: The frequencies of these variants in Central Asian populations are poorly characterized. Even a simple frequency comparison is publication-worthy.

4.6. Ancestry-Specific Risk Analysis

EXPLORATORY

Tests unique to admixed populations:

  • Dose-response: Does more Eastern/Western ancestry proportion correlate with pregnancy loss risk?
  • Interaction: Does a SNP’s effect differ depending on the individual’s global ancestry?
  • Admixture mapping: Using local ancestry, test whether pregnancy loss cases have excess Western or Eastern ancestry at specific chromosomal regions

5. Priority Roadmap

# Task Priority Depends on Time estimate
1 Covariate validation of ADMIXTURE DONE Complete
2 Phenotype–genotype ID mapping IMMEDIATE 1–2 days
3 Case/control definition IMMEDIATE #2 1 day
4 ROH analysis (FROH) DONE Complete (Step 15)
5 Covariate preparation HIGH #2, #4 1 day
6 GWAS (targeted 8 PBS + genome-wide) HIGH #3, #5 1–2 days
7 Known RPL variant screening HIGH 2 hours
8 Gene annotation (Ensembl VEP) DONE Complete (Step 12)
9 IBD stratified by ancestry DONE Complete (Step 15)
10 DAPC analysis MEDIUM 0.5 day
11 Evanno ΔK replicates (sNMF complete) IN PROGRESS Replicates running on DRAGEN
12 Global ADMIXTURE with 1000G DONE K=5–8 plateau (CV=0.294) Complete
13 Local ancestry inference LATER 2–3 days
14 iHS/nSL selection scans LATER 1–2 days
15 FST heatmap & MDS visualization DONE Complete (Step 14)