Step 15: ROH & IBD Analysis

1. Overview

Runs of Homozygosity (ROH) are contiguous stretches of homozygous genotypes that arise when an individual inherits two copies of the same ancestral haplotype — a hallmark of parental relatedness (consanguinity) or population-level founder effects. Identity-by-Descent (IBD) analysis detects pairs of individuals sharing long haplotype segments, revealing cryptic relatedness not apparent from pedigree records.

Both analyses are critical for understanding the Uzbek population's demographic history and for designing association studies (e.g., pregnancy loss GWAS) where cryptic relatedness inflates test statistics if unaccounted for.

1,047

Individuals

36,702

ROH segments

0.015

Median F_ROH

6,368

Related pairs (IBD)

Duplicate samples

Key finding: Median F_ROH = 0.015 is comparable to outbred European populations (~0.01–0.02). However, a substantial tail of consanguineous individuals (28 with F_ROH > 0.0625) indicates historical endogamy patterns in a subset of the Uzbek cohort. This is directly relevant to pregnancy loss genetics, as elevated autozygosity increases exposure to recessive disease alleles.

2. ROH Summary Statistics

PLINK 1.9 --homozyg was run on the Uzbek post-QC dataset (1,047 × 5,405,898 SNPs) with parameters tuned for detecting long ROH segments reflecting recent consanguinity:

plink --bfile UZB_v2_for_roh \
      --homozyg \
      --homozyg-window-snp 50 \
      --homozyg-snp 50 \
      --homozyg-kb 1000 \
      --homozyg-density 50 \
      --homozyg-gap 1000 \
      --homozyg-window-het 1 \
      --homozyg-window-missing 5 \
      --homozyg-window-threshold 0.05 \
      --out UZB_v2_ROH

Parameter rationale

Parameter	Value	Meaning
`--homozyg-kb`	1000	Minimum ROH length 1,000 kb — focuses on long ROH reflecting recent consanguinity
`--homozyg-snp`	50	Minimum 50 SNPs per ROH — avoids sparse-coverage artefacts
`--homozyg-window-snp`	50	Scanning window of 50 SNPs
`--homozyg-density`	50	Max 50 kb/SNP density — ensures ROH are in SNP-dense regions
`--homozyg-gap`	1000	Max 1 Mb gap between consecutive SNPs within a ROH
`--homozyg-window-het`	1	Max 1 heterozygous call per scanning window (allows for genotyping error)
`--homozyg-window-missing`	5	Max 5 missing genotypes per window before excluding that region
`--homozyg-window-threshold`	0.05	Min proportion (5%) of overlapping homozygous windows to call a ROH

Distribution of ROH per individual

Statistic	ROH Count	Total ROH (Mb)	F_ROH
Minimum	15	22.07	0.0077
25th percentile	27	37.08	0.0129
Median	31	42.55	0.0148
Mean	35.1	54.19	0.0188
75th percentile	37	52.77	0.0183
Maximum	145	320.20	0.1111

F_ROH calculation: F_ROH = Σ(ROH length) / L_genome, where L_genome = 2,881 Mb (autosomal genome covered by the SNP array). This genomic inbreeding coefficient is more informative than pedigree-based F because it captures ancient as well as recent consanguinity.

Extreme individuals

🔴 Highest F_ROH

Individual	N_ROH	Total Mb	F_ROH
705_14-79m	145	320.20	0.1111
548_14-20m	123	299.87	0.1041
629_03-160	128	298.83	0.1037
70_02-35	121	280.22	0.0973
927_13-118	114	274.11	0.0951

F_ROH > 0.0625 corresponds to parents who are at least 1st cousins.

🟢 Lowest F_ROH

Individual	N_ROH	Total Mb	F_ROH
904_09-129	16	22.07	0.0077
463_08-442	19	22.11	0.0077

7 ROH with just 4.3 Mb total — compatible with outbred ancestry and no recent consanguinity.

3. F_ROH Distribution

The histogram below shows the distribution of genomic inbreeding coefficients across all 1,047 individuals. Reference thresholds are marked for clinical interpretation:

Interpreting F_ROH:
• F_ROH < 0.0156 — no evidence of recent parental relatedness (outbred)
• 0.0156 – 0.0625 — background consanguinity, consistent with 3rd–4th cousin parents
• 0.0625 – 0.125 — parents likely 1st cousins or equivalent (6.25% of genome identical-by-descent)
• F_ROH > 0.125 — parents closer than 1st cousins (half-siblings or double 1st cousins)

F_ROH class breakdown

606

Outbred (F<0.0156)

413

Background (0.016–0.0625)

Consanguineous (>0.0625)

4. ROH by Chromosome

ROH frequency varies by chromosome, reflecting both chromosome length and regional recombination rate differences. Longer chromosomes accumulate more ROH simply because they contain more physical sequence, but recombination coldspots (e.g., pericentromeric regions) can elevate local ROH density beyond length expectations.

Notable: Chromosome 2 has the highest ROH count (3,590), consistent with its large physical size (243 Mb) and known low-recombination pericentromeric block. Chromosome 21 has the fewest ROH (237), reflecting its small size (46.7 Mb).

5. Identity-by-Descent Analysis

IBD was estimated using PLINK 1.9 --genome on the LD-pruned dataset (1,047 samples, 88,722 SNPs) with --min 0.05. The method-of-moments estimator produces PI_HAT (proportion of genome shared IBD) for every pair, along with Z0, Z1, Z2 (probabilities of sharing 0, 1, or 2 alleles IBD).

547,581

Total pairs tested

186

Related (π̂ > 0.125)

Duplicates / MZ twins

1st-degree relatives

Relatedness by degree

Category	PI_HAT threshold	Expected relationship	Pairs	% of total
Duplicates / MZ	> 0.98	Identical genotypes (lab duplicates or MZ twins)	0	0%
1st degree	> 0.354	Parent–child or full siblings	3	0.0005%
2nd degree	> 0.177	Half-siblings, avuncular, grandparent	2	0.0004%
3rd degree	> 0.0884	1st cousins or equivalent	423	0.077%
Total related	> 0.0884	All degrees combined	428	0.078%

Note on duplicates: The post-QC dataset has 0 duplicate/MZ pairs, as duplicates were already removed during quality control (Step 7, --mind 0.05). Post-QC sample count: N = 1,047.

6. PI_HAT Distribution

The vast majority of pairs (98.8%) share < 5% of their genome IBD, as expected for unrelated individuals. The analysis on LD-pruned variants (88.7K SNPs) with --min 0.05 threshold detected 6,368 pairs:

PI_HAT distribution bins

< 0.05

541,213 (98.8%)

0.05–0.10

6,128 (1.12%)

0.10–0.20

236 (0.043%)

0.20–0.50

3 (0.0005%)

> 0.50

1 (0.0002%)

Implications for GWAS: With only 5 pairs at 2nd degree or closer and 423 3rd-degree pairs, cryptic relatedness is minimal in the post-QC dataset. Standard GRM-based mixed models (BOLT-LMM, SAIGE) remain recommended for the planned pregnancy loss GWAS, but aggressive kinship filtering is unlikely to be necessary.

7. ADMIXTURE × ROH Cross-Reference

Using ADMIXTURE K=2 ancestry proportions (Q1: European-like component, mean 0.650; Q2: East Asian-like component, mean 0.350), we can examine whether autozygosity varies by ancestry proportion. In admixed populations, individuals with more homogeneous ancestry (one dominant component) may show elevated homozygosity due to assortative mating within subgroups.

Expected pattern: In a single-pulse admixture, F_ROH should be somewhat elevated at both extremes of the ancestry distribution (Q1 < 0.3 or Q1 > 0.8) where individuals are genetically more homogeneous, and lower in the middle where recombination between ancestral haplotypes breaks up long homozygous blocks. Deviations suggest ongoing substructure or community-level endogamy.

8. Clinical Relevance for Pregnancy Loss

Clinical context: Elevated autozygosity (F_ROH) has been consistently associated with adverse reproductive outcomes including recurrent pregnancy loss (RPL), stillbirth, and congenital anomalies. The mechanisms include:

Increased homozygosity for recessive lethal or sub-lethal alleles
Reduced heterozygosity at HLA genes → impaired maternal–fetal immune tolerance
Increased burden of damaging homozygous variants in developmentally critical genes

Key takeaways for the Uzbek cohort

Population-level

Median F_ROH = 0.015 — comparable to outbred European and South Asian populations
~2.7% of individuals (≈ 28) have F_ROH > 0.0625, suggesting 1st-cousin-level parental consanguinity
Minimal cryptic relatedness (5 pairs at 2nd degree or closer in 1,047 samples)
Consistent with historical preference for endogamous marriages in Uzbek communities

GWAS design implications

Covariates required: F_ROH should be included as a covariate in pregnancy loss GWAS to control for genome-wide recessive burden
Kinship filter: Remove one individual from each pair with PI_HAT > 0.125, or use GRM-based mixed models
ROH-enriched genes: Loci consistently within ROH across affected women may harbor recessive pregnancy loss genes
Stratification: Consider analyzing high-F_ROH and low-F_ROH groups separately

F_ROH vs. published populations

Population	Typical median F_ROH	Relative to UZB
UK Biobank (British)	0.008–0.012	3–4× lower
South Asian (1000G SAS)	0.015–0.025	1.3–2× lower
Uzbek (this study)	0.015	Reference
Qatar / Saudi populations	0.040–0.060	1.3–1.9× higher
Isolated populations (e.g., Amish)	0.060–0.120	2–4× higher

9. Methods

Software: PLINK v1.9.0-b.7.7 (64-bit, 515 GB RAM workstation)
Input (ROH): UZB_v2_qc BED/BIM/FAM (1,047 × 5,405,898 SNPs, post-QC)
Input (IBD): UZB_v2_ldpruned BED/BIM/FAM (1,047 × 88,722 SNPs, LD-pruned)
F_ROH formula: Σ(ROH length in kb) / 2,881,033 kb
IBD method: PLINK method-of-moments estimator (--genome flag)
Degree thresholds: Duplicates > 0.98, 1st > 0.354, 2nd > 0.177, 3rd > 0.0884

Command log

# ROH analysis
plink --bfile ~/v2/roh/UZB_v2_for_roh \
      --homozyg \
      --homozyg-window-snp 50 \
      --homozyg-snp 50 \
      --homozyg-kb 1000 \
      --homozyg-density 50 \
      --homozyg-gap 1000 \
      --homozyg-window-het 1 \
      --homozyg-window-missing 5 \
      --homozyg-window-threshold 0.05 \
      --out ~/v2/roh/UZB_v2_ROH

# Output: 36,702 ROH segments across 1,047 individuals
# Files: UZB_v2_ROH.hom, UZB_v2_ROH.hom.indiv, UZB_v2_ROH.hom.summary

# IBD analysis
plink --bfile ~/v2/plink/UZB_v2_ldpruned \
      --genome \
      --min 0.05 \
      --out ~/v2/ibd/UZB_v2_IBD

# Output: 6,368 pairs with PI_HAT >= 0.05

Output files (on server)

File	Description	Size
`UZB_v2_ROH.hom`	All 36,702 individual ROH segments with coordinates	~4 MB
`UZB_v2_ROH.hom.indiv`	Per-individual ROH summary (N_ROH, total_KB, avg_KB)	~77 KB
`UZB_v2_ROH.hom.summary`	Per-SNP ROH frequency across individuals	~18 MB
`ConvSK_mind20_ibd.genome`	All pairwise IBD estimates	~90 MB

Step 15 of 15 • ALSU Genotyping Analysis Pipeline • March 2026

Step 15: Runs of Homozygosity & IBD

1. Overview

2. ROH Summary Statistics

Parameter rationale

Distribution of ROH per individual

Extreme individuals

🔴 Highest FROH

🟢 Lowest FROH

3. FROH Distribution

FROH class breakdown

4. ROH by Chromosome

5. Identity-by-Descent Analysis

Relatedness by degree

6. PI_HAT Distribution

PI_HAT distribution bins

7. ADMIXTURE × ROH Cross-Reference

8. Clinical Relevance for Pregnancy Loss

Key takeaways for the Uzbek cohort

Population-level

GWAS design implications

FROH vs. published populations

9. Methods

Command log

Output files (on server)

🔴 Highest F_ROH

🟢 Lowest F_ROH

3. F_ROH Distribution

F_ROH class breakdown

F_ROH vs. published populations