← Back to Roadmap

Step 13: LD Analysis

Linkage disequilibrium clumping of PBS candidates and genome-wide LD decay estimation

Completed — March 2026

1. Overview

To determine whether the 8 PBS candidate SNPs from Step 10 represent independent signals (rather than being in linkage disequilibrium with each other), we perform LD clumping. Additionally, we estimate genome-wide LD decay to characterize the extent of haplotype structure in the Uzbek population.

Goal: (1) Identify how many of the 8 PBS candidates are truly independent signals, and (2) characterize the LD landscape of the Uzbek population for future fine-mapping analyses.

2. Prerequisites

InputSourceDescription
pbs_candidates.json Step 10 8 PBS candidate SNPs
UZB_1kG_merged.{bed,bim,fam} Step 8 Merged dataset for LD reference (3,595 samples × 77,111 SNPs)
UZB_v2_qc.{bed,bim,fam} Step 6 Uzbek QC-passed dataset for LD decay (1,047 samples × 5.41M SNPs)

3. Pipeline

3.1 LD Clumping (Independent Loci)

Use PLINK’s clumping algorithm to group correlated PBS candidates. SNPs within ±1000 kb with r² ≥ 0.5 are merged into the same locus, with the highest-PBS SNP retained as the lead variant:

# Create proxy p-value file (higher PBS = smaller pseudo-p-value) # This allows PLINK's clumping to retain the highest-PBS variant as lead # --clump FILE: LD-based clumping — groups correlated SNPs into independent loci # --clump-snp-field / --clump-field: column headers in FILE for SNP ID and p-value # --clump-p1 1: p-value threshold for index (lead) SNPs (1 = accept all) # --clump-p2 1: p-value threshold for proxy SNPs (1 = accept all) # --clump-r2 0.5: r² threshold — SNPs with r² ≥ 0.5 to lead are grouped into same locus # --clump-kb 1000: physical distance window (±1000 kb) for LD lookup around each lead plink --bfile UZB_1kG_merged \ --keep keep_UZB.txt \ --clump pbs_clump_pval.txt \ --clump-snp-field SNP \ --clump-field P \ --clump-p1 1 \ --clump-p2 1 \ --clump-r2 0.5 \ --clump-kb 1000 \ --allow-no-sex \ --out pbs_clumped
8 variants loaded from pbs_clump_pval.txt. 8 clumps formed from 8 top variants. Results written to pbs_clumped.clumped.

3.2 Pairwise LD Among Candidates

Compute pairwise r² among all 8 PBS candidates to verify independence. Only pairs with r² ≥ 0.1 within ±5000 kb are reported:

# --r2: compute pairwise LD (squared correlation of allele counts) # --ld-window-kb 5000: max physical distance between pairs (5 Mb) # --ld-window 99999: max SNP-index distance (effectively unlimited) # --ld-window-r2 0.1: minimum r² to include in output (filters noise) plink --bfile UZB_1kG_merged \ --keep keep_UZB.txt \ --extract pbs_candidate_snps.txt \ --r2 \ --ld-window-kb 5000 \ --ld-window 99999 \ --ld-window-r2 0.1 \ --allow-no-sex \ --out pbs_ld
Result: 0 pairs with r² ≥ 0.1 — all 8 PBS candidates are in independent loci.

3.3 Genome-Wide LD Decay

To characterize background LD in the Uzbek population, we sample 3,000 random SNPs from the full QC-passed dataset and compute pairwise r² up to ±2 Mb, then bin into 50 kb windows:

# Sample 3,000 SNPs (reproducible seed) # shuf -n 3000: randomly select 3,000 lines from stdin # --random-source=<(yes 42): deterministic pseudo-random seed for reproducibility # (process substitution feeds infinite "42\n" as entropy source to shuf) awk '{print $2}' UZB_v2_qc.bim | shuf -n 3000 --random-source=<(yes 42) > decay_snps.txt # Compute pairwise r² (all pairs within 2 Mb, report all r² including 0) # --ld-window-r2 0.0: report ALL pairs including r²=0 (needed for unbiased decay curve) plink --bfile UZB_v2_qc \ --extract decay_snps.txt \ --r2 \ --ld-window-kb 2000 \ --ld-window 99999 \ --ld-window-r2 0.0 \ --allow-no-sex \ --out ld_decay
# Bin into 50kb windows (same-chromosome pairs only) # PLINK .ld columns: $1=CHR_A $2=BP_A $3=SNP_A $4=CHR_B $5=BP_B $6=SNP_B $7=R2 awk 'NR>1 { # NR>1: skip header row dist = ($5 > $2) ? ($5 - $2) : ($2 - $5); # absolute distance in bp if ($1 == $4) { # same-chromosome pairs only bin = int(dist / 50000) * 50; # 50 kb bins, labeled in kb sum[bin] += $7; # accumulate r² values count[bin]++; } } END { for (b in sum) printf "%d\t%.6f\t%d\n", b, sum[b]/count[b], count[b] }' ld_decay.ld | sort -n > ld_decay_binned.tsv

4. Results

4.1 Independent Loci

#SNPChrPositionPBSUZBMAFUZB
Interpretation: All 8 PBS candidates reside on separate chromosomes or are far enough apart (with r² < 0.1) to represent independent loci. This rules out a single large-effect haplotype driving multiple apparent signals.

4.2 LD Decay

Mean r² by genomic distance (50 kb bins), computed from 3,000 sampled SNPs in the Uzbek QC-passed dataset (1,047 samples):

Distance (kb)Mean r²Pair Count
LD Decay Summary: r² drops from 0.175 at 0–50 kb to 0.040 at 50–100 kb (a 4.4-fold drop), reaching background levels (~0.002) by 500 kb. This rapid decay is consistent with the admixed Central Asian population history, where multiple ancestral sources contribute heterogeneous haplotype blocks.

5. Output Files

FileDescription
pbs_clumped.clumpedPLINK clump output (independent lead variants + proxy lists)
pbs_ld.ldPairwise r² among PBS candidate SNPs
ld_decay.ldRaw pairwise r² for 3,000 sampled SNPs
ld_decay_binned.tsvBinned LD decay curve (50 kb windows)
ld_data.jsonAll LD results compiled for visualization

6. Next Steps