1. Overview
To determine whether the 8 PBS candidate SNPs from Step 10 represent
independent signals (rather than being in linkage disequilibrium with each other), we perform
LD clumping. Additionally, we estimate genome-wide LD decay to characterize the extent of
haplotype structure in the Uzbek population.
Goal: (1) Identify how many of the 8 PBS candidates are truly independent signals,
and (2) characterize the LD landscape of the Uzbek population for future fine-mapping analyses.
2. Prerequisites
| Input | Source | Description |
pbs_candidates.json |
Step 10 |
8 PBS candidate SNPs |
UZB_1kG_merged.{bed,bim,fam} |
Step 8 |
Merged dataset for LD reference (3,595 samples × 77,111 SNPs) |
UZB_v2_qc.{bed,bim,fam} |
Step 6 |
Uzbek QC-passed dataset for LD decay (1,047 samples × 5.41M SNPs) |
3. Pipeline
3.1 LD Clumping (Independent Loci)
Use PLINK’s clumping algorithm to group correlated PBS candidates. SNPs within
±1000 kb with r² ≥ 0.5 are merged into the same locus, with the highest-PBS
SNP retained as the lead variant:
# Create proxy p-value file (higher PBS = smaller pseudo-p-value)
# This allows PLINK's clumping to retain the highest-PBS variant as lead
# --clump FILE: LD-based clumping — groups correlated SNPs into independent loci
# --clump-snp-field / --clump-field: column headers in FILE for SNP ID and p-value
# --clump-p1 1: p-value threshold for index (lead) SNPs (1 = accept all)
# --clump-p2 1: p-value threshold for proxy SNPs (1 = accept all)
# --clump-r2 0.5: r² threshold — SNPs with r² ≥ 0.5 to lead are grouped into same locus
# --clump-kb 1000: physical distance window (±1000 kb) for LD lookup around each lead
plink --bfile UZB_1kG_merged \
--keep keep_UZB.txt \
--clump pbs_clump_pval.txt \
--clump-snp-field SNP \
--clump-field P \
--clump-p1 1 \
--clump-p2 1 \
--clump-r2 0.5 \
--clump-kb 1000 \
--allow-no-sex \
--out pbs_clumped
8 variants loaded from pbs_clump_pval.txt.
8 clumps formed from 8 top variants.
Results written to pbs_clumped.clumped.
3.2 Pairwise LD Among Candidates
Compute pairwise r² among all 8 PBS candidates to verify independence. Only pairs with
r² ≥ 0.1 within ±5000 kb are reported:
# --r2: compute pairwise LD (squared correlation of allele counts)
# --ld-window-kb 5000: max physical distance between pairs (5 Mb)
# --ld-window 99999: max SNP-index distance (effectively unlimited)
# --ld-window-r2 0.1: minimum r² to include in output (filters noise)
plink --bfile UZB_1kG_merged \
--keep keep_UZB.txt \
--extract pbs_candidate_snps.txt \
--r2 \
--ld-window-kb 5000 \
--ld-window 99999 \
--ld-window-r2 0.1 \
--allow-no-sex \
--out pbs_ld
Result: 0 pairs with r² ≥ 0.1 — all 8 PBS candidates are in independent loci.
3.3 Genome-Wide LD Decay
To characterize background LD in the Uzbek population, we sample 3,000 random SNPs from
the full QC-passed dataset and compute pairwise r² up to ±2 Mb, then bin into
50 kb windows:
# Sample 3,000 SNPs (reproducible seed)
# shuf -n 3000: randomly select 3,000 lines from stdin
# --random-source=<(yes 42): deterministic pseudo-random seed for reproducibility
# (process substitution feeds infinite "42\n" as entropy source to shuf)
awk '{print $2}' UZB_v2_qc.bim | shuf -n 3000 --random-source=<(yes 42) > decay_snps.txt
# Compute pairwise r² (all pairs within 2 Mb, report all r² including 0)
# --ld-window-r2 0.0: report ALL pairs including r²=0 (needed for unbiased decay curve)
plink --bfile UZB_v2_qc \
--extract decay_snps.txt \
--r2 \
--ld-window-kb 2000 \
--ld-window 99999 \
--ld-window-r2 0.0 \
--allow-no-sex \
--out ld_decay
# Bin into 50kb windows (same-chromosome pairs only)
# PLINK .ld columns: $1=CHR_A $2=BP_A $3=SNP_A $4=CHR_B $5=BP_B $6=SNP_B $7=R2
awk 'NR>1 { # NR>1: skip header row
dist = ($5 > $2) ? ($5 - $2) : ($2 - $5); # absolute distance in bp
if ($1 == $4) { # same-chromosome pairs only
bin = int(dist / 50000) * 50; # 50 kb bins, labeled in kb
sum[bin] += $7; # accumulate r² values
count[bin]++;
}
} END {
for (b in sum) printf "%d\t%.6f\t%d\n", b, sum[b]/count[b], count[b]
}' ld_decay.ld | sort -n > ld_decay_binned.tsv
4. Results
4.1 Independent Loci
| # | SNP | Chr | Position | PBSUZB | MAFUZB |
Interpretation: All 8 PBS candidates reside on separate chromosomes or are
far enough apart (with r² < 0.1) to represent independent loci. This rules out a single
large-effect haplotype driving multiple apparent signals.
4.2 LD Decay
Mean r² by genomic distance (50 kb bins), computed from 3,000 sampled SNPs in the
Uzbek QC-passed dataset (1,047 samples):
| Distance (kb) | Mean r² | Pair Count |
LD Decay Summary: r² drops from 0.175 at 0–50 kb to 0.040 at 50–100 kb
(a 4.4-fold drop), reaching background levels (~0.002) by 500 kb. This rapid decay is consistent
with the admixed Central Asian population history, where multiple ancestral sources contribute
heterogeneous haplotype blocks.
5. Output Files
| File | Description |
pbs_clumped.clumped | PLINK clump output (independent lead variants + proxy lists) |
pbs_ld.ld | Pairwise r² among PBS candidate SNPs |
ld_decay.ld | Raw pairwise r² for 3,000 sampled SNPs |
ld_decay_binned.tsv | Binned LD decay curve (50 kb windows) |
ld_data.json | All LD results compiled for visualization |
6. Next Steps