Step 13: LD Analysis - ALSU Pipeline

1. Overview

To determine whether the 8 PBS candidate SNPs from Step 10 represent independent signals (rather than being in linkage disequilibrium with each other), we perform LD clumping. Additionally, we estimate genome-wide LD decay to characterize the extent of haplotype structure in the Uzbek population.

Goal: (1) Identify how many of the 8 PBS candidates are truly independent signals, and (2) characterize the LD landscape of the Uzbek population for future fine-mapping analyses.

2. Prerequisites

Input	Source	Description
`pbs_candidates.json`	Step 10	8 PBS candidate SNPs
`UZB_1kG_merged.{bed,bim,fam}`	Step 8	Merged dataset for LD reference (3,595 samples × 77,111 SNPs)
`UZB_v2_qc.{bed,bim,fam}`	Step 6	Uzbek QC-passed dataset for LD decay (1,047 samples × 5.41M SNPs)

3. Pipeline

3.1 LD Clumping (Independent Loci)

Use PLINK’s clumping algorithm to group correlated PBS candidates. SNPs within ±1000 kb with r² ≥ 0.5 are merged into the same locus, with the highest-PBS SNP retained as the lead variant:

# Create proxy p-value file (higher PBS = smaller pseudo-p-value)
# This allows PLINK's clumping to retain the highest-PBS variant as lead

# --clump FILE: LD-based clumping — groups correlated SNPs into independent loci
# --clump-snp-field / --clump-field: column headers in FILE for SNP ID and p-value
# --clump-p1 1: p-value threshold for index (lead) SNPs (1 = accept all)
# --clump-p2 1: p-value threshold for proxy SNPs (1 = accept all)
# --clump-r2 0.5: r² threshold — SNPs with r² ≥ 0.5 to lead are grouped into same locus
# --clump-kb 1000: physical distance window (±1000 kb) for LD lookup around each lead
plink --bfile UZB_1kG_merged \
      --keep keep_UZB.txt \
      --clump pbs_clump_pval.txt \
      --clump-snp-field SNP \
      --clump-field P \
      --clump-p1 1 \
      --clump-p2 1 \
      --clump-r2 0.5 \
      --clump-kb 1000 \
      --allow-no-sex \
      --out pbs_clumped

8 variants loaded from pbs_clump_pval.txt. 8 clumps formed from 8 top variants. Results written to pbs_clumped.clumped.

3.2 Pairwise LD Among Candidates

Compute pairwise r² among all 8 PBS candidates to verify independence. Only pairs with r² ≥ 0.1 within ±5000 kb are reported:

# --r2: compute pairwise LD (squared correlation of allele counts)
# --ld-window-kb 5000: max physical distance between pairs (5 Mb)
# --ld-window 99999: max SNP-index distance (effectively unlimited)
# --ld-window-r2 0.1: minimum r² to include in output (filters noise)
plink --bfile UZB_1kG_merged \
      --keep keep_UZB.txt \
      --extract pbs_candidate_snps.txt \
      --r2 \
      --ld-window-kb 5000 \
      --ld-window 99999 \
      --ld-window-r2 0.1 \
      --allow-no-sex \
      --out pbs_ld

Result: 0 pairs with r² ≥ 0.1 — all 8 PBS candidates are in independent loci.

3.3 Genome-Wide LD Decay

To characterize background LD in the Uzbek population, we sample 3,000 random SNPs from the full QC-passed dataset and compute pairwise r² up to ±2 Mb, then bin into 50 kb windows:

# Sample 3,000 SNPs (reproducible seed)
# shuf -n 3000: randomly select 3,000 lines from stdin
# --random-source=<(yes 42): deterministic pseudo-random seed for reproducibility
#   (process substitution feeds infinite "42\n" as entropy source to shuf)
awk '{print $2}' UZB_v2_qc.bim | shuf -n 3000 --random-source=<(yes 42) > decay_snps.txt

# Compute pairwise r² (all pairs within 2 Mb, report all r² including 0)
# --ld-window-r2 0.0: report ALL pairs including r²=0 (needed for unbiased decay curve)
plink --bfile UZB_v2_qc \
      --extract decay_snps.txt \
      --r2 \
      --ld-window-kb 2000 \
      --ld-window 99999 \
      --ld-window-r2 0.0 \
      --allow-no-sex \
      --out ld_decay

# Bin into 50kb windows (same-chromosome pairs only)
# PLINK .ld columns: $1=CHR_A  $2=BP_A  $3=SNP_A  $4=CHR_B  $5=BP_B  $6=SNP_B  $7=R2
awk 'NR>1 {                                    # NR>1: skip header row
    dist = ($5 > $2) ? ($5 - $2) : ($2 - $5);  # absolute distance in bp
    if ($1 == $4) {                             # same-chromosome pairs only
        bin = int(dist / 50000) * 50;           # 50 kb bins, labeled in kb
        sum[bin] += $7;                         # accumulate r² values
        count[bin]++;
    }
} END {
    for (b in sum) printf "%d\t%.6f\t%d\n", b, sum[b]/count[b], count[b]
}' ld_decay.ld | sort -n > ld_decay_binned.tsv

4. Results

4.1 Independent Loci

#	SNP	Chr	Position	PBS_UZB	MAF_UZB

Interpretation: All 8 PBS candidates reside on separate chromosomes or are far enough apart (with r² < 0.1) to represent independent loci. This rules out a single large-effect haplotype driving multiple apparent signals.

4.2 LD Decay

Mean r² by genomic distance (50 kb bins), computed from 3,000 sampled SNPs in the Uzbek QC-passed dataset (1,047 samples):

Distance (kb)	Mean r²	Pair Count

LD Decay Summary: r² drops from 0.175 at 0–50 kb to 0.040 at 50–100 kb (a 4.4-fold drop), reaching background levels (~0.002) by 500 kb. This rapid decay is consistent with the admixed Central Asian population history, where multiple ancestral sources contribute heterogeneous haplotype blocks.

5. Output Files

File	Description
`pbs_clumped.clumped`	PLINK clump output (independent lead variants + proxy lists)
`pbs_ld.ld`	Pairwise r² among PBS candidate SNPs
`ld_decay.ld`	Raw pairwise r² for 3,000 sampled SNPs
`ld_decay_binned.tsv`	Binned LD decay curve (50 kb windows)
`ld_data.json`	All LD results compiled for visualization

6. Next Steps

Step 14: Fst & MDS — Pairwise F_ST matrix and multidimensional scaling across all populations.
Step 15: ROH & IBD — Runs of homozygosity and identity-by-descent analysis.