← Back to Roadmap

Step 10: Multi-Population PBS Analysis

Population Branch Statistic to identify Uzbek-specific allele frequency shifts

✓ Spring 2026 — April 11, 2026 Completed — March 2026

1. Overview

The Population Branch Statistic (PBS) measures allele frequency divergence along a specific population lineage relative to two outgroup populations. By computing PBS for the Uzbek branch of a three-population tree (UZB–EUR–EAS), we identify SNPs where the Uzbek population has experienced unusually large frequency shifts — potential signatures of local adaptation or genetic drift.

Spring 2026 PBS Results

79,767 SNPs analyzed (vs 77,111 in winter). Pop sizes: UZB=1,047, EUR=522, EAS=515, SAS=492, AFR=671.
MetricSpring 2026Winter 2025
SNPs analyzed79,76777,111
Mean PBSUZB−0.01001−0.00979
Median PBSUZB−0.00604−0.00601
Stdev0.026770.02648
Tier 1 (PBS≥0.3)88
Tier 2 (significant)4,9951
Spring vs Winter: Core PBS statistics are nearly identical. Same 8 Tier 1 SNPs detected. Tier 2 difference reflects different tier criteria between runs (spring uses ΔAF≥0.3 threshold).
genetic drift.

Goal: Identify SNPs with elevated PBS scores (Uzbek-specific allele frequency changes) and classify them by tier: high PBS (≥0.3), large absolute frequency difference (ΔAF ≥0.3 vs all populations), or near-private alleles (UZB MAF ≥5%, all others ≤1%).

2. Prerequisites

SourceFile(s)Description
Merged Dataset UZB_1kG_merged.{bed,bim,fam} 3,595 samples × 77,111 LD-pruned SNPs (from Step 8)
Population Mapping pop_mapping.txt Sample-to-superpopulation assignments (from Step 8)

Population Panel

PopulationCodeNSource
Uzbek cohortUZB1,047ALSU QC-passed set
EuropeanEUR5221000 Genomes Phase 3
East AsianEAS5151000 Genomes Phase 3
South AsianSAS4921000 Genomes Phase 3
AfricanAFR6711000 Genomes Phase 3

3. Pipeline

3.1 Build Population Cluster File

Create PLINK cluster assignments from the population mapping and the merged FAM file:

# Build PLINK cluster file: FID IID CLUSTER awk 'NR==FNR {pop[$1]=$2; next} { fid=$1; iid=$2; if (iid in pop) p=pop[iid]; else if (fid in pop) p=pop[fid]; else p="UZB"; print fid, iid, p }' pop_mapping.txt UZB_1kG_merged.fam > clusters.txt # Verify population counts awk '{print $3}' clusters.txt | sort | uniq -c | sort -rn
1047 UZB 671 AFR 522 EUR 515 EAS 492 SAS 348 AMR

3.2 Extract Per-Population Sample Lists

for POP in UZB EUR EAS SAS AFR; do awk -v p="$POP" '$3==p {print $1, $2}' clusters.txt > keep_${POP}.txt echo "$POP: $(wc -l < keep_${POP}.txt) samples" done

3.3 Compute Per-Population Allele Frequencies

for POP in UZB EUR EAS SAS AFR; do plink --bfile UZB_1kG_merged \ --keep keep_${POP}.txt \ --freq \ --out freq_${POP} \ --allow-no-sex --silent done

3.4 Pairwise Per-SNP FST

Compute per-SNP Weir & Cockerham FST for the PBS triangle (UZB–EUR, UZB–EAS, EUR–EAS) plus two additional pairs (UZB–SAS, UZB–AFR) for ΔAF context:

# Function to compute pairwise FST compute_fst() { POP1=$1; POP2=$2 cat keep_${POP1}.txt keep_${POP2}.txt > keep_${POP1}_${POP2}.txt awk -v p="$POP1" '{print $1, $2, p}' keep_${POP1}.txt > within_${POP1}_${POP2}.txt awk -v p="$POP2" '{print $1, $2, p}' keep_${POP2}.txt >> within_${POP1}_${POP2}.txt plink --bfile UZB_1kG_merged \ --keep keep_${POP1}_${POP2}.txt \ --fst \ --within within_${POP1}_${POP2}.txt \ --out fst_${POP1}_${POP2} \ --allow-no-sex --silent } # PBS triangle compute_fst UZB EUR compute_fst UZB EAS compute_fst EUR EAS # Extra pairs for delta-AF context compute_fst UZB SAS compute_fst UZB AFR

3.5 PBS Calculation (Python)

PBS is derived from pairwise FST values by converting each to a divergence time T = −ln(1 − FST), then computing the branch length for the target (UZB) population:

PBS formula: PBSUZB = (TUZB-EUR + TUZB-EAS − TEUR-EAS) / 2
# Core PBS computation (from 02_calculate_pbs.py) import math def fst_to_T(fst_val): """Convert FST to divergence time, capping FST at 0.999""" f = max(0.0, min(fst_val, 0.999)) return -math.log(1.0 - f) # For each SNP present in all three FST files: T_ue = fst_to_T(fst_UZB_EUR[snp]) T_ua = fst_to_T(fst_UZB_EAS[snp]) T_ea = fst_to_T(fst_EUR_EAS[snp]) pbs_uzb = (T_ue + T_ua - T_ea) / 2.0 # Tier classification tier1 = pbs_uzb >= 0.3 # Strong PBS tier2 = min_delta_af >= 0.3 # High delta-AF vs ALL populations tier3 = (maf_uzb >= 0.05 and # Near-private: common in UZB, all(maf_other <= 0.01)) # rare everywhere else
python3 02_calculate_pbs.py --outdir ./pbs_results Loading Fst data... UZB-EUR: 77,111 SNPs UZB-EAS: 77,111 SNPs EUR-EAS: 77,111 SNPs Computing PBS... Total SNPs analyzed: 77,111 === PBS SUMMARY === n_snps: 77111 mean: -0.009794 median: -0.006012 stdev: 0.026484 min: -0.362111 max: 2.988450 n_pbs_ge_03: 8 n_pbs_ge_01: 13 n_negative: 63414 n_tier1: 8 n_tier2: 1 n_tier3: 0 n_candidates: 8

4. Results

4.1 Pairwise FST (Weighted)

ComparisonWeighted FSTMean FST
UZB vs EUR0.014480.00997
UZB vs EAS0.039290.02357
EUR vs EAS0.084480.05023
UZB vs SAS0.014410.00980
UZB vs AFR0.129300.06096
Interpretation: UZB is genetically closest to EUR (FST=0.014) and SAS (FST=0.014), moderately distant from EAS (0.039), and most distant from AFR (0.129). This is consistent with the PCA positioning (Step 8) showing Uzbeks intermediate between EUR and SAS.

4.2 PBS Summary Statistics

MetricValue
SNPs analyzed77,111
Mean PBSUZB−0.00979
Median PBSUZB−0.00601
Standard deviation0.02648
Min−0.36211
Max2.98845
PBS ≥ 0.3 (Tier 1)8 SNPs
PBS ≥ 0.113 SNPs
Negative PBS63,414 (82.2%)

4.3 Top PBS Candidates

#SNPChrPositionPBSUZB MAFUZBMAFEURMAFEASMAFSASMAFAFR
Observation: The top 3 SNPs (all on chromosome 12) show extremely high PBS values (>2.5) with very low UZB MAF (~2–3%) compared to high AFR MAF (~49%). These likely reflect ancestral allele retention rather than positive selection. The more biologically interesting candidates are SNPs 5–8 (PBS 0.32–0.53), where UZB shows high MAF (~48%) diverging from other populations.

5. Output Files

FileDescription
clusters.txtPopulation cluster file (FID IID POP)
freq_{UZB,EUR,EAS,SAS,AFR}.frqPer-population allele frequencies
fst_{POP1}_{POP2}.fstPer-SNP pairwise FST (5 pairs)
pbs_all.tsvPBS scores for all 77,111 SNPs
pbs_candidates.jsonFiltered candidate SNPs (Tier 1/2/3)
pbs_stats.jsonSummary statistics
pbs_histogram.jsonPBS distribution for plotting

6. Next Steps