Step 10: Multi-Population PBS Analysis

1. Overview

The Population Branch Statistic (PBS) measures allele frequency divergence along a specific population lineage relative to two outgroup populations. By computing PBS for the Uzbek branch of a three-population tree (UZB–EUR–EAS), we identify SNPs where the Uzbek population has experienced unusually large frequency shifts — potential signatures of local adaptation or genetic drift.

Spring 2026 PBS Results

79,767 SNPs analyzed (vs 77,111 in winter). Pop sizes: UZB=1,047, EUR=522, EAS=515, SAS=492, AFR=671.

Metric	Spring 2026	Winter 2025
SNPs analyzed	79,767	77,111
Mean PBS_UZB	−0.01001	−0.00979
Median PBS_UZB	−0.00604	−0.00601
Stdev	0.02677	0.02648
Tier 1 (PBS≥0.3)	8	8
Tier 2 (significant)	4,995	1

Spring vs Winter: Core PBS statistics are nearly identical. Same 8 Tier 1 SNPs detected. Tier 2 difference reflects different tier criteria between runs (spring uses ΔAF≥0.3 threshold).

genetic drift.

Goal: Identify SNPs with elevated PBS scores (Uzbek-specific allele frequency changes) and classify them by tier: high PBS (≥0.3), large absolute frequency difference (ΔAF ≥0.3 vs all populations), or near-private alleles (UZB MAF ≥5%, all others ≤1%).

2. Prerequisites

Source	File(s)	Description
Merged Dataset	`UZB_1kG_merged.{bed,bim,fam}`	3,595 samples × 77,111 LD-pruned SNPs (from Step 8)
Population Mapping	`pop_mapping.txt`	Sample-to-superpopulation assignments (from Step 8)

Population Panel

Population	Code	N	Source
Uzbek cohort	UZB	1,047	ALSU QC-passed set
European	EUR	522	1000 Genomes Phase 3
East Asian	EAS	515	1000 Genomes Phase 3
South Asian	SAS	492	1000 Genomes Phase 3
African	AFR	671	1000 Genomes Phase 3

3. Pipeline

3.1 Build Population Cluster File

Create PLINK cluster assignments from the population mapping and the merged FAM file:

# Build PLINK cluster file: FID IID CLUSTER
awk 'NR==FNR {pop[$1]=$2; next} {
  fid=$1; iid=$2;
  if (iid in pop) p=pop[iid];
  else if (fid in pop) p=pop[fid];
  else p="UZB";
  print fid, iid, p
}' pop_mapping.txt UZB_1kG_merged.fam > clusters.txt

# Verify population counts
awk '{print $3}' clusters.txt | sort | uniq -c | sort -rn

1047 UZB 671 AFR 522 EUR 515 EAS 492 SAS 348 AMR

3.2 Extract Per-Population Sample Lists

for POP in UZB EUR EAS SAS AFR; do
    awk -v p="$POP" '$3==p {print $1, $2}' clusters.txt > keep_${POP}.txt
    echo "$POP: $(wc -l < keep_${POP}.txt) samples"
done

3.3 Compute Per-Population Allele Frequencies

for POP in UZB EUR EAS SAS AFR; do
    plink --bfile UZB_1kG_merged \
          --keep keep_${POP}.txt \
          --freq \
          --out freq_${POP} \
          --allow-no-sex --silent
done

3.4 Pairwise Per-SNP F_ST

Compute per-SNP Weir & Cockerham F_ST for the PBS triangle (UZB–EUR, UZB–EAS, EUR–EAS) plus two additional pairs (UZB–SAS, UZB–AFR) for ΔAF context:

# Function to compute pairwise FST
compute_fst() {
    POP1=$1; POP2=$2
    cat keep_${POP1}.txt keep_${POP2}.txt > keep_${POP1}_${POP2}.txt
    awk -v p="$POP1" '{print $1, $2, p}' keep_${POP1}.txt  > within_${POP1}_${POP2}.txt
    awk -v p="$POP2" '{print $1, $2, p}' keep_${POP2}.txt >> within_${POP1}_${POP2}.txt

    plink --bfile UZB_1kG_merged \
          --keep keep_${POP1}_${POP2}.txt \
          --fst \
          --within within_${POP1}_${POP2}.txt \
          --out fst_${POP1}_${POP2} \
          --allow-no-sex --silent
}

# PBS triangle
compute_fst UZB EUR
compute_fst UZB EAS
compute_fst EUR EAS

# Extra pairs for delta-AF context
compute_fst UZB SAS
compute_fst UZB AFR

3.5 PBS Calculation (Python)

PBS is derived from pairwise F_ST values by converting each to a divergence time T = −ln(1 − F_ST), then computing the branch length for the target (UZB) population:

PBS formula: PBS_UZB = (T_UZB-EUR + T_UZB-EAS − T_EUR-EAS) / 2

# Core PBS computation (from 02_calculate_pbs.py)
import math

def fst_to_T(fst_val):
    """Convert FST to divergence time, capping FST at 0.999"""
    f = max(0.0, min(fst_val, 0.999))
    return -math.log(1.0 - f)

# For each SNP present in all three FST files:
T_ue = fst_to_T(fst_UZB_EUR[snp])
T_ua = fst_to_T(fst_UZB_EAS[snp])
T_ea = fst_to_T(fst_EUR_EAS[snp])
pbs_uzb = (T_ue + T_ua - T_ea) / 2.0

# Tier classification
tier1 = pbs_uzb >= 0.3                    # Strong PBS
tier2 = min_delta_af >= 0.3               # High delta-AF vs ALL populations
tier3 = (maf_uzb >= 0.05 and              # Near-private: common in UZB,
         all(maf_other <= 0.01))           # rare everywhere else

python3 02_calculate_pbs.py --outdir ./pbs_results Loading Fst data... UZB-EUR: 77,111 SNPs UZB-EAS: 77,111 SNPs EUR-EAS: 77,111 SNPs Computing PBS... Total SNPs analyzed: 77,111 === PBS SUMMARY === n_snps: 77111 mean: -0.009794 median: -0.006012 stdev: 0.026484 min: -0.362111 max: 2.988450 n_pbs_ge_03: 8 n_pbs_ge_01: 13 n_negative: 63414 n_tier1: 8 n_tier2: 1 n_tier3: 0 n_candidates: 8

4. Results

4.1 Pairwise F_ST (Weighted)

Comparison	Weighted F_ST	Mean F_ST
UZB vs EUR	0.01448	0.00997
UZB vs EAS	0.03929	0.02357
EUR vs EAS	0.08448	0.05023
UZB vs SAS	0.01441	0.00980
UZB vs AFR	0.12930	0.06096

Interpretation: UZB is genetically closest to EUR (F_ST=0.014) and SAS (F_ST=0.014), moderately distant from EAS (0.039), and most distant from AFR (0.129). This is consistent with the PCA positioning (Step 8) showing Uzbeks intermediate between EUR and SAS.

4.2 PBS Summary Statistics

Metric	Value
SNPs analyzed	77,111
Mean PBS_UZB	−0.00979
Median PBS_UZB	−0.00601
Standard deviation	0.02648
Min	−0.36211
Max	2.98845
PBS ≥ 0.3 (Tier 1)	8 SNPs
PBS ≥ 0.1	13 SNPs
Negative PBS	63,414 (82.2%)

4.3 Top PBS Candidates

#	SNP	Chr	Position	PBS_UZB	MAF_UZB	MAF_EUR	MAF_EAS	MAF_SAS	MAF_AFR

Observation: The top 3 SNPs (all on chromosome 12) show extremely high PBS values (>2.5) with very low UZB MAF (~2–3%) compared to high AFR MAF (~49%). These likely reflect ancestral allele retention rather than positive selection. The more biologically interesting candidates are SNPs 5–8 (PBS 0.32–0.53), where UZB shows high MAF (~48%) diverging from other populations.

5. Output Files

File	Description
`clusters.txt`	Population cluster file (FID IID POP)
`freq_{UZB,EUR,EAS,SAS,AFR}.frq`	Per-population allele frequencies
`fst_{POP1}_{POP2}.fst`	Per-SNP pairwise F_ST (5 pairs)
`pbs_all.tsv`	PBS scores for all 77,111 SNPs
`pbs_candidates.json`	Filtered candidate SNPs (Tier 1/2/3)
`pbs_stats.json`	Summary statistics
`pbs_histogram.json`	PBS distribution for plotting

6. Next Steps

Step 11: ADMIXTURE — Continental ancestry decomposition using the same merged dataset.
Step 12: PBS Annotation — Functional annotation of the 8 PBS candidates.
Step 13: LD Analysis — LD clumping and decay analysis of PBS candidates.