← Back to Roadmap

Step 9: Genome-Wide FST Analysis (Uzbek vs EUR)

Quantify population differentiation between Uzbek cohort and 1000 Genomes European reference

✓ Spring 2026 — April 11, 2026 Completed — October 23 (pilot) & November 17 (genome-wide), 2025

1. Overview

FST (fixation index) measures allele frequency differentiation between populations. Values range from 0 (no differentiation — identical allele frequencies) to 1 (complete fixation of different alleles). This step computes genome-wide Weir & Cockerham FST between our Uzbek cohort and 1000 Genomes Phase 3 European (EUR) samples to:

  • Quantify genetic distance — how differentiated are Uzbek samples from Europeans?
  • Identify highly differentiated loci — SNPs with extreme FST may be under population-specific selection or linked to ancestry-informative markers
  • Contextualize GWAS findings — if a GWAS hit has high FST, it may reflect population stratification rather than true disease association

Interpretation Guide

FST RangeInterpretationExample
0.00–0.05Little differentiationPopulations within same continental group
0.05–0.15Moderate differentiationEUR vs SAS (~0.05), EUR vs AMR (~0.06)
0.15–0.25Great differentiationEUR vs EAS (~0.11), EUR vs AFR (~0.15)
>0.25Very great differentiationExtreme outlier loci under selection

Spring 2026 FST Results

Multi-population FST computed on 79,767 merged SNPs using UZB (1,047), EUR (522), EAS (515), SAS (492), AFR (671) samples.
ComparisonWeighted FSTMean FST
UZB vs EUR0.014680.01002
UZB vs EAS0.040100.02393
UZB vs SAS0.014450.00988
UZB vs AFR0.129070.06153
EUR vs EAS0.086310.05090
Consistent with Winter: UZB closest to EUR (0.015) and SAS (0.014), moderately distant from EAS (0.040), most distant from AFR (0.129). Values nearly identical to winter run confirming reproducibility.

2. Input Data

Source File(s) Description
Uzbek Data uzbek_data.ped/.map 1,199 Uzbek samples, 650,181 genotyped variants (hg38/GRCh38)
1000G Reference ALL.chr*.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz Phase 3 VCFs (hg19/GRCh37), all 2,504 samples
EUR Sample List european_samples.txt 503 EUR samples (CEU, GBR, FIN, TSI, IBS)
LiftOver Chain hg38ToHg19.over.chain UCSC chain file for hg38 → hg19 coordinate conversion
Note on Sample Count: This analysis used 1,199 Uzbek samples from the pre-QC genotyped dataset (uzbek_data.ped), not the 1,047 post-imputation final dataset. The Fst analysis was performed in October–November 2025, concurrent with the QC pipeline (Steps 1–6). The additional samples include those later removed for missingness, IBD duplicates, and other QC filters. For downstream analyses requiring consistency with the GWAS dataset, FST should be recalculated on the final 1,047 samples.

3. Coordinate Liftover (hg38 → hg19)

The 1000 Genomes Phase 3 reference data uses hg19 (GRCh37) coordinates, while our Uzbek genotyping data was called on hg38 (GRCh38). To merge the datasets, we must convert Uzbek coordinates to hg19.

3.1 Convert PED/MAP to Binary PLINK

cd /staging/ALSU-analysis/Fst_analysis # Convert PED/MAP to binary format plink --file uzbek_data --make-bed --out uzbek_data_hg38
650181 variants loaded from .bim file. 1199 people (0 males, 0 females, 1199 ambiguous) loaded from .fam. Warning: Variant 1 (post-sort) triallelic; setting rarest alleles missing. [multiple triallelic warnings omitted] Total genotyping rate is 0.96741. --make-bed to uzbek_data_hg38.bed + uzbek_data_hg38.bim + uzbek_data_hg38.fam ... done.

3.2 Run UCSC LiftOver

Extract BED-format positions from the BIM file and convert coordinates using the UCSC liftOver tool:

# Create BED file from BIM (chr, start, end, SNP_ID) # BIM columns: $1=chr $2=snpID $3=cM $4=bp_position $5=A1 $6=A2 # BED format requires 0-based start → $4-1; end stays 1-based → $4 awk '{print "chr"$1, $4-1, $4, $2}' OFS='\t' uzbek_data_hg38.bim > uzbek_hg38_positions.bed # liftOver args: input.bed chain_file mapped_output.bed unmapped_output.bed liftOver uzbek_hg38_positions.bed hg38ToHg19.over.chain uzbek_hg19_lifted.bed unlifted.bed # Create position update file (old_ID → new_position) # BED $4=SNP_ID, $3=end coord (= 1-based position for single-bp SNPs) awk '{print $4, $3}' uzbek_hg19_lifted.bed > update_positions.txt # List of SNPs that successfully lifted awk '{print $4}' uzbek_hg19_lifted.bed > snps_to_keep.txt

3.3 Apply New Coordinates

# Extract only liftable SNPs plink --bfile uzbek_data_hg38 \ --extract snps_to_keep.txt \ --make-bed \ --out uzbek_data_hg19_temp
650181 variants loaded from .bim file. 1199 people loaded from .fam. --extract: 647854 variants remaining. --make-bed to uzbek_data_hg19_temp.bed + uzbek_data_hg19_temp.bim + uzbek_data_hg19_temp.fam ... done.
# --update-map FILE: 2-column file (SNP_ID new_bp_position) # Replaces genomic coordinates in .bim for each matched variant plink --bfile uzbek_data_hg19_temp \ --update-map update_positions.txt \ --make-bed \ --out uzbek_data_hg19
647854 variants loaded from .bim file. 1199 people loaded from .fam. --update-map: 647854 values updated. Warning: Base-pair positions are now unsorted! --make-bed to uzbek_data_hg19.bed + uzbek_data_hg19.bim + uzbek_data_hg19.fam ... done.
Variants Lifted: 647,854 / 650,181 (99.6% success rate)

Variants Lost: 2,327 (unmappable between assemblies)

4. Chromosome 22 Pilot (October 23, 2025)

Before processing all 22 chromosomes, a pilot was run on chr22 to validate the pipeline. The per-chromosome workflow has 7 steps:

4.1 Per-Chromosome Pipeline

# [1/7] Convert 1000G VCF to PLINK (EUR samples only) # --keep FILE: 2-column file (FID IID) — retain only listed samples plink --vcf 1000G_data/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz \ --keep 1000G_data/european_samples_fixed.txt \ --make-bed \ --out 1000G_data/1000G_EUR_chr22 # [2/7] Extract Uzbek chr22 plink --bfile uzbek_data_hg19 --chr 22 --make-bed --out uzbek_chr22 # [3/7] Rename SNPs to chr:pos format (for matching across datasets) # awk: overwrite $2 (SNP ID) with chr:pos so both datasets share naming scheme awk '{$2=$1":"$4; print}' OFS='\t' uzbek_chr22.bim > uzbek_chr22_renamed.bim cp uzbek_chr22.bed uzbek_chr22_renamed.bed cp uzbek_chr22.fam uzbek_chr22_renamed.fam awk '{$2=$1":"$4; print}' OFS='\t' 1000G_data/1000G_EUR_chr22.bim > 1000G_EUR_chr22_newnames.bim cp 1000G_data/1000G_EUR_chr22.bed 1000G_EUR_chr22_newnames.bed cp 1000G_data/1000G_EUR_chr22.fam 1000G_EUR_chr22_newnames.fam # [4/7] Find overlapping positions awk '{print $1":"$4}' uzbek_chr22_renamed.bim | sort > uzbek_chr22_pos.txt awk '{print $1":"$4}' 1000G_EUR_chr22_newnames.bim | sort > 1000G_chr22_pos.txt # comm -12: output only lines present in BOTH sorted files (set intersection) comm -12 uzbek_chr22_pos.txt 1000G_chr22_pos.txt > positions_chr22.txt # [5/7] Extract overlapping SNPs from both datasets plink --bfile uzbek_chr22_renamed --extract positions_chr22.txt --make-bed --out uzbek_chr22_matched plink --bfile 1000G_EUR_chr22_newnames --extract positions_chr22.txt --make-bed --out 1000G_EUR_chr22_matched # [6/7] Merge (with strand-flip handling) # --bmerge PREFIX: merge current --bfile dataset with a second PLINK fileset # PLINK auto-flips strand for A/T↔T/A, C/G↔G/C; writes .missnp for 3+ allele conflicts plink --bfile uzbek_chr22_matched --bmerge 1000G_EUR_chr22_matched \ --make-bed --out merged_chr22_temp # If 3+ allele conflicts arise, exclude problematic SNPs and re-merge: plink --bfile uzbek_chr22_matched --exclude merged_chr22_temp-merge.missnp \ --make-bed --out uzbek_chr22_clean plink --bfile 1000G_EUR_chr22_matched --exclude merged_chr22_temp-merge.missnp \ --make-bed --out 1000G_EUR_chr22_clean # [7/7] Final merge plink --bfile uzbek_chr22_clean --bmerge 1000G_EUR_chr22_clean \ --make-bed --out merged_uzbek_EUR_chr22

4.2 Pilot FST

# Create population assignment file # Format: FID IID GROUP (tab-separated) # 1,199 Uzbek samples labeled "UZBEK" # 503 EUR samples labeled "EUR" # (populations.txt covers all 1,702 merged samples) plink --bfile merged_uzbek_EUR_chr22 \ --fst \ --within populations.txt \ --out uzbek_vs_eur_fst
5762 variants loaded from .bim file. 1702 people (0 males, 0 females, 1702 ambiguous) loaded from .fam. --within: 2 clusters loaded, covering a total of 1702 people. 5762 markers with valid Fst estimates. Mean Fst estimate: 0.0148483 Weighted Fst estimate: 0.0181685
Chr22 Pilot Successful: 5,762 overlapping SNPs between Uzbek and EUR, Mean FST = 0.0148, Weighted FST = 0.0182. Pipeline validated.

5. Genome-Wide Processing (November 17, 2025)

The validated per-chromosome pipeline was automated across all 22 autosomes using two shell scripts. The first script (process_all_chromosomes.sh) processed chr1–14 (skipping chr22 which was already done), and the second (continue_processing.sh) processed chr15–21.

5.1 Automated Pipeline Script

#!/bin/bash set -e echo "===== Processing all chromosomes for genome-wide Fst =====" echo "Start time: $(date)" for chr in {1..22}; do echo "" echo "===== CHROMOSOME ${chr} =====" # Skip chr22 (already done in pilot) if [ $chr -eq 22 ]; then echo " Chr22 already processed, using existing files" continue fi echo " [1/7] Converting 1000G VCF to PLINK..." plink --vcf 1000G_data/ALL.chr${chr}.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz \ --keep 1000G_data/european_samples_fixed.txt \ --make-bed \ --out 1000G_data/1000G_EUR_chr${chr} \ --silent echo " [2/7] Extracting Uzbek chr${chr}..." plink --bfile uzbek_data_hg19 \ --chr ${chr} \ --make-bed \ --out uzbek_chr${chr} \ --silent echo " [3/7] Renaming SNPs to chr:pos format..." awk '{$2=$1":"$4; print}' OFS='\t' uzbek_chr${chr}.bim > uzbek_chr${chr}_renamed.bim cp uzbek_chr${chr}.bed uzbek_chr${chr}_renamed.bed cp uzbek_chr${chr}.fam uzbek_chr${chr}_renamed.fam awk '{$2=$1":"$4; print}' OFS='\t' 1000G_data/1000G_EUR_chr${chr}.bim > 1000G_EUR_chr${chr}_newnames.bim cp 1000G_data/1000G_EUR_chr${chr}.bed 1000G_EUR_chr${chr}_newnames.bed cp 1000G_data/1000G_EUR_chr${chr}.fam 1000G_EUR_chr${chr}_newnames.fam echo " [4/7] Finding overlapping SNPs..." awk '{print $1":"$4}' uzbek_chr${chr}_renamed.bim | sort > uzbek_chr${chr}_pos.txt awk '{print $1":"$4}' 1000G_EUR_chr${chr}_newnames.bim | sort > 1000G_chr${chr}_pos.txt comm -12 uzbek_chr${chr}_pos.txt 1000G_chr${chr}_pos.txt > positions_chr${chr}.txt overlap=$(wc -l < positions_chr${chr}.txt) echo " Found ${overlap} overlapping SNPs" echo " [5/7] Extracting overlapping SNPs..." plink --bfile uzbek_chr${chr}_renamed \ --extract positions_chr${chr}.txt \ --make-bed \ --out uzbek_chr${chr}_matched \ --silent plink --bfile 1000G_EUR_chr${chr}_newnames \ --extract positions_chr${chr}.txt \ --make-bed \ --out 1000G_EUR_chr${chr}_matched \ --silent echo " [6/7] Attempting merge..." plink --bfile uzbek_chr${chr}_matched \ --bmerge 1000G_EUR_chr${chr}_matched \ --make-bed \ --out merged_chr${chr}_temp \ --silent 2>&1 | grep -q "3+ alleles" && { echo " Merge conflicts detected, excluding problem SNPs..." plink --bfile uzbek_chr${chr}_matched \ --exclude merged_chr${chr}_temp-merge.missnp \ --make-bed \ --out uzbek_chr${chr}_clean \ --silent plink --bfile 1000G_EUR_chr${chr}_matched \ --exclude merged_chr${chr}_temp-merge.missnp \ --make-bed \ --out 1000G_EUR_chr${chr}_clean \ --silent echo " [7/7] Final merge..." plink --bfile uzbek_chr${chr}_clean \ --bmerge 1000G_EUR_chr${chr}_clean \ --make-bed \ --out merged_chr${chr} \ --silent } || { echo " [7/7] Merge successful!" mv merged_chr${chr}_temp.bed merged_chr${chr}.bed mv merged_chr${chr}_temp.bim merged_chr${chr}.bim mv merged_chr${chr}_temp.fam merged_chr${chr}.fam } snps=$(wc -l < merged_chr${chr}.bim) echo " ✓ Chr${chr} complete: ${snps} SNPs" # Cleanup intermediate files rm -f uzbek_chr${chr}.* uzbek_chr${chr}_renamed.* uzbek_chr${chr}_matched.* uzbek_chr${chr}_clean.* rm -f 1000G_EUR_chr${chr}_newnames.* 1000G_EUR_chr${chr}_matched.* 1000G_EUR_chr${chr}_clean.* rm -f uzbek_chr${chr}_pos.txt 1000G_chr${chr}_pos.txt positions_chr${chr}.txt rm -f merged_chr${chr}_temp.* done echo "" echo "===== All chromosomes processed! =====" echo "End time: $(date)"

5.2 Merge All Chromosomes

# Create merge list (chr2-chr22, using chr1 as base) cat > merge_list.txt << 'EOF' merged_chr2 merged_chr3 merged_chr4 merged_chr5 merged_chr6 merged_chr7 merged_chr8 merged_chr9 merged_chr10 merged_chr11 merged_chr12 merged_chr13 merged_chr14 merged_chr15 merged_chr16 merged_chr17 merged_chr18 merged_chr19 merged_chr20 merged_chr21 merged_chr22 EOF # Merge all chromosomes into one dataset plink --bfile merged_chr1 \ --merge-list merge_list.txt \ --make-bed \ --out merged_all_chrs
376208 variants and 1702 people pass filters and QC. --make-bed to merged_all_chrs.bed + merged_all_chrs.bim + merged_all_chrs.fam ... done.

6. Genome-Wide FST Calculation

6.1 Population Assignment File

The populations.txt file assigns each sample to either UZBEK or EUR:

# Format: FID IID GROUP # Example entries: 01-02 01-02 UZBEK 01-03 01-03 UZBEK 01-07 01-07 UZBEK ... HG00096 HG00096 EUR HG00097 HG00097 EUR NA20832 NA20832 EUR
UZBEK samples: 1,199

EUR samples (1000G): 503 (CEU=99, GBR=91, FIN=99, TSI=107, IBS=107)

Total merged: 1,702

6.2 Run FST

# --fst: Weir & Cockerham (1984) Fst estimator — measures allele frequency # divergence between populations; can produce slightly negative values # when within-group variance exceeds between-group variance (≈ zero) # --within FILE: 3-column cluster file (FID IID GROUP_LABEL) # assigns each sample to a comparison group plink --bfile merged_all_chrs \ --fst \ --within populations.txt \ --out genomewide_uzbek_vs_eur_fst
PLINK v1.9.0-b.7.7 64-bit (22 Oct 2024) Options in effect: --bfile merged_all_chrs --fst --within populations.txt --out genomewide_uzbek_vs_eur_fst 376208 variants loaded from .bim file. 1702 people (0 males, 0 females, 1702 ambiguous) loaded from .fam. --within: 2 clusters loaded, covering a total of 1702 people. Total genotyping rate is 0.978817. 376208 variants and 1702 people pass filters and QC. Writing --fst report (2 populations) to genomewide_uzbek_vs_eur_fst.fst ... done. 376206 markers with valid Fst estimates (2 excluded). Mean Fst estimate: 0.0160089 Weighted Fst estimate: 0.0204295
Genome-wide FST Complete:
  • Mean FST = 0.0160 (unweighted average across all SNPs)
  • Weighted FST = 0.0204 (weighted by expected heterozygosity — more robust)
  • Valid markers: 376,206 (2 excluded due to invariant alleles)
Updated results (1,047 post-QC samples, 77,111 LD-pruned SNPs): UZB–EUR weighted FST = 0.0145 (mean 0.0100). The lower value reflects both the stricter QC (1,047 vs 1,199 samples) and the more aggressively LD-pruned SNP set. See Step 14 for the full 5-population matrix.

7. Results

7.1 FST Distribution

FST Range Number of SNPs Percentage Interpretation
≥ 0.50 256 0.07% Extreme differentiation — potential selection targets
0.30 – 0.50 193 0.05% Very high differentiation
0.10 – 0.30 4,094 1.09% Moderate to high differentiation
0.05 – 0.10 24,194 6.43% Moderate differentiation
0.00 – 0.05 285,433 75.86% Low differentiation (background level)
< 0.00 62,038 16.49% Negative FST (sampling noise; treated as ~0)
About Negative FST: Weir & Cockerham's estimator can produce slightly negative values when within-population variance exceeds between-population variance — this reflects sampling noise at loci with very similar allele frequencies. The 16.5% of negative values are expected for closely related populations and are biologically equivalent to zero.

7.2 Top 30 Most Differentiated Loci

RankChrPosition (hg19)SNP IDFSTN (non-missing)
1914,776,1459:147761450.99821,701
2140,954,8771:409548770.99821,701
31423,893,14814:238931480.98771,696
42119,169,13321:191691330.98701,691
59104,189,8569:1041898560.98691,699
6824,813,3918:248133910.98531,695
71450,100,24214:501002420.98471,694
81748,275,33917:482753390.97871,692
9469,530,0984:695300980.97821,684
101147,355,47511:473554750.97281,681
112179,642,5892:1796425890.97111,683
12630,585,7716:305857710.96861,693
13136,934,8051:369348050.96681,688
144187,003,7294:1870037290.96291,675
151395,672,26513:956722650.96021,670
161161,199,4311:1611994310.95881,688
17629,640,7856:296407850.95841,696
18848,701,7868:487017860.95801,671
191146,661,8141:1466618140.95521,676
20470,354,5344:703545340.95321,670
212152,580,8152:1525808150.95211,691
221689,833,57616:898335760.95181,695
231134,999,68211:349996820.94941,672
24631,059,8256:310598250.94901,695
25221,238,4132:212384130.94861,673
265162,896,7595:1628967590.94181,675
278145,639,6818:1456396810.94171,659
281096,818,11910:968181190.94061,698
29632,704,8846:327048840.93881,676
301117,476,46011:174764600.93871,693
Notable observation about chr6 hits: Three of the top 30 hits (ranks 12, 17, 24, 29) are on chromosome 6 at positions 29.6–32.7 Mb. This region corresponds to the HLA/MHC complex (6p21.3), which is one of the most polymorphic regions of the human genome and is well-known for extreme population differentiation. The HLA region is also of potential relevance to pregnancy loss (immune tolerance of fetal alloantigens).

7.3 Per-Chromosome Summary

ChromosomeOverlapping SNPsMean FST
Chr 129,1150.0163
Chr 231,1440.0166
Chr 326,0570.0157
Chr 424,0770.0159
Chr 521,9200.0151
Chr 626,9720.0192
Chr 721,1750.0160
Chr 819,7170.0157
Chr 917,8310.0162
Chr 1019,3860.0153
Chr 1118,1070.0158
Chr 1218,4720.0163
Chr 1313,8120.0155
Chr 1412,4180.0154
Chr 1511,8740.0164
Chr 1612,7770.0159
Chr 1710,9450.0159
Chr 1811,1730.0143
Chr 198,2440.0143
Chr 209,7040.0155
Chr 215,5260.0149
Chr 225,7620.0148

Chr 6 stands out with the highest per-chromosome mean FST (0.0192), driven by the HLA/MHC region. Chromosomes 18 and 19 have the lowest mean FST (0.0143).

8. Interpretation

8.1 Population Context

The weighted FST of 0.0204 between Uzbek and EUR populations indicates low but non-trivial genetic differentiation. For context:

Population PairTypical FSTSource
Within-European (e.g., CEU vs TSI)0.002–0.0061000 Genomes
Uzbek vs EUR (pre-QC, 376K SNPs)0.020This analysis
Uzbek vs EUR (post-QC, 77K SNPs)0.015Recalculation
EUR vs SAS (South Asian)~0.020–0.0401000 Genomes
EUR vs EAS (East Asian)~0.10–0.111000 Genomes
EUR vs AFR (African)~0.12–0.151000 Genomes

The Uzbek cohort is genetically closer to Europeans than South Asians are, consistent with Central Asia's geographic and historical position as a crossroads between Europe and Asia. This aligns with the global PCA results (Step 8) where Uzbek samples cluster between EUR and SAS.

8.2 Implications for GWAS

  • Population stratification is modest but real: FST = 0.02 means ~2% of allele frequency variance is between populations. This is enough to inflate GWAS test statistics if not corrected.
  • PCA covariates essential: Include global PCA components (from Step 8) as covariates in association models to control for this differentiation.
  • High-FST SNPs require caution: The 449 loci with FST > 0.3 are ancestry-informative markers. If these appear in GWAS results, they likely reflect population structure rather than disease biology.
  • HLA region needs special handling: The chr6 HLA hits are both population-differentiated AND biologically relevant to pregnancy. Cross-reference with GWAS hits carefully.

8.3 Highly Differentiated Loci — Potential Biology

SNPs with FST near 1.0 are essentially fixed for different alleles in Uzbek vs EUR. While many reflect genetic drift and founder effects, some may overlap known selection signals. The top loci on chr9 (14.8 Mb) and chr1 (40.9 Mb) with FST = 0.998 warrant gene annotation to identify nearby genes and potential functional consequences.

Action Required: Annotate the top FST hits (FST > 0.5) with nearest gene(s) and cross-reference against known selection scans in Central Asian populations. Tools: ANNOVAR, Ensembl VEP, or UCSC Genome Browser.

9. Output Files

File Location Description
uzbek_data_hg19.{bed,bim,fam} /staging/ALSU-analysis/Fst_analysis/ Uzbek genotypes in hg19 coordinates (647,854 variants)
merged_all_chrs.{bed,bim,fam} Merged Uzbek + EUR dataset (376,208 variants, 1,702 people)
populations.txt Population assignments (FID IID GROUP): 1,199 UZBEK + 503 EUR
genomewide_uzbek_vs_eur_fst.fst Per-SNP FST values (CHR SNP POS NMISS FST)
genomewide_uzbek_vs_eur_fst.log PLINK log with summary statistics
uzbek_vs_eur_fst.fst Chr22 pilot FST results
process_all_chromosomes.sh Automated pipeline script (chr1–14)
continue_processing.sh Continuation script (chr15–21)

FST Output File Format

# genomewide_uzbek_vs_eur_fst.fst # Columns: CHR SNP POS NMISS FST CHR SNP POS NMISS FST 1 1:727841 727841 1673 -0.000559429 1 1:846808 846808 1669 -3.48052e-05 ... 9 9:14776145 14776145 1701 0.998236 1 1:40954877 40954877 1701 0.998236

10. Key Findings

Finding Value Significance
Genome-wide Weighted FST 0.0145 Uzbek are equidistant from EUR and SAS (77K SNPs, 1,047 samples)
Extreme outliers (FST > 0.5) 256 loci Potential selection targets or ancestry-informative markers to flag in GWAS
Chr 6 enrichment Highest per-chr mean (0.0192) HLA/MHC complex drives elevated FST; relevant to immune-mediated pregnancy loss
376,206 overlapping markers Genome-wide coverage Sufficient density for population genetics analysis from genotyped (non-imputed) variants
16.5% negative FST 62,038 loci Expected for closely related populations; reflects sampling noise
Summary: The Uzbek cohort shows low population differentiation from Europeans (FST ≈ 0.02), consistent with Central Asian geography. Approximately 450 loci show extreme differentiation (FST > 0.3), concentrated in the HLA region and scattered across the genome. These results inform population stratification correction and highlight regions requiring careful interpretation in downstream GWAS.

11. Next Steps

  • Gene Annotation: Annotate the 256 extreme FST loci (FST > 0.5) with nearest genes using ANNOVAR/VEP
  • Fst Manhattan Plot: Create a genome-wide Manhattan plot of FST values for visual identification of differentiation peaks
  • Recalculate on Final Dataset: Re-run FST using the post-QC 1,047 samples for consistency with GWAS
  • Cross-reference with GWAS: Flag any GWAS hits that overlap with high-FST regions
  • Compare with EAS/SAS: Calculate FST against East Asian and South Asian 1000G populations to decompose Uzbek ancestry components
  • ADMIXTURE Integration: Combine FST with ADMIXTURE results (Step 10) for comprehensive ancestry characterization