Step 9: Fst Analysis (Uzbek vs EUR)

1. Overview

F_ST (fixation index) measures allele frequency differentiation between populations. Values range from 0 (no differentiation — identical allele frequencies) to 1 (complete fixation of different alleles). This step computes genome-wide Weir & Cockerham F_ST between our Uzbek cohort and 1000 Genomes Phase 3 European (EUR) samples to:

Quantify genetic distance — how differentiated are Uzbek samples from Europeans?
Identify highly differentiated loci — SNPs with extreme F_ST may be under population-specific selection or linked to ancestry-informative markers
Contextualize GWAS findings — if a GWAS hit has high F_ST, it may reflect population stratification rather than true disease association

Interpretation Guide

F_ST Range	Interpretation	Example
0.00–0.05	Little differentiation	Populations within same continental group
0.05–0.15	Moderate differentiation	EUR vs SAS (~0.05), EUR vs AMR (~0.06)
0.15–0.25	Great differentiation	EUR vs EAS (~0.11), EUR vs AFR (~0.15)
>0.25	Very great differentiation	Extreme outlier loci under selection

Spring 2026 F_ST Results

Multi-population F_ST computed on 79,767 merged SNPs using UZB (1,047), EUR (522), EAS (515), SAS (492), AFR (671) samples.

Comparison	Weighted F_ST	Mean F_ST
UZB vs EUR	0.01468	0.01002
UZB vs EAS	0.04010	0.02393
UZB vs SAS	0.01445	0.00988
UZB vs AFR	0.12907	0.06153
EUR vs EAS	0.08631	0.05090

Consistent with Winter: UZB closest to EUR (0.015) and SAS (0.014), moderately distant from EAS (0.040), most distant from AFR (0.129). Values nearly identical to winter run confirming reproducibility.

2. Input Data

Source	File(s)	Description
Uzbek Data	`uzbek_data.ped/.map`	1,199 Uzbek samples, 650,181 genotyped variants (hg38/GRCh38)
1000G Reference	`ALL.chr*.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz`	Phase 3 VCFs (hg19/GRCh37), all 2,504 samples
EUR Sample List	`european_samples.txt`	503 EUR samples (CEU, GBR, FIN, TSI, IBS)
LiftOver Chain	`hg38ToHg19.over.chain`	UCSC chain file for hg38 → hg19 coordinate conversion

Note on Sample Count: This analysis used 1,199 Uzbek samples from the pre-QC genotyped dataset (uzbek_data.ped), not the 1,047 post-imputation final dataset. The Fst analysis was performed in October–November 2025, concurrent with the QC pipeline (Steps 1–6). The additional samples include those later removed for missingness, IBD duplicates, and other QC filters. For downstream analyses requiring consistency with the GWAS dataset, F_ST should be recalculated on the final 1,047 samples.

3. Coordinate Liftover (hg38 → hg19)

The 1000 Genomes Phase 3 reference data uses hg19 (GRCh37) coordinates, while our Uzbek genotyping data was called on hg38 (GRCh38). To merge the datasets, we must convert Uzbek coordinates to hg19.

3.1 Convert PED/MAP to Binary PLINK

cd /staging/ALSU-analysis/Fst_analysis

# Convert PED/MAP to binary format
plink --file uzbek_data --make-bed --out uzbek_data_hg38

650181 variants loaded from .bim file. 1199 people (0 males, 0 females, 1199 ambiguous) loaded from .fam. Warning: Variant 1 (post-sort) triallelic; setting rarest alleles missing. [multiple triallelic warnings omitted] Total genotyping rate is 0.96741. --make-bed to uzbek_data_hg38.bed + uzbek_data_hg38.bim + uzbek_data_hg38.fam ... done.

3.2 Run UCSC LiftOver

Extract BED-format positions from the BIM file and convert coordinates using the UCSC liftOver tool:

# Create BED file from BIM (chr, start, end, SNP_ID)
# BIM columns: $1=chr  $2=snpID  $3=cM  $4=bp_position  $5=A1  $6=A2
# BED format requires 0-based start → $4-1; end stays 1-based → $4
awk '{print "chr"$1, $4-1, $4, $2}' OFS='\t' uzbek_data_hg38.bim > uzbek_hg38_positions.bed

# liftOver args: input.bed  chain_file  mapped_output.bed  unmapped_output.bed
liftOver uzbek_hg38_positions.bed hg38ToHg19.over.chain uzbek_hg19_lifted.bed unlifted.bed

# Create position update file (old_ID → new_position)
# BED $4=SNP_ID, $3=end coord (= 1-based position for single-bp SNPs)
awk '{print $4, $3}' uzbek_hg19_lifted.bed > update_positions.txt

# List of SNPs that successfully lifted
awk '{print $4}' uzbek_hg19_lifted.bed > snps_to_keep.txt

3.3 Apply New Coordinates

# Extract only liftable SNPs
plink --bfile uzbek_data_hg38 \
    --extract snps_to_keep.txt \
    --make-bed \
    --out uzbek_data_hg19_temp

650181 variants loaded from .bim file. 1199 people loaded from .fam. --extract: 647854 variants remaining. --make-bed to uzbek_data_hg19_temp.bed + uzbek_data_hg19_temp.bim + uzbek_data_hg19_temp.fam ... done.

# --update-map FILE: 2-column file (SNP_ID  new_bp_position)
#   Replaces genomic coordinates in .bim for each matched variant
plink --bfile uzbek_data_hg19_temp \
    --update-map update_positions.txt \
    --make-bed \
    --out uzbek_data_hg19

647854 variants loaded from .bim file. 1199 people loaded from .fam. --update-map: 647854 values updated. Warning: Base-pair positions are now unsorted! --make-bed to uzbek_data_hg19.bed + uzbek_data_hg19.bim + uzbek_data_hg19.fam ... done.

Variants Lifted: 647,854 / 650,181 (99.6% success rate)

Variants Lost: 2,327 (unmappable between assemblies)

4. Chromosome 22 Pilot (October 23, 2025)

Before processing all 22 chromosomes, a pilot was run on chr22 to validate the pipeline. The per-chromosome workflow has 7 steps:

4.1 Per-Chromosome Pipeline

# [1/7] Convert 1000G VCF to PLINK (EUR samples only)
# --keep FILE: 2-column file (FID IID) — retain only listed samples
plink --vcf 1000G_data/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz \
    --keep 1000G_data/european_samples_fixed.txt \
    --make-bed \
    --out 1000G_data/1000G_EUR_chr22

# [2/7] Extract Uzbek chr22
plink --bfile uzbek_data_hg19 --chr 22 --make-bed --out uzbek_chr22

# [3/7] Rename SNPs to chr:pos format (for matching across datasets)
# awk: overwrite $2 (SNP ID) with chr:pos so both datasets share naming scheme
awk '{$2=$1":"$4; print}' OFS='\t' uzbek_chr22.bim > uzbek_chr22_renamed.bim
cp uzbek_chr22.bed uzbek_chr22_renamed.bed
cp uzbek_chr22.fam uzbek_chr22_renamed.fam

awk '{$2=$1":"$4; print}' OFS='\t' 1000G_data/1000G_EUR_chr22.bim > 1000G_EUR_chr22_newnames.bim
cp 1000G_data/1000G_EUR_chr22.bed 1000G_EUR_chr22_newnames.bed
cp 1000G_data/1000G_EUR_chr22.fam 1000G_EUR_chr22_newnames.fam

# [4/7] Find overlapping positions
awk '{print $1":"$4}' uzbek_chr22_renamed.bim | sort > uzbek_chr22_pos.txt
awk '{print $1":"$4}' 1000G_EUR_chr22_newnames.bim | sort > 1000G_chr22_pos.txt
# comm -12: output only lines present in BOTH sorted files (set intersection)
comm -12 uzbek_chr22_pos.txt 1000G_chr22_pos.txt > positions_chr22.txt

# [5/7] Extract overlapping SNPs from both datasets
plink --bfile uzbek_chr22_renamed --extract positions_chr22.txt --make-bed --out uzbek_chr22_matched
plink --bfile 1000G_EUR_chr22_newnames --extract positions_chr22.txt --make-bed --out 1000G_EUR_chr22_matched

# [6/7] Merge (with strand-flip handling)
# --bmerge PREFIX: merge current --bfile dataset with a second PLINK fileset
#   PLINK auto-flips strand for A/T↔T/A, C/G↔G/C; writes .missnp for 3+ allele conflicts
plink --bfile uzbek_chr22_matched --bmerge 1000G_EUR_chr22_matched \
    --make-bed --out merged_chr22_temp

# If 3+ allele conflicts arise, exclude problematic SNPs and re-merge:
plink --bfile uzbek_chr22_matched --exclude merged_chr22_temp-merge.missnp \
    --make-bed --out uzbek_chr22_clean
plink --bfile 1000G_EUR_chr22_matched --exclude merged_chr22_temp-merge.missnp \
    --make-bed --out 1000G_EUR_chr22_clean

# [7/7] Final merge
plink --bfile uzbek_chr22_clean --bmerge 1000G_EUR_chr22_clean \
    --make-bed --out merged_uzbek_EUR_chr22

4.2 Pilot F_ST

# Create population assignment file
# Format: FID IID GROUP (tab-separated)
# 1,199 Uzbek samples labeled "UZBEK"
# 503 EUR samples labeled "EUR"
# (populations.txt covers all 1,702 merged samples)

plink --bfile merged_uzbek_EUR_chr22 \
    --fst \
    --within populations.txt \
    --out uzbek_vs_eur_fst

5762 variants loaded from .bim file. 1702 people (0 males, 0 females, 1702 ambiguous) loaded from .fam. --within: 2 clusters loaded, covering a total of 1702 people. 5762 markers with valid Fst estimates. Mean Fst estimate: 0.0148483 Weighted Fst estimate: 0.0181685

Chr22 Pilot Successful: 5,762 overlapping SNPs between Uzbek and EUR, Mean F_ST = 0.0148, Weighted F_ST = 0.0182. Pipeline validated.

5. Genome-Wide Processing (November 17, 2025)

The validated per-chromosome pipeline was automated across all 22 autosomes using two shell scripts. The first script (process_all_chromosomes.sh) processed chr1–14 (skipping chr22 which was already done), and the second (continue_processing.sh) processed chr15–21.

5.1 Automated Pipeline Script

#!/bin/bash
set -e

echo "===== Processing all chromosomes for genome-wide Fst ====="
echo "Start time: $(date)"

for chr in {1..22}; do
    echo ""
    echo "===== CHROMOSOME ${chr} ====="

    # Skip chr22 (already done in pilot)
    if [ $chr -eq 22 ]; then
        echo "  Chr22 already processed, using existing files"
        continue
    fi

    echo "  [1/7] Converting 1000G VCF to PLINK..."
    plink --vcf 1000G_data/ALL.chr${chr}.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz \
          --keep 1000G_data/european_samples_fixed.txt \
          --make-bed \
          --out 1000G_data/1000G_EUR_chr${chr} \
          --silent

    echo "  [2/7] Extracting Uzbek chr${chr}..."
    plink --bfile uzbek_data_hg19 \
          --chr ${chr} \
          --make-bed \
          --out uzbek_chr${chr} \
          --silent

    echo "  [3/7] Renaming SNPs to chr:pos format..."
    awk '{$2=$1":"$4; print}' OFS='\t' uzbek_chr${chr}.bim > uzbek_chr${chr}_renamed.bim
    cp uzbek_chr${chr}.bed uzbek_chr${chr}_renamed.bed
    cp uzbek_chr${chr}.fam uzbek_chr${chr}_renamed.fam

    awk '{$2=$1":"$4; print}' OFS='\t' 1000G_data/1000G_EUR_chr${chr}.bim > 1000G_EUR_chr${chr}_newnames.bim
    cp 1000G_data/1000G_EUR_chr${chr}.bed 1000G_EUR_chr${chr}_newnames.bed
    cp 1000G_data/1000G_EUR_chr${chr}.fam 1000G_EUR_chr${chr}_newnames.fam

    echo "  [4/7] Finding overlapping SNPs..."
    awk '{print $1":"$4}' uzbek_chr${chr}_renamed.bim | sort > uzbek_chr${chr}_pos.txt
    awk '{print $1":"$4}' 1000G_EUR_chr${chr}_newnames.bim | sort > 1000G_chr${chr}_pos.txt
    comm -12 uzbek_chr${chr}_pos.txt 1000G_chr${chr}_pos.txt > positions_chr${chr}.txt

    overlap=$(wc -l < positions_chr${chr}.txt)
    echo "    Found ${overlap} overlapping SNPs"

    echo "  [5/7] Extracting overlapping SNPs..."
    plink --bfile uzbek_chr${chr}_renamed \
          --extract positions_chr${chr}.txt \
          --make-bed \
          --out uzbek_chr${chr}_matched \
          --silent

    plink --bfile 1000G_EUR_chr${chr}_newnames \
          --extract positions_chr${chr}.txt \
          --make-bed \
          --out 1000G_EUR_chr${chr}_matched \
          --silent

    echo "  [6/7] Attempting merge..."
    plink --bfile uzbek_chr${chr}_matched \
          --bmerge 1000G_EUR_chr${chr}_matched \
          --make-bed \
          --out merged_chr${chr}_temp \
          --silent 2>&1 | grep -q "3+ alleles" && {
        echo "    Merge conflicts detected, excluding problem SNPs..."
        plink --bfile uzbek_chr${chr}_matched \
              --exclude merged_chr${chr}_temp-merge.missnp \
              --make-bed \
              --out uzbek_chr${chr}_clean \
              --silent

        plink --bfile 1000G_EUR_chr${chr}_matched \
              --exclude merged_chr${chr}_temp-merge.missnp \
              --make-bed \
              --out 1000G_EUR_chr${chr}_clean \
              --silent

        echo "  [7/7] Final merge..."
        plink --bfile uzbek_chr${chr}_clean \
              --bmerge 1000G_EUR_chr${chr}_clean \
              --make-bed \
              --out merged_chr${chr} \
              --silent
    } || {
        echo "  [7/7] Merge successful!"
        mv merged_chr${chr}_temp.bed merged_chr${chr}.bed
        mv merged_chr${chr}_temp.bim merged_chr${chr}.bim
        mv merged_chr${chr}_temp.fam merged_chr${chr}.fam
    }

    snps=$(wc -l < merged_chr${chr}.bim)
    echo "  ✓ Chr${chr} complete: ${snps} SNPs"

    # Cleanup intermediate files
    rm -f uzbek_chr${chr}.* uzbek_chr${chr}_renamed.* uzbek_chr${chr}_matched.* uzbek_chr${chr}_clean.*
    rm -f 1000G_EUR_chr${chr}_newnames.* 1000G_EUR_chr${chr}_matched.* 1000G_EUR_chr${chr}_clean.*
    rm -f uzbek_chr${chr}_pos.txt 1000G_chr${chr}_pos.txt positions_chr${chr}.txt
    rm -f merged_chr${chr}_temp.*
done

echo ""
echo "===== All chromosomes processed! ====="
echo "End time: $(date)"

5.2 Merge All Chromosomes

# Create merge list (chr2-chr22, using chr1 as base)
cat > merge_list.txt << 'EOF'
merged_chr2
merged_chr3
merged_chr4
merged_chr5
merged_chr6
merged_chr7
merged_chr8
merged_chr9
merged_chr10
merged_chr11
merged_chr12
merged_chr13
merged_chr14
merged_chr15
merged_chr16
merged_chr17
merged_chr18
merged_chr19
merged_chr20
merged_chr21
merged_chr22
EOF

# Merge all chromosomes into one dataset
plink --bfile merged_chr1 \
    --merge-list merge_list.txt \
    --make-bed \
    --out merged_all_chrs

376208 variants and 1702 people pass filters and QC. --make-bed to merged_all_chrs.bed + merged_all_chrs.bim + merged_all_chrs.fam ... done.

6. Genome-Wide F_ST Calculation

6.1 Population Assignment File

The populations.txt file assigns each sample to either UZBEK or EUR:

# Format: FID  IID  GROUP
# Example entries:
01-02   01-02   UZBEK
01-03   01-03   UZBEK
01-07   01-07   UZBEK
...
HG00096 HG00096 EUR
HG00097 HG00097 EUR
NA20832 NA20832 EUR

UZBEK samples: 1,199

EUR samples (1000G): 503 (CEU=99, GBR=91, FIN=99, TSI=107, IBS=107)

Total merged: 1,702

6.2 Run F_ST

# --fst: Weir & Cockerham (1984) Fst estimator — measures allele frequency
#   divergence between populations; can produce slightly negative values
#   when within-group variance exceeds between-group variance (≈ zero)
# --within FILE: 3-column cluster file (FID  IID  GROUP_LABEL)
#   assigns each sample to a comparison group
plink --bfile merged_all_chrs \
    --fst \
    --within populations.txt \
    --out genomewide_uzbek_vs_eur_fst

PLINK v1.9.0-b.7.7 64-bit (22 Oct 2024) Options in effect: --bfile merged_all_chrs --fst --within populations.txt --out genomewide_uzbek_vs_eur_fst 376208 variants loaded from .bim file. 1702 people (0 males, 0 females, 1702 ambiguous) loaded from .fam. --within: 2 clusters loaded, covering a total of 1702 people. Total genotyping rate is 0.978817. 376208 variants and 1702 people pass filters and QC. Writing --fst report (2 populations) to genomewide_uzbek_vs_eur_fst.fst ... done. 376206 markers with valid Fst estimates (2 excluded). Mean Fst estimate: 0.0160089 Weighted Fst estimate: 0.0204295

Genome-wide F_ST Complete:

Mean F_ST = 0.0160 (unweighted average across all SNPs)
Weighted F_ST = 0.0204 (weighted by expected heterozygosity — more robust)
Valid markers: 376,206 (2 excluded due to invariant alleles)

Updated results (1,047 post-QC samples, 77,111 LD-pruned SNPs): UZB–EUR weighted F_ST = 0.0145 (mean 0.0100). The lower value reflects both the stricter QC (1,047 vs 1,199 samples) and the more aggressively LD-pruned SNP set. See Step 14 for the full 5-population matrix.

7. Results

7.1 F_ST Distribution

F_ST Range	Number of SNPs	Percentage	Interpretation
≥ 0.50	256	0.07%	Extreme differentiation — potential selection targets
0.30 – 0.50	193	0.05%	Very high differentiation
0.10 – 0.30	4,094	1.09%	Moderate to high differentiation
0.05 – 0.10	24,194	6.43%	Moderate differentiation
0.00 – 0.05	285,433	75.86%	Low differentiation (background level)
< 0.00	62,038	16.49%	Negative F_ST (sampling noise; treated as ~0)

About Negative F_ST: Weir & Cockerham's estimator can produce slightly negative values when within-population variance exceeds between-population variance — this reflects sampling noise at loci with very similar allele frequencies. The 16.5% of negative values are expected for closely related populations and are biologically equivalent to zero.

7.2 Top 30 Most Differentiated Loci

Rank	Chr	Position (hg19)	SNP ID	F_ST	N (non-missing)
1	9	14,776,145	9:14776145	0.9982	1,701
2	1	40,954,877	1:40954877	0.9982	1,701
3	14	23,893,148	14:23893148	0.9877	1,696
4	21	19,169,133	21:19169133	0.9870	1,691
5	9	104,189,856	9:104189856	0.9869	1,699
6	8	24,813,391	8:24813391	0.9853	1,695
7	14	50,100,242	14:50100242	0.9847	1,694
8	17	48,275,339	17:48275339	0.9787	1,692
9	4	69,530,098	4:69530098	0.9782	1,684
10	11	47,355,475	11:47355475	0.9728	1,681
11	2	179,642,589	2:179642589	0.9711	1,683
12	6	30,585,771	6:30585771	0.9686	1,693
13	1	36,934,805	1:36934805	0.9668	1,688
14	4	187,003,729	4:187003729	0.9629	1,675
15	13	95,672,265	13:95672265	0.9602	1,670
16	1	161,199,431	1:161199431	0.9588	1,688
17	6	29,640,785	6:29640785	0.9584	1,696
18	8	48,701,786	8:48701786	0.9580	1,671
19	1	146,661,814	1:146661814	0.9552	1,676
20	4	70,354,534	4:70354534	0.9532	1,670
21	2	152,580,815	2:152580815	0.9521	1,691
22	16	89,833,576	16:89833576	0.9518	1,695
23	11	34,999,682	11:34999682	0.9494	1,672
24	6	31,059,825	6:31059825	0.9490	1,695
25	2	21,238,413	2:21238413	0.9486	1,673
26	5	162,896,759	5:162896759	0.9418	1,675
27	8	145,639,681	8:145639681	0.9417	1,659
28	10	96,818,119	10:96818119	0.9406	1,698
29	6	32,704,884	6:32704884	0.9388	1,676
30	11	17,476,460	11:17476460	0.9387	1,693

Notable observation about chr6 hits: Three of the top 30 hits (ranks 12, 17, 24, 29) are on chromosome 6 at positions 29.6–32.7 Mb. This region corresponds to the HLA/MHC complex (6p21.3), which is one of the most polymorphic regions of the human genome and is well-known for extreme population differentiation. The HLA region is also of potential relevance to pregnancy loss (immune tolerance of fetal alloantigens).

7.3 Per-Chromosome Summary

Chromosome	Overlapping SNPs	Mean F_ST
Chr 1	29,115	0.0163
Chr 2	31,144	0.0166
Chr 3	26,057	0.0157
Chr 4	24,077	0.0159
Chr 5	21,920	0.0151
Chr 6	26,972	0.0192
Chr 7	21,175	0.0160
Chr 8	19,717	0.0157
Chr 9	17,831	0.0162
Chr 10	19,386	0.0153
Chr 11	18,107	0.0158
Chr 12	18,472	0.0163
Chr 13	13,812	0.0155
Chr 14	12,418	0.0154
Chr 15	11,874	0.0164
Chr 16	12,777	0.0159
Chr 17	10,945	0.0159
Chr 18	11,173	0.0143
Chr 19	8,244	0.0143
Chr 20	9,704	0.0155
Chr 21	5,526	0.0149
Chr 22	5,762	0.0148

Chr 6 stands out with the highest per-chromosome mean F_ST (0.0192), driven by the HLA/MHC region. Chromosomes 18 and 19 have the lowest mean F_ST (0.0143).

8. Interpretation

8.1 Population Context

The weighted F_ST of 0.0204 between Uzbek and EUR populations indicates low but non-trivial genetic differentiation. For context:

Population Pair	Typical F_ST	Source
Within-European (e.g., CEU vs TSI)	0.002–0.006	1000 Genomes
Uzbek vs EUR (pre-QC, 376K SNPs)	0.020	This analysis
Uzbek vs EUR (post-QC, 77K SNPs)	0.015	Recalculation
EUR vs SAS (South Asian)	~0.020–0.040	1000 Genomes
EUR vs EAS (East Asian)	~0.10–0.11	1000 Genomes
EUR vs AFR (African)	~0.12–0.15	1000 Genomes

The Uzbek cohort is genetically closer to Europeans than South Asians are, consistent with Central Asia's geographic and historical position as a crossroads between Europe and Asia. This aligns with the global PCA results (Step 8) where Uzbek samples cluster between EUR and SAS.

8.2 Implications for GWAS

Population stratification is modest but real: F_ST = 0.02 means ~2% of allele frequency variance is between populations. This is enough to inflate GWAS test statistics if not corrected.
PCA covariates essential: Include global PCA components (from Step 8) as covariates in association models to control for this differentiation.
High-F_ST SNPs require caution: The 449 loci with F_ST > 0.3 are ancestry-informative markers. If these appear in GWAS results, they likely reflect population structure rather than disease biology.
HLA region needs special handling: The chr6 HLA hits are both population-differentiated AND biologically relevant to pregnancy. Cross-reference with GWAS hits carefully.

8.3 Highly Differentiated Loci — Potential Biology

SNPs with F_ST near 1.0 are essentially fixed for different alleles in Uzbek vs EUR. While many reflect genetic drift and founder effects, some may overlap known selection signals. The top loci on chr9 (14.8 Mb) and chr1 (40.9 Mb) with F_ST = 0.998 warrant gene annotation to identify nearby genes and potential functional consequences.

Action Required: Annotate the top F_ST hits (F_ST > 0.5) with nearest gene(s) and cross-reference against known selection scans in Central Asian populations. Tools: ANNOVAR, Ensembl VEP, or UCSC Genome Browser.

9. Output Files

File	Location	Description
`uzbek_data_hg19.{bed,bim,fam}`	`/staging/ALSU-analysis/Fst_analysis/`	Uzbek genotypes in hg19 coordinates (647,854 variants)
`merged_all_chrs.{bed,bim,fam}`		Merged Uzbek + EUR dataset (376,208 variants, 1,702 people)
`populations.txt`		Population assignments (FID IID GROUP): 1,199 UZBEK + 503 EUR
`genomewide_uzbek_vs_eur_fst.fst`		Per-SNP F_ST values (CHR SNP POS NMISS FST)
`genomewide_uzbek_vs_eur_fst.log`		PLINK log with summary statistics
`uzbek_vs_eur_fst.fst`		Chr22 pilot F_ST results
`process_all_chromosomes.sh`		Automated pipeline script (chr1–14)
`continue_processing.sh`		Continuation script (chr15–21)

F_ST Output File Format

# genomewide_uzbek_vs_eur_fst.fst
# Columns: CHR  SNP  POS  NMISS  FST
CHR     SNP     POS     NMISS   FST
1       1:727841        727841  1673    -0.000559429
1       1:846808        846808  1669    -3.48052e-05
...
9       9:14776145      14776145        1701    0.998236
1       1:40954877      40954877        1701    0.998236

10. Key Findings

Finding	Value	Significance
Genome-wide Weighted F_ST	0.0145	Uzbek are equidistant from EUR and SAS (77K SNPs, 1,047 samples)
Extreme outliers (F_ST > 0.5)	256 loci	Potential selection targets or ancestry-informative markers to flag in GWAS
Chr 6 enrichment	Highest per-chr mean (0.0192)	HLA/MHC complex drives elevated F_ST; relevant to immune-mediated pregnancy loss
376,206 overlapping markers	Genome-wide coverage	Sufficient density for population genetics analysis from genotyped (non-imputed) variants
16.5% negative F_ST	62,038 loci	Expected for closely related populations; reflects sampling noise

Summary: The Uzbek cohort shows low population differentiation from Europeans (F_ST ≈ 0.02), consistent with Central Asian geography. Approximately 450 loci show extreme differentiation (F_ST > 0.3), concentrated in the HLA region and scattered across the genome. These results inform population stratification correction and highlight regions requiring careful interpretation in downstream GWAS.

11. Next Steps

Gene Annotation: Annotate the 256 extreme F_ST loci (F_ST > 0.5) with nearest genes using ANNOVAR/VEP
Fst Manhattan Plot: Create a genome-wide Manhattan plot of F_ST values for visual identification of differentiation peaks
Recalculate on Final Dataset: Re-run F_ST using the post-QC 1,047 samples for consistency with GWAS
Cross-reference with GWAS: Flag any GWAS hits that overlap with high-F_ST regions
Compare with EAS/SAS: Calculate F_ST against East Asian and South Asian 1000G populations to decompose Uzbek ancestry components
ADMIXTURE Integration: Combine F_ST with ADMIXTURE results (Step 10) for comprehensive ancestry characterization

Step 9: Genome-Wide FST Analysis (Uzbek vs EUR)