Step 1: Sample Missingness Filter

1. Overview

Per-sample missingness (F_MISS) quantifies the fraction of genotype calls that failed for a given individual. Elevated missingness typically indicates technical failures during genotyping — degraded DNA, low input concentration, partial hybridisation on the array, or scanning artefacts. Retaining such samples introduces differential missingness bias: variants appear to differ in frequency between high-quality and low-quality samples even when no true genetic difference exists, producing spurious associations and inflated test statistics.

Rationale: Sample-level missingness filtering is the first QC step because it prevents high-missingness samples from influencing SNP-level statistics computed later (e.g., HWE p-values, allele frequency estimates). A relaxed 20% threshold was chosen for this clinical biobank array dataset to remove only catastrophic failures while retaining as much of the Central Asian cohort as possible. See Section 2 for threshold comparison with the literature.

Key Metrics

1,247

Starting Samples

99

Samples Removed

1,148

Final Samples

7.9%

Removal Rate

★ Spring 2026 Re-analysis

This page documents the corrected Step 1 (99 removed → 1,148 retained). The original winter 2025 pipeline removed only 92 of these 99 samples — 7 samples with F_MISS > 0.20 were incorrectly retained and propagated through all downstream analyses. The root cause of the 92-vs-99 discrepancy was never identified.

Overlap with Step 0: All 99 removed samples were independently flagged by the chip-level investigation (84 chip failures, 12 high-missingness outliers, 3 contaminated). See Section 4 for the full cross-reference.

2. QC Parameter: Per-Sample Missingness

PLINK computes per-sample missingness as F_MISS = N_MISS / N_GENO, where N_MISS is the number of genotype calls that failed (coded as missing) and N_GENO is the total number of attempted genotype calls. The --mind flag removes all individuals with F_MISS above a specified threshold.

Parameter	Value Used	Typical Range	Biological Interpretation
`--mind`	0.20 (20%)	0.01–0.10	Samples with >20% missing genotypes are removed. This corresponds to failing >130,000 out of 654,027 genotype calls.

Common Thresholds in the Literature

Threshold	Stringency	Use Case
1% (--mind 0.01)	Very strict	Large GWAS with high genotyping quality (e.g., UK Biobank)
2% (--mind 0.02)	Strict	Anderson et al. 2010 (Nature Protocols) recommended threshold
5% (--mind 0.05)	Moderate	Typical for population genetics studies
10% (--mind 0.10)	Relaxed	Biobank or clinical cohorts with variable DNA quality
20% (--mind 0.20)	Very relaxed	Used here — removes only catastrophic failures

3. Per-Sample Missingness Distribution

The histogram below shows the distribution of per-sample missingness rates (F_MISS) across all 1,247 input samples using 1%-wide bins. The vast majority of samples cluster below 2% missingness, with a long right tail representing the 99 samples exceeding the 20% threshold.

Per-Sample Missingness Distribution (F_MISS)

N = 1,247 samples · 1%-wide bins · Red dashed line = 20% removal threshold

The first bin (0.00–0.01) contains 815 samples, dominating the scale. The zoomed chart below shows the distribution of the 99 removed samples (F_MISS > 0.20) in detail.

Removed Samples — Zoomed View (F_MISS > 0.20)

N = 99 removed samples · 1%-wide bins

Missingness Summary Statistics

Source: ConvSK_raw_miss.imiss on Biotech2024 (/staging/ALSU-analysis/spring2026/)

Statistic	All 1,247	Retained 1,148	Removed 99
Mean F_MISS	0.0406	0.0198	0.2085–0.4156 range
Std Dev	0.0781	0.0320	—
Minimum	0.0078	0.0078	0.2085 (08-507)
Maximum	0.4156	0.1979	0.4156 (08-621)

Interpretation: The retained samples have a mean missingness of ~2%, indicating excellent overall genotyping quality. The 99 removed samples form a distinct cluster with F_MISS ranging from 20.9% to 41.6%, far from the bulk of the distribution. This bimodal pattern (most samples near 1%, a small group at 20–42%) is consistent with plate-specific or batch-specific genotyping failures rather than gradual degradation.

4. Cross-Reference with Step 0 Investigation

The Step 0 chip-level investigation independently categorised all 1,247 samples by examining chip barcodes, raw scan metrics, and contamination indicators. Every one of the 99 samples removed by the F_MISS > 0.20 filter had already been flagged by the investigation — a 100% overlap confirming that the missingness threshold correctly captures all major technical failures.

Breakdown by Investigation Category

Step 0 Category	Count	Description
`chip_failure`	84	Samples on catastrophically failed chip barcodes (208993030112, 208993030034, 208993030080, 208993030109)
`high_fmiss`	12	Individual high-missingness outliers on otherwise functional chips
`contaminated`	3	Samples flagged for DNA contamination (08-25, 08-495, 08-701)
Total	99	100% overlap with missingness filter

The 7-Sample Discrepancy (Winter 2025 vs Spring 2026)

The original winter 2025 pipeline removed only 92 samples despite the same 20% threshold. Seven samples with F_MISS well above 0.20 were incorrectly retained. The root cause was never identified. All 7 have F_MISS ranging from 0.2605 to 0.3732 — these are not borderline cases.

FID	IID	F_MISS	Step 0 Category
840	08-25	0.3308	contaminated
458	08-365	0.3732	high_fmiss
862	08-495	0.3258	contaminated
499	08-701	0.2908	contaminated
886	08-77	0.2605	high_fmiss
898	08-825	0.3605	chip_failure
910	12-11	0.3719	high_fmiss

Implication: These 7 samples (3 contaminated, 3 high-missingness outliers, 1 chip failure) propagated through all winter 2025 downstream analyses (Steps 2–15). The spring 2026 re-analysis corrects this by using the verified 99-sample removal list.

5. Input & Output Data

Input

Files	ConvSK_raw.bed, ConvSK_raw.bim, ConvSK_raw.fam (symlinks to winter 2025 originals)
Location	`/staging/ALSU-analysis/spring2026/`
Format	PLINK binary format (.bed/.bim/.fam)
Samples	1,247 individuals
Variants	654,027 SNPs (Illumina GSA-24v3-0_A2 array)

Output

Files	ConvSK_mind20.bed, ConvSK_mind20.bim, ConvSK_mind20.fam
Location	`/staging/ALSU-analysis/spring2026/`
Samples	1,148 individuals (99 removed)
Variants	654,027 SNPs (unchanged — sample filter does not affect SNP count)

Intermediate Files

`ConvSK_raw_miss.imiss`	Per-individual missingness (1,247 rows)
`ConvSK_raw_miss.lmiss`	Per-SNP missingness (654,027 rows)
`remove_miss20.txt`	FID/IID pairs for 99 samples with F_MISS > 0.20

6. Commands Executed

Step 1a: Calculate per-sample missingness

$ cd /staging/ALSU-analysis/spring2026/

plink --bfile ConvSK_raw \
  --missing \
  --out ConvSK_raw_miss

# Produces:
#   ConvSK_raw_miss.imiss  (per-individual missingness)
#   ConvSK_raw_miss.lmiss  (per-SNP missingness)

Step 1b: Extract samples with F_MISS > 0.20

# Column 6 of .imiss = F_MISS
awk 'NR>1 && $6+0 > 0.20 {print $1"\t"$2}' \
  ConvSK_raw_miss.imiss > remove_miss20.txt

wc -l remove_miss20.txt
99 remove_miss20.txt

Step 1c: Remove high-missingness samples

$ plink --bfile ConvSK_raw \
  --remove remove_miss20.txt \
  --make-bed \
  --out ConvSK_mind20

# Expected output: 99 people removed, 1148 remaining.
# Result: ConvSK_mind20.bed/bim/fam (654,027 variants x 1,148 samples)

7. Removed Samples

99 samples were removed for exceeding the 20% missingness threshold. Their F_MISS values range from 0.2085 (08-507) to 0.4156 (08-621).

All 99 removed samples with F_MISS values (click to expand)

Source: ConvSK_raw_miss.imiss + remove_miss20.txt on Biotech2024 (/staging/ALSU-analysis/spring2026/). ★ = not in winter 2025 removal list (7 extra samples)

FID      IID        F_MISS
    08-365     0.3732  ★
    08-701     0.2908  ★
    08-25      0.3308  ★
    08-495     0.3258  ★
    08-77      0.2605  ★
    08-825     0.3605  ★
    12-11      0.3719  ★
   08-664     0.2487
   08-438     0.2362
   02-55      0.2564
   02-99      0.2748
   04-51      0.3115
   08-68      0.2113
   02-96      0.3713
   08-48      0.2461
   02-103     0.3567
   03-328     0.2810
   03-72      0.2843
   03-73      0.2856
   03-85      0.2896
   07-42      0.2883
   07-79      0.2913
   07-96      0.3048
   08-290     0.2913
   08-291     0.2928
   08-292     0.2807
   08-296     0.2927
   08-297     0.3020
   08-298     0.2762
   08-305     0.2821
   08-306     0.2785
   08-309     0.2981
   08-310     0.2948
   08-311     0.3063
   08-314     0.3037
   08-317     0.2933
   08-321     0.2943
   08-322     0.2819
   08-323     0.2936
   08-326     0.3144
   08-331     0.3196
   08-332     0.2567
   08-336     0.2709
   08-341     0.2582
   08-352     0.2438
   08-356     0.2285
   08-360     0.2553
   08-374     0.2637
   08-386     0.3078
   08-390     0.2908
   08-391     0.3182
   08-392     0.3345
   08-399     0.2698
   08-407     0.2579
   08-410     0.2514
   08-411     0.2525
   08-416     0.2499
   08-419     0.2308
   08-421     0.2432
   08-430     0.2638
   08-433     0.3217
   08-487     0.3156
   08-490     0.3166
   08-505     0.2807
   08-507     0.2085
   08-568     0.2482
   08-582     0.2853
   08-59      0.3112
   08-598     0.3664
   08-621     0.4156
   08-747     0.2309
   08-749     0.2858
   08-754     0.3210
   08-761     0.3477
   08-762     0.4032
   08-766     0.2315
   08-768     0.2513
   08-778     0.2385
   08-779     0.2554
   08-780     0.2513
   08-781     0.2580
   08-782     0.2562
   08-783     0.2540
   08-784     0.2511
   08-786     0.2575
   08-787     0.2661
   08-788     0.2621
   08-789     0.2303
   08-792     0.2555
   08-793     0.2534
   08-794     0.2608
   08-825d    0.2573
   15-12M     0.2577
   20-07      0.2616
   08-425     0.2585
   08-454     0.2447
   08-412     0.2548
   08-492     0.2622
   08-403     0.2669

8. Quality Verification

✓ Post-filter verification:

remove_miss20.txt contains exactly 99 lines (FID ↔ IID pairs, tab-separated)
All 99 have F_MISS > 0.20 (min = 0.2085, max = 0.4156)
All 1,148 retained samples have F_MISS ≤ 0.1979
No variants removed — 654,027 SNPs carried forward unchanged
100% overlap with Step 0 investigation verdicts

$ wc -l remove_miss20.txt
99 remove_miss20.txt

# Verify no retained sample exceeds threshold:
awk 'NR>1 && $6+0 > 0.20' ConvSK_raw_miss.imiss | wc -l
99

# Highest retained F_MISS:
awk 'NR>1 && $6+0 <= 0.20' ConvSK_raw_miss.imiss | \
  awk '{if($6+0 > max) max=$6+0} END{print max}'
0.1979

9. Chronological Log

Winter 2025 (original run)

Initial pipeline
Missingness filter applied at /staging/ALSU-analysis/winter2025/PLINK_301125_0312/. Removal list contained 92 samples — 7 samples with F_MISS > 0.20 were incorrectly retained. All downstream analyses (Steps 2–15) used this buggy dataset.

Spring 2026 (re-analysis)

Workspace set up
Created /staging/ALSU-analysis/spring2026/ with symlinks to unchanged raw PLINK files.

Spring 2026

Missingness recomputed
plink --missing on ConvSK_raw: .imiss (1,247 samples) and .lmiss (654,027 SNPs) produced.

Spring 2026

Correct removal list generated
99 samples with F_MISS > 0.20 extracted to remove_miss20.txt. Cross-referenced with Step 0: 84 chip_failure + 12 high_fmiss + 3 contaminated = 99.

Spring 2026

Samples removed
plink --remove: 1,247 → 1,148 samples. Output: ConvSK_mind20.