← Back to Roadmap

Step 1: Sample Missingness Filter

Remove samples with excessive missing genotypes to ensure downstream analytical reliability

✓ Completed — Spring 2026 (re-analysis)

1. Overview

Per-sample missingness (F_MISS) quantifies the fraction of genotype calls that failed for a given individual. Elevated missingness typically indicates technical failures during genotyping — degraded DNA, low input concentration, partial hybridisation on the array, or scanning artefacts. Retaining such samples introduces differential missingness bias: variants appear to differ in frequency between high-quality and low-quality samples even when no true genetic difference exists, producing spurious associations and inflated test statistics.

Rationale: Sample-level missingness filtering is the first QC step because it prevents high-missingness samples from influencing SNP-level statistics computed later (e.g., HWE p-values, allele frequency estimates). A relaxed 20% threshold was chosen for this clinical biobank array dataset to remove only catastrophic failures while retaining as much of the Central Asian cohort as possible. See Section 2 for threshold comparison with the literature.

Key Metrics

1,247
Starting Samples
99
Samples Removed
1,148
Final Samples
7.9%
Removal Rate
★ Spring 2026 Re-analysis
This page documents the corrected Step 1 (99 removed → 1,148 retained). The original winter 2025 pipeline removed only 92 of these 99 samples — 7 samples with F_MISS > 0.20 were incorrectly retained and propagated through all downstream analyses. The root cause of the 92-vs-99 discrepancy was never identified.

Overlap with Step 0: All 99 removed samples were independently flagged by the chip-level investigation (84 chip failures, 12 high-missingness outliers, 3 contaminated). See Section 4 for the full cross-reference.

2. QC Parameter: Per-Sample Missingness

PLINK computes per-sample missingness as F_MISS = N_MISS / N_GENO, where N_MISS is the number of genotype calls that failed (coded as missing) and N_GENO is the total number of attempted genotype calls. The --mind flag removes all individuals with F_MISS above a specified threshold.

Parameter Value Used Typical Range Biological Interpretation
--mind 0.20 (20%) 0.01–0.10 Samples with >20% missing genotypes are removed. This corresponds to failing >130,000 out of 654,027 genotype calls.

Common Thresholds in the Literature

Threshold Stringency Use Case
1% (--mind 0.01) Very strict Large GWAS with high genotyping quality (e.g., UK Biobank)
2% (--mind 0.02) Strict Anderson et al. 2010 (Nature Protocols) recommended threshold
5% (--mind 0.05) Moderate Typical for population genetics studies
10% (--mind 0.10) Relaxed Biobank or clinical cohorts with variable DNA quality
20% (--mind 0.20) Very relaxed Used here — removes only catastrophic failures

3. Per-Sample Missingness Distribution

The histogram below shows the distribution of per-sample missingness rates (F_MISS) across all 1,247 input samples using 1%-wide bins. The vast majority of samples cluster below 2% missingness, with a long right tail representing the 99 samples exceeding the 20% threshold.

Per-Sample Missingness Distribution (F_MISS)
N = 1,247 samples · 1%-wide bins · Red dashed line = 20% removal threshold

The first bin (0.00–0.01) contains 815 samples, dominating the scale. The zoomed chart below shows the distribution of the 99 removed samples (F_MISS > 0.20) in detail.

Removed Samples — Zoomed View (F_MISS > 0.20)
N = 99 removed samples · 1%-wide bins

Missingness Summary Statistics

Source: ConvSK_raw_miss.imiss on Biotech2024 (/staging/ALSU-analysis/spring2026/)

StatisticAll 1,247Retained 1,148Removed 99
Mean F_MISS0.04060.01980.2085–0.4156 range
Std Dev0.07810.0320
Minimum0.00780.00780.2085 (08-507)
Maximum0.41560.19790.4156 (08-621)
Interpretation: The retained samples have a mean missingness of ~2%, indicating excellent overall genotyping quality. The 99 removed samples form a distinct cluster with F_MISS ranging from 20.9% to 41.6%, far from the bulk of the distribution. This bimodal pattern (most samples near 1%, a small group at 20–42%) is consistent with plate-specific or batch-specific genotyping failures rather than gradual degradation.

4. Cross-Reference with Step 0 Investigation

The Step 0 chip-level investigation independently categorised all 1,247 samples by examining chip barcodes, raw scan metrics, and contamination indicators. Every one of the 99 samples removed by the F_MISS > 0.20 filter had already been flagged by the investigation — a 100% overlap confirming that the missingness threshold correctly captures all major technical failures.

Breakdown by Investigation Category

Step 0 CategoryCountDescription
chip_failure84Samples on catastrophically failed chip barcodes (208993030112, 208993030034, 208993030080, 208993030109)
high_fmiss12Individual high-missingness outliers on otherwise functional chips
contaminated3Samples flagged for DNA contamination (08-25, 08-495, 08-701)
Total99100% overlap with missingness filter

The 7-Sample Discrepancy (Winter 2025 vs Spring 2026)

The original winter 2025 pipeline removed only 92 samples despite the same 20% threshold. Seven samples with F_MISS well above 0.20 were incorrectly retained. The root cause was never identified. All 7 have F_MISS ranging from 0.2605 to 0.3732 — these are not borderline cases.

FIDIIDF_MISSStep 0 Category
84008-250.3308contaminated
45808-3650.3732high_fmiss
86208-4950.3258contaminated
49908-7010.2908contaminated
88608-770.2605high_fmiss
89808-8250.3605chip_failure
91012-110.3719high_fmiss
Implication: These 7 samples (3 contaminated, 3 high-missingness outliers, 1 chip failure) propagated through all winter 2025 downstream analyses (Steps 2–15). The spring 2026 re-analysis corrects this by using the verified 99-sample removal list.

5. Input & Output Data

Input

FilesConvSK_raw.bed, ConvSK_raw.bim, ConvSK_raw.fam (symlinks to winter 2025 originals)
Location/staging/ALSU-analysis/spring2026/
FormatPLINK binary format (.bed/.bim/.fam)
Samples1,247 individuals
Variants654,027 SNPs (Illumina GSA-24v3-0_A2 array)

Output

FilesConvSK_mind20.bed, ConvSK_mind20.bim, ConvSK_mind20.fam
Location/staging/ALSU-analysis/spring2026/
Samples1,148 individuals (99 removed)
Variants654,027 SNPs (unchanged — sample filter does not affect SNP count)

Intermediate Files

ConvSK_raw_miss.imissPer-individual missingness (1,247 rows)
ConvSK_raw_miss.lmissPer-SNP missingness (654,027 rows)
remove_miss20.txtFID/IID pairs for 99 samples with F_MISS > 0.20

6. Commands Executed

Step 1a: Calculate per-sample missingness

$ cd /staging/ALSU-analysis/spring2026/ plink --bfile ConvSK_raw \ --missing \ --out ConvSK_raw_miss # Produces: # ConvSK_raw_miss.imiss (per-individual missingness) # ConvSK_raw_miss.lmiss (per-SNP missingness)

Step 1b: Extract samples with F_MISS > 0.20

# Column 6 of .imiss = F_MISS awk 'NR>1 && $6+0 > 0.20 {print $1"\t"$2}' \ ConvSK_raw_miss.imiss > remove_miss20.txt wc -l remove_miss20.txt 99 remove_miss20.txt

Step 1c: Remove high-missingness samples

$ plink --bfile ConvSK_raw \ --remove remove_miss20.txt \ --make-bed \ --out ConvSK_mind20 # Expected output: 99 people removed, 1148 remaining. # Result: ConvSK_mind20.bed/bim/fam (654,027 variants x 1,148 samples)

7. Removed Samples

99 samples were removed for exceeding the 20% missingness threshold. Their F_MISS values range from 0.2085 (08-507) to 0.4156 (08-621).

All 99 removed samples with F_MISS values (click to expand)

Source: ConvSK_raw_miss.imiss + remove_miss20.txt on Biotech2024 (/staging/ALSU-analysis/spring2026/). ★ = not in winter 2025 removal list (7 extra samples)

FID IID F_MISS 458 08-365 0.3732 ★ 499 08-701 0.2908 ★ 840 08-25 0.3308 ★ 862 08-495 0.3258 ★ 886 08-77 0.2605 ★ 898 08-825 0.3605 ★ 910 12-11 0.3719 ★ 1003 08-664 0.2487 1030 08-438 0.2362 1042 02-55 0.2564 1043 02-99 0.2748 1054 04-51 0.3115 1055 08-68 0.2113 1065 02-96 0.3713 1068 08-48 0.2461 1081 02-103 0.3567 1104 03-328 0.2810 1105 03-72 0.2843 1106 03-73 0.2856 1107 03-85 0.2896 1108 07-42 0.2883 1109 07-79 0.2913 1110 07-96 0.3048 1111 08-290 0.2913 1112 08-291 0.2928 1113 08-292 0.2807 1114 08-296 0.2927 1115 08-297 0.3020 1116 08-298 0.2762 1117 08-305 0.2821 1118 08-306 0.2785 1119 08-309 0.2981 1120 08-310 0.2948 1121 08-311 0.3063 1122 08-314 0.3037 1123 08-317 0.2933 1124 08-321 0.2943 1125 08-322 0.2819 1126 08-323 0.2936 1127 08-326 0.3144 1128 08-331 0.3196 1129 08-332 0.2567 1130 08-336 0.2709 1131 08-341 0.2582 1132 08-352 0.2438 1133 08-356 0.2285 1134 08-360 0.2553 1135 08-374 0.2637 1136 08-386 0.3078 1137 08-390 0.2908 1138 08-391 0.3182 1139 08-392 0.3345 1140 08-399 0.2698 1141 08-407 0.2579 1142 08-410 0.2514 1143 08-411 0.2525 1144 08-416 0.2499 1145 08-419 0.2308 1146 08-421 0.2432 1147 08-430 0.2638 1148 08-433 0.3217 1149 08-487 0.3156 1150 08-490 0.3166 1151 08-505 0.2807 1152 08-507 0.2085 1159 08-568 0.2482 1160 08-582 0.2853 1161 08-59 0.3112 1162 08-598 0.3664 1163 08-621 0.4156 1171 08-747 0.2309 1172 08-749 0.2858 1173 08-754 0.3210 1174 08-761 0.3477 1175 08-762 0.4032 1176 08-766 0.2315 1177 08-768 0.2513 1178 08-778 0.2385 1179 08-779 0.2554 1180 08-780 0.2513 1181 08-781 0.2580 1182 08-782 0.2562 1183 08-783 0.2540 1184 08-784 0.2511 1185 08-786 0.2575 1186 08-787 0.2661 1187 08-788 0.2621 1188 08-789 0.2303 1189 08-792 0.2555 1190 08-793 0.2534 1191 08-794 0.2608 1192 08-825d 0.2573 1193 15-12M 0.2577 1194 20-07 0.2616 1195 08-425 0.2585 1196 08-454 0.2447 1197 08-412 0.2548 1198 08-492 0.2622 1199 08-403 0.2669

8. Quality Verification

✓ Post-filter verification:
  • remove_miss20.txt contains exactly 99 lines (FID ↔ IID pairs, tab-separated)
  • All 99 have F_MISS > 0.20 (min = 0.2085, max = 0.4156)
  • All 1,148 retained samples have F_MISS ≤ 0.1979
  • No variants removed — 654,027 SNPs carried forward unchanged
  • 100% overlap with Step 0 investigation verdicts
$ wc -l remove_miss20.txt 99 remove_miss20.txt # Verify no retained sample exceeds threshold: awk 'NR>1 && $6+0 > 0.20' ConvSK_raw_miss.imiss | wc -l 99 # Highest retained F_MISS: awk 'NR>1 && $6+0 <= 0.20' ConvSK_raw_miss.imiss | \ awk '{if($6+0 > max) max=$6+0} END{print max}' 0.1979

9. Chronological Log

Winter 2025 (original run)
Initial pipeline
Missingness filter applied at /staging/ALSU-analysis/winter2025/PLINK_301125_0312/. Removal list contained 92 samples — 7 samples with F_MISS > 0.20 were incorrectly retained. All downstream analyses (Steps 2–15) used this buggy dataset.
Spring 2026 (re-analysis)
Workspace set up
Created /staging/ALSU-analysis/spring2026/ with symlinks to unchanged raw PLINK files.
Spring 2026
Missingness recomputed
plink --missing on ConvSK_raw: .imiss (1,247 samples) and .lmiss (654,027 SNPs) produced.
Spring 2026
Correct removal list generated
99 samples with F_MISS > 0.20 extracted to remove_miss20.txt. Cross-referenced with Step 0: 84 chip_failure + 12 high_fmiss + 3 contaminated = 99.
Spring 2026
Samples removed
plink --remove: 1,247 → 1,148 samples. Output: ConvSK_mind20.