← Back to Roadmap

Step 0: Pre-QC Sample Investigation

Chip-level and sample-level quality audit of the raw 1,247-sample dataset before any filtering

Investigation — April 2026

Purpose

This page summarizes the findings from a comprehensive pre-QC investigation of the raw PLINK files (ConvSK_raw: 654,027 variants × 1,247 samples, Illumina GSA-24v3-0_A2 array). The investigation examined chip-level quality, sample identity (IBD), d/t replicate integrity, heterozygosity, sex check, and contamination indicators before any QC filtering was applied.

Full interactive dashboard: sample_investigation_v2.html (17 sections with Plotly charts, sortable tables, and per-sample verdicts). Data source: data/investigation_data_v2.json.

This page has two objectives:

  1. Cross-check Steps 1–15 — verify that all QC decisions are consistent with investigation findings
  2. GWAS sample filtering — definitive reference for which samples to keep, remove, and flag for future phenotype-association analyses

1. Raw Dataset

1,247
Total Samples
654,027
Variants
52
Physical Chips
24
Samples per Chip
PropertyValue
ArrayIllumina GSA-24v3-0_A2 (Infinium Global Screening Array)
Source filesConvSK_raw.bed/.bim/.fam
Server path/staging/ALSU-analysis/winter2025/PLINK_301125_0312/
Genome buildGRCh37 (hg19)
Sample sheet1,248 rows → 1,200 unique Sample_IDs (48 duplicated IDs with d/t suffixes)
Global het rateMean = 0.193, SD = 0.067

2. Sample Verdicts — Summary

1,056
KEEP
191
REMOVE
30
KEEP — Unverified Identity

Removal Breakdown (191 samples)

CategoryCountDescription
chip_failure 98 On one of 4 catastrophic chips (208993030xxx series) — mean F_MISS 24–29%, all positions failed
dt_artifact 45 d/t replicate entries (suffix samples removed, base kept where possible)
ibd_duplicate 33 IBD deduplication — one member of each unexpected identical pair removed
high_fmiss 12 F_MISS > 0.20 on non-catastrophic chips
contaminated 3 08-495, 08-25, 08-701 — extreme het (z = +5 to +7), F_MISS 29–33%, pattern consistent with two-person DNA mixture
chip_failure — 98 samples (click to expand)
Loading…
dt_artifact — 45 samples
Loading…
ibd_duplicate — 33 samples
Loading…
high_fmiss — 12 samples
Loading…
contaminated — 3 samples
Loading…

Identity Status (1,056 KEEP samples)

StatusCountGWAS Eligibility
verified 1,026 OK for all analyses (PCA, ADMIXTURE, FST, GWAS)
unverified 30 OK for PCA / ADMIXTURE / FST. Must exclude from GWAS — cannot match genotype to correct phenotype record
unverified identity — 30 KEEP samples
Loading…

3. Catastrophic Chips

4 chips in the 208993030xxx series failed entirely. All 24 positions on each chip are affected — the failure is chip-wide.

Chip BarcodeSamplesMean F_MISSMax F_MISS
208993030034250.2900.409
208993030039250.2730.372
208993030044250.2360.380
208993030046240.2810.420
Step 1 cross-check: The bulk of Step 1 removals (F_MISS > 0.20) come from these 4 chips. Samples on other chips with F_MISS > 0.20 are in the high_fmiss category (12 samples).
208993030034 — samples
Loading…
208993030039 — samples
Loading…
208993030044 — samples
Loading…
208993030046 — samples
Loading…

4. Identity Pairs & Clusters

65
Total Pairs (PI_HAT ≥ 0.98)
49
Clusters
106
Samples Involved
Pair CategoryCountDescription
Expected d/t matches 10 d/t sample correctly matches its base — expected QC replicates
Unexpected — same chip 18 Two different sample IDs on the same chip with identical DNA
Unexpected — cross-chip 37 Two different sample IDs on different chips with identical DNA
Total unexpected 55 These 55 pairs involve identity problems that cannot be resolved without external verification
Step 2 cross-check: Step 2 uses PI_HAT ≥ 0.98 to identify 65 pairs → 49 clusters → removes 57 samples via graph-based deduplication. The 55 unexpected pairs mean that for each such pair, we kept one sample but cannot verify whose DNA it actually contains. This produces the 30 "unverified identity" KEEP samples (some pairs share a kept member across multiple clusters).
Expected d/t pairs — 10 pairs
Loading…
Unexpected identical pairs — 55 pairs
Loading…
⚠ GWAS implication: The 30 unverified-identity samples are genotypically clean (they pass all QC thresholds) but their sample labels may be wrong. For PCA / ADMIXTURE / FST, this doesn't matter — the DNA is from a real Uzbek individual. For GWAS / phenotype association, these 30 samples must be excluded because we cannot match genotype to the correct phenotype record.

5. d/t Replicate Verification

48 samples carry a "d" or "t" suffix (45 base IDs: 42 pairs + 3 triplets = 48 entries). Each d/t is on a different physical chip than its base — these are independent replicate experiments, not software artifacts. The pipeline operator added d/t suffixes within GenomeStudio to disambiguate duplicate Sample_IDs before PLINK export.

✓ Match Base
✗ Match Wrong Person
? No Match at ≥0.98
Disposition: All 45 d/t suffix entries (and 3 t-suffix entries) are removed in the REMOVE list (status dt_artifact). The base sample is kept when it passes QC. This ensures no duplicate individuals in downstream analyses.
d/t replicate entries — 48 samples
Loading…

6. Possible Contamination

3 samples show a statistical pattern consistent with two-person DNA mixture:

SampleHet Ratez-scoreF_MISSF(X)
08-4950.624+6.40.326-1.47
08-250.668+7.10.331-1.18
08-7010.550+5.30.291-0.80

Evidence: Het rates 5–7σ above the mean, high missingness (29–33%), extreme negative F(X). This pattern is consistent with a two-person DNA mixture but has not been independently confirmed.

Verification needed: Examine BAF (B-Allele Frequency) plots in GenomeStudio for these 3 samples. A contaminated sample shows a characteristic 5-band pattern (BAF ≈ 0, 0.25, 0.50, 0.75, 1.0) instead of the normal 3-band pattern.

7. Step-by-Step Cross-Check Reference

Use this table to verify that each pipeline step's inputs and outputs are consistent with the investigation findings.

StepOperationExpected InputExpected OutputInvestigation Notes
Step 1 F_MISS > 0.20 filter 654,027 × 1,247 1,155 samples retained (92 removed) Investigation found 99 should be removed (the pipeline used a slightly different removal list — 7-sample discrepancy documented in Step 1)
Step 2 IBD dedup (PI_HAT ≥ 0.98) 1,155 samples 1,098 samples (57 removed from 49 clusters) Investigation confirms 65 pairs, 49 clusters. 55 of 65 are unexpected. Dedup resolves duplicate genotypes but not identity ambiguity.
Step 3 SNP QC + VCF export 654,027 variants × 1,098 472,191 variants in VCF No sample-level issues. SNP QC removes MAF/HWE/geno failures, I/D alleles, duplicate positions.
Step 4 Imputation (TOPMed) 472,191 variants 10,846,569 variants Quality depends on input sample quality. Samples from bad chips or with contamination may have unreliable imputed genotypes.
Step 56 ID normalize + final QC 1,098 samples Post-imputation clean set Verify that the 3 contaminated samples (if still present) are flagged or removed at this stage.
Step 78 PCA (local + global) Post-imputation set PCA coordinates 30 unverified-identity samples are acceptable for PCA (DNA is from a real Uzbek individual regardless of label).
Step 914 FST, ADMIXTURE, PBS, LD, MDS Post-imputation set Population structure results Same as PCA — unverified identity has minimal effect on population-level analyses.
Step 15 ROH & IBD Post-imputation set ROH segments, IBD sharing ROH is per-sample — unverified identity doesn't affect ROH detection. IBD sharing between unverified samples may be artifactual.
Future GWAS Phenotype–genotype association Post-imputation set Association statistics Must exclude the 30 unverified-identity samples in addition to all 191 REMOVE samples. Effective GWAS N = 1,026.

8. GWAS Sample Filtering Criteria

For phenotype–genotype association (GWAS), apply the following exclusions in order:

#FilterSamples RemovedCumulative Remaining
1 chip_failure — catastrophic chips (208993030xxx) 98 1,149
2 dt_artifact — d/t replicate suffixes 45 1,104
3 ibd_duplicate — one member of each unexpected identical pair 33 1,071
4 high_fmiss — F_MISS > 0.20 on non-catastrophic chips 12 1,059
5 contaminated — pattern consistent with DNA mixture 3 1,056
6 unverified — identity ambiguous (cannot match to phenotype) 30 1,026
Population-structure analyses (PCA, ADMIXTURE, FST, PBS, MDS): Filters 1–5 only → N = 1,056.
GWAS / phenotype association: Filters 1–6 → N = 1,026.
Generating the removal list: The full per-sample verdict table is in the interactive dashboard (Section 14). Each sample has an action (KEEP / REMOVE), status (reason category), and identity_status (verified / unverified). Filter data/investigation_data_v2.json → sample_verdicts where action == "REMOVE" or (for GWAS) additionally where identity_status == "unverified".

9. Open Items

  • BAF verification of 3 contaminated samples — check GenomeStudio BAF plots for 5-band pattern (08-495, 08-25, 08-701)
  • SNP fingerprinting for identity resolution — 96-SNP panel on ~69 original DNA tubes to resolve the 55 unexpected identical pairs (see dashboard Section 17)
  • Request original plate-loading manifest from the lab (physical tube-to-well mapping) to help resolve identity mismatches
  • 7-sample discrepancy — Step 1 in production removed 92 samples instead of the correct 99. All downstream steps (2–15) used 1,155 → 1,098 cascade instead of 1,148 → corrected value. Re-execution needed for publication-quality results.
  • Scanner QC results — raw IDAT control probe analysis for the 4 bad chips (see dashboard Section 13)