Step 0: Pre-QC Sample Investigation

Purpose

This page summarizes the findings from a comprehensive pre-QC investigation of the raw PLINK files (ConvSK_raw: 654,027 variants × 1,247 samples, Illumina GSA-24v3-0_A2 array). The investigation examined chip-level quality, sample identity (IBD), d/t replicate integrity, heterozygosity, sex check, and contamination indicators before any QC filtering was applied.

Full interactive dashboard: sample_investigation_v2.html (17 sections with Plotly charts, sortable tables, and per-sample verdicts). Data source: data/investigation_data_v2.json.

This page has two objectives:

Cross-check Steps 1–15 — verify that all QC decisions are consistent with investigation findings
GWAS sample filtering — definitive reference for which samples to keep, remove, and flag for future phenotype-association analyses

1. Raw Dataset

1,247

Total Samples

654,027

Variants

52

Physical Chips

24

Samples per Chip

Property	Value
Array	Illumina GSA-24v3-0_A2 (Infinium Global Screening Array)
Source files	`ConvSK_raw.bed/.bim/.fam`
Server path	`/staging/ALSU-analysis/winter2025/PLINK_301125_0312/`
Genome build	GRCh37 (hg19)
Sample sheet	1,248 rows → 1,200 unique Sample_IDs (48 duplicated IDs with d/t suffixes)
Global het rate	Mean = 0.193, SD = 0.067

2. Sample Verdicts — Summary

1,056

KEEP

191

REMOVE

30

KEEP — Unverified Identity

Removal Breakdown (191 samples)

Category	Count	Description
chip_failure	98	On one of 4 catastrophic chips (208993030xxx series) — mean F_MISS 24–29%, all positions failed
dt_artifact	45	d/t replicate entries (suffix samples removed, base kept where possible)
ibd_duplicate	33	IBD deduplication — one member of each unexpected identical pair removed
high_fmiss	12	F_MISS > 0.20 on non-catastrophic chips
contaminated	3	08-495, 08-25, 08-701 — extreme het (z = +5 to +7), F_MISS 29–33%, pattern consistent with two-person DNA mixture

chip_failure — 98 samples (click to expand)

Loading…

dt_artifact — 45 samples

Loading…

ibd_duplicate — 33 samples

Loading…

high_fmiss — 12 samples

Loading…

contaminated — 3 samples

Loading…

Identity Status (1,056 KEEP samples)

Status	Count	GWAS Eligibility
verified	1,026	OK for all analyses (PCA, ADMIXTURE, FST, GWAS)
unverified	30	OK for PCA / ADMIXTURE / FST. Must exclude from GWAS — cannot match genotype to correct phenotype record

unverified identity — 30 KEEP samples

Loading…

3. Catastrophic Chips

4 chips in the 208993030xxx series failed entirely. All 24 positions on each chip are affected — the failure is chip-wide.

Chip Barcode	Samples	Mean F_MISS	Max F_MISS
`208993030034`	25	0.290	0.409
`208993030039`	25	0.273	0.372
`208993030044`	25	0.236	0.380
`208993030046`	24	0.281	0.420

Step 1 cross-check: The bulk of Step 1 removals (F_MISS > 0.20) come from these 4 chips. Samples on other chips with F_MISS > 0.20 are in the high_fmiss category (12 samples).

208993030034 — samples

Loading…

208993030039 — samples

Loading…

208993030044 — samples

Loading…

208993030046 — samples

Loading…

4. Identity Pairs & Clusters

65

Total Pairs (PI_HAT ≥ 0.98)

49

Clusters

106

Samples Involved

Pair Category	Count	Description
Expected d/t matches	10	d/t sample correctly matches its base — expected QC replicates
Unexpected — same chip	18	Two different sample IDs on the same chip with identical DNA
Unexpected — cross-chip	37	Two different sample IDs on different chips with identical DNA
Total unexpected	55	These 55 pairs involve identity problems that cannot be resolved without external verification

Step 2 cross-check: Step 2 uses PI_HAT ≥ 0.98 to identify 65 pairs → 49 clusters → removes 57 samples via graph-based deduplication. The 55 unexpected pairs mean that for each such pair, we kept one sample but cannot verify whose DNA it actually contains. This produces the 30 "unverified identity" KEEP samples (some pairs share a kept member across multiple clusters).

Expected d/t pairs — 10 pairs

Loading…

Unexpected identical pairs — 55 pairs

Loading…

⚠ GWAS implication: The 30 unverified-identity samples are genotypically clean (they pass all QC thresholds) but their sample labels may be wrong. For PCA / ADMIXTURE / FST, this doesn't matter — the DNA is from a real Uzbek individual. For GWAS / phenotype association, these 30 samples must be excluded because we cannot match genotype to the correct phenotype record.

5. d/t Replicate Verification

48 samples carry a "d" or "t" suffix (45 base IDs: 42 pairs + 3 triplets = 48 entries). Each d/t is on a different physical chip than its base — these are independent replicate experiments, not software artifacts. The pipeline operator added d/t suffixes within GenomeStudio to disambiguate duplicate Sample_IDs before PLINK export.

—

✓ Match Base

—

✗ Match Wrong Person

—

? No Match at ≥0.98

Disposition: All 45 d/t suffix entries (and 3 t-suffix entries) are removed in the REMOVE list (status dt_artifact). The base sample is kept when it passes QC. This ensures no duplicate individuals in downstream analyses.

d/t replicate entries — 48 samples

Loading…

6. Possible Contamination

3 samples show a statistical pattern consistent with two-person DNA mixture:

Sample	Het Rate	z-score	F_MISS	F(X)
`08-495`	0.624	+6.4	0.326	-1.47
`08-25`	0.668	+7.1	0.331	-1.18
`08-701`	0.550	+5.3	0.291	-0.80

Evidence: Het rates 5–7σ above the mean, high missingness (29–33%), extreme negative F(X). This pattern is consistent with a two-person DNA mixture but has not been independently confirmed.

Verification needed: Examine BAF (B-Allele Frequency) plots in GenomeStudio for these 3 samples. A contaminated sample shows a characteristic 5-band pattern (BAF ≈ 0, 0.25, 0.50, 0.75, 1.0) instead of the normal 3-band pattern.

7. Step-by-Step Cross-Check Reference

Use this table to verify that each pipeline step's inputs and outputs are consistent with the investigation findings.

Step	Operation	Expected Input	Expected Output	Investigation Notes
Step 1	F_MISS > 0.20 filter	654,027 × 1,247	1,155 samples retained (92 removed)	Investigation found 99 should be removed (the pipeline used a slightly different removal list — 7-sample discrepancy documented in Step 1)
Step 2	IBD dedup (PI_HAT ≥ 0.98)	1,155 samples	1,098 samples (57 removed from 49 clusters)	Investigation confirms 65 pairs, 49 clusters. 55 of 65 are unexpected. Dedup resolves duplicate genotypes but not identity ambiguity.
Step 3	SNP QC + VCF export	654,027 variants × 1,098	472,191 variants in VCF	No sample-level issues. SNP QC removes MAF/HWE/geno failures, I/D alleles, duplicate positions.
Step 4	Imputation (TOPMed)	472,191 variants	10,846,569 variants	Quality depends on input sample quality. Samples from bad chips or with contamination may have unreliable imputed genotypes.
Step 5–6	ID normalize + final QC	1,098 samples	Post-imputation clean set	Verify that the 3 contaminated samples (if still present) are flagged or removed at this stage.
Step 7–8	PCA (local + global)	Post-imputation set	PCA coordinates	30 unverified-identity samples are acceptable for PCA (DNA is from a real Uzbek individual regardless of label).
Step 9–14	FST, ADMIXTURE, PBS, LD, MDS	Post-imputation set	Population structure results	Same as PCA — unverified identity has minimal effect on population-level analyses.
Step 15	ROH & IBD	Post-imputation set	ROH segments, IBD sharing	ROH is per-sample — unverified identity doesn't affect ROH detection. IBD sharing between unverified samples may be artifactual.
Future GWAS	Phenotype–genotype association	Post-imputation set	Association statistics	Must exclude the 30 unverified-identity samples in addition to all 191 REMOVE samples. Effective GWAS N = 1,026.

8. GWAS Sample Filtering Criteria

For phenotype–genotype association (GWAS), apply the following exclusions in order:

#	Filter	Samples Removed	Cumulative Remaining
1	chip_failure — catastrophic chips (208993030xxx)	98	1,149
2	dt_artifact — d/t replicate suffixes	45	1,104
3	ibd_duplicate — one member of each unexpected identical pair	33	1,071
4	high_fmiss — F_MISS > 0.20 on non-catastrophic chips	12	1,059
5	contaminated — pattern consistent with DNA mixture	3	1,056
6	unverified — identity ambiguous (cannot match to phenotype)	30	1,026

Population-structure analyses (PCA, ADMIXTURE, FST, PBS, MDS): Filters 1–5 only → N = 1,056.
GWAS / phenotype association: Filters 1–6 → N = 1,026.

Generating the removal list: The full per-sample verdict table is in the interactive dashboard (Section 14). Each sample has an action (KEEP / REMOVE), status (reason category), and identity_status (verified / unverified). Filter data/investigation_data_v2.json → sample_verdicts where action == "REMOVE" or (for GWAS) additionally where identity_status == "unverified".

9. Open Items

⬜ BAF verification of 3 contaminated samples — check GenomeStudio BAF plots for 5-band pattern (08-495, 08-25, 08-701)
⬜ SNP fingerprinting for identity resolution — 96-SNP panel on ~69 original DNA tubes to resolve the 55 unexpected identical pairs (see dashboard Section 17)
⬜ Request original plate-loading manifest from the lab (physical tube-to-well mapping) to help resolve identity mismatches
⬜ 7-sample discrepancy — Step 1 in production removed 92 samples instead of the correct 99. All downstream steps (2–15) used 1,155 → 1,098 cascade instead of 1,148 → corrected value. Re-execution needed for publication-quality results.
⬜ Scanner QC results — raw IDAT control probe analysis for the 4 bad chips (see dashboard Section 13)