ALSU Sample Investigation v2 — Chip-Level Forensics

654,027 variants × 1,247 samples across 52 physical chips — All data from real server computations

1. Summary Statistics

Overview of the dataset and the three anomalies discovered. Red numbers indicate problems that need action. Each number card is clickable via the table of contents to jump to the relevant detailed section.

2. Sample Quality vs Position

Each dot is one sample. The vertical axis shows the missingness rate (F_MISS) — higher means more data is missing. Think of it as a "failure score": a dot near the top means that sample's genotyping mostly failed.
The horizontal axis groups samples by physical chip (sorted worst→best), so all samples from the same chip cluster together.
The chart is split into two panels: the top panel shows the worst samples (>6% missing) and the bottom panel zooms into the normal samples (≤6% missing).
Red dots = samples on the 4 catastrophic chips. Orange dots = other degraded samples. In the bottom panel, different colors represent different physical chips.

3. Chip Quality Ranking — The Smoking Gun

Each bar is one physical chip (52 total), sorted from worst to best. The bar height shows the average missingness of all ~24 samples on that chip. The diamond markers show the single worst sample on each chip.
Red bars = catastrophic chips (>20% average missing — nearly useless). Orange = degraded (10–20%). Blue = moderate. Green = good.
The dashed red line marks the 20% removal threshold. Everything above it should be discarded.

4. Chip Position Heatmaps

Each heatmap shows one physical chip as a grid of its physical layout: 12 rows (R01–R12) × 2 columns (C01–C02) = 24 sample positions. Color shows missingness: dark blue = good (low missing), orange = degraded, bright red = failed.
Look for patterns: if an entire chip is red, the chip itself failed. If only certain rows are red, the problem was position-specific (e.g., a particular section of the chip wasn't scanned properly).

5. Missingness Distribution by ID Prefix

Sample IDs start with a number prefix (e.g., "08-123" has prefix "08"). This box plot shows the distribution of missingness grouped by prefix.
Each box shows the middle 50% of values; the line inside is the median. Dots above the box are outliers — individual samples with unusually high missingness. The red dashed line is the 20% removal threshold.

6. Relatedness Network — Who Matches Whom?

Each dot is a sample, and a line between two dots means those samples have nearly identical DNA (PI_HAT ≥ 0.98 — effectively the same person). Dots are grouped into clusters; a cluster of 2 means a pair of identical samples, a cluster of 3 means a triplet, etc.
Why is this a problem? In a normal study, each person should appear only once. Finding 106 samples in 49 identity clusters means there are identity problems in the dataset that need to be resolved.
Colors: ● Blue = original sample, ● Orange = d-suffix duplicate, ● Purple = t-suffix triplicate, ● Red = sample from a catastrophic chip.

7. Identity Pairs & d/t Verification

PLINK IBD analysis found 65 pairs of samples with near-identical DNA (PI_HAT ≥ 0.98). Of these, 10 are expected d/t replicates (a d/t sample correctly matching its base). The other 55 are unexpected: two samples with different IDs that contain the same person's DNA.

d/t replicates: 48 samples were re-processed with a "d" or "t" suffix (e.g., "08-25d" should match "08-25"). Each d/t is on a different physical chip than its base — these are independent experiments, not software artifacts. We tested whether each d/t actually matches its expected base:
✓ d/t Match = matches base (expected). ✗ d/t Mismatch = matches a different person. ? d/t No match = doesn't match anyone at ≥98%.

Pie chart: how many of the 55 unexpected pairs are on the same chip vs different chips.

8. Sex Check Results

Genetic sex can be determined from the X chromosome. Males have one X (highly homozygous, F near 1.0), females have two X's (more heterozygous, F near 0). Since this is an all-female pregnancy cohort, nearly all samples should cluster near F = 0.
The left chart plots each sample's X-chromosome F value. Samples far below 0 (extremely negative F) are contaminated — a mix of two people's DNA produces excess heterozygosity that's impossible for a single person. Extreme outliers are clamped (shown as ✕) to keep the chart readable.
The right chart is a histogram of F values — you can see the main peak near 0 (normal females) and a small tail of extreme negatives (contaminated).

9. Heterozygosity Analysis

Heterozygosity is the fraction of DNA positions where a person has two different versions. In a healthy population, this rate is fairly consistent across individuals (~19%).
The left histogram shows the distribution — the tall peak near 0.19 is normal. The long right tail (high het rates) corresponds to samples from failed chips or contaminated tubes where the equipment misread signals as heterozygous.
The right scatter plot compares heterozygosity (vertical) vs missingness (horizontal). Normal samples cluster in the lower-left. Degraded samples drift to the upper-right — high missingness AND high heterozygosity together is the fingerprint of systematic genotyping failure. The red dots are samples with het rates more than 3 standard deviations above normal.

10. Hyper-Connected Samples — Contamination or Dropout?

"Hyper-connected" means a sample appears genetically identical to many other samples simultaneously. One explanation is that a DNA mixture shares alleles from multiple donors, inflating apparent relatedness across many comparisons.
The blue bars show each suspect sample's heterozygosity rate, with the z-score (how many standard deviations from normal) labeled above. Bars above the red dashed line (+3σ) are statistically extreme. The orange diamonds show missingness on the right axis.
Interpretation: Samples with both high het (z>3) AND high missingness are contaminated. Samples with normal het but connectivity issues may simply be population outliers.

11. Root Cause Analysis & Recommendations

12. Practical Action Plan ★

13. Scanner QC & Control Probe Analysis

This section goes deeper than genotyping results — it examines the raw scanner hardware data from the IDAT files (the binary output of the Illumina iScan laser scanner) to determine why the 4 bad chips failed.
Every BeadChip contains 23 built-in control probes that test each step of the chemistry protocol independently: Did the dyes attach (staining)? Did the DNA bind to the chip (hybridization)? Did the enzymatic step work (extension)? By comparing control probe signals on bad chips vs good chips, we can pinpoint exactly which step failed.
Key stat cards: "S/N" = signal-to-noise ratio (higher is better). "Grn median" = median green-channel intensity across all 704K probes (higher is better). "B/G ratio" = Bad/Good ratio for control probes (1.0 = same as good chips, below 1.0 = weaker signal on bad chips).

Loading scanner_qc_data.json...

★ 14. Full Sample Verdict Table — Every Sample, Concrete Status

The definitive table. Every one of the 1,247 samples with its status (KEEP or REMOVE), the reason, and key QC metrics. Use the filter buttons to show specific categories. This is the actionable output of the entire investigation.

★★★ 15. Identity Audit — Samples with Unverifiable Identity

The problem: When two samples with different names produce identical DNA (PI_HAT ≥ 0.98), keeping one and removing the other does NOT tell us whose DNA is in the kept well. For example: if "08-265" and "08-267" are genetically identical, we cannot know whether the DNA belongs to person 265 or person 267. We kept 08-265 only because it has marginally lower missingness — but that says nothing about whose DNA it is.

This is NOT an edge case. ? of our KEEP samples are involved in such unexpected pairs and have UNVERIFIED identity. These samples have excellent genotyping quality but we cannot confirm they represent the person named on the label.

Impact:

For PCA / ADMIXTURE / FST: If all samples come from the same Uzbek population, the identity ambiguity has minimal effect — the DNA is still from a real Uzbek individual, we just don't know which one.
For GWAS / phenotype association: These samples MUST be excluded because we cannot match genotype to the correct phenotype record.
For publication: This must be documented as a sample-tracking limitation.

★★ 16. KING-Robust Independent Verification

PLINK's PI_HAT (Method of Moments) and KING-robust (Manichaikul et al. 2010) use different algorithms to estimate relatedness. If both methods flag the same pairs, the result is independently confirmed — not a statistical artifact of one method.

Result: KING confirms every single one of the 65 PLINK pairs. Zero pairs were refuted. The 55 unexpected identity pairs are real, not chip artifacts.

★★★ 17. Resolution Plan: Targeted SNP Fingerprinting

The only way to resolve identity ambiguity is to go back to the original DNA tubes. Full re-genotyping is not needed — a cheap 96-SNP identity fingerprint panel ($8–25/sample) is the standard method for sample identity QC. One round of fingerprinting on 69 samples will make the entire dataset publication-ready.

How it works: The lab runs a small SNP panel on the original tube for each sample. We compare the fingerprint to our existing GSA genotypes. If the tube matches the genotype → identity confirmed. If it doesn't → the identity is wrong and needs further investigation.

What to request from the lab: "Run a 96-SNP identity fingerprint panel on these DNA tubes. Compare each fingerprint to the existing GSA genotype data (we provide the PLINK .bed). Return which sample matches and any mismatches."

ALSU Sample Investigation Report — v2 Chip-Level Forensics