← Back to Roadmap

Step 2: IBD-Based Duplicate Removal

Identify and remove duplicate or near-identical samples using genome-wide identity-by-descent

✓ Completed — Spring 2026 (re-analysis)

1. Overview

Identity-by-descent (IBD) analysis estimates the proportion of the genome shared between every pair of individuals. The statistic PI_HAT (π̂) summarises this: PI_HAT = P(IBD=2) + 0.5 × P(IBD=1). For unrelated individuals, PI_HAT ≈ 0; for parent–offspring or full siblings, PI_HAT ≈ 0.50; for monozygotic twins or duplicate samples, PI_HAT ≈ 1.0.

Duplicate or near-identical samples (PI_HAT ≥ 0.98) violate the independence assumption required by virtually all downstream statistical tests. Retaining them inflates test statistics, biases allele frequency estimates, and produces artefactual signals in PCA and ADMIXTURE analyses.

Rationale: A PI_HAT threshold of ≥ 0.98 identifies samples that are near-genetically-identical — either the same individual genotyped twice, aliquots from the same DNA extraction, or monozygotic twins. This is well above the expected PI_HAT for first-degree relatives (~0.50), ensuring that biological relatives are retained. Relatedness at lower thresholds is assessed separately in Step 15 (ROH & IBD).

Key Metrics

1,148
Starting Samples
55
Duplicates Removed
1,093
Final Samples
4.8%
Removal Rate
Deduplication summary: 63 sample pairs had PI_HAT ≥ 0.98, involving 102 unique samples that formed 47 connected clusters (some individuals appeared in multiple pairs, indicating the same DNA was genotyped 3+ times under different IDs). The earliest-registered sample per cluster was retained (lowest FID), and the remaining 55 were removed.

2. QC Parameter: PI_HAT & Kinship Categories

PLINK's --genome command computes pairwise IBD estimates for all sample pairs. The key outputs are Z0, Z1, Z2 (probabilities of sharing 0, 1, or 2 alleles IBD) and PI_HAT = Z2 + 0.5 × Z1.

The --min 0.98 flag restricts output to pairs exceeding the duplicate threshold, avoiding a massive output file for all ~658k sample pairs when only near-identical duplicates are of interest. IBD was computed on 621,580 autosomal variants (654,027 total minus 32,447 non-autosomal).

Expected PI_HAT by Relationship

Relationship Expected PI_HAT Typical Z0, Z1, Z2 Action in This Pipeline
Duplicate / MZ twin ~1.00 0, 0, 1 Removed (≥ 0.98)
Parent–offspring ~0.50 0, 1, 0 Retained
Full sibling ~0.50 0.25, 0.50, 0.25 Retained
Half-sibling / Avuncular ~0.25 0.50, 0.50, 0 Retained
First cousin ~0.125 0.75, 0.25, 0 Retained
Unrelated ~0.00 1, 0, 0 Retained

3. IBD Analysis Results

63 pairs exceeded PI_HAT ≥ 0.98. All are clearly technical duplicates (same individual genotyped under different IDs), as evidenced by Z2 ≈ 1.0 and Z0 ≈ 0.0 in every pair. PI_HAT values range from 0.9846 to 1.0000, with the vast majority at 0.9998–1.0000.

PI_HAT Distribution of 63 Duplicate Pairs
All pairs have PI_HAT ≥ 0.98 · grouped by PI_HAT value

The lowest PI_HAT is 0.9846 (08-799 ↔ 06-41d) — still far above first-degree relatives (~0.50). All 63 pairs are unambiguous technical duplicates; none are biological relatives. 32 “d”-suffix and 3 “t”-suffix samples are all in the removal list, confirming the laboratory’s duplicate labelling. The remaining 20 removed samples had different IID roots (same individual registered under different codes).

4. Cluster Analysis & Deduplication Strategy

The 63 pairs were modelled as an undirected graph (samples = nodes, PI_HAT ≥ 0.98 = edges). Connected components identify clusters of multiply-duplicated individuals. A total of 47 clusters were found: 39 of size 2 (simple pairs) and 8 of size 3 (one individual genotyped three times).

Deduplication Rule

Keep the sample with the lowest FID (earliest registered) per cluster. This is deterministic, reproducible, and does not require missingness data. Since all cluster members are near-identical (PI_HAT > 0.98), genotyping quality differences are negligible.

Cluster Size Distribution

Cluster SizeClustersSamples InvolvedSamples Removed
2 (pair)397839
3 (triple)82416
Total4710255
All 47 clusters with kept/removed members (click to expand)

Source: ConvSK_mind20.genome on Biotech2024 (/staging/ALSU-analysis/spring2026/). Strategy: keep lowest FID per cluster.

Cluster Size Keep Remove C1 3 02-29 (FID:11) 01-29 (FID:52), 01-29t (FID:338) C2 2 08-131 (FID:42) 08-131d (FID:381) C3 2 02-39 (FID:71) 06-30 (FID:206) C4 2 01-17 (FID:90) 01-17d (FID:291) C5 2 09-37 (FID:91) 02-104 (FID:150) C6 2 03-154 (FID:129) 03-155d (FID:352) C7 3 03-155 (FID:130) 03-156 (FID:131), 03-156d (FID:353) C8 2 01-50 (FID:145) 01-50d (FID:327) C9 2 02-36 (FID:152) 07-19 (FID:270) C10 3 02-45 (FID:154) 08-107 (FID:237), 06-43d (FID:325) C11 2 02-52 (FID:157) 01-53 (FID:290) C12 3 02-59 (FID:158) 02-45d (FID:287), 02-45t (FID:347) C13 3 04-07 (FID:165) 06-34d (FID:273), 06-15d (FID:276) C14 2 04-08 (FID:166) 07-01d (FID:280) C15 2 08-493 (FID:168) 08-744 (FID:422) C16 2 04-13 (FID:169) 01-59d (FID:284) C17 2 04-14 (FID:170) 02-49d (FID:282) C18 2 04-22 (FID:176) 06-23d (FID:266) C19 2 04-25 (FID:178) 07-02d (FID:263) C20 2 04-45 (FID:187) 06-06d (FID:262) C21 2 06-04 (FID:194) 02-52d (FID:293) C22 2 06-28 (FID:204) 08-107d (FID:380) C23 3 06-29 (FID:205) 02-104d (FID:292), 02-104t (FID:342) C24 2 06-38 (FID:213) 06-42d (FID:268) C25 2 06-39 (FID:214) 07-10d (FID:278) C26 2 07-01 (FID:218) 02-64d (FID:283) C27 2 07-04 (FID:220) 04-20d (FID:297) C28 2 07-10 (FID:222) 04-54d (FID:279) C29 2 08-799 (FID:225) 06-41d (FID:300) C30 2 08-816 (FID:229) 02-36d (FID:326) C31 2 08-129 (FID:231) 08-493d (FID:415) C32 2 08-541 (FID:232) 04-22d (FID:294) C33 3 08-509 (FID:233) 04-36 (FID:296), 04-40 (FID:369) C34 2 08-774 (FID:234) 04-55d (FID:281) C35 2 07-15 (FID:243) 08-160 (FID:244) C36 2 07-16 (FID:245) 02-90 (FID:247) C37 2 08-436 (FID:246) 07-17 (FID:248) C38 3 08-179 (FID:249) 08-194 (FID:261), 03-37 (FID:265) C39 2 08-498 (FID:269) 08-795d (FID:272) C40 2 08-128 (FID:271) 08-124 (FID:275) C41 2 08-817 (FID:277) 08-81d (FID:306) C42 2 04-23 (FID:295) 04-23d (FID:329) C43 2 08-770 (FID:335) 08-770d (FID:519) C44 2 01-18 (FID:580) 09-76 (FID:581) C45 2 08-265 (FID:757) 08-267 (FID:758) C46 2 08-181 (FID:832) 08-45 (FID:849) C47 2 12-04 (FID:908) 12-05 (FID:909)
All 63 pairs with PI_HAT values (click to expand)

Source: ConvSK_mind20.genome — all pairs with PI_HAT ≥ 0.98.

FID1 IID1 FID2 IID2 PI_HAT Z2 11 02-29 52 01-29 0.9999 0.9997 11 02-29 338 01-29t 0.9998 0.9997 42 08-131 381 08-131d 0.9998 0.9995 52 01-29 338 01-29t 0.9999 0.9998 71 02-39 206 06-30 0.9999 0.9998 90 01-17 291 01-17d 0.9999 0.9999 91 09-37 150 02-104 0.9999 0.9999 129 03-154 352 03-155d 0.9999 0.9999 130 03-155 131 03-156 0.9998 0.9997 130 03-155 353 03-156d 0.9999 0.9999 131 03-156 353 03-156d 0.9998 0.9997 145 01-50 327 01-50d 0.9999 0.9999 152 02-36 270 07-19 0.9999 0.9999 154 02-45 237 08-107 0.9999 0.9999 154 02-45 325 06-43d 1.0000 0.9999 157 02-52 290 01-53 0.9998 0.9998 158 02-59 287 02-45d 0.9999 0.9999 158 02-59 347 02-45t 0.9988 0.9977 165 04-07 273 06-34d 0.9998 0.9997 165 04-07 276 06-15d 0.9999 0.9998 166 04-08 280 07-01d 0.9997 0.9994 168 08-493 422 08-744 0.9993 0.9986 169 04-13 284 01-59d 0.9999 0.9999 170 04-14 282 02-49d 1.0000 1.0000 176 04-22 266 06-23d 0.9999 0.9999 178 04-25 263 07-02d 0.9991 0.9981 187 04-45 262 06-06d 0.9982 0.9964 194 06-04 293 02-52d 1.0000 1.0000 204 06-28 380 08-107d 0.9997 0.9994 205 06-29 292 02-104d 0.9999 0.9998 205 06-29 342 02-104t 0.9999 0.9998 213 06-38 268 06-42d 0.9999 0.9999 214 06-39 278 07-10d 0.9999 0.9999 218 07-01 283 02-64d 1.0000 0.9999 220 07-04 297 04-20d 1.0000 0.9999 222 07-10 279 04-54d 1.0000 1.0000 225 08-799 300 06-41d 0.9846 0.9692 229 08-816 326 02-36d 0.9995 0.9991 231 08-129 415 08-493d 1.0000 1.0000 232 08-541 294 04-22d 0.9999 0.9999 233 08-509 296 04-36 1.0000 1.0000 233 08-509 369 04-40 0.9999 0.9999 234 08-774 281 04-55d 1.0000 1.0000 237 08-107 325 06-43d 1.0000 1.0000 243 07-15 244 08-160 1.0000 0.9999 245 07-16 247 02-90 0.9998 0.9997 246 08-436 248 07-17 0.9998 0.9997 249 08-179 261 08-194 0.9992 0.9984 249 08-179 265 03-37 0.9992 0.9984 261 08-194 265 03-37 0.9991 0.9982 269 08-498 272 08-795d 1.0000 1.0000 271 08-128 275 08-124 0.9994 0.9989 273 06-34d 276 06-15d 0.9999 0.9997 277 08-817 306 08-81d 0.9999 0.9999 287 02-45d 347 02-45t 0.9989 0.9978 292 02-104d 342 02-104t 0.9999 0.9999 295 04-23 329 04-23d 0.9999 0.9999 296 04-36 369 04-40 0.9999 0.9999 335 08-770 519 08-770d 1.0000 0.9999 580 01-18 581 09-76 1.0000 0.9999 757 08-265 758 08-267 1.0000 1.0000 832 08-181 849 08-45 1.0000 0.9999 908 12-04 909 12-05 0.9998 0.9995

5. Comparison with Winter 2025

Winter 2025 found 65 pairs / 49 clusters / 57 removed (from 1,155 input). Spring 2026 finds 63 pairs / 47 clusters / 55 removed (from 1,148 input). The 2 extra winter pairs were formed by 2 of the 7 samples that Step 1 incorrectly retained (see Step 1 §4).

Methodological note: winter 2025 computed IBD on LD-pruned variants (44,782 SNPs after --indep-pairwise 50 5 0.1), while spring 2026 uses all 621,580 autosomal variants. At PI_HAT ≥ 0.98 (duplicates), the variant set has negligible effect — the 2-pair difference is attributable to the sample correction.

6. Input & Output Data

Input

FilesConvSK_mind20.bed, ConvSK_mind20.bim, ConvSK_mind20.fam
Location/staging/ALSU-analysis/spring2026/
Samples1,148 (from Step 1)
Variants654,027 SNPs

Output

FilesConvSK_mind20_dedup.bed, ConvSK_mind20_dedup.bim, ConvSK_mind20_dedup.fam
Location/staging/ALSU-analysis/spring2026/
Samples1,093 (55 removed from 47 duplicate clusters)
Variants654,027 SNPs (unchanged)

Intermediate Files

FileDescription
ConvSK_mind20.genomeAll pairwise IBD pairs with PI_HAT ≥ 0.98 (63 pairs + header)
duplicates_pihat098.txt55 sample IDs to remove (FID ↔ IID, tab-separated)
dup_clusters_summary.tsv47 clusters with kept/removed members

7. Commands Executed

Step 2a: Compute pairwise IBD

$ cd /staging/ALSU-analysis/spring2026/ plink --bfile ConvSK_mind20 \ --genome \ --min 0.98 \ --out ConvSK_mind20 PLINK v1.9.0-b.7.7 64-bit (22 Oct 2024) 654027 variants loaded from .bim file. 1148 people (0 males, 0 females, 1148 ambiguous) loaded from .fam. Total genotyping rate is 0.980199. 654027 variants and 1148 people pass filters and QC. Excluding 32447 variants on non-autosomes from IBD calculation. IBD calculations complete. Finished writing ConvSK_mind20.genome .

Step 2b: Build clusters and generate removal list

# Save as step2_dedup_graph.py in the working directory, then run: # $ python3 step2_dedup_graph.py # # Reads: ConvSK_mind20.genome (output of Step 2a) # Writes: duplicates_pihat098.txt (removal list for Step 2c) # dup_clusters_summary.tsv (cluster audit log) from collections import defaultdict pairs = [] fid_of = {} with open('ConvSK_mind20.genome') as f: header = f.readline() for line in f: parts = line.split() fid1, iid1 = int(parts[0]), parts[1] fid2, iid2 = int(parts[2]), parts[3] pairs.append((iid1, iid2)) fid_of[iid1] = fid1 fid_of[iid2] = fid2 # Build adjacency graph adj = defaultdict(set) for s1, s2 in pairs: adj[s1].add(s2) adj[s2].add(s1) # Find connected components (BFS) visited = set() clusters = [] for node in sorted(adj.keys()): if node in visited: continue comp = set() q = [node] while q: n = q.pop(0) if n in visited: continue visited.add(n) comp.add(n) q.extend(nb for nb in adj[n] if nb not in visited) clusters.append(comp) # Strategy: keep the sample with the lowest FID (earliest registered) per cluster to_remove = [] with open('dup_clusters_summary.tsv', 'w') as cf: cf.write('cluster_id\tsize\tkept_fid\tkept_iid\tremoved\n') for i, cluster in enumerate(sorted(clusters, key=lambda c: min(fid_of[s] for s in c)), 1): members = sorted(cluster, key=lambda s: fid_of[s]) keep = members[0] remove = members[1:] to_remove.extend(remove) removed_str = ','.join(f'{fid_of[s]}:{s}' for s in remove) cf.write(f'{i}\t{len(cluster)}\t{fid_of[keep]}\t{keep}\t{removed_str}\n') with open('duplicates_pihat098.txt', 'w') as f: for iid in sorted(to_remove, key=lambda s: fid_of[s]): f.write(f'{fid_of[iid]}\t{iid}\n') print(f'Pairs: {len(pairs)}') print(f'Clusters: {len(clusters)}') print(f'Samples involved: {len(adj)}') print(f'To remove: {len(to_remove)}') print(f'To keep: {len(clusters)}') print(f'Expected final: 1148 - {len(to_remove)} = {1148 - len(to_remove)}')
# Expected output: Pairs: 63 Clusters: 47 Samples involved: 102 To remove: 55 To keep: 47 Expected final: 1148 - 55 = 1093

Step 2c: Remove duplicates

$ plink --bfile ConvSK_mind20 \ --remove duplicates_pihat098.txt \ --make-bed \ --out ConvSK_mind20_dedup # Expected: 55 people removed, 1093 remaining. # Result: ConvSK_mind20_dedup.bed/bim/fam (654,027 variants x 1,093 samples)

8. Removed Samples

55 samples were removed from 47 duplicate clusters. All 32 “d”-suffix and 3 “t”-suffix samples are in the removal list, confirming that the laboratory’s duplicate labelling was correct.

All 55 removed samples (click to expand)

Source: duplicates_pihat098.txt on Biotech2024 (/staging/ALSU-analysis/spring2026/).

FID IID Cluster Partner(s) Kept 52 01-29 C1 02-29 (FID:11) 131 03-156 C7 03-155 (FID:130) 150 02-104 C5 09-37 (FID:91) 206 06-30 C3 02-39 (FID:71) 237 08-107 C10 02-45 (FID:154) 244 08-160 C35 07-15 (FID:243) 247 02-90 C36 07-16 (FID:245) 248 07-17 C37 08-436 (FID:246) 261 08-194 C38 08-179 (FID:249) 262 06-06d C20 04-45 (FID:187) 263 07-02d C19 04-25 (FID:178) 265 03-37 C38 08-179 (FID:249) 266 06-23d C18 04-22 (FID:176) 268 06-42d C24 06-38 (FID:213) 270 07-19 C9 02-36 (FID:152) 272 08-795d C39 08-498 (FID:269) 273 06-34d C13 04-07 (FID:165) 275 08-124 C40 08-128 (FID:271) 276 06-15d C13 04-07 (FID:165) 278 07-10d C25 06-39 (FID:214) 279 04-54d C28 07-10 (FID:222) 280 07-01d C14 04-08 (FID:166) 281 04-55d C34 08-774 (FID:234) 282 02-49d C17 04-14 (FID:170) 283 02-64d C26 07-01 (FID:218) 284 01-59d C16 04-13 (FID:169) 287 02-45d C12 02-59 (FID:158) 290 01-53 C11 02-52 (FID:157) 291 01-17d C4 01-17 (FID:90) 292 02-104d C23 06-29 (FID:205) 293 02-52d C21 06-04 (FID:194) 294 04-22d C32 08-541 (FID:232) 296 04-36 C33 08-509 (FID:233) 297 04-20d C27 07-04 (FID:220) 300 06-41d C29 08-799 (FID:225) 306 08-81d C41 08-817 (FID:277) 325 06-43d C10 02-45 (FID:154) 326 02-36d C30 08-816 (FID:229) 327 01-50d C8 01-50 (FID:145) 329 04-23d C42 04-23 (FID:295) 338 01-29t C1 02-29 (FID:11) 342 02-104t C23 06-29 (FID:205) 347 02-45t C12 02-59 (FID:158) 352 03-155d C6 03-154 (FID:129) 353 03-156d C7 03-155 (FID:130) 369 04-40 C33 08-509 (FID:233) 380 08-107d C22 06-28 (FID:204) 381 08-131d C2 08-131 (FID:42) 415 08-493d C31 08-129 (FID:231) 422 08-744 C15 08-493 (FID:168) 519 08-770d C43 08-770 (FID:335) 581 09-76 C44 01-18 (FID:580) 758 08-267 C45 08-265 (FID:757) 849 08-45 C46 08-181 (FID:832) 909 12-05 C47 12-04 (FID:908)

9. Quality Verification

✓ Verification checks:
  • duplicates_pihat098.txt contains exactly 55 lines
  • All 32 “d”-suffix and 3 “t”-suffix samples are in the removal list
  • No cluster has more than one sample kept
  • No variants removed — 654,027 SNPs carried forward
  • Expected: ConvSK_mind20_dedup.fam has 1,093 lines
$ wc -l duplicates_pihat098.txt 55 duplicates_pihat098.txt wc -l ConvSK_mind20_dedup.fam ConvSK_mind20_dedup.bim 1093 ConvSK_mind20_dedup.fam 654027 ConvSK_mind20_dedup.bim

10. Chronological Log

Winter 2025 (original run)
Initial IBD deduplication
Input: 1,155 samples (from buggy Step 1). Found 65 pairs in 49 clusters. Removed 57 → 1,098 retained. Used LD pruning + full genome computation.
Spring 2026 (re-analysis)
IBD computed on corrected input
plink --genome --min 0.98 on ConvSK_mind20 (1,148 samples). IBD computed on 621,580 autosomal variants. Found 63 pairs.
Spring 2026
Cluster analysis
63 pairs → 47 connected components (102 unique samples). Strategy: keep lowest FID per cluster.
Spring 2026
Duplicates removed
plink --remove: 1,148 → 1,093 samples. Output: ConvSK_mind20_dedup.