Step 2: IBD Deduplication

1. Overview

Identity-by-descent (IBD) analysis estimates the proportion of the genome shared between every pair of individuals. The statistic PI_HAT (π̂) summarises this: PI_HAT = P(IBD=2) + 0.5 × P(IBD=1). For unrelated individuals, PI_HAT ≈ 0; for parent–offspring or full siblings, PI_HAT ≈ 0.50; for monozygotic twins or duplicate samples, PI_HAT ≈ 1.0.

Duplicate or near-identical samples (PI_HAT ≥ 0.98) violate the independence assumption required by virtually all downstream statistical tests. Retaining them inflates test statistics, biases allele frequency estimates, and produces artefactual signals in PCA and ADMIXTURE analyses.

Rationale: A PI_HAT threshold of ≥ 0.98 identifies samples that are near-genetically-identical — either the same individual genotyped twice, aliquots from the same DNA extraction, or monozygotic twins. This is well above the expected PI_HAT for first-degree relatives (~0.50), ensuring that biological relatives are retained. Relatedness at lower thresholds is assessed separately in Step 15 (ROH & IBD).

Key Metrics

1,148

Starting Samples

55

Duplicates Removed

1,093

Final Samples

4.8%

Removal Rate

Deduplication summary: 63 sample pairs had PI_HAT ≥ 0.98, involving 102 unique samples that formed 47 connected clusters (some individuals appeared in multiple pairs, indicating the same DNA was genotyped 3+ times under different IDs). The earliest-registered sample per cluster was retained (lowest FID), and the remaining 55 were removed.

2. QC Parameter: PI_HAT & Kinship Categories

PLINK's --genome command computes pairwise IBD estimates for all sample pairs. The key outputs are Z0, Z1, Z2 (probabilities of sharing 0, 1, or 2 alleles IBD) and PI_HAT = Z2 + 0.5 × Z1.

The --min 0.98 flag restricts output to pairs exceeding the duplicate threshold, avoiding a massive output file for all ~658k sample pairs when only near-identical duplicates are of interest. IBD was computed on 621,580 autosomal variants (654,027 total minus 32,447 non-autosomal).

Expected PI_HAT by Relationship

Relationship	Expected PI_HAT	Typical Z0, Z1, Z2	Action in This Pipeline
Duplicate / MZ twin	~1.00	0, 0, 1	Removed (≥ 0.98)
Parent–offspring	~0.50	0, 1, 0	Retained
Full sibling	~0.50	0.25, 0.50, 0.25	Retained
Half-sibling / Avuncular	~0.25	0.50, 0.50, 0	Retained
First cousin	~0.125	0.75, 0.25, 0	Retained
Unrelated	~0.00	1, 0, 0	Retained

3. IBD Analysis Results

63 pairs exceeded PI_HAT ≥ 0.98. All are clearly technical duplicates (same individual genotyped under different IDs), as evidenced by Z2 ≈ 1.0 and Z0 ≈ 0.0 in every pair. PI_HAT values range from 0.9846 to 1.0000, with the vast majority at 0.9998–1.0000.

PI_HAT Distribution of 63 Duplicate Pairs

All pairs have PI_HAT ≥ 0.98 · grouped by PI_HAT value

The lowest PI_HAT is 0.9846 (08-799 ↔ 06-41d) — still far above first-degree relatives (~0.50). All 63 pairs are unambiguous technical duplicates; none are biological relatives. 32 “d”-suffix and 3 “t”-suffix samples are all in the removal list, confirming the laboratory’s duplicate labelling. The remaining 20 removed samples had different IID roots (same individual registered under different codes).

4. Cluster Analysis & Deduplication Strategy

The 63 pairs were modelled as an undirected graph (samples = nodes, PI_HAT ≥ 0.98 = edges). Connected components identify clusters of multiply-duplicated individuals. A total of 47 clusters were found: 39 of size 2 (simple pairs) and 8 of size 3 (one individual genotyped three times).

Deduplication Rule

Keep the sample with the lowest FID (earliest registered) per cluster. This is deterministic, reproducible, and does not require missingness data. Since all cluster members are near-identical (PI_HAT > 0.98), genotyping quality differences are negligible.

Cluster Size Distribution

Cluster Size	Clusters	Samples Involved	Samples Removed
2 (pair)	39	78	39
3 (triple)	8	24	16
Total	47	102	55

All 47 clusters with kept/removed members (click to expand)

Source: ConvSK_mind20.genome on Biotech2024 (/staging/ALSU-analysis/spring2026/). Strategy: keep lowest FID per cluster.

Cluster  Size  Keep             Remove
C1       3     02-29 (FID:11)   01-29 (FID:52), 01-29t (FID:338)
C2       2     08-131 (FID:42)  08-131d (FID:381)
C3       2     02-39 (FID:71)   06-30 (FID:206)
C4       2     01-17 (FID:90)   01-17d (FID:291)
C5       2     09-37 (FID:91)   02-104 (FID:150)
C6       2     03-154 (FID:129) 03-155d (FID:352)
C7       3     03-155 (FID:130) 03-156 (FID:131), 03-156d (FID:353)
C8       2     01-50 (FID:145)  01-50d (FID:327)
C9       2     02-36 (FID:152)  07-19 (FID:270)
C10      3     02-45 (FID:154)  08-107 (FID:237), 06-43d (FID:325)
C11      2     02-52 (FID:157)  01-53 (FID:290)
C12      3     02-59 (FID:158)  02-45d (FID:287), 02-45t (FID:347)
C13      3     04-07 (FID:165)  06-34d (FID:273), 06-15d (FID:276)
C14      2     04-08 (FID:166)  07-01d (FID:280)
C15      2     08-493 (FID:168) 08-744 (FID:422)
C16      2     04-13 (FID:169)  01-59d (FID:284)
C17      2     04-14 (FID:170)  02-49d (FID:282)
C18      2     04-22 (FID:176)  06-23d (FID:266)
C19      2     04-25 (FID:178)  07-02d (FID:263)
C20      2     04-45 (FID:187)  06-06d (FID:262)
C21      2     06-04 (FID:194)  02-52d (FID:293)
C22      2     06-28 (FID:204)  08-107d (FID:380)
C23      3     06-29 (FID:205)  02-104d (FID:292), 02-104t (FID:342)
C24      2     06-38 (FID:213)  06-42d (FID:268)
C25      2     06-39 (FID:214)  07-10d (FID:278)
C26      2     07-01 (FID:218)  02-64d (FID:283)
C27      2     07-04 (FID:220)  04-20d (FID:297)
C28      2     07-10 (FID:222)  04-54d (FID:279)
C29      2     08-799 (FID:225) 06-41d (FID:300)
C30      2     08-816 (FID:229) 02-36d (FID:326)
C31      2     08-129 (FID:231) 08-493d (FID:415)
C32      2     08-541 (FID:232) 04-22d (FID:294)
C33      3     08-509 (FID:233) 04-36 (FID:296), 04-40 (FID:369)
C34      2     08-774 (FID:234) 04-55d (FID:281)
C35      2     07-15 (FID:243)  08-160 (FID:244)
C36      2     07-16 (FID:245)  02-90 (FID:247)
C37      2     08-436 (FID:246) 07-17 (FID:248)
C38      3     08-179 (FID:249) 08-194 (FID:261), 03-37 (FID:265)
C39      2     08-498 (FID:269) 08-795d (FID:272)
C40      2     08-128 (FID:271) 08-124 (FID:275)
C41      2     08-817 (FID:277) 08-81d (FID:306)
C42      2     04-23 (FID:295)  04-23d (FID:329)
C43      2     08-770 (FID:335) 08-770d (FID:519)
C44      2     01-18 (FID:580)  09-76 (FID:581)
C45      2     08-265 (FID:757) 08-267 (FID:758)
C46      2     08-181 (FID:832) 08-45 (FID:849)
C47      2     12-04 (FID:908)  12-05 (FID:909)

All 63 pairs with PI_HAT values (click to expand)

Source: ConvSK_mind20.genome — all pairs with PI_HAT ≥ 0.98.

FID1  IID1      FID2  IID2       PI_HAT   Z2
02-29       52  01-29      0.9999   0.9997
02-29      338  01-29t     0.9998   0.9997
08-131     381  08-131d    0.9998   0.9995
01-29      338  01-29t     0.9999   0.9998
02-39      206  06-30      0.9999   0.9998
01-17      291  01-17d     0.9999   0.9999
09-37      150  02-104     0.9999   0.9999
03-154     352  03-155d    0.9999   0.9999
03-155     131  03-156     0.9998   0.9997
03-155     353  03-156d    0.9999   0.9999
03-156     353  03-156d    0.9998   0.9997
01-50      327  01-50d     0.9999   0.9999
02-36      270  07-19      0.9999   0.9999
02-45      237  08-107     0.9999   0.9999
02-45      325  06-43d     1.0000   0.9999
02-52      290  01-53      0.9998   0.9998
02-59      287  02-45d     0.9999   0.9999
02-59      347  02-45t     0.9988   0.9977
04-07      273  06-34d     0.9998   0.9997
04-07      276  06-15d     0.9999   0.9998
04-08      280  07-01d     0.9997   0.9994
08-493     422  08-744     0.9993   0.9986
04-13      284  01-59d     0.9999   0.9999
04-14      282  02-49d     1.0000   1.0000
04-22      266  06-23d     0.9999   0.9999
04-25      263  07-02d     0.9991   0.9981
04-45      262  06-06d     0.9982   0.9964
06-04      293  02-52d     1.0000   1.0000
06-28      380  08-107d    0.9997   0.9994
06-29      292  02-104d    0.9999   0.9998
06-29      342  02-104t    0.9999   0.9998
06-38      268  06-42d     0.9999   0.9999
06-39      278  07-10d     0.9999   0.9999
07-01      283  02-64d     1.0000   0.9999
07-04      297  04-20d     1.0000   0.9999
07-10      279  04-54d     1.0000   1.0000
08-799     300  06-41d     0.9846   0.9692
08-816     326  02-36d     0.9995   0.9991
08-129     415  08-493d    1.0000   1.0000
08-541     294  04-22d     0.9999   0.9999
08-509     296  04-36      1.0000   1.0000
08-509     369  04-40      0.9999   0.9999
08-774     281  04-55d     1.0000   1.0000
08-107     325  06-43d     1.0000   1.0000
07-15      244  08-160     1.0000   0.9999
07-16      247  02-90      0.9998   0.9997
08-436     248  07-17      0.9998   0.9997
08-179     261  08-194     0.9992   0.9984
08-179     265  03-37      0.9992   0.9984
08-194     265  03-37      0.9991   0.9982
08-498     272  08-795d    1.0000   1.0000
08-128     275  08-124     0.9994   0.9989
06-34d     276  06-15d     0.9999   0.9997
08-817     306  08-81d     0.9999   0.9999
02-45d     347  02-45t     0.9989   0.9978
02-104d    342  02-104t    0.9999   0.9999
04-23      329  04-23d     0.9999   0.9999
04-36      369  04-40      0.9999   0.9999
08-770     519  08-770d    1.0000   0.9999
01-18      581  09-76      1.0000   0.9999
08-265     758  08-267     1.0000   1.0000
08-181     849  08-45      1.0000   0.9999
12-04      909  12-05      0.9998   0.9995

5. Comparison with Winter 2025

Winter 2025 found 65 pairs / 49 clusters / 57 removed (from 1,155 input). Spring 2026 finds 63 pairs / 47 clusters / 55 removed (from 1,148 input). The 2 extra winter pairs were formed by 2 of the 7 samples that Step 1 incorrectly retained (see Step 1 §4).

Methodological note: winter 2025 computed IBD on LD-pruned variants (44,782 SNPs after --indep-pairwise 50 5 0.1), while spring 2026 uses all 621,580 autosomal variants. At PI_HAT ≥ 0.98 (duplicates), the variant set has negligible effect — the 2-pair difference is attributable to the sample correction.

6. Input & Output Data

Input

Files	ConvSK_mind20.bed, ConvSK_mind20.bim, ConvSK_mind20.fam
Location	`/staging/ALSU-analysis/spring2026/`
Samples	1,148 (from Step 1)
Variants	654,027 SNPs

Output

Files	ConvSK_mind20_dedup.bed, ConvSK_mind20_dedup.bim, ConvSK_mind20_dedup.fam
Location	`/staging/ALSU-analysis/spring2026/`
Samples	1,093 (55 removed from 47 duplicate clusters)
Variants	654,027 SNPs (unchanged)

Intermediate Files

File	Description
`ConvSK_mind20.genome`	All pairwise IBD pairs with PI_HAT ≥ 0.98 (63 pairs + header)
`duplicates_pihat098.txt`	55 sample IDs to remove (FID ↔ IID, tab-separated)
`dup_clusters_summary.tsv`	47 clusters with kept/removed members

7. Commands Executed

Step 2a: Compute pairwise IBD

$ cd /staging/ALSU-analysis/spring2026/

plink --bfile ConvSK_mind20 \
  --genome \
  --min 0.98 \
  --out ConvSK_mind20

PLINK v1.9.0-b.7.7 64-bit (22 Oct 2024)
654027 variants loaded from .bim file.
1148 people (0 males, 0 females, 1148 ambiguous) loaded from .fam.
Total genotyping rate is 0.980199.
654027 variants and 1148 people pass filters and QC.
Excluding 32447 variants on non-autosomes from IBD calculation.
IBD calculations complete.
Finished writing ConvSK_mind20.genome .

Step 2b: Build clusters and generate removal list

# Save as step2_dedup_graph.py in the working directory, then run:
# $ python3 step2_dedup_graph.py
#
# Reads: ConvSK_mind20.genome (output of Step 2a)
# Writes: duplicates_pihat098.txt (removal list for Step 2c)
#         dup_clusters_summary.tsv (cluster audit log)

from collections import defaultdict

pairs = []
fid_of = {}
with open('ConvSK_mind20.genome') as f:
    header = f.readline()
    for line in f:
        parts = line.split()
        fid1, iid1 = int(parts[0]), parts[1]
        fid2, iid2 = int(parts[2]), parts[3]
        pairs.append((iid1, iid2))
        fid_of[iid1] = fid1
        fid_of[iid2] = fid2

# Build adjacency graph
adj = defaultdict(set)
for s1, s2 in pairs:
    adj[s1].add(s2)
    adj[s2].add(s1)

# Find connected components (BFS)
visited = set()
clusters = []
for node in sorted(adj.keys()):
    if node in visited:
        continue
    comp = set()
    q = [node]
    while q:
        n = q.pop(0)
        if n in visited:
            continue
        visited.add(n)
        comp.add(n)
        q.extend(nb for nb in adj[n] if nb not in visited)
    clusters.append(comp)

# Strategy: keep the sample with the lowest FID (earliest registered) per cluster
to_remove = []
with open('dup_clusters_summary.tsv', 'w') as cf:
    cf.write('cluster_id\tsize\tkept_fid\tkept_iid\tremoved\n')
    for i, cluster in enumerate(sorted(clusters, key=lambda c: min(fid_of[s] for s in c)), 1):
        members = sorted(cluster, key=lambda s: fid_of[s])
        keep = members[0]
        remove = members[1:]
        to_remove.extend(remove)
        removed_str = ','.join(f'{fid_of[s]}:{s}' for s in remove)
        cf.write(f'{i}\t{len(cluster)}\t{fid_of[keep]}\t{keep}\t{removed_str}\n')

with open('duplicates_pihat098.txt', 'w') as f:
    for iid in sorted(to_remove, key=lambda s: fid_of[s]):
        f.write(f'{fid_of[iid]}\t{iid}\n')

print(f'Pairs: {len(pairs)}')
print(f'Clusters: {len(clusters)}')
print(f'Samples involved: {len(adj)}')
print(f'To remove: {len(to_remove)}')
print(f'To keep: {len(clusters)}')
print(f'Expected final: 1148 - {len(to_remove)} = {1148 - len(to_remove)}')

# Expected output:
Pairs: 63
Clusters: 47
Samples involved: 102
To remove: 55
To keep: 47
Expected final: 1148 - 55 = 1093

Step 2c: Remove duplicates

$ plink --bfile ConvSK_mind20 \
  --remove duplicates_pihat098.txt \
  --make-bed \
  --out ConvSK_mind20_dedup

# Expected: 55 people removed, 1093 remaining.
# Result: ConvSK_mind20_dedup.bed/bim/fam (654,027 variants x 1,093 samples)

8. Removed Samples

55 samples were removed from 47 duplicate clusters. All 32 “d”-suffix and 3 “t”-suffix samples are in the removal list, confirming that the laboratory’s duplicate labelling was correct.

All 55 removed samples (click to expand)

Source: duplicates_pihat098.txt on Biotech2024 (/staging/ALSU-analysis/spring2026/).

FID      IID         Cluster  Partner(s) Kept
     01-29       C1       02-29 (FID:11)
    03-156      C7       03-155 (FID:130)
    02-104      C5       09-37 (FID:91)
    06-30       C3       02-39 (FID:71)
    08-107      C10      02-45 (FID:154)
    08-160      C35      07-15 (FID:243)
    02-90       C36      07-16 (FID:245)
    07-17       C37      08-436 (FID:246)
    08-194      C38      08-179 (FID:249)
    06-06d      C20      04-45 (FID:187)
    07-02d      C19      04-25 (FID:178)
    03-37       C38      08-179 (FID:249)
    06-23d      C18      04-22 (FID:176)
    06-42d      C24      06-38 (FID:213)
    07-19       C9       02-36 (FID:152)
    08-795d     C39      08-498 (FID:269)
    06-34d      C13      04-07 (FID:165)
    08-124      C40      08-128 (FID:271)
    06-15d      C13      04-07 (FID:165)
    07-10d      C25      06-39 (FID:214)
    04-54d      C28      07-10 (FID:222)
    07-01d      C14      04-08 (FID:166)
    04-55d      C34      08-774 (FID:234)
    02-49d      C17      04-14 (FID:170)
    02-64d      C26      07-01 (FID:218)
    01-59d      C16      04-13 (FID:169)
    02-45d      C12      02-59 (FID:158)
    01-53       C11      02-52 (FID:157)
    01-17d      C4       01-17 (FID:90)
    02-104d     C23      06-29 (FID:205)
    02-52d      C21      06-04 (FID:194)
    04-22d      C32      08-541 (FID:232)
    04-36       C33      08-509 (FID:233)
    04-20d      C27      07-04 (FID:220)
    06-41d      C29      08-799 (FID:225)
    08-81d      C41      08-817 (FID:277)
    06-43d      C10      02-45 (FID:154)
    02-36d      C30      08-816 (FID:229)
    01-50d      C8       01-50 (FID:145)
    04-23d      C42      04-23 (FID:295)
    01-29t      C1       02-29 (FID:11)
    02-104t     C23      06-29 (FID:205)
    02-45t      C12      02-59 (FID:158)
    03-155d     C6       03-154 (FID:129)
    03-156d     C7       03-155 (FID:130)
    04-40       C33      08-509 (FID:233)
    08-107d     C22      06-28 (FID:204)
    08-131d     C2       08-131 (FID:42)
    08-493d     C31      08-129 (FID:231)
    08-744      C15      08-493 (FID:168)
    08-770d     C43      08-770 (FID:335)
    09-76       C44      01-18 (FID:580)
    08-267      C45      08-265 (FID:757)
    08-45       C46      08-181 (FID:832)
    12-05       C47      12-04 (FID:908)

9. Quality Verification

✓ Verification checks:

duplicates_pihat098.txt contains exactly 55 lines
All 32 “d”-suffix and 3 “t”-suffix samples are in the removal list
No cluster has more than one sample kept
No variants removed — 654,027 SNPs carried forward
Expected: ConvSK_mind20_dedup.fam has 1,093 lines

$ wc -l duplicates_pihat098.txt
55 duplicates_pihat098.txt

wc -l ConvSK_mind20_dedup.fam ConvSK_mind20_dedup.bim
   1093 ConvSK_mind20_dedup.fam
 654027 ConvSK_mind20_dedup.bim

10. Chronological Log

Winter 2025 (original run)

Initial IBD deduplication
Input: 1,155 samples (from buggy Step 1). Found 65 pairs in 49 clusters. Removed 57 → 1,098 retained. Used LD pruning + full genome computation.

Spring 2026 (re-analysis)

IBD computed on corrected input
plink --genome --min 0.98 on ConvSK_mind20 (1,148 samples). IBD computed on 621,580 autosomal variants. Found 63 pairs.

Spring 2026

Cluster analysis
63 pairs → 47 connected components (102 unique samples). Strategy: keep lowest FID per cluster.