Step 2a: Compute pairwise IBD
$ cd /staging/ALSU-analysis/spring2026/
plink --bfile ConvSK_mind20 \
--genome \
--min 0.98 \
--out ConvSK_mind20
PLINK v1.9.0-b.7.7 64-bit (22 Oct 2024)
654027 variants loaded from .bim file.
1148 people (0 males, 0 females, 1148 ambiguous) loaded from .fam.
Total genotyping rate is 0.980199.
654027 variants and 1148 people pass filters and QC.
Excluding 32447 variants on non-autosomes from IBD calculation.
IBD calculations complete.
Finished writing ConvSK_mind20.genome .
Step 2b: Build clusters and generate removal list
# Save as step2_dedup_graph.py in the working directory, then run:
# $ python3 step2_dedup_graph.py
#
# Reads: ConvSK_mind20.genome (output of Step 2a)
# Writes: duplicates_pihat098.txt (removal list for Step 2c)
# dup_clusters_summary.tsv (cluster audit log)
from collections import defaultdict
pairs = []
fid_of = {}
with open('ConvSK_mind20.genome') as f:
header = f.readline()
for line in f:
parts = line.split()
fid1, iid1 = int(parts[0]), parts[1]
fid2, iid2 = int(parts[2]), parts[3]
pairs.append((iid1, iid2))
fid_of[iid1] = fid1
fid_of[iid2] = fid2
# Build adjacency graph
adj = defaultdict(set)
for s1, s2 in pairs:
adj[s1].add(s2)
adj[s2].add(s1)
# Find connected components (BFS)
visited = set()
clusters = []
for node in sorted(adj.keys()):
if node in visited:
continue
comp = set()
q = [node]
while q:
n = q.pop(0)
if n in visited:
continue
visited.add(n)
comp.add(n)
q.extend(nb for nb in adj[n] if nb not in visited)
clusters.append(comp)
# Strategy: keep the sample with the lowest FID (earliest registered) per cluster
to_remove = []
with open('dup_clusters_summary.tsv', 'w') as cf:
cf.write('cluster_id\tsize\tkept_fid\tkept_iid\tremoved\n')
for i, cluster in enumerate(sorted(clusters, key=lambda c: min(fid_of[s] for s in c)), 1):
members = sorted(cluster, key=lambda s: fid_of[s])
keep = members[0]
remove = members[1:]
to_remove.extend(remove)
removed_str = ','.join(f'{fid_of[s]}:{s}' for s in remove)
cf.write(f'{i}\t{len(cluster)}\t{fid_of[keep]}\t{keep}\t{removed_str}\n')
with open('duplicates_pihat098.txt', 'w') as f:
for iid in sorted(to_remove, key=lambda s: fid_of[s]):
f.write(f'{fid_of[iid]}\t{iid}\n')
print(f'Pairs: {len(pairs)}')
print(f'Clusters: {len(clusters)}')
print(f'Samples involved: {len(adj)}')
print(f'To remove: {len(to_remove)}')
print(f'To keep: {len(clusters)}')
print(f'Expected final: 1148 - {len(to_remove)} = {1148 - len(to_remove)}')
# Expected output:
Pairs: 63
Clusters: 47
Samples involved: 102
To remove: 55
To keep: 47
Expected final: 1148 - 55 = 1093
Step 2c: Remove duplicates
$ plink --bfile ConvSK_mind20 \
--remove duplicates_pihat098.txt \
--make-bed \
--out ConvSK_mind20_dedup
# Expected: 55 people removed, 1093 remaining.
# Result: ConvSK_mind20_dedup.bed/bim/fam (654,027 variants x 1,093 samples)