Finding duplicate & similar code¶
Veska clusters similar code across your repo (or across every registered repo) so you can triage it for de-duplication. It is a query over the same graph everything else uses - no separate scan. One command returns the groups, ranked, each shaped so you can open a verify-and-dedupe task per cluster.
veska duplicates # one repo: ranked clusters, all tiers
veska duplicates --all-repos # across every registered repo
veska duplicates --tiers structural --path internal/ --json
The same view is available to your agent over MCP as eng_find_clusters
(no seed required).
Three tiers¶
A cluster is a group of two or more symbols that resemble each other. Every cluster carries a tier that says how they resemble each other, tightest first. A symbol appears at most once, at its tightest tier.
- exact - byte-for-byte identical source (literal copy-paste), by
content_hashequality. Deterministic, no embeddings. - structural - the same shape after a consistent rename of variables,
parameters, and literals (Type-2 clones). Two functions with identical bodies
but different names - or
a + bvsx + y- land here even though their text differs. Computed from an identifier-normalized hash of the syntax tree at parse time. - near - semantically similar above the elected embedder's calibrated
threshold (the
SIMILAR_TOedges auto-link already stores). The loosest tier: renamed-and-drifted copies, parallel implementations. Threshold-sensitive - lower--min-scorefor more recall, accept more noise.
Which tier should I act on first?
exact and structural are high-precision de-dupe candidates - the code is
genuinely the same, just copied or renamed. near is for review: it casts
a wider net and benefits from a human (or agent) deciding whether the
similarity is worth consolidating.
Scope¶
- One repo (default) - clusters within the repo you point at (
--repo, or the repo resolved from your cwd). - Cross-repo (
--all-repos) - clusters across every registered repo, so a function copy-pasted from one project into another shows up as one cluster whose members span both. Cross-repo covers the exact and structural tiers; the near tier stays within a single repo for now. - Path (
--path internal/infrastructure/mcp) - restrict the sweep to a file-path prefix for a focused pass.
Container and sub-symbol kinds (package, file, module, field, import, and the synthetic chunk nodes) are excluded so boilerplate doesn't flood the results.
Before the structural & near tiers work¶
The structural and near tiers read data populated at parse/promotion time:
the structural_hash on each node, and the similarity score on each
SIMILAR_TO edge. A repository indexed before those landed carries neither
until you re-index it:
After a re-index the exact tier is unchanged, and structural / near start
returning clusters. The exact tier needs nothing special - it works on any
promoted graph.
Related¶
- For "what else looks like this one symbol?" use
veska similar <symbol>(eng_search_similar) - the per-symbol pivot, no group-wide scan. veska clonesis the older, single-tier view (exact, or--near);veska duplicatessupersedes it with the unified, ranked, cross-repo output.- See The code graph for how nodes and edges are built, and Semantic search & embeddings for what powers the near tier.