Repository Deduplication Tool

A Python-based code analysis project that detects exact and near-duplicate logic in a repository and prepares structured, reviewable context for LLM-assisted refactoring.

Project goal

This project was built to explore a practical software maintenance problem: how to find repeated logic before it turns into long-term engineering cost. The goal was not only to detect copy-pasted code, but to design a workflow that combines static analysis with LLM-assisted refactoring in a more controlled and reviewable way.

Why this project matters

Duplicate code is easy to introduce and difficult to manage at scale. It increases maintenance effort, raises the risk of inconsistent bug fixes, and makes future refactoring slower and more error-prone. This project focuses on surfacing duplicated logic early, organizing it into meaningful groups, and turning those findings into structured refactor tasks instead of leaving them as raw scanner output.

Why the LLM component matters

The most important idea behind this project is that a language model is most useful when the problem has already been narrowed and structured. Rather than asking an LLM to reason over an entire repository, this tool first performs the analysis work: it scans the codebase, extracts functions and methods, detects exact and near-duplicate logic, and packages only the relevant context for the next step.

That design turns the LLM from a general code generator into a constrained refactor assistant. Each prompt is built around a canonical snippet, the duplicate members in the cluster, and explicit instructions to preserve behavior, keep signatures stable when possible, prefer shared helpers, and return a unified diff. This makes the workflow safer, more focused, and more realistic for smaller models with limited context windows.

What the program does

Scans Python repositories while respecting ignore patterns.
Extracts functions and methods for structural analysis.
Detects exact duplicates through normalized AST-based hashing.
Detects near duplicates through embedding-based similarity search.
Produces a machine-readable plan and a human-readable report.
Generates focused prompt files for LLM-assisted refactor proposals.
Supports an approval-first patch workflow rather than automatic code rewriting.

Engineering challenges addressed

A significant part of this project involved improving result quality rather than simply making the scanner run. During development, exact duplicate detection initially misclassified identical functions because function names were still preserved in normalized AST output. Near-duplicate clustering also had to be improved so the same snippets were not repeatedly grouped into overlapping clusters.

Those issues were addressed by refining normalization rules, improving clustering behavior, and tuning thresholds so the output better distinguished exact duplicates, true near duplicates, and helper-based implementations that should not all collapse into the same group.

Results and workflow value

The final workflow can distinguish between exact duplicate functions that are candidates for direct consolidation and near-duplicate logic that may be better handled through helper extraction or a shared abstraction. Instead of stopping at detection, the tool produces structured findings that support a practical human-in-the-loop refactor process.

That is where the LLM workflow becomes meaningful. Rather than giving a model an entire repository and asking for broad refactors, the tool hands it a small, relevant, bounded task. This improves clarity, reduces noise, and makes the generated output easier to validate before any patch is applied.

Skills demonstrated

Python package design with a src layout
CLI development
Abstract syntax tree analysis
Static code analysis workflows
Exact and near-duplicate detection strategy
Embedding-based similarity search with FAISS
Prompt construction for constrained LLM tasks
Patch-oriented refactor workflows
Debugging, validation, and iterative tool improvement
Developer tooling and code-maintenance automation

Example outputs

The project produces structured outputs that support both inspection and downstream refactor workflows:

plan.json for machine-readable duplicate metadata
report.md for readable duplicate summaries
prompt files for cluster-by-cluster LLM review
patch files for approved code changes

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src/repo_dedupe_tool		src/repo_dedupe_tool
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Repository Deduplication Tool

Project goal

Why this project matters

Why the LLM component matters

What the program does

Engineering challenges addressed

Results and workflow value

Skills demonstrated

Example outputs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Repository Deduplication Tool

Project goal

Why this project matters

Why the LLM component matters

What the program does

Engineering challenges addressed

Results and workflow value

Skills demonstrated

Example outputs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages