A Python-based code analysis project that detects exact and near-duplicate logic in a repository and prepares structured, reviewable context for LLM-assisted refactoring.
This project was built to explore a practical software maintenance problem: how to find repeated logic before it turns into long-term engineering cost. The goal was not only to detect copy-pasted code, but to design a workflow that combines static analysis with LLM-assisted refactoring in a more controlled and reviewable way.
Duplicate code is easy to introduce and difficult to manage at scale. It increases maintenance effort, raises the risk of inconsistent bug fixes, and makes future refactoring slower and more error-prone. This project focuses on surfacing duplicated logic early, organizing it into meaningful groups, and turning those findings into structured refactor tasks instead of leaving them as raw scanner output.
The most important idea behind this project is that a language model is most useful when the problem has already been narrowed and structured. Rather than asking an LLM to reason over an entire repository, this tool first performs the analysis work: it scans the codebase, extracts functions and methods, detects exact and near-duplicate logic, and packages only the relevant context for the next step.
That design turns the LLM from a general code generator into a constrained refactor assistant. Each prompt is built around a canonical snippet, the duplicate members in the cluster, and explicit instructions to preserve behavior, keep signatures stable when possible, prefer shared helpers, and return a unified diff. This makes the workflow safer, more focused, and more realistic for smaller models with limited context windows.
- Scans Python repositories while respecting ignore patterns.
- Extracts functions and methods for structural analysis.
- Detects exact duplicates through normalized AST-based hashing.
- Detects near duplicates through embedding-based similarity search.
- Produces a machine-readable plan and a human-readable report.
- Generates focused prompt files for LLM-assisted refactor proposals.
- Supports an approval-first patch workflow rather than automatic code rewriting.
A significant part of this project involved improving result quality rather than simply making the scanner run. During development, exact duplicate detection initially misclassified identical functions because function names were still preserved in normalized AST output. Near-duplicate clustering also had to be improved so the same snippets were not repeatedly grouped into overlapping clusters.
Those issues were addressed by refining normalization rules, improving clustering behavior, and tuning thresholds so the output better distinguished exact duplicates, true near duplicates, and helper-based implementations that should not all collapse into the same group.
The final workflow can distinguish between exact duplicate functions that are candidates for direct consolidation and near-duplicate logic that may be better handled through helper extraction or a shared abstraction. Instead of stopping at detection, the tool produces structured findings that support a practical human-in-the-loop refactor process.
That is where the LLM workflow becomes meaningful. Rather than giving a model an entire repository and asking for broad refactors, the tool hands it a small, relevant, bounded task. This improves clarity, reduces noise, and makes the generated output easier to validate before any patch is applied.
- Python package design with a
srclayout - CLI development
- Abstract syntax tree analysis
- Static code analysis workflows
- Exact and near-duplicate detection strategy
- Embedding-based similarity search with FAISS
- Prompt construction for constrained LLM tasks
- Patch-oriented refactor workflows
- Debugging, validation, and iterative tool improvement
- Developer tooling and code-maintenance automation
The project produces structured outputs that support both inspection and downstream refactor workflows:
plan.jsonfor machine-readable duplicate metadatareport.mdfor readable duplicate summaries- prompt files for cluster-by-cluster LLM review
- patch files for approved code changes