This directory contains benchmark results and analysis of popular open-source projects.
- Validate InsightCode's scoring algorithm against real-world codebases
- Demonstrate performance and accuracy on diverse project types
- Provide reference scores for comparison and calibration
- Build credibility through transparent, reproducible analysis
The most recent benchmark results are generated by running:
npm run benchmark # Full codebase analysis
npm run benchmark:production # Production code onlyLatest Benchmark: July 21, 2025 (v0.7.0)
- 9 projects analyzed (Angular, Chalk, ESLint, Express, Jest, Lodash, TypeScript, UUID, Vue)
- 677,099 lines processed in 70.31 seconds
- Analysis speed: 9,630 lines/second
- Success rate: 100% (9/9 projects)
Results are automatically saved to benchmarks/ with detailed individual reports.
InsightCode uses progressive penalties without caps following the Pareto Principle:
Health Score = 100 - Σ(penalties)
Final Score = Math.max(0, Health Score)
Complexity Penalties (McCabe-based):
- ≤10: No penalty (100 points) - McCabe "excellent"
- 11-20: Linear penalty (100→70 points) - 3 points per unit
- 21-50: Quadratic penalty (70→30 points) - exponential maintenance burden
- 51+: Exponential penalty (30→0 points) - extreme complexity gets extreme penalties
- 100+: Additional catastrophic penalties - no artificial caps
Duplication Penalties (Mode-aware):
- Legacy Mode (default): ≤15% excellent, ≤30% acceptable, ≤50% critical
- Strict Mode (--strict-duplication): ≤3% excellent, ≤8% acceptable, ≤15% critical
- Progressive penalties without caps for extreme duplication
Size Penalties (Clean Code inspired):
- ≤200 LOC: No penalty - optimal file size
- 201-500 LOC: Linear penalty - gentle degradation
- 500+ LOC: Exponential penalty - massive files severely penalized
Issue Penalties (Severity-weighted):
- Critical: 20 points each
- High: 12 points each
- Medium: 6 points each
- Low: 2 points each
Weighted Aggregation (internal hypothesis requiring validation):
- 45% Complexity - Primary defect predictor
- 30% Maintainability - Development velocity impact
- 25% Duplication - Technical debt indicator
Architectural Criticality Weighting: Each file receives a "criticism score" determining its weight in final project scores:
CriticismScore = (Dependencies × 2.0) + (WeightedIssues × 0.5) + 1
Where:
- Dependencies = incomingDeps + outgoingDeps + (isInCycle ? 5 : 0)
- WeightedIssues = (critical×4) + (high×3) + (medium×2) + (low×1)
Note: Complexity is NOT included to avoid double-counting since it's already weighted at 45% in health score calculation.
Grade Thresholds (Academic standard):
- A: 90-100 points
- B: 80-89 points
- C: 70-79 points
- D: 60-69 points
- F: 0-59 points
- Method: Extended Cyclomatic Complexity (McCabe, 1976)
- Base: Every file starts at complexity 1
- +1 for each decision point:
if,else if(but NOTelsealone)for,while,do-while,for-in,for-ofcasein switch (but NOTdefault)catchin try-catch&&,||(logical operators as implicit branching)? :(ternary operator)
- Validation: 100% accurate with comprehensive test suite
- Method: 8-line sliding window with content-based analysis
- Philosophy: Focuses on actionable copy-paste, not structural similarity
- Accuracy: ~85% conservative (prefers false negatives over false positives)
- Block size: 8 lines minimum for significance
- Token threshold: 8+ tokens to filter trivial matches
| Grade | Projects | Percentage |
|---|---|---|
| A | 3 projects | 33% |
| B | 4 projects | 44% |
| C | 2 projects | 22% |
| D | 0 projects | 0% |
| F | 0 projects | 0% |
-
UUID (Grade A, 97/100)
- Excellent: Low complexity, minimal duplication
- 978 LOC across 29 files
- Average complexity: 4.6
-
Express (Grade A, 94/100)
- Excellent: Well-structured web framework
- 1,135 LOC across 7 files
- Average complexity: 7.4
-
Chalk (Grade A, 93/100)
- Excellent: Clean terminal styling library
- 475 LOC across 5 files
- Average complexity: 8.0
- Analysis speed: 9,630 lines/second
- Average code quality: 86/100
- Average duplication rate: 1.7%
- Most critical issues: Deep nesting (2,829 issues found)
- Largest project: TypeScript (316,214 LOC, 697 files)
Small Projects (lodash, chalk, uuid):
- Average score: 91/100
- Average complexity: 37.0
- Observation: Small projects achieving excellent quality
Medium Projects (express, vue, jest):
- Average score: 84/100
- Average complexity: 18.3
- Observation: Best balance of features vs maintainability
Large Projects (angular, eslint, typescript):
- Average score: 82/100
- Average complexity: 43.4
- Observation: Large projects maintaining high quality standards
Current Analysis (July 2025):
- File Complexity: 1,818 (extreme)
- Individual Health Score: 0/100 (F grade)
- Project Complexity Score: 7/100 (weighted by architectural criticality)
- Critical Issues: 25
Historical Context: Lodash's monolithic design was deliberate:
-
Pre-ES6 Era Constraints:
- Single file compatibility (browsers, Node.js, Rhino)
- No module bundlers or tree-shaking
- CDN distribution priority
-
User Experience Trade-offs:
- ✅ Zero configuration for developers
- ✅ Universal compatibility maintained
- ❌ Extreme complexity penalty (1,818)
- ❌ Maintenance burden
-
Modern Context:
// Legacy usage (still dominant) <script src="lodash.js"></script> // Modern modular usage available import map from 'lodash/map' import filter from 'lodash/filter'
Key Insight: F grades can reflect legitimate architectural trade-offs, not just poor engineering. Consider business constraints and historical context when interpreting scores.
InsightCode analyzes itself for:
- ✅ Feature validation and testing
- ✅ Documentation examples
- ✅ Output format demonstration
- ❌ NOT for quality judgment
Analysis tools inherently have high complexity due to AST manipulation and parsing logic. Self-analysis scores demonstrate capabilities, not code quality assessment.
| Aspect | SonarQube | InsightCode |
|---|---|---|
| Method | Token/Structure-based | Content-based |
| Philosophy | "Repetitive structure" | "Actionable copy-paste" |
| Lodash Result | 70.4% duplication | ~6% duplication |
| Focus | Structural patterns | Actual code copying |
Example: Benchmark suites with repetitive structure:
// SonarQube flags this as duplication
suites.push(Benchmark.Suite('`_.map`'))
suites.push(Benchmark.Suite('`_.filter`'))
// InsightCode: Different methods = different code
// Structure similarity ≠ actionable duplicationWe prioritize actionable insights over structural pattern detection.
benchmark-report-prod-2025-07-21.md- Latest v0.7.0 production analysisbenchmark-summary-prod-2025-07-21.json- Machine-readable summarybenchmark-report-full-2025-07-21.md- Complete codebase analysis (including tests)
- McCabe (1976): Complexity thresholds ≤10 excellent, >50 critical
- Martin Clean Code (2008): File size recommendations ≤200 LOC optimal
- NASA/SEL Standards: Current NPR 7150.2D requires ≤15 complexity for critical software
- ISO/IEC 25010: Maintainability quality model compliance
- SonarQube Quality Gates: Strict duplication mode aligns with 3% threshold
- Google Code Practices: Duplication tolerance ~2-3% maintained
- NIST Guidelines: >20 complexity = high defect probability
- 677K+ lines analyzed across diverse project types
- 100% success rate on complex, real-world codebases
- Consistent results across multiple benchmark runs
- Performance proven at 9K+ lines/second analysis speed with v0.7.0 enhancements
For complete methodology details, see docs/FILE_HEALTH_SCORE_METHODOLOGY.md and docs/SCORING_ARCHITECTURE.md