Skip to content

Latest commit

 

History

History

README.md

InsightCode Benchmarks Documentation

This directory contains benchmark results and analysis of popular open-source projects.

🎯 Purpose

  • Validate InsightCode's scoring algorithm against real-world codebases
  • Demonstrate performance and accuracy on diverse project types
  • Provide reference scores for comparison and calibration
  • Build credibility through transparent, reproducible analysis

📊 Latest Results

The most recent benchmark results are generated by running:

npm run benchmark              # Full codebase analysis
npm run benchmark:production    # Production code only

Latest Benchmark: July 21, 2025 (v0.7.0)

  • 9 projects analyzed (Angular, Chalk, ESLint, Express, Jest, Lodash, TypeScript, UUID, Vue)
  • 677,099 lines processed in 70.31 seconds
  • Analysis speed: 9,630 lines/second
  • Success rate: 100% (9/9 projects)

Results are automatically saved to benchmarks/ with detailed individual reports.

📐 Methodology (v0.7.0+)

Health Score System

InsightCode uses progressive penalties without caps following the Pareto Principle:

Core Formula

Health Score = 100 - Σ(penalties)
Final Score = Math.max(0, Health Score)

Progressive Penalty System

Complexity Penalties (McCabe-based):

  • ≤10: No penalty (100 points) - McCabe "excellent"
  • 11-20: Linear penalty (100→70 points) - 3 points per unit
  • 21-50: Quadratic penalty (70→30 points) - exponential maintenance burden
  • 51+: Exponential penalty (30→0 points) - extreme complexity gets extreme penalties
  • 100+: Additional catastrophic penalties - no artificial caps

Duplication Penalties (Mode-aware):

  • Legacy Mode (default): ≤15% excellent, ≤30% acceptable, ≤50% critical
  • Strict Mode (--strict-duplication): ≤3% excellent, ≤8% acceptable, ≤15% critical
  • Progressive penalties without caps for extreme duplication

Size Penalties (Clean Code inspired):

  • ≤200 LOC: No penalty - optimal file size
  • 201-500 LOC: Linear penalty - gentle degradation
  • 500+ LOC: Exponential penalty - massive files severely penalized

Issue Penalties (Severity-weighted):

  • Critical: 20 points each
  • High: 12 points each
  • Medium: 6 points each
  • Low: 2 points each

Project-Level Scoring

Weighted Aggregation (internal hypothesis requiring validation):

  • 45% Complexity - Primary defect predictor
  • 30% Maintainability - Development velocity impact
  • 25% Duplication - Technical debt indicator

Architectural Criticality Weighting: Each file receives a "criticism score" determining its weight in final project scores:

CriticismScore = (Dependencies × 2.0) + (WeightedIssues × 0.5) + 1

Where:

  • Dependencies = incomingDeps + outgoingDeps + (isInCycle ? 5 : 0)
  • WeightedIssues = (critical×4) + (high×3) + (medium×2) + (low×1)

Note: Complexity is NOT included to avoid double-counting since it's already weighted at 45% in health score calculation.

Grade Thresholds (Academic standard):

  • A: 90-100 points
  • B: 80-89 points
  • C: 70-79 points
  • D: 60-69 points
  • F: 0-59 points

Measurement Accuracy

Complexity Calculation

  • Method: Extended Cyclomatic Complexity (McCabe, 1976)
  • Base: Every file starts at complexity 1
  • +1 for each decision point:
    • if, else if (but NOT else alone)
    • for, while, do-while, for-in, for-of
    • case in switch (but NOT default)
    • catch in try-catch
    • &&, || (logical operators as implicit branching)
    • ? : (ternary operator)
  • Validation: 100% accurate with comprehensive test suite

Duplication Detection

  • Method: 8-line sliding window with content-based analysis
  • Philosophy: Focuses on actionable copy-paste, not structural similarity
  • Accuracy: ~85% conservative (prefers false negatives over false positives)
  • Block size: 8 lines minimum for significance
  • Token threshold: 8+ tokens to filter trivial matches

📈 Latest Benchmark Results (July 21, 2025)

Production Code Analysis - Grade Distribution

Grade Projects Percentage
A 3 projects 33%
B 4 projects 44%
C 2 projects 22%
D 0 projects 0%
F 0 projects 0%

Top Performers

  1. UUID (Grade A, 97/100)

    • Excellent: Low complexity, minimal duplication
    • 978 LOC across 29 files
    • Average complexity: 4.6
  2. Express (Grade A, 94/100)

    • Excellent: Well-structured web framework
    • 1,135 LOC across 7 files
    • Average complexity: 7.4
  3. Chalk (Grade A, 93/100)

    • Excellent: Clean terminal styling library
    • 475 LOC across 5 files
    • Average complexity: 8.0

Performance Statistics

  • Analysis speed: 9,630 lines/second
  • Average code quality: 86/100
  • Average duplication rate: 1.7%
  • Most critical issues: Deep nesting (2,829 issues found)
  • Largest project: TypeScript (316,214 LOC, 697 files)

Project Size Analysis

Small Projects (lodash, chalk, uuid):

  • Average score: 91/100
  • Average complexity: 37.0
  • Observation: Small projects achieving excellent quality

Medium Projects (express, vue, jest):

  • Average score: 84/100
  • Average complexity: 18.3
  • Observation: Best balance of features vs maintainability

Large Projects (angular, eslint, typescript):

  • Average score: 82/100
  • Average complexity: 43.4
  • Observation: Large projects maintaining high quality standards

🔍 Understanding the Results: Context Matters

Case Study: Lodash's Deliberate Architecture

Current Analysis (July 2025):

  • File Complexity: 1,818 (extreme)
  • Individual Health Score: 0/100 (F grade)
  • Project Complexity Score: 7/100 (weighted by architectural criticality)
  • Critical Issues: 25

Historical Context: Lodash's monolithic design was deliberate:

  1. Pre-ES6 Era Constraints:

    • Single file compatibility (browsers, Node.js, Rhino)
    • No module bundlers or tree-shaking
    • CDN distribution priority
  2. User Experience Trade-offs:

    • ✅ Zero configuration for developers
    • ✅ Universal compatibility maintained
    • ❌ Extreme complexity penalty (1,818)
    • ❌ Maintenance burden
  3. Modern Context:

    // Legacy usage (still dominant)
    <script src="lodash.js"></script>
    
    // Modern modular usage available
    import map from 'lodash/map'
    import filter from 'lodash/filter'

Key Insight: F grades can reflect legitimate architectural trade-offs, not just poor engineering. Consider business constraints and historical context when interpreting scores.

🧪 Self-Analysis Disclaimer

InsightCode analyzes itself for:

  • ✅ Feature validation and testing
  • ✅ Documentation examples
  • ✅ Output format demonstration
  • NOT for quality judgment

Analysis tools inherently have high complexity due to AST manipulation and parsing logic. Self-analysis scores demonstrate capabilities, not code quality assessment.

🔄 Duplication Detection Philosophy

InsightCode vs SonarQube: Different Approaches

Aspect SonarQube InsightCode
Method Token/Structure-based Content-based
Philosophy "Repetitive structure" "Actionable copy-paste"
Lodash Result 70.4% duplication ~6% duplication
Focus Structural patterns Actual code copying

Example: Benchmark suites with repetitive structure:

// SonarQube flags this as duplication
suites.push(Benchmark.Suite('`_.map`'))
suites.push(Benchmark.Suite('`_.filter`'))

// InsightCode: Different methods = different code
// Structure similarity ≠ actionable duplication

We prioritize actionable insights over structural pattern detection.

📁 Benchmark History

Recent Benchmarks

  • benchmark-report-prod-2025-07-21.md - Latest v0.7.0 production analysis
  • benchmark-summary-prod-2025-07-21.json - Machine-readable summary
  • benchmark-report-full-2025-07-21.md - Complete codebase analysis (including tests)

🎯 Benchmark Validation

Academic Foundation

  • McCabe (1976): Complexity thresholds ≤10 excellent, >50 critical
  • Martin Clean Code (2008): File size recommendations ≤200 LOC optimal
  • NASA/SEL Standards: Current NPR 7150.2D requires ≤15 complexity for critical software
  • ISO/IEC 25010: Maintainability quality model compliance

Industry Alignment

  • SonarQube Quality Gates: Strict duplication mode aligns with 3% threshold
  • Google Code Practices: Duplication tolerance ~2-3% maintained
  • NIST Guidelines: >20 complexity = high defect probability

Statistical Validation

  • 677K+ lines analyzed across diverse project types
  • 100% success rate on complex, real-world codebases
  • Consistent results across multiple benchmark runs
  • Performance proven at 9K+ lines/second analysis speed with v0.7.0 enhancements

For complete methodology details, see docs/FILE_HEALTH_SCORE_METHODOLOGY.md and docs/SCORING_ARCHITECTURE.md