A comprehensive system for automatically classifying programming problems into difficulty categories (Easy, Medium, Hard) and predicting numerical difficulty scores on a 1-10 scale.
Watch Live Demo - Complete system demonstration showing real-time predictions and technical implementation.
AutoJudge analyzes programming problem descriptions using natural language processing and statistical learning techniques to predict their difficulty. The system is designed for competitive programming platforms, educational institutions, and coding assessment services.
- Overall Accuracy: 55.0%
- Dataset Size: 4,112 programming problems
- Training/Test Split: 3,289 / 823 samples
Predicted
Actual Easy Medium Hard Total
Easy 69 49 35 153
Medium 36 107 138 281
Hard 25 87 277 389
Total 130 243 450 823
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Easy | 0.531 | 0.451 | 0.488 | 153 |
| Medium | 0.440 | 0.381 | 0.408 | 281 |
| Hard | 0.616 | 0.712 | 0.660 | 389 |
Weighted Average: Precision=0.540, Recall=0.550, F1-Score=0.542
- Mean Absolute Error (MAE): 1.735 points
- Root Mean Square Error (RMSE): 2.071 points
- R² Score: 0.116
| Class | MAE | Actual Score | Predicted Score |
|---|---|---|---|
| Easy | 2.267 | 1.99 ± 0.43 | 4.25 ± 0.54 |
| Medium | 0.817 | 4.13 ± 0.75 | 4.71 ± 0.52 |
| Hard | 2.189 | 7.11 ± 1.13 | 4.92 ± 0.48 |
- Total Problems: 4,112
- Class Distribution:
- Hard: 1,941 problems (47.2%)
- Medium: 1,405 problems (34.2%)
- Easy: 766 problems (18.6%)
- Score Range: 1.1 - 9.7
- Mean Score: 5.11 ± 2.18
- Total Features: 3,015
- TF-IDF Features: 3,000 (selected from 4,000 using chi-square feature selection)
- Custom Features: 15 domain-specific features
- Text Metrics: Length, word count, vocabulary richness
- Algorithm Indicators: Graph algorithms, dynamic programming, data structures
- Complexity Markers: Sorting/searching, string processing, mathematical content
- Problem Characteristics: Constraints, optimization keywords, test case patterns
- Type: Voting Classifier (Ensemble)
- Components:
- Logistic Regression (C=2.0, balanced class weights)
- Random Forest Classifier (400 estimators, max_depth=35)
- Gradient Boosting Classifier (300 estimators, max_depth=12)
- Voting Strategy: Soft voting
- Type: Random Forest Regressor
- Parameters:
- n_estimators: 350
- max_depth: 30
- min_samples_split: 5
- min_samples_leaf: 2
- max_features: sqrt
- Preprocessing: Lowercasing, whitespace normalization, abbreviation expansion
- TF-IDF Vectorization:
- Max features: 4,000
- N-gram range: (1, 3)
- Stop words: English
- Min/Max document frequency: 2 / 0.85
- Feature Selection: Chi-square test (k=3,000)
- Feature Scaling: StandardScaler for custom features
- Feature Combination: Sparse matrix concatenation
POST /predict
Content-Type: application/json
{
"description": "Find the maximum sum of a subarray using dynamic programming"
}Response:
{
"class": "medium",
"score": 5.2,
"confidence": 0.678,
"reliable": true,
"features": {
"textLength": 65,
"wordCount": 11,
"dynamicProgramming": 1.0,
"graphAlgorithms": 0.0,
"dataStructures": 0.0,
"sortingSearching": 0.0,
"stringProcessing": 0.0,
"basicMath": 0.0,
"advancedMath": 0.0,
"complexityNotation": 0.0
}
}POST /predict/structured
Content-Type: application/json
{
"description": "Find the shortest path in a weighted graph",
"input_desc": "Graph represented as adjacency matrix with weights",
"output_desc": "Array of distances from source to all vertices"
}GET /healthReturns system status and component health information.
- Python 3.8+
- Flask 2.0+
- scikit-learn 1.0+
- pandas, numpy, scipy
# Clone the repository
git clone https://github.com/oldhero07/Autojudge.git
cd Autojudge
# Install dependencies
pip install -r flask_app/requirements.txt
# Run the application
cd flask_app
python app.pyThe application will start on http://localhost:5000
# Build the image
docker build -t autojudge .
# Run the container
docker run -p 5000:5000 autojudge# Start all services
docker-compose up -dNavigate to http://localhost:5000 to access the web interface for problem submission and prediction.
import requests
# Predict problem difficulty
response = requests.post('http://localhost:5000/predict', json={
'description': 'Implement a binary search algorithm to find an element in a sorted array'
})
result = response.json()
print(f"Class: {result['class']}")
print(f"Score: {result['score']}/10")
print(f"Confidence: {result['confidence']:.3f}")problems = [
"Sort an array of integers",
"Find shortest path using Dijkstra's algorithm",
"Implement suffix array with linear time complexity"
]
for problem in problems:
response = requests.post('http://localhost:5000/predict',
json={'description': problem})
result = response.json()
print(f"{problem[:50]}... -> {result['class']} ({result['score']}/10)")- Class Imbalance: The system shows bias toward predicting "Hard" problems due to dataset imbalance (47.2% hard problems)
- Score Regression: Limited R² score (0.116) indicates challenges in precise numerical score prediction
- Domain Specificity: Optimized for competitive programming problems
- Language Dependency: Designed for English problem descriptions
- Inference Time: ~50ms per prediction
- Memory Usage: ~200MB for loaded components
- Throughput: ~20 requests/second on standard hardware
- Storage: ~15MB serialized components
AutoJudge/
├── README.md # Project documentation
├── PROJECT_REPORT.md # Comprehensive technical report
├── DEPLOYMENT.md # Deployment guide
├── Dockerfile # Container configuration
├── docker-compose.yml # Multi-service setup
├── flask_app/ # Flask application
│ ├── app.py # Main application (1,393 lines)
│ ├── requirements.txt # Python dependencies
│ ├── models/ # Trained ML models
│ ├── templates/ # HTML templates
│ └── static/ # Static assets
├── problems_data.jsonl # Training dataset (4,112 problems)
├── test_api.py # API validation tests
└── docs/ # Additional documentation
├── PROJECT_STRUCTURE.md # Architecture documentation ├── Dockerfile # Container configuration ├── docker-compose.yml # Multi-service setup ├── flask_app/ # Flask application │ ├── app.py # Main application (1,393 lines) │ ├── requirements.txt # Python dependencies │ ├── models/ # Trained components │ ├── templates/ # HTML templates │ └── static/ # Static assets ├── problems_data.jsonl # Training dataset ├── scripts/ # Utility scripts └── docs/ # Additional documentation
## Contributing
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/improvement`)
3. Commit your changes (`git commit -am 'Add new feature'`)
4. Push to the branch (`git push origin feature/improvement`)
5. Create a Pull Request
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Citation
If you use AutoJudge in your research or applications, please cite:
```bibtex
@software{autojudge2026,
title={AutoJudge: Programming Problem Difficulty Prediction System},
author={oldhero07},
year={2026},
url={https://github.com/oldhero07/Autojudge}
}
- Dataset sourced from competitive programming platforms
- Built with scikit-learn, Flask, and modern NLP techniques
- Inspired by the need for automated problem categorization in educational technology