A robust Python-based system to combat fraudulent reviews on ecommerce platforms by automatically scraping, classifying, flagging, and facilitating the removal of fake product reviews.
- Web Scraper: Extracts reviews from Amazon, Flipkart, and other platforms
- NLP Pipeline: Advanced text preprocessing with BERT embeddings
- ML Classifier: Ensemble models (Random Forest, XGBoost, SVM) for fake review detection
- Automated Flagging: Trust scoring, pattern detection, and suspicious behavior clustering
- API Integration: Automated deletion requests with approval workflow
- Admin Dashboard: Streamlit-based interface for review management
- Sentiment analysis with rating correlation
- IP/User behavior tracking and clustering
- Explainable AI with reason codes for predictions
- Real-time review validation API
- Comprehensive test coverage
- Backend: FastAPI
- ML/NLP: scikit-learn, XGBoost, transformers (BERT)
- Web Scraping: Selenium, BeautifulSoup, Playwright
- Database: PostgreSQL with SQLAlchemy ORM
- Dashboard: Streamlit
- Deployment: Docker, Docker Compose
- Python 3.9+
- PostgreSQL 13+
- Chrome/Chromium (for Selenium)
# Clone repository
git clone https://github.com/yourusername/fake-review-detector.git
cd fake-review-detector
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Set up environment variables
cp .env.example .env
# Edit .env with your configuration
# Initialize database
python scripts/init_db.py
# Run migrations
alembic upgrade head# Start the API server
uvicorn app.main:app --reload --port 8000
# Start the Streamlit dashboard (separate terminal)
streamlit run dashboard/app.py
# Run scraper
python -m app.scraper.main --platform amazon --product-url "..."# Build and run all services
docker-compose up -d
# Access services
# API: http://localhost:8000
# Dashboard: http://localhost:8501
# PostgreSQL: localhost:5432The dashboard can be deployed to Streamlit Cloud with the following configuration:
- GitHub repository with your code
- Streamlit Cloud account (https://streamlit.io/cloud)
- Backend API deployed and publicly accessible
- Configure Secrets: In your Streamlit Cloud app settings, add the following secrets:
# .streamlit/secrets.toml
API_URL = "https://your-backend-api.example.com/api"- Verify Requirements: Ensure
requirements.txtincludes:
streamlit>=1.28.0
httpx>=0.25.0
tenacity>=8.2.0
nest-asyncio>=1.5.0
pandas>=2.1.0
plotly>=5.18.0
loguru>=0.7.0- Runtime Configuration: Create
.streamlit/runtime.txt:
python-3.11.16- Deploy:
- Go to https://share.streamlit.io/
- Click "New app"
- Select your repository
- Set main file path:
dashboard/app.py - Click "Deploy"
- API URL: The dashboard requires a publicly accessible backend API. Set the
API_URLsecret to your production API endpoint. - Health Check: The dashboard performs a health check on startup. If the backend is unreachable, a banner will be displayed with a retry button.
- CPU-Only Torch: The requirements are configured for CPU-only PyTorch to ensure compatibility with Streamlit Cloud's environment.
- No Browser Automation: Selenium and Playwright are disabled in the cloud requirements as they require system-level browser binaries.
- File Size Limits: CSV uploads are limited to 50MB. Adjust
MAX_UPLOAD_SIZE_MBinapp/utils.pyif needed.
Backend Connection Issues:
- Verify the
API_URLsecret is set correctly - Ensure the backend API is publicly accessible and not blocked by CORS
- Check backend health endpoint:
https://your-api.example.com/api/admin/health
Dependency Errors:
- Ensure
runtime.txtspecifies Python 3.11.16 - Verify all packages in requirements.txt are compatible with the Python version
- Check Streamlit Cloud build logs for specific error messages
Memory Issues:
- Disable BERT embeddings in Settings if experiencing memory errors
- Consider using the minimal requirements file:
requirements-streamlit-minimal.txt
For detailed deployment instructions, see STREAMLIT_DEPLOYMENT.md.
from app.scraper import AmazonScraper
scraper = AmazonScraper()
reviews = scraper.scrape_product("https://www.amazon.com/product/...")from app.classifier import FakeReviewClassifier
classifier = FakeReviewClassifier()
result = classifier.predict(review_text)
print(f"Fake probability: {result['fake_probability']}")
print(f"Reasons: {result['reasons']}")# Check single review
curl -X POST "http://localhost:8000/api/reviews/check" \
-H "Content-Type: application/json" \
-d '{"text": "Amazing product!", "rating": 5}'
# Batch processing
curl -X POST "http://localhost:8000/api/reviews/batch" \
-H "Content-Type: application/json" \
-F "file=@reviews.csv"fake-review-detector/
├── app/
│ ├── __init__.py
│ ├── main.py # FastAPI application
│ ├── config.py # Configuration management
│ ├── database.py # Database connection
│ ├── models/ # SQLAlchemy models
│ ├── scraper/ # Web scraping modules
│ ├── preprocessing/ # Text preprocessing pipeline
│ ├── classifier/ # ML models and training
│ ├── flagging/ # Flagging and alert system
│ ├── api_integration/ # Platform API integration
│ └── routers/ # API endpoints
├── dashboard/
│ └── app.py # Streamlit dashboard
├── tests/
│ ├── test_scraper.py
│ ├── test_classifier.py
│ └── test_api.py
├── scripts/
│ ├── init_db.py # Database initialization
│ └── train_model.py # Model training script
├── data/
│ ├── raw/ # Scraped reviews
│ ├── processed/ # Cleaned data
│ └── models/ # Trained model artifacts
├── docker/
│ ├── Dockerfile
│ └── docker-compose.yml
├── requirements.txt
├── .env.example
├── .gitignore
└── README.md
| Model | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|
| Random Forest | 94.2% | 93.8% | 94.5% | 94.1% |
| XGBoost | 95.1% | 94.9% | 95.3% | 95.1% |
| SVM | 92.7% | 92.1% | 93.2% | 92.6% |
| Ensemble | 96.3% | 96.1% | 96.5% | 96.3% |
Once the server is running, visit:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
# Run all tests
pytest
# Run with coverage
pytest --cov=app --cov-report=html
# Run specific test suite
pytest tests/test_classifier.py -v- Fork the repository
- Create feature branch (
git checkout -b feature/AmazingFeature) - Commit changes (
git commit -m 'Add AmazingFeature') - Push to branch (
git push origin feature/AmazingFeature) - Open Pull Request
@Mayank-iitj
- Dataset: Amazon Review Dataset, Fake Review Corpus
- Pre-trained models: HuggingFace Transformers
- Libraries: scikit-learn, XGBoost, Selenium
Project Link: https://github.com/yourusername/fake-review-detector