Dual News Recommendation System Using MIND Dataset

This project implements and compares two personalized news recommendation systems based on the Microsoft News Dataset (MIND). The two approaches used are:

Collaborative Filtering (ALS) - Implemented using the Alternating Least Squares algorithm in Apache Spark.
Content-Based Filtering (FAISS) - Utilizes BERT embeddings and FAISS for efficient vector similarity search.

The project explores key challenges such as scalability, implicit feedback handling, and embedding-based recommendations, and provides insights into the trade-offs between the two approaches.

Introduction

The goal of this project is to compare the strengths and weaknesses of Collaborative Filtering and Content-Based Filtering in the context of personalized news recommendations. The collaborative approach models user-item interactions through matrix factorization, while the content-based approach leverages high-dimensional embeddings to recommend articles based on similarity.

The Microsoft News Dataset (MIND) provides the foundation for this analysis, containing:

News articles categorized by topic and subtopic.
User behavior logs including article clicks, impressions, and interaction histories.

Key Highlights

ALS Model: Handles large-scale, sparse user-item matrices with Spark's distributed architecture.
FAISS Integration: Employs BERT embeddings and FAISS for scalable nearest-neighbor search, addressing the cold-start problem effectively.

Project Structure

The repository is organized as follows:

.
├── Dockerfile
├── EDA.ipynb
├── LICENSE
├── MIND_Recommender_Results.pbix
├── README.md
├── docker-compose.yml
├── experiments
│   ├── cbrs_spark.py
│   └── newsapi
│       ├── cleaning.py
│       ├── embed.py
│       ├── prova_fetching.py
│       └── provamongo.py
├── requirements
│   ├── requirements_als.txt
│   ├── requirements_cbrs.txt
│   ├── requirements_clustering.txt
│   └── requirements_fetching.txt
├── requirements.txt
├── saved_models
├── src
│   ├── __init__.py
│   ├── algorithms
│   │   ├── als
│   │   │   ├── als_utils.py
│   │   │   ├── run_train_als.py
│   │   │   └── train_als.py
│   │   ├── cbrs
│   │   │   ├── __init__.py
│   │   │   ├── cbrs_utils_pandas.py
│   │   │   ├── clean_embed.py
│   │   │   ├── info.md
│   │   │   └── run_cbrs_pandas.py
│   │   └── clustering
│   │       └── clustering.py
│   ├── configs
│   │   ├── config.yaml
│   │   └── setup.py
│   ├── data_management
│   │   ├── __init__.py
│   │   ├── fetch_mind.py
│   │   └── mind.py
│   ├── training
│   │   ├── ALS_hyperparam_optimization.ipynb
│   │   ├── __init__.py
│   │   ├── evaluation.py
│   │   └── evaluation_metrics.py
│   └── utilities
│       ├── __init__.py
│       └── data_utils.py
└── start.sh

Key Directories

src/: Core codebase including algorithms, data management, and utilities.
experiments/: Exploratory scripts for embedding generation and news API fetching.
outputs/: Contains visualizations and analysis outputs (e.g., cluster visualizations).
requirements/: Separate requirements files for different modules (ALS, FAISS, etc.).

Setup Instructions

Steps

Clone the Repository:

git clone https://github.com/pippotek/Dual-Recommendation-System.git
cd Dual-Recommendation-System

Install Dependencies: Install the software indicated in requirements.txt.
Modify Configuration: Update the config.yaml file to set your preferred options and hyperparameters for the ALS model and add your Wandb API key.
Start the App:
```
bash start.sh
```

Tip

Make sure your Docker memory allocation is set to a minimum of 6GB to ensure all containers run smoothly without performance issues.

Disclaimer: This project has been tested on Ubuntu and macOS. Compatibility with Windows has not been verified.

Algorithms

Collaborative Filtering

Collaborative Filtering uses the Alternating Least Squares (ALS) algorithm implemented with Apache Spark for scalability.

Workflow:
1. Preprocess the user-item interaction matrix using implicit feedback (clicks).
2. Hyperparameter Tuning to find the optimal number of latent factors, regularization parameter and number of iterations.
3. Train the ALS model on the interaction matrix to identify latent factors for users and articles.
4. Generate recommendations for users by predicting their preferences for unseen articles.

Content-Based Filtering

Content-Based Filtering leverages BERT embeddings to represent news articles and FAISS for approximate nearest neighbor search.

Workflow:
1. Generate embeddings for news articles using a pretrained BERT model.
2. Index embeddings with FAISS for efficient similarity search.
3. Retrieve similar articles based on a user’s reading history using cosine similarity.

Clustering

To validate the embeddings generated for content-based filtering, K-means clustering (k=3) was performed on the news article embeddings. The goal was to ensure that similar articles were grouped together.

Results

More about the results can be found in our report. An example of the PowerBI dashboard is showed below:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dual News Recommendation System Using MIND Dataset

Table of Contents

Introduction

Key Highlights

Project Structure

Key Directories

Setup Instructions

Steps

Algorithms

Collaborative Filtering

Content-Based Filtering

Clustering

Results

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 176 Commits
experiments		experiments
outputs		outputs
requirements		requirements
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
Data Mining Report.pdf		Data Mining Report.pdf
Dockerfile		Dockerfile
EDA.ipynb		EDA.ipynb
LICENSE		LICENSE
MIND_Recommender_Results.pbix		MIND_Recommender_Results.pbix
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
start.sh		start.sh

Folders and files

Latest commit

History

Repository files navigation

Dual News Recommendation System Using MIND Dataset

Table of Contents

Introduction

Key Highlights

Project Structure

Key Directories

Setup Instructions

Steps

Algorithms

Collaborative Filtering

Content-Based Filtering

Clustering

Results

Authors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages