Skip to content

labd2m/ExSmell-Gold

Repository files navigation

ExSmell-Gold

Introduction

ExSmell-Gold is a curated dataset of Elixir code smells designed to support research, benchmarking, and experimentation with automated code smell detection techniques.

The dataset was created based on the code smell catalog maintained by Lucas Vegi:

https://github.com/lucasvegi/Elixir-Code-Smells

ExSmell-Gold contains 3,500 Elixir source code examples, including:

  • 1,750 examples containing code smells
  • 1,750 examples without intentionally introduced code smells

The smelly examples cover 35 code smells, with 50 examples per smell.

The examples were initially generated using Claude Sonnet 4.6 and subsequently reviewed, refined, and validated by researchers to ensure alignment with the definitions provided in the original catalog.

The goal of ExSmell-Gold is to provide a publicly available benchmark for:

  • Code smell detection
  • Multi-class smell classification
  • Multi-label smell classification
  • Smell localization
  • Evaluation of static analysis tools
  • Evaluation of Machine Learning models
  • Evaluation of Large Language Models (LLMs)

Dataset Quality

A representative subset of the dataset was manually reviewed by two researchers.

The validation process confirmed a high level of agreement regarding the presence and classification of smells.

For the non-smelly subset, a statistical validation was performed using a representative sample. After correcting the identified issues, the subset achieved an estimated correctness of approximately 97.4%, considering a 90% confidence level and a 5% margin of error.

Additional details regarding dataset construction and validation will be presented in a forthcoming research publication.

Repository Organization

The dataset is organized according to the categories defined in the original catalog.

Design-related smells

Examples for each smell can be found in the following directories:

Low-level concerns smells

Examples for each smell can be found in the following directories:

Traditional Smells

Traditional code smells adapted to the Elixir ecosystem can be found in:

Non-Smelly Examples

Examples intentionally created without code smells can be found in:

Intended Usage

ExSmell-Gold can be used as:

  • A benchmark dataset for code smell detection.
  • A ground-truth dataset for evaluating LLMs.
  • A training resource for machine learning models.
  • A reference collection of code smell examples in Elixir.
  • Educational material for discussing software quality and maintainability in Elixir projects.

About

ExSmell-Gold is a curated benchmark dataset of 3,500 Elixir code examples (1,750 smelly and 1,750 smell-free) covering 35 code smells. Created from the Elixir Code Smells catalog and validated by researchers, it supports research and evaluation of code smell detection, classification, localization, static analysis tools, ML models, and LLMs.

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages