MacSQL-Logic-Tuner

MacSQL-Logic-Tuner is a Python-based automated pipeline designed for fine-tuning the Qwen 2.5 Coder 7B (4-bit) model on Apple Silicon (M-series) hardware using the MLX Framework.

This project provides end-to-end tooling to teach the model custom database schemas and complex business logic, optimized for devices with 16GB Unified Memory.

🚀 Key Features

🔹 Unified Dataset Management

Automated Merging: Seamlessly combines historical data (all_data.jsonl) with incremental updates (new_data.jsonl).
Smart Deduplication: Automatically detects and removes duplicate entries based on content to ensure higher data quality.
Auto-Splitting: Performs an 80/20 Train/Validation split to ensure proper model evaluation during training.

⚡ Hardware Optimization

Apple Silicon Native: Built on top of the MLX Framework by Apple Machine Learning Research.
Resource Efficient: Optimized for 16GB Unified Memory systems using:
- 4-bit Quantization: Reduces memory footprint without significant accuracy loss.
- LoRA (Low-Rank Adaptation): Fine-tunes only a small subset of parameters for efficiency.
- Batch Size 1: Ensures stability on consumer hardware.

🔗 Standalone Export (Fusion)

Automatic Fusion: Merges the learned LoRA adapters back into the base model.
Ready-to-Deploy: Generates a standalone my_custom_db_model directory that can be used directly for inference without requiring the original base model or adapter weights loaded separately.

🛠️ Technical Specifications

Language: Python 3.10+
Framework: mlx-lm (Apple MLX for Language Models)
Base Model: mlx-community/Qwen2.5-Coder-7B-Instruct-4bit
Supported Hardware: Apple M1/M2/M3/M4 Chips (Pro/Max/Ultra recommended, but works on Base 16GB models)

📦 Initial Setup

Clone the Repository

git clone https://github.com/your-username/MacSQL-Logic-Tuner.git
cd MacSQL-Logic-Tuner

Setup Environment Using uv (recommended):

# Creates virtualenv and installs dependencies from uv.lock/pyproject.toml
uv sync
source .venv/bin/activate

Alternative (Standard pip):

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

⚙️ Configuration

Key parameters in main.py can be adjusted for your specific needs:

Parameter	Default Value	Description
`BASE_MODEL`	`mlx-community/Qwen2.5-Coder-7B-Instruct-4bit`	The HF repo ID of the base model.
`ITERS`	`1000`	Number of training iterations. Increase for better learning.
`DATA_DIR`	`data`	Directory where processed `train.jsonl` & `valid.jsonl` are stored.
`FINAL_MODEL_DIR`	`my_custom_db_model`	Output directory for the fused, ready-to-use model.

📖 Usage Instructions

1. Data Preparation

To start, place your initial dataset in all_data.jsonl. If you have new data to add, place it in new_data.jsonl and run the merge script:

# Merges new_data.jsonl into all_data.jsonl and removes duplicates
python data_merge.py

Then, split the merged data into training and validation sets:

# Generates data/train.jsonl and data/valid.jsonl
python auto_split_data_set.py

2. Training & Fusion

Run the main pipeline to fine-tune the model and export the final result. This script performs two major steps:

Fine-Tuning: Trains adapters using LoRA.
Fusion: Merges adapters into the base model.

python main.py

Look for "Success! Your custom model is saved in: ..." at the end of the output.

3. Model Inference

Test your new custom model using the MLX CLI or Python.

Using CLI:

mlx_lm.generate --model my_custom_db_model --prompt "SELECT * FROM users WHERE" --max-tokens 100

Using Python:

from mlx_lm import load, generate

model, tokenizer = load("my_custom_db_model")
response = generate(model, tokenizer, prompt="Generate a SQL query for...", verbose=True)
print(response)

⚠️ Notes for 16GB RAM Users

Ensure no other heavy applications (Docker, Chrome tabs, IDEs) are running during training.
If you encounter Out-Of-Memory (OOM) errors, double-check that --batch-size is set to 1 in main.py.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
azure_source		azure_source
data		data
.gitignore		.gitignore
.python-version		.python-version
Azure_Deployment.md		Azure_Deployment.md
README.md		README.md
all_data.jsonl		all_data.jsonl
auto_split_data_set.py		auto_split_data_set.py
convert_to_peft.py		convert_to_peft.py
data_merge.py		data_merge.py
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MacSQL-Logic-Tuner

🚀 Key Features

🔹 Unified Dataset Management

⚡ Hardware Optimization

🔗 Standalone Export (Fusion)

🛠️ Technical Specifications

📦 Initial Setup

⚙️ Configuration

📖 Usage Instructions

1. Data Preparation

2. Training & Fusion

3. Model Inference

⚠️ Notes for 16GB RAM Users

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MacSQL-Logic-Tuner

🚀 Key Features

🔹 Unified Dataset Management

⚡ Hardware Optimization

🔗 Standalone Export (Fusion)

🛠️ Technical Specifications

📦 Initial Setup

⚙️ Configuration

📖 Usage Instructions

1. Data Preparation

2. Training & Fusion

3. Model Inference

⚠️ Notes for 16GB RAM Users

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages