diff --git a/AIM_INTEGRATION_PLAN.md b/AIM_INTEGRATION_PLAN.md new file mode 100644 index 0000000..7b49187 --- /dev/null +++ b/AIM_INTEGRATION_PLAN.md @@ -0,0 +1,1070 @@ +# Aim Experiment Tracking Integration for tinyLab + +**Status:** Design Complete - Ready for Implementation +**Date:** 2025-11-18 +**Purpose:** Add comprehensive experiment tracking with web UI for mechanistic interpretability research + +**Note:** This plan now includes Stage-1A developmental interpretability metrics (VDI, circularity, induction head emergence) alongside the core suppressor analysis work. + +--- + +## Overview + +This document outlines the integration of [Aim](https://aimstack.io/) experiment tracking into tinyLab. Aim will provide an interactive web UI to browse experiments, compare runs, and visualize mechanistic interpretability metrics in real-time. + +**Coverage:** Supports both the main suppressor paper experiments AND the Stage-1A preregistered pilot on early-layer synchronization control. + +### Why Aim? + +- **Self-hosted** - No cloud dependencies, works offline +- **Python-native** - Easy integration with existing codebase +- **Rich visualizations** - Interactive plots, comparisons, filtering +- **Flexible** - Supports custom metrics, images, text, distributions +- **Fast** - Efficient storage and querying +- **Open source** - MIT license, no vendor lock-in + +### What Gets Tracked + +``` +Run Metadata Core Metrics MI Metrics Artifacts +├── model_name ├── logit_diff ├── ov_fidelity_by_layer ├── attention_heatmaps +├── condition ├── accuracy ├── qk_pattern_strength ├── ov_projections +├── probe_type ├── p_drop ├── activation_entropy ├── calibration_curves +├── layers ├── kl_divergence ├── geometric_curvature ├── confusion_matrices +├── heads ├── calibration_ece ├── pca_rank_by_layer ├── token_clouds +├── seed ├── mediation_fraction ├── path_patching_effects └── trajectory_plots +├── git_commit └── bootstrap_ci └── emergence_curves +├── timestamp +└── device +``` + +--- + +## Architecture + +### Directory Structure + +``` +tinyLab/ +├── .aim/ # Aim storage (gitignored) +│ ├── meta/ # Metadata index +│ ├── runs/ # Run data (metrics, logs) +│ └── seqs/ # Sequence storage +│ +├── lab/ +│ ├── tracking/ # NEW: Aim integration code +│ │ ├── __init__.py +│ │ ├── tracker.py # Main tracking class +│ │ ├── metrics.py # Metric definitions +│ │ ├── visualizations.py # Custom plots +│ │ └── migrate.py # Import existing results +│ │ +│ ├── harness.py # MODIFIED: Add tracking hooks +│ └── configs/ # EXISTING: Experiment configs +│ +├── scripts/ +│ ├── import_to_aim.py # Import historical results +│ └── launch_aim_ui.sh # Start Aim web UI +│ +└── docs/ + └── AIM_USAGE.md # User guide for Aim UI +``` + +### Data Flow + +``` +Experiment Run (harness.py) + ↓ +TinyLabTracker.log_metrics() + ↓ +Aim Run Storage (.aim/) + ↓ +Aim UI (http://localhost:43800) + ↓ +Interactive Visualizations +``` + +--- + +## Implementation Plan + +### Phase 1: Core Integration (30 min) + +**Goal:** Basic tracking of runs with metadata and core metrics + +#### 1. Install Aim + +```bash +pip install aim +``` + +#### 2. Create Tracking Module + +**`lab/tracking/__init__.py`:** +```python +"""Experiment tracking with Aim.""" +from .tracker import TinyLabTracker + +__all__ = ['TinyLabTracker'] +``` + +**`lab/tracking/tracker.py`:** +```python +"""Main tracking class for tinyLab experiments.""" +from aim import Run +from pathlib import Path +from typing import Dict, Any, Optional +import json + +class TinyLabTracker: + """ + Wrapper around Aim for mechanistic interpretability experiments. + + Example usage: + tracker = TinyLabTracker( + experiment_name="h1_suppressor_sweep", + config=config_dict, + tags=["gpt2-medium", "facts", "layer0"] + ) + + # Log metrics + tracker.log_metric("logit_diff", 2.45, step=0) + tracker.log_metric("accuracy", 0.89, step=0) + + # Log custom MI metrics + tracker.log_ov_fidelity(ov_scores_by_layer, step=0) + + # Log artifacts + tracker.log_attention_pattern(attn_matrix, head=(0, 2)) + + # Finish + tracker.finish() + """ + + def __init__( + self, + experiment_name: str, + config: Dict[str, Any], + tags: Optional[list] = None, + repo_path: Optional[str] = None + ): + """ + Initialize tracker for an experiment run. + + Args: + experiment_name: Name of experiment (e.g., "h1_cross_condition") + config: Full experiment configuration dict + tags: List of tags for filtering (e.g., ["gpt2-medium", "facts"]) + repo_path: Path to .aim directory (default: project root) + """ + self.experiment_name = experiment_name + self.config = config + + # Initialize Aim run + self.run = Run( + repo=repo_path, + experiment=experiment_name, + tags=tags or [] + ) + + # Log all config as hyperparameters + self.run['hparams'] = config + + # Log key metadata + self.run['model_name'] = config.get('model_name', 'unknown') + self.run['condition'] = config.get('tag', 'unknown') + self.run['probe_type'] = config.get('probe', 'unknown') + self.run['device'] = config.get('device', 'unknown') + self.run['seed'] = config.get('seed', None) + + # Git info + import git + try: + repo = git.Repo(search_parent_directories=True) + self.run['git_commit'] = repo.head.commit.hexsha[:8] + self.run['git_branch'] = repo.active_branch.name + except: + pass + + def log_metric(self, name: str, value: float, step: int = 0, context: Optional[Dict] = None): + """ + Log a scalar metric. + + Args: + name: Metric name (e.g., "logit_diff") + value: Metric value + step: Step/iteration (0 for final metrics) + context: Additional context (e.g., {"head": "0:2"}) + """ + self.run.track(value, name=name, step=step, context=context or {}) + + def log_metrics_dict(self, metrics: Dict[str, float], step: int = 0, prefix: str = ""): + """ + Log multiple metrics at once. + + Args: + metrics: Dict of {metric_name: value} + step: Step/iteration + prefix: Prefix to add to all metric names + """ + for name, value in metrics.items(): + full_name = f"{prefix}/{name}" if prefix else name + self.log_metric(full_name, value, step=step) + + def log_head_metrics(self, head: tuple, metrics: Dict[str, float], step: int = 0): + """ + Log metrics for a specific attention head. + + Args: + head: Tuple of (layer, head_idx) + metrics: Dict of {metric_name: value} + step: Step/iteration + """ + context = {"layer": head[0], "head": head[1]} + for name, value in metrics.items(): + self.run.track(value, name=name, step=step, context=context) + + def log_layer_metrics(self, layer: int, metrics: Dict[str, float], step: int = 0): + """ + Log metrics for a specific layer. + + Args: + layer: Layer index + metrics: Dict of {metric_name: value} + step: Step/iteration + """ + context = {"layer": layer} + for name, value in metrics.items(): + self.run.track(value, name=name, step=step, context=context) + + def log_ov_fidelity(self, fidelity_by_layer: Dict[int, float], step: int = 0): + """ + Log OV circuit fidelity across layers. + + Args: + fidelity_by_layer: {layer_idx: fidelity_score} + step: Step/iteration + """ + for layer, fidelity in fidelity_by_layer.items(): + self.run.track( + fidelity, + name="ov_fidelity", + step=step, + context={"layer": layer} + ) + + def log_activation_entropy( + self, + layer: int, + entropy: float, + entropy_type: str = "subspace", + step: int = 0 + ): + """ + Log activation entropy for a layer. + + Args: + layer: Layer index + entropy: Entropy value + entropy_type: Type of entropy ("subspace", "diagonal", "per_token") + step: Step/iteration + """ + self.run.track( + entropy, + name=f"activation_entropy_{entropy_type}", + step=step, + context={"layer": layer} + ) + + def log_geometric_metrics( + self, + curvature: float, + output_entropy: float, + step: int = 0, + phase: str = "final" + ): + """ + Log geometric signature metrics. + + Args: + curvature: Trajectory curvature + output_entropy: Output distribution entropy + step: Step/iteration + phase: Phase of trajectory ("early", "mid", "final") + """ + self.run.track(curvature, name="curvature", step=step, context={"phase": phase}) + self.run.track(output_entropy, name="output_entropy", step=step, context={"phase": phase}) + + def log_image(self, name: str, image, step: int = 0, context: Optional[Dict] = None): + """ + Log an image (attention pattern, plot, etc.). + + Args: + name: Image name + image: PIL Image, numpy array, or matplotlib figure + step: Step/iteration + context: Additional context + """ + from aim import Image + self.run.track(Image(image), name=name, step=step, context=context or {}) + + def log_attention_pattern(self, pattern, layer: int, head: int, step: int = 0): + """ + Log attention pattern heatmap. + + Args: + pattern: Attention matrix (numpy array or matplotlib figure) + layer: Layer index + head: Head index + step: Step/iteration + """ + import matplotlib.pyplot as plt + from aim import Image + + # If pattern is numpy array, create heatmap + if hasattr(pattern, 'shape'): + fig, ax = plt.subplots(figsize=(8, 6)) + im = ax.imshow(pattern, cmap='viridis', aspect='auto') + ax.set_title(f'Attention Pattern L{layer}H{head}') + ax.set_xlabel('Key Position') + ax.set_ylabel('Query Position') + plt.colorbar(im, ax=ax) + self.run.track( + Image(fig), + name="attention_pattern", + step=step, + context={"layer": layer, "head": head} + ) + plt.close(fig) + else: + # Assume it's already a figure + self.run.track( + Image(pattern), + name="attention_pattern", + step=step, + context={"layer": layer, "head": head} + ) + + def log_distribution(self, name: str, values, step: int = 0, context: Optional[Dict] = None): + """ + Log a distribution of values. + + Args: + name: Distribution name + values: Array of values + step: Step/iteration + context: Additional context + """ + from aim import Distribution + self.run.track( + Distribution(values), + name=name, + step=step, + context=context or {} + ) + + def log_text(self, name: str, text: str, step: int = 0): + """ + Log text (e.g., model output, errors). + + Args: + name: Text identifier + text: Text content + step: Step/iteration + """ + from aim import Text + self.run.track(Text(text), name=name, step=step) + + def log_artifact(self, name: str, artifact: Any): + """ + Log arbitrary Python object as artifact. + + Args: + name: Artifact name + artifact: Any JSON-serializable object + """ + self.run[name] = artifact + + def finish(self, final_metrics: Optional[Dict[str, float]] = None): + """ + Finalize the run. + + Args: + final_metrics: Optional final metrics to log + """ + if final_metrics: + self.log_metrics_dict(final_metrics, prefix="final") + + self.run.close() +``` + +#### 3. Integrate with Harness + +**Modify `lab/harness.py`:** + +```python +# At top of file +from lab.tracking import TinyLabTracker + +# In the main experiment function: +def run_experiment(config_path: str): + # Load config + config = load_config(config_path) + + # Initialize tracker + tracker = TinyLabTracker( + experiment_name=config.get('experiment', 'unnamed'), + config=config, + tags=[ + config['model_name'], + config.get('tag', 'unknown'), + f"layer{config.get('target_layer', 0)}" + ] + ) + + try: + # Run experiment + results = run_ablation_sweep(config) + + # Log results + for head, metrics in results.items(): + tracker.log_head_metrics( + head=head, + metrics={ + 'logit_diff': metrics['ld'], + 'accuracy': metrics['acc'], + 'p_drop': metrics['p_drop'], + 'kl_divergence': metrics['kl'] + } + ) + + # Log aggregate metrics + tracker.log_metrics_dict({ + 'mean_logit_diff': np.mean([m['ld'] for m in results.values()]), + 'max_logit_diff': np.max([m['ld'] for m in results.values()]), + 'top_head_ld': sorted(results.items(), key=lambda x: x[1]['ld'])[-1][1]['ld'] + }) + + finally: + tracker.finish() +``` + +#### 4. Start Aim UI + +```bash +# Launch web UI +aim up + +# Opens at http://localhost:43800 +``` + +--- + +### Phase 2: Historical Data Import (1 hour) + +**Goal:** Import existing results from `reports/` into Aim + +**`scripts/import_to_aim.py`:** +```python +#!/usr/bin/env python3 +""" +Import historical tinyLab results into Aim. + +Usage: + python scripts/import_to_aim.py + python scripts/import_to_aim.py --reports-dir reports/ +""" +import argparse +import json +from pathlib import Path +from aim import Run +import re + +def parse_filename(filename: str): + """Extract metadata from filename.""" + # Examples: + # gpt2m_facts_ranking.csv + # mistral_cf_l0_ranking.csv + # h1_head_rank_stats.json + + parts = filename.stem.split('_') + metadata = {} + + # Extract model + if 'gpt2' in filename.stem: + if 'gpt2m' in filename.stem: + metadata['model'] = 'gpt2-medium' + elif 'gpt2l' in filename.stem: + metadata['model'] = 'gpt2-large' + else: + metadata['model'] = 'gpt2' + elif 'mistral' in filename.stem: + metadata['model'] = 'mistral-7b' + elif 'pythia' in filename.stem: + metadata['model'] = 'pythia' + + # Extract condition + conditions = ['facts', 'cf', 'logic', 'neg', 'counterfactual', 'negation', 'logical'] + for cond in conditions: + if cond in filename.stem: + metadata['condition'] = cond + break + + # Extract hypothesis + h_match = re.search(r'h(\d+)', filename.stem) + if h_match: + metadata['hypothesis'] = f"H{h_match.group(1)}" + + return metadata + +def import_head_rankings(csv_path: Path, repo_path: str = None): + """Import head ranking CSV.""" + import pandas as pd + + metadata = parse_filename(csv_path) + + run = Run( + repo=repo_path, + experiment=f"imported_{metadata.get('hypothesis', 'ranking')}", + tags=['imported', 'historical'] + list(metadata.values()) + ) + + # Log metadata + run['source_file'] = str(csv_path) + run['imported'] = True + for k, v in metadata.items(): + run[k] = v + + # Load and log data + df = pd.read_csv(csv_path) + + for idx, row in df.iterrows(): + layer = row.get('layer', row.get('Layer', 0)) + head = row.get('head', row.get('Head', idx)) + + context = {"layer": int(layer), "head": int(head)} + + # Log available metrics + for col in df.columns: + if col.lower() in ['layer', 'head', 'rank']: + continue + try: + value = float(row[col]) + run.track(value, name=col.lower(), step=0, context=context) + except: + pass + + run.close() + print(f"✓ Imported {csv_path.name}") + +def import_json_metrics(json_path: Path, repo_path: str = None): + """Import JSON metric file.""" + metadata = parse_filename(json_path) + + run = Run( + repo=repo_path, + experiment=f"imported_{metadata.get('hypothesis', 'metrics')}", + tags=['imported', 'historical'] + list(metadata.values()) + ) + + # Log metadata + run['source_file'] = str(json_path) + run['imported'] = True + for k, v in metadata.items(): + run[k] = v + + # Load data + with open(json_path) as f: + data = json.load(f) + + # Log all metrics + def log_nested(obj, prefix=""): + """Recursively log nested dict.""" + if isinstance(obj, dict): + for key, val in obj.items(): + new_prefix = f"{prefix}/{key}" if prefix else key + log_nested(val, new_prefix) + elif isinstance(obj, (int, float)): + run.track(float(obj), name=prefix, step=0) + elif isinstance(obj, list) and all(isinstance(x, (int, float)) for x in obj): + # Log as distribution + from aim import Distribution + run.track(Distribution(obj), name=prefix, step=0) + + log_nested(data) + + run.close() + print(f"✓ Imported {json_path.name}") + +def main(): + parser = argparse.ArgumentParser(description="Import tinyLab results to Aim") + parser.add_argument('--reports-dir', default='reports/', help='Path to reports directory') + parser.add_argument('--repo', default=None, help='Path to .aim directory') + args = parser.parse_args() + + reports_dir = Path(args.reports_dir) + + # Import CSVs + print("Importing CSV files...") + for csv_file in reports_dir.glob('**/*.csv'): + try: + import_head_rankings(csv_file, repo_path=args.repo) + except Exception as e: + print(f"✗ Failed to import {csv_file.name}: {e}") + + # Import JSONs + print("\nImporting JSON files...") + for json_file in reports_dir.glob('**/*.json'): + # Skip manifest files + if 'manifest' in json_file.name.lower(): + continue + try: + import_json_metrics(json_file, repo_path=args.repo) + except Exception as e: + print(f"✗ Failed to import {json_file.name}: {e}") + + print("\n✓ Import complete! Launch UI with: aim up") + +if __name__ == '__main__': + main() +``` + +**Run import:** +```bash +python scripts/import_to_aim.py +``` + +--- + +### Phase 3: Custom Visualizations (2 hours) + +**Goal:** Add tinyLab-specific visualizations to Aim UI + +**`lab/tracking/visualizations.py`:** +```python +"""Custom visualizations for Aim UI.""" +import matplotlib.pyplot as plt +import numpy as np +from typing import List, Dict, Any + +class MIVisualizations: + """Mechanistic interpretability visualizations.""" + + @staticmethod + def plot_layer_metrics(metrics_by_layer: Dict[int, Dict[str, float]], metric_name: str): + """ + Plot metric evolution across layers. + + Args: + metrics_by_layer: {layer_idx: {metric_name: value}} + metric_name: Which metric to plot + + Returns: + matplotlib.Figure + """ + layers = sorted(metrics_by_layer.keys()) + values = [metrics_by_layer[l][metric_name] for l in layers] + + fig, ax = plt.subplots(figsize=(10, 6)) + ax.plot(layers, values, marker='o', linewidth=2, markersize=8) + ax.set_xlabel('Layer', fontsize=12) + ax.set_ylabel(metric_name.replace('_', ' ').title(), fontsize=12) + ax.set_title(f'{metric_name.replace("_", " ").title()} Across Layers', fontsize=14) + ax.grid(True, alpha=0.3) + + return fig + + @staticmethod + def plot_head_heatmap(head_metrics: Dict[tuple, float], n_layers: int, n_heads: int): + """ + Plot heatmap of head-level metrics. + + Args: + head_metrics: {(layer, head): metric_value} + n_layers: Number of layers + n_heads: Number of heads per layer + + Returns: + matplotlib.Figure + """ + # Create matrix + matrix = np.zeros((n_layers, n_heads)) + for (layer, head), value in head_metrics.items(): + matrix[layer, head] = value + + fig, ax = plt.subplots(figsize=(12, 8)) + im = ax.imshow(matrix, cmap='RdYlGn', aspect='auto') + + ax.set_xlabel('Head', fontsize=12) + ax.set_ylabel('Layer', fontsize=12) + ax.set_title('Head Ablation Effects (ΔLD)', fontsize=14) + + # Add colorbar + cbar = plt.colorbar(im, ax=ax) + cbar.set_label('Logit Difference', fontsize=12) + + # Add grid + ax.set_xticks(np.arange(n_heads)) + ax.set_yticks(np.arange(n_layers)) + ax.grid(which='major', color='white', linewidth=0.5) + + return fig + + @staticmethod + def plot_emergence_curve( + checkpoint_metrics: Dict[int, float], + checkpoint_steps: List[int], + metric_name: str = "logit_diff" + ): + """ + Plot metric emergence across training checkpoints (for Pythia). + + Args: + checkpoint_metrics: {checkpoint_step: metric_value} + checkpoint_steps: List of checkpoint steps + metric_name: Metric to plot + + Returns: + matplotlib.Figure + """ + steps = sorted(checkpoint_steps) + values = [checkpoint_metrics.get(s, 0) for s in steps] + + fig, ax = plt.subplots(figsize=(10, 6)) + ax.plot(steps, values, marker='o', linewidth=2, markersize=8, color='#2E86AB') + ax.set_xlabel('Training Steps', fontsize=12) + ax.set_ylabel(metric_name.replace('_', ' ').title(), fontsize=12) + ax.set_title(f'{metric_name.replace("_", " ").title()} Emergence', fontsize=14) + ax.set_xscale('log') + ax.grid(True, alpha=0.3) + + # Add shaded region for crystallization + if len(values) > 2: + # Find inflection point (simple heuristic) + diffs = np.diff(values) + inflection = np.argmax(diffs) + 1 + ax.axvspan(steps[0], steps[inflection], alpha=0.1, color='red', label='Pre-crystallization') + ax.axvspan(steps[inflection], steps[-1], alpha=0.1, color='green', label='Post-crystallization') + ax.legend() + + return fig + + @staticmethod + def plot_ov_token_projection( + token_embeddings: np.ndarray, + token_labels: List[str], + title: str = "OV Circuit Token Projection" + ): + """ + Plot 2D projection of OV-projected tokens. + + Args: + token_embeddings: (n_tokens, embedding_dim) array + token_labels: List of token strings + title: Plot title + + Returns: + matplotlib.Figure + """ + from sklearn.decomposition import PCA + + # Project to 2D + pca = PCA(n_components=2) + embeddings_2d = pca.fit_transform(token_embeddings) + + fig, ax = plt.subplots(figsize=(12, 8)) + + # Scatter plot + scatter = ax.scatter( + embeddings_2d[:, 0], + embeddings_2d[:, 1], + c=range(len(token_labels)), + cmap='viridis', + s=100, + alpha=0.6 + ) + + # Add labels + for i, label in enumerate(token_labels): + ax.annotate( + label, + (embeddings_2d[i, 0], embeddings_2d[i, 1]), + fontsize=9, + alpha=0.8 + ) + + ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} var)', fontsize=12) + ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} var)', fontsize=12) + ax.set_title(title, fontsize=14) + ax.grid(True, alpha=0.3) + + return fig +``` + +--- + +### Phase 4: DVC Integration (30 min) + +**Goal:** Ensure Aim storage works with DVC + +**Update `.gitignore`:** +```gitignore +# Aim tracking (local only, regenerate from DVC data) +/.aim/ +``` + +**Optional:** Track Aim exports with DVC: + +```python +# scripts/export_aim_reports.py +"""Export Aim runs to static JSON for DVC tracking.""" +from aim import Repo +import json + +repo = Repo('.') + +# Export all runs metadata +runs_data = [] +for run in repo.iter_runs(): + runs_data.append({ + 'hash': run.hash, + 'name': run.name, + 'experiment': run.experiment, + 'creation_time': run.creation_time.isoformat(), + 'params': run.get('hparams', {}), + 'metrics': { + track.name: track.values.last_value() + for track in run.metrics() + } + }) + +# Save to reports/ +with open('reports/aim_runs_export.json', 'w') as f: + json.dump(runs_data, f, indent=2) + +# Track with DVC +# dvc add reports/aim_runs_export.json +``` + +--- + +## Usage Guide + +### Running Experiments with Tracking + +**Before (no tracking):** +```bash +python -m lab.battery --config lab/configs/run_h1_cross_condition_balanced.json +``` + +**After (with Aim tracking):** +```bash +# Tracking is automatic! Just run as before +python -m lab.battery --config lab/configs/run_h1_cross_condition_balanced.json + +# View in UI +aim up +``` + +### Browsing Experiments + +**Launch UI:** +```bash +aim up +# Opens http://localhost:43800 +``` + +**UI Features:** + +1. **Runs Table** - View all runs with metadata, hyperparameters, metrics +2. **Metrics Explorer** - Compare metrics across runs with interactive plots +3. **Images** - Browse attention patterns, OV projections, calibration curves +4. **Text Logs** - View model outputs, errors, notes +5. **Params** - Filter and group by hyperparameters +6. **Custom Dashboards** - Create saved views for specific analyses + +### Filtering and Grouping + +**In UI:** +- Filter by model: `run.model_name == "gpt2-medium"` +- Filter by condition: `run.condition == "facts"` +- Group by hypothesis: Group by `run.hypothesis` +- Compare top heads: Filter by `logit_diff > 2.0` with context `{layer: 0}` + +**Programmatically:** +```python +from aim import Repo + +repo = Repo('.') + +# Find all GPT-2 Medium facts runs +runs = repo.query_runs( + "run.model_name == 'gpt2-medium' and run.condition == 'facts'" +).iter() + +for run in runs: + print(f"Run {run.hash}: LD = {run.metrics()['logit_diff'].last_value()}") +``` + +### Creating Custom Dashboards + +**Example: Suppressor Analysis Dashboard** + +```python +# In Aim UI → Metrics → Create new dashboard +# Add charts: +# 1. Logit Diff by Layer (line plot, group by layer) +# 2. Head Heatmap (table, context: {layer, head}) +# 3. OV Fidelity Over Time (line plot, group by checkpoint) +# 4. Attention Patterns (image grid, filter: layer == 0) + +# Save as "Suppressor Analysis" +``` + +--- + +## Advanced Features + +### 1. Compare Runs Side-by-Side + +```python +from aim import Repo + +repo = Repo('.') + +# Get two runs +run1 = repo.get_run('abc123') # GPT-2 Medium facts +run2 = repo.get_run('def456') # Mistral facts + +# Compare metrics +for metric_name in ['logit_diff', 'accuracy', 'calibration_ece']: + val1 = run1.metrics()[metric_name].last_value() + val2 = run2.metrics()[metric_name].last_value() + print(f"{metric_name}: GPT-2={val1:.3f}, Mistral={val2:.3f}") +``` + +### 2. Export for Paper + +```python +# Export specific metrics for LaTeX table +from aim import Repo +import pandas as pd + +repo = Repo('.') + +# Query runs +runs = repo.query_runs("run.condition == 'facts'").iter() + +# Build dataframe +data = [] +for run in runs: + data.append({ + 'Model': run['model_name'], + 'ΔLD': run.metrics()['logit_diff'].last_value(), + 'Accuracy': run.metrics()['accuracy'].last_value(), + 'ECE': run.metrics()['calibration_ece'].last_value(), + }) + +df = pd.DataFrame(data) +print(df.to_latex(index=False, float_format='%.3f')) +``` + +### 3. Automated Analysis Pipelines + +```python +# scripts/analyze_latest_run.py +"""Analyze most recent run and generate report.""" +from aim import Repo + +repo = Repo('.') + +# Get latest run +run = sorted(repo.iter_runs(), key=lambda r: r.creation_time, reverse=True)[0] + +print(f"Latest Run: {run.hash}") +print(f"Experiment: {run.experiment}") +print(f"Model: {run['model_name']}") +print(f"\nTop Metrics:") +for name in ['logit_diff', 'accuracy', 'p_drop']: + print(f" {name}: {run.metrics()[name].last_value():.3f}") + +# Find top suppressor heads +head_metrics = {} +for track in run.metrics(): + if track.name == 'logit_diff' and track.context.get('layer') == 0: + head = track.context['head'] + head_metrics[head] = track.values.last_value() + +top_heads = sorted(head_metrics.items(), key=lambda x: x[1], reverse=True)[:5] +print(f"\nTop 5 Suppressor Heads (L0):") +for head, ld in top_heads: + print(f" Head {head}: ΔLD = {ld:.3f}") +``` + +--- + +## Migration Checklist + +- [ ] Install Aim: `pip install aim` +- [ ] Create `lab/tracking/` module +- [ ] Add `TinyLabTracker` class +- [ ] Integrate tracking into `lab/harness.py` +- [ ] Test with single experiment run +- [ ] Import historical results: `python scripts/import_to_aim.py` +- [ ] Launch UI: `aim up` +- [ ] Verify metrics, images, distributions appear correctly +- [ ] Create custom dashboards for key analyses +- [ ] Add `.aim/` to `.gitignore` +- [ ] Update documentation (DVC_SETUP.md, README.md) +- [ ] Train team on Aim UI usage + +--- + +## FAQ + +**Q: How does Aim compare to MLflow?** +A: Aim has a more modern UI, better metric comparison, and is specifically designed for ML/DL experiments. MLflow is more general-purpose with deployment features we don't need. + +**Q: Will this slow down experiments?** +A: Minimal overhead (<1% for typical runs). Logging is asynchronous. + +**Q: Can I disable tracking?** +A: Yes, just don't initialize `TinyLabTracker`. Or use env var: `TINYLAB_DISABLE_TRACKING=1`. + +**Q: How much storage does Aim use?** +A: ~1-5MB per run for metrics/metadata. Images/distributions increase this. Use `aim storage --clean` to remove old runs. + +**Q: Can I query Aim from notebooks?** +A: Yes! See examples above. Full Python API available. + +**Q: How do I backup Aim data?** +A: The `.aim/` directory contains everything. Can export to JSON for DVC tracking or copy entire directory. + +--- + +## Next Steps + +1. **Implement Phase 1** - Basic tracking (30 min) +2. **Test with one experiment** - Verify tracking works (15 min) +3. **Import historical data** - Run import script (30 min) +4. **Explore UI** - Familiarize with Aim interface (30 min) +5. **Add custom visualizations** - Implement MI-specific plots (2 hours) +6. **Create dashboards** - Build saved views for analyses (1 hour) +7. **Document for team** - Write usage guide (1 hour) + +**Total time:** ~5-6 hours for complete integration + +--- + +## Resources + +- **Aim Docs:** https://aimstack.readthedocs.io/ +- **Aim GitHub:** https://github.com/aimhubio/aim +- **Aim Discord:** https://community.aimstack.io/ +- **Examples:** https://github.com/aimhubio/aim/tree/main/examples + +--- + +**Document Version:** 1.0 +**Author:** Claude +**Status:** Ready for Implementation diff --git a/DVC_IMPLEMENTATION_GUIDE.md b/DVC_IMPLEMENTATION_GUIDE.md new file mode 100644 index 0000000..779abe7 --- /dev/null +++ b/DVC_IMPLEMENTATION_GUIDE.md @@ -0,0 +1,520 @@ +# DVC Implementation Guide for tinyLab + +**Status:** Ready for Implementation +**Date:** 2025-11-18 +**Branch:** `claude/migrate-dvc-tracking-013aAqNvWvh6CwHntNnxvhjo` + +## Overview + +This guide provides the complete implementation plan for migrating tinyLab to DVC (Data Version Control). All design work, documentation, and automation scripts are complete and ready for execution. + +## What Has Been Prepared + +### 1. Data Inventory ✅ +- Comprehensive catalog of all data files (355+ files, ~7.4 MB) +- Classification of what needs DVC tracking vs what stays in Git +- Size analysis and growth projections + +### 2. Architecture Design ✅ +- Two proposed structures (Option A: Minimal, Option B: Full) +- **Recommendation:** Option A (Minimal Restructure) for low risk and fast migration +- Future-proof design ready for S3/GCS/Azure backends + +### 3. Documentation ✅ +- **DVC_MIGRATION_DESIGN.md** - Complete architecture and design decisions +- **DVC_SETUP.md** - User guide for setup and daily workflows +- **DVC_TROUBLESHOOTING.md** - Comprehensive troubleshooting reference +- **README.md** - Updated with DVC quick start + +### 4. Automation Scripts ✅ +- **scripts/migrate_to_dvc.sh** - Fully automated migration script +- Dry-run support for safe testing +- Backup creation capability +- Comprehensive error checking + +## Implementation Steps + +### Prerequisites + +Before starting, ensure: +- [ ] You have a clean working directory (`git status` shows clean) +- [ ] You're on the correct branch (`claude/migrate-dvc-tracking-013aAqNvWvh6CwHntNnxvhjo`) +- [ ] You've reviewed the design document (DVC_MIGRATION_DESIGN.md) +- [ ] You have a backup (optional but recommended) + +### Step 1: Install DVC + +```bash +# Using pip +pip install dvc + +# Verify installation +dvc version +# Should output: 3.x.x or higher +``` + +**Troubleshooting:** If installation fails, see [DVC_TROUBLESHOOTING.md](DVC_TROUBLESHOOTING.md#installation-issues) + +### Step 2: Run Migration Script (Dry Run) + +Test the migration without making changes: + +```bash +# Dry run - see what would happen +./scripts/migrate_to_dvc.sh --dry-run + +# Dry run with backup preparation +./scripts/migrate_to_dvc.sh --dry-run --backup +``` + +Review the output carefully. The script will show: +- Which directories will be tracked +- What .dvc files will be created +- What changes will be made to .gitignore +- Git staging operations + +### Step 3: Create Backup (Recommended) + +```bash +# Create backup of all data +./scripts/migrate_to_dvc.sh --backup --dry-run + +# Or manually: +mkdir -p backups +tar czf backups/tinylab_pre_dvc_$(date +%Y%m%d_%H%M%S).tar.gz \ + lab/data/corpora \ + lab/data/splits \ + data/lexicons \ + reports \ + paper/supplement +``` + +### Step 4: Execute Migration + +Run the actual migration: + +```bash +# Execute migration +./scripts/migrate_to_dvc.sh + +# Or with backup +./scripts/migrate_to_dvc.sh --backup +``` + +**What happens:** +1. DVC is initialized (`.dvc/` directory created) +2. Local remote configured (`.dvcstore/`) +3. `.gitignore` updated with DVC patterns +4. Data directories tracked with DVC +5. `.dvc` pointer files created +6. Changes staged in Git + +### Step 5: Verify Migration + +Check that everything worked: + +```bash +# Check DVC status +dvc status +# Should output: "Data and pipelines are up to date." + +# List .dvc files created +find . -name "*.dvc" +# Should show: +# lab/data/corpora.dvc +# lab/data/splits.dvc +# data/lexicons/hedge_booster.json.dvc +# reports.dvc +# paper/supplement.dvc + +# Check .dvcstore size +du -sh .dvcstore +# Should be ~7-8 MB + +# Verify git status +git status +# Should show staged .dvc files and .gitignore +``` + +### Step 6: Test Data Retrieval + +Simulate a fresh clone: + +```bash +# In a temporary directory (don't do this in main repo!) +cd /tmp +git clone /home/user/tinyLab tinylab-test +cd tinylab-test + +# Install DVC +pip install dvc + +# Pull data +dvc pull + +# Verify files +ls -lh lab/data/corpora/ +ls -lh reports/ + +# Run smoke test +python smoke_test.py + +# Clean up +cd .. +rm -rf tinylab-test +``` + +### Step 7: Commit Changes + +If everything looks good: + +```bash +cd /home/user/tinyLab + +# Review what will be committed +git status +git diff --cached .gitignore +cat lab/data/corpora.dvc +cat reports.dvc + +# Commit DVC migration +git commit -m "Add DVC tracking for datasets, results, and artifacts + +- Initialize DVC with local remote (.dvcstore) +- Track lab/data/corpora (18 JSONL files, ~370K) +- Track lab/data/splits (18 JSON files, ~29K) +- Track data/lexicons/hedge_booster.json +- Track reports/ (298 CSV/JSON files, ~7.4MB) +- Track paper/supplement/ (20+ files, ~150K) +- Update .gitignore with DVC patterns + +All data moved to .dvcstore, .dvc pointers tracked in git. +Total tracked: ~7.4 MB across 355+ files. + +See DVC_MIGRATION_DESIGN.md for architecture details. +See DVC_SETUP.md for usage instructions." +``` + +### Step 8: Push to Remote + +Push both code and data: + +```bash +# Push code changes to GitHub +git push -u origin claude/migrate-dvc-tracking-013aAqNvWvh6CwHntNnxvhjo + +# Data is already in .dvcstore (local remote) +# When ready for S3/GCS, add remote and push: +# dvc remote add s3store s3://tinylab-data/dvc-cache +# dvc push -r s3store +``` + +### Step 9: Test Cross-Machine Reproducibility + +On a different machine (or fresh clone): + +```bash +# Clone repository +git clone tinylab-fresh +cd tinylab-fresh + +# Checkout DVC branch +git checkout claude/migrate-dvc-tracking-013aAqNvWvh6CwHntNnxvhjo + +# Install DVC +pip install dvc + +# Pull data +dvc pull + +# Verify +ls lab/data/corpora/ +ls reports/ + +# Run tests +python smoke_test.py +make postprocess +cd paper && make +``` + +## Manual Migration (If Script Fails) + +If the automated script fails, follow these manual steps: + +### 1. Initialize DVC +```bash +dvc init +``` + +### 2. Configure Local Remote +```bash +dvc remote add localstore .dvcstore --local +dvc remote default localstore +``` + +### 3. Update .gitignore + +Add to `.gitignore`: +```gitignore +# DVC +/.dvcstore/ +/reports/*.csv +/reports/*.json +/reports/layer_sweep_* +/reports/appendices +/reports/pythia_layer*_vdi_drift* +/lab/data/corpora +/lab/data/splits +/data/lexicons/*.json +/paper/supplement/*.json +/paper/supplement/*.csv +/paper/supplement/cuda_validation +``` + +### 4. Track Data with DVC +```bash +dvc add lab/data/corpora +dvc add lab/data/splits +dvc add data/lexicons/hedge_booster.json +dvc add reports +dvc add paper/supplement +``` + +### 5. Stage Git Changes +```bash +git add .dvc/.gitignore .dvc/config +git add .gitignore +git add lab/data/corpora.dvc +git add lab/data/splits.dvc +git add data/lexicons/hedge_booster.json.dvc +git add reports.dvc +git add paper/supplement.dvc +``` + +### 6. Commit +```bash +git commit -m "Add DVC tracking for datasets and results" +``` + +## Post-Migration Tasks + +### Update Documentation + +1. **Update REPLICATION.md** + - Add DVC installation step + - Add `dvc pull` before running experiments + +2. **Update QUICKSTART.md** + - Mention DVC setup after environment setup + +3. **Update CI/CD** (if applicable) + - Add DVC installation to CI workflows + - Add `dvc pull` before running tests + +### Team Onboarding + +Share with team: +1. Link to [DVC_SETUP.md](DVC_SETUP.md) +2. Quick start: `pip install dvc && dvc pull` +3. When to use DVC: "Always `dvc pull` after `git pull`" + +### Monitor and Maintain + +1. **Check .dvcstore size regularly:** + ```bash + du -sh .dvcstore + ``` + +2. **Garbage collect old versions:** + ```bash + dvc gc -w # Remove unused cached data + ``` + +3. **Monitor Git repo size:** + ```bash + du -sh .git + # Should stay small (only .dvc pointer files) + ``` + +## Migration to Cloud Storage (Future) + +When ready to migrate to S3/GCS/Azure: + +### Option 1: AWS S3 + +```bash +# Add S3 remote +dvc remote add s3store s3://tinylab-data/dvc-cache +dvc remote modify s3store region us-west-2 + +# Configure credentials (use environment variables) +export AWS_ACCESS_KEY_ID=xxx +export AWS_SECRET_ACCESS_KEY=yyy + +# Push data to S3 +dvc push -r s3store + +# Set as default remote +dvc remote default s3store + +# Update .dvc/config in git +git add .dvc/config +git commit -m "Set S3 as default DVC remote" +``` + +### Option 2: Google Cloud Storage + +```bash +# Add GCS remote +dvc remote add gcsstore gs://tinylab-data/dvc-cache + +# Authenticate +gcloud auth application-default login + +# Push data +dvc push -r gcsstore + +# Set as default +dvc remote default gcsstore +``` + +### Option 3: Azure Blob Storage + +```bash +# Add Azure remote +dvc remote add azurestore azure://tinylab-data/dvc-cache +dvc remote modify azurestore account_name + +# Set credentials +export AZURE_STORAGE_ACCOUNT= +export AZURE_STORAGE_KEY= + +# Push data +dvc push -r azurestore +``` + +## Rollback Procedure + +If you need to undo the migration: + +### Option 1: Git Reset (Before Push) + +```bash +# Reset to before DVC commit +git reset HEAD~1 + +# Remove DVC initialization +rm -rf .dvc .dvcstore + +# Restore .gitignore +git checkout HEAD .gitignore + +# Data files should still be present +ls lab/data/corpora/ +``` + +### Option 2: Restore from Backup + +```bash +# Extract backup +tar xzf backups/tinylab_pre_dvc_YYYYMMDD_HHMMSS.tar.gz + +# Remove DVC +rm -rf .dvc .dvcstore +rm **/*.dvc + +# Reset .gitignore +git checkout origin/main .gitignore +``` + +### Option 3: Revert Commit (After Push) + +```bash +# Revert the DVC migration commit +git revert + +# Remove DVC files +rm -rf .dvc .dvcstore +``` + +## Success Criteria + +Migration is successful when: + +- ✅ `dvc status` shows "Data and pipelines are up to date" +- ✅ All `.dvc` files created and tracked in Git +- ✅ `.dvcstore/` directory created and gitignored +- ✅ Data files gitignored (CSV, JSON in reports/, etc.) +- ✅ `dvc pull` works in fresh clone +- ✅ `python smoke_test.py` passes +- ✅ `make postprocess` completes successfully +- ✅ `cd paper && make` generates PDF +- ✅ Git repository size reasonable (<50MB) +- ✅ `.dvcstore` size matches expected (~7-8MB) + +## Troubleshooting + +For issues during migration, see [DVC_TROUBLESHOOTING.md](DVC_TROUBLESHOOTING.md). + +Common issues: +- **DVC installation fails** → See [Installation Issues](DVC_TROUBLESHOOTING.md#installation-issues) +- **`dvc pull` fails** → See [Data Retrieval Problems](DVC_TROUBLESHOOTING.md#data-retrieval-problems) +- **Git repo too large** → See [Git Integration Issues](DVC_TROUBLESHOOTING.md#git-integration-issues) + +## Support + +- **Documentation:** See all `DVC_*.md` files in repository root +- **DVC Docs:** https://dvc.org/doc +- **Issues:** File issues on GitHub with `[DVC]` prefix +- **Questions:** Check [DVC_TROUBLESHOOTING.md](DVC_TROUBLESHOOTING.md) first + +## Files Created + +This migration preparation includes: + +| File | Purpose | +|------|---------| +| `DVC_MIGRATION_DESIGN.md` | Architecture and design decisions | +| `DVC_SETUP.md` | User guide for setup and workflows | +| `DVC_TROUBLESHOOTING.md` | Troubleshooting reference | +| `DVC_IMPLEMENTATION_GUIDE.md` | This file - step-by-step implementation | +| `scripts/migrate_to_dvc.sh` | Automated migration script | +| `README.md` | Updated with DVC quick start | + +## Timeline Estimate + +- **Preparation (Review):** 30 minutes +- **Migration Execution:** 10 minutes +- **Verification:** 15 minutes +- **Testing:** 20 minutes +- **Documentation Updates:** 15 minutes +- **Total:** ~1.5 hours + +## Next Steps + +1. **Review** this guide and [DVC_MIGRATION_DESIGN.md](DVC_MIGRATION_DESIGN.md) +2. **Install** DVC: `pip install dvc` +3. **Test** migration: `./scripts/migrate_to_dvc.sh --dry-run` +4. **Execute** migration: `./scripts/migrate_to_dvc.sh --backup` +5. **Verify** and commit changes +6. **Push** to remote: `git push` +7. **Test** on fresh clone +8. **Celebrate** 🎉 - Your data is now version controlled! + +--- + +**Questions or Issues?** + +1. Check [DVC_TROUBLESHOOTING.md](DVC_TROUBLESHOOTING.md) +2. Review [DVC_SETUP.md](DVC_SETUP.md) +3. See DVC documentation: https://dvc.org/doc +4. File a GitHub issue with `[DVC]` prefix + +**Ready to proceed?** Follow the steps above to implement DVC tracking. + +--- + +**Document Version:** 1.0 +**Last Updated:** 2025-11-18 +**Author:** Claude +**Status:** Ready for Implementation diff --git a/DVC_MIGRATION_DESIGN.md b/DVC_MIGRATION_DESIGN.md new file mode 100644 index 0000000..1618914 --- /dev/null +++ b/DVC_MIGRATION_DESIGN.md @@ -0,0 +1,703 @@ +# DVC Migration Design for tinyLab + +## Executive Summary + +This document outlines the design for migrating tinyLab to use DVC (Data Version Control) for all datasets, checkpoints, logs, and artifacts. The design prioritizes: +1. **Minimal code changes** - preserve existing workflows +2. **Clear organization** - logical grouping by purpose +3. **Future-proof** - ready for S3/GCS/Azure backends +4. **Reversibility** - all changes staged behind git branches + +## Current State Inventory + +### Data Currently in Git (to be moved to DVC) + +| Category | Location | Files | Size | Purpose | +|----------|----------|-------|------|---------| +| Raw Data | `lab/data/corpora/` | 18 | ~370K | JSONL datasets (facts, counterfactual, logical, negation) | +| Data Splits | `lab/data/splits/` | 18 | ~29K | Train/val/test indices | +| Lexicons | `data/lexicons/` | 1 | 949B | Hedge/booster word lists | +| Stage-1A Data | `lab/data/task_b_weekdays.jsonl` | 1 | ~10K | Task-B weekday modular addition data | +| Results (CSV) | `reports/` | 161+ | ~4.5MB | Head rankings, layer sweeps, summaries | +| Results (JSON) | `reports/` | 137+ | ~2.8MB | Metrics, analyses, manifests | +| Stage-1A Results | `reports/task_b_circularity_*.json`, `reports/pilot_stage1a/` | 5+ | ~50K | Circularity summaries, VDI runs | +| Paper Supplements | `paper/supplement/` | 20+ | ~150K | Bootstrap CI, calibration, validation data | + +**Total data to track with DVC: ~7.5 MB across 360+ files** (including Stage-1A pilot artifacts) + +### Data Already Gitignored (stays ignored) + +- `lab/runs/*` - Empty (only .gitkeep) +- `mlruns/*` - Empty (only .gitkeep) +- `*.png`, `*.html`, `*.pdf` - Generated plots +- `*.log` - Generated logs +- `*.ipynb` - Jupyter notebooks + +### Code/Config (stays in Git) + +- `lab/configs/*.json` - 100 experiment configs (~124K) +- All Python scripts (analysis, training, figures) +- LaTeX source files +- Documentation and makefiles + +--- + +## Proposed Directory Structure + +### Option A: Minimal Restructure (RECOMMENDED) + +Keep existing paths but organize DVC tracking by purpose. Minimal code changes required. + +``` +tinyLab/ +├── .dvc/ # DVC configuration +├── .dvcstore/ # Local DVC cache (gitignored) +│ +├── data/ # Raw data - DVC tracked +│ ├── lexicons/ # [DVC] Lexicon files +│ │ └── hedge_booster.json +│ └── README.md # Documents data sources +│ +├── lab/ +│ ├── data/ # Lab datasets - DVC tracked +│ │ ├── corpora/ # [DVC] Raw experimental corpora +│ │ │ ├── facts_*.jsonl +│ │ │ ├── counterfactual_*.jsonl +│ │ │ ├── logical_*.jsonl +│ │ │ └── negation_*.jsonl +│ │ └── splits/ # [DVC] Processed train/test splits +│ │ └── *.split.json +│ │ +│ ├── configs/ # [GIT] Experiment configurations +│ ├── analysis/ # [GIT] Analysis scripts +│ ├── runs/ # [IGNORED] Generated training runs +│ └── tests/ # [GIT] Test files +│ +├── reports/ # Results - DVC tracked +│ ├── *.csv # [DVC] All ranking CSVs +│ ├── *.json # [DVC] All metric JSONs +│ ├── layer_sweep_*/ # [DVC] Layer sweep subdirs +│ ├── appendices/ # [DVC] Additional analyses +│ ├── RESULTS_MANIFEST.json # [DVC] Master results index +│ └── README.md # Documents results structure +│ +├── paper/ +│ ├── sections/ # [GIT] LaTeX source +│ ├── scripts/ # [GIT] Figure generation scripts +│ ├── supplement/ # Paper supplement data - DVC tracked +│ │ ├── *.json # [DVC] Supplement metrics +│ │ └── cuda_validation/ # [DVC] CUDA validation results +│ └── generated/ # [IGNORED] Auto-generated content +│ +├── mlruns/ # [IGNORED] MLflow tracking +├── figs/ # [GIT] Figure descriptions (md) +│ # [IGNORED] Rendered plots (png/pdf) +├── docs/ # [GIT] Documentation +└── devlog/ # [GIT] Development logs +``` + +**DVC Tracking Scheme:** +- `data/lexicons/*.json` → Track individual files +- `lab/data/corpora/` → Track entire directory (includes task_b_weekdays.jsonl) +- `lab/data/splits/` → Track entire directory +- `reports/` → Track entire directory (includes all CSV/JSON, Stage-1A circularity/VDI results) +- `paper/supplement/` → Track entire directory + +--- + +### Option B: Full Reorganization (More disruptive) + +Complete restructure following canonical data science layout. Requires updating all import paths. + +``` +tinyLab/ +├── .dvc/ +├── .dvcstore/ # Local DVC cache +│ +├── data/ # ALL data - DVC tracked +│ ├── raw/ # Raw immutable data +│ │ ├── corpora/ # [DVC] Moved from lab/data/corpora/ +│ │ │ ├── facts/ +│ │ │ ├── counterfactual/ +│ │ │ ├── logical/ +│ │ │ └── negation/ +│ │ └── lexicons/ # [DVC] Moved from data/lexicons/ +│ │ +│ └── processed/ # Derived/transformed data +│ └── splits/ # [DVC] Moved from lab/data/splits/ +│ +├── results/ # Renamed from reports/ - DVC tracked +│ ├── metrics/ # [DVC] JSON metric files +│ ├── rankings/ # [DVC] CSV ranking files +│ ├── analyses/ # [DVC] Specialized analyses +│ │ ├── layer_sweep/ +│ │ ├── vdi_drift/ +│ │ ├── entropy/ +│ │ └── ov_reports/ +│ └── MANIFEST.json # Master index +│ +├── models/ # For future model artifacts +│ └── checkpoints/ # [DVC] Training checkpoints (currently empty) +│ +├── lab/ # Experimental code +│ ├── configs/ # [GIT] Experiment configs +│ ├── analysis/ # [GIT] Analysis scripts +│ └── tests/ # [GIT] Test files +│ +├── paper/ +│ ├── supplement/ # [DVC] Paper supplement data +│ └── sections/ # [GIT] LaTeX source +│ +├── logs/ # Execution logs +│ ├── mlruns/ # [IGNORED] MLflow runs +│ └── training/ # [IGNORED] Training logs +│ +└── notebooks/ # [IGNORED] Jupyter notebooks +``` + +**Migration Impact:** +- Requires updating ~30 analysis scripts +- Need to update Makefile paths +- Configuration files need path updates +- More maintenance but cleaner long-term + +--- + +## Recommendation: Option A (Minimal Restructure) + +**Rationale:** +1. **Low risk** - existing code continues to work +2. **Fast migration** - can be completed in hours, not days +3. **Reversible** - easy to rollback if needed +4. **Sufficient** - achieves all DVC goals without unnecessary complexity + +The current structure is already reasonably well-organized: +- `lab/data/` clearly separates experimental data +- `reports/` is an established convention +- `paper/supplement/` is logically placed + +We can achieve clean DVC tracking without restructuring. + +--- + +## DVC Configuration + +### DVC Remote Structure + +```bash +# Local remote inside repository (git-ignored) +.dvcstore/ + ├── files/ + │ └── md5/ # Content-addressable storage + │ ├── ab/ + │ │ └── cdef123... + │ └── ... + └── tmp/ +``` + +**Configuration:** +```bash +# .dvc/config.local +[core] + remote = localstore + +[remote "localstore"] + url = .dvcstore +``` + +**Future S3 Migration:** +```bash +# Just add remote and push +dvc remote add s3store s3://tinylab-data/ +dvc remote default s3store +dvc push +``` + +### .gitignore Updates + +Add to `.gitignore`: +```gitignore +# DVC +/reports/*.csv +/reports/*.json +/reports/layer_sweep_* +/reports/appendices +/lab/data/corpora +/lab/data/splits +/data/lexicons +/paper/supplement/*.json +/paper/supplement/*.csv +/paper/supplement/cuda_validation +.dvcstore/ +``` + +Keep tracking: +- `*.dvc` files (DVC pointers) +- `.dvc/config` (DVC configuration) +- `.dvc/.gitignore` + +--- + +## DVC Tracking Strategy + +### Granularity Decision Matrix + +| Directory | Strategy | Rationale | +|-----------|----------|-----------| +| `lab/data/corpora/` | Single `.dvc` for entire dir | Files change together, versioned as unit | +| `lab/data/splits/` | Single `.dvc` for entire dir | Derived from corpora, versioned together | +| `data/lexicons/` | Individual `.dvc` per file | Small, independent files | +| `reports/` | Single `.dvc` for entire dir | Results regenerated together, large file count | +| `paper/supplement/` | Single `.dvc` for entire dir | Small, versioned with paper | + +### Directory-Level Tracking + +```bash +# Track entire directories +dvc add lab/data/corpora +dvc add lab/data/splits +dvc add reports +dvc add paper/supplement + +# Track individual files +dvc add data/lexicons/hedge_booster.json +``` + +**Generated artifacts:** +``` +lab/data/corpora.dvc # Pointer file (goes in git) +lab/data/splits.dvc # Pointer file (goes in git) +reports.dvc # Pointer file (goes in git) +paper/supplement.dvc # Pointer file (goes in git) +data/lexicons/hedge_booster.json.dvc # Pointer file (goes in git) +``` + +--- + +## Migration Workflow + +### Phase 1: Preparation (No changes to working tree) + +1. Create branch: `git checkout -b dvc-migration` +2. Install DVC: `pip install dvc` +3. Initialize DVC: `dvc init` +4. Configure local remote: + ```bash + dvc remote add localstore .dvcstore --local + dvc remote default localstore + ``` +5. Update `.gitignore` with DVC patterns + +### Phase 2: Add DVC Tracking + +**Track data directories:** +```bash +# Add DVC tracking (data moved to .dvcstore, .dvc pointers created) +dvc add lab/data/corpora +dvc add lab/data/splits +dvc add data/lexicons/hedge_booster.json +dvc add reports +dvc add paper/supplement + +# Check what was created +ls -la lab/data/*.dvc +ls -la *.dvc +ls -la paper/*.dvc +``` + +**Commit DVC pointers:** +```bash +git add lab/data/corpora.dvc lab/data/splits.dvc +git add data/lexicons/hedge_booster.json.dvc +git add reports.dvc paper/supplement.dvc +git add .gitignore .dvc/config .dvc/.gitignore +git commit -m "Add DVC tracking for datasets, results, and supplements" +``` + +### Phase 3: Verification + +**Test data retrieval:** +```bash +# Remove data (simulate fresh clone) +rm -rf lab/data/corpora lab/data/splits reports paper/supplement +rm -f data/lexicons/hedge_booster.json + +# Restore from DVC +dvc pull + +# Verify all files restored +ls lab/data/corpora/*.jsonl +ls lab/data/splits/*.json +ls reports/*.csv +ls paper/supplement/*.json +``` + +**Test reproducibility:** +```bash +# Run smoke test +python smoke_test.py + +# Run single analysis +python lab/analysis/export_head_rankings.py + +# Verify outputs match +``` + +### Phase 4: Documentation and Push + +```bash +# Create comprehensive docs +# (see Documentation section below) + +# Push to remote +git push -u origin dvc-migration + +# Create pull request for review +``` + +--- + +## Data Flows and Dependencies + +### Data Generation Pipeline + +``` +Raw Data (DVC) + ↓ +lab/data/corpora/*.jsonl + ↓ +[scripts/facts_make_split.py] + ↓ +lab/data/splits/*.json (DVC) + ↓ +[lab/analysis/*.py scripts] + ↓ +reports/*.csv + *.json (DVC) + ↓ +[paper/scripts/*.py] + ↓ +paper/supplement/*.json (DVC) + ↓ +[pdflatex] + ↓ +paper/main.pdf (IGNORED) +``` + +### Reproducibility Requirements + +To regenerate all results from scratch: + +```bash +# 1. Clone repository +git clone && cd tinyLab + +# 2. Restore data +dvc pull + +# 3. Install dependencies +pip install -e . + +# 4. Run analyses +make postprocess + +# 5. Generate paper +cd paper && make +``` + +**Critical insight:** Only raw data and splits need DVC tracking. Results can be regenerated via `make postprocess`, but we track them anyway for: +- **Speed** - Avoid re-running expensive analyses +- **Reproducibility** - Preserve exact results for papers +- **Collaboration** - Share results without re-computation + +--- + +## Documentation Requirements + +### 1. DVC_SETUP.md (New file) + +```markdown +# DVC Setup Guide for tinyLab + +## Installation + +# Prerequisites +- Python 3.11+ +- Git + +# Install DVC +pip install dvc + +## First-time Setup (after cloning) + +# Pull all data +dvc pull + +# Verify +ls lab/data/corpora/*.jsonl +ls reports/*.csv + +## Adding New Data + +# Track new dataset +dvc add data/new_dataset.csv +git add data/new_dataset.csv.dvc +git commit -m "Add new dataset" + +## Updating Tracked Data + +# Modify data, then update tracking +dvc add reports/ +git add reports.dvc +git commit -m "Update results after experiment X" + +## Troubleshooting + +See docs/DVC_TROUBLESHOOTING.md +``` + +### 2. Update README.md + +Add DVC section: +```markdown +## Data Management with DVC + +This project uses DVC to manage datasets and results. After cloning: + +\`\`\`bash +pip install dvc +dvc pull +\`\`\` + +See [DVC_SETUP.md](DVC_SETUP.md) for detailed instructions. +``` + +### 3. Update REPLICATION.md + +Add DVC step: +```markdown +## Replication Steps + +1. Clone repository +2. **Pull data with DVC**: `dvc pull` +3. Install dependencies: `pip install -e .` +4. Run experiments: `make postprocess` +``` + +--- + +## Migration Risks and Mitigations + +### Risk 1: Large file count in single .dvc file + +**Issue:** `reports/` has 298 files. If any single file changes, entire directory re-uploads. + +**Mitigation:** +- Acceptable for ~7MB total size +- Can split later if needed: `reports/csv.dvc` + `reports/json.dvc` +- Monitor with `dvc status` + +### Risk 2: Git repository growth + +**Issue:** Multiple versions of `.dvc` files increase git repo size. + +**Mitigation:** +- `.dvc` files are tiny (~100 bytes each) +- Only 5 `.dvc` files total +- Git handles small text files efficiently + +### Risk 3: Accidental data loss + +**Issue:** `dvc add` moves data to `.dvcstore`, could lose if .dvcstore deleted. + +**Mitigation:** +- Create backup before migration: `tar czf tinylab-backup.tar.gz reports/ lab/data/` +- Test `dvc pull` restoration before deleting original data +- Keep branch protection on main/master + +### Risk 4: Path breakage + +**Issue:** Scripts might hardcode paths that DVC changes. + +**Mitigation:** +- Option A (recommended) doesn't change any paths +- DVC creates symlinks/copies, paths remain valid +- Test suite runs before/after migration + +### Risk 5: Merge conflicts with .dvc files + +**Issue:** Two branches updating same data creates conflicts in `.dvc` files. + +**Mitigation:** +- `.dvc` files are structured JSON, easy to merge +- Use `dvc diff` to understand changes +- Document conflict resolution in DVC_SETUP.md + +--- + +## Testing Checklist + +Before considering migration complete: + +- [ ] `dvc status` shows all files tracked +- [ ] `dvc push` succeeds to localstore +- [ ] `dvc pull` restores all files correctly +- [ ] `python smoke_test.py` passes +- [ ] `make postprocess` completes without errors +- [ ] `cd paper && make` generates PDF +- [ ] All analysis scripts run successfully +- [ ] Git repository size reasonable (<50MB) +- [ ] `.dvcstore` size matches expected (~7-8MB) +- [ ] Fresh clone + `dvc pull` works on different machine +- [ ] Documentation clear and complete + +--- + +## Future Enhancements + +### Phase 2: Cloud Storage (S3/GCS) + +```bash +# Add S3 remote +dvc remote add s3store s3://tinylab-data/dvc-cache +dvc remote default s3store + +# Push to S3 +dvc push + +# Configure access +dvc remote modify s3store access_key_id XXX +dvc remote modify s3store secret_access_key YYY +``` + +### Phase 3: Data Versioning + +```bash +# Tag dataset versions +git tag -a data-v1.0 -m "Initial dataset release" +git tag -a data-v1.1 -m "Added balanced variants" + +# Checkout specific version +git checkout data-v1.0 +dvc checkout +``` + +### Phase 4: Pipelines (Optional) + +Define data pipelines in `dvc.yaml`: +```yaml +stages: + split_data: + cmd: python scripts/facts_make_split.py + deps: + - lab/data/corpora/ + outs: + - lab/data/splits/ + + analyze: + cmd: python lab/analysis/head_rank_stats.py + deps: + - lab/data/splits/ + outs: + - reports/h1_head_rank_stats.json +``` + +Run with: `dvc repro` + +--- + +## Appendix: File Size Analysis + +### Files by Size Category + +| Size Range | Count | Category | DVC Strategy | +|------------|-------|----------|--------------| +| < 1KB | 45 | Config JSON, small JSONs | Track individually or as dir | +| 1-10KB | 89 | Data splits, small metrics | Track as directory | +| 10-50KB | 156 | Corpora, CSVs, metric JSONs | Track as directory | +| 50-100KB | 48 | Large CSVs, result manifests | Track as directory | +| 100KB-1MB | 15 | Large result files | Track as directory | +| > 1MB | 2 | Comprehensive reports | Track as directory | + +**Total:** 355 files, ~7.4 MB + +### Growth Projections + +**Conservative (1 year):** +- New experiments: 10 runs/month × 12 months = 120 runs +- New results: ~200KB per run = 24MB +- New checkpoints: 0 (using pretrained models) +- **Total:** ~31MB + +**Aggressive (1 year):** +- New experiments: 50 runs/month × 12 months = 600 runs +- New results: ~200KB per run = 120MB +- Model fine-tuning: 5 checkpoints × 500MB = 2.5GB +- **Total:** ~2.6GB + +**Conclusion:** Even aggressive growth is manageable with S3/GCS backends. + +--- + +## Appendix: DVC Commands Reference + +### Essential Commands + +```bash +# Initialize +dvc init + +# Track data +dvc add + +# Save changes +git add .dvc .gitignore +git commit -m "Track with DVC" + +# Push/pull data +dvc push # Upload to remote +dvc pull # Download from remote + +# Status +dvc status # Check for changes +dvc diff # Compare versions + +# Restore data +dvc checkout # Restore to committed version +dvc fetch # Download without checking out +``` + +### Advanced Commands + +```bash +# Remote management +dvc remote add +dvc remote modify