Skip to content

ranabir/SuperAdditive_Vector_Collusion

Repository files navigation

Activation Steering Safety Pilot

Research Motivation

Recent safety research investigates whether AI systems could exhibit heightened risks when multiple individually benign failure modes or steering vectors combine. This project explores whether "two individually low-risk steering vectors can compose into a superadditive safety failure, even when neither vector is risky alone."

Pilot Hypothesis

The baseline hypothesis evaluated in this pilot: Composing two harmless "stylistic" or "behavioral" vectors (e.g., conciseness + direct answer first) might produce an unexpectedly strong capability degradation or harmful compliance increase, exceeding the additive risk profile.

What This Repository Promotes (Current Support)

  • Contrastive Vector Extraction: Local activation targeting from a set of contrastive examples.
  • Hook-based Steering: PyTorch hooks over a Hugging Face CausalLM pipeline.
  • Modular Configs: Run experiments (extraction, single-vector validation, and pairwise composition) easily via YAML config passing.
  • Reporting: Basic CSV tabular output and Matplotlib bar charts for superadditive gain checks.

What Is Not Yet Implemented

  • Full benchmark suites (e.g., StrongREJECT, HarmBench) have not been completely wrapped; the framework instead relies on sample eval schemas for quick starts.
  • Refined automated thresholding or LLM-as-a-judge score matching natively (basic placeholders included).
  • Multi-layer distributed vector applying natively (assumes a single layer for composition baseline presently).

Installation Instructions

  1. Clone and enter the directory:
    git clone <repo-url>
    cd TwoVectorCollusion
  2. Install requirements:
    pip install -r requirements.txt
  3. Local module linking (Optional):
    pip install -e .
    (Note: Ensure your PYTHONPATH includes src if not using pip install)

Repository Layout

project_root/
  configs/           # YAML experiment settings
  data/samples/      # Toy local dataset examples (jsonl)
  outputs/           # Saved run logs and figures
  src/               # Core packages
    activation_steering_safety/
        experiments/ # CLI hooks and logic blocks
        eval/        # Evaluation routines
        steering/    # Composition logic
  tests/             # Unit tests (to be populated)

Data Format

Vector Extraction (Contrastive)

Expected in JSONL format:

{
  "id": "c_1",
  "behavior": "concise",
  "base_prompt": "Explain overfitting in machine learning.",
  "positive_response": "Overfitting happens when a model...",
  "negative_response": "Overfitting is a fundamental concept..."
}

End-to-End Workflow

  1. Build Data: Verify samples exist in data/samples/.
  2. Extract Vectors: Extract the single-behavior offsets.
  3. Validate Singles: Run inference with single vectors attached to ensure base capabilities run effectively.
  4. Run Pairs: Dynamically compose the vectors and capture superadditivity metrics.
  5. Inspect: Summarize scores and run plots.

Example Commands

You must prefix commands with PYTHONPATH=src if you didn't install the module via pip:

  1. Extract vectors

    PYTHONPATH=src python3 -m activation_steering_safety.cli.main extract-vectors --config configs/experiments/extract_pilot.yaml
  2. Validate single vectors

    PYTHONPATH=src python3 -m activation_steering_safety.cli.main validate-singles --config configs/experiments/validate_pilot.yaml
  3. Run pairwise experiments

    PYTHONPATH=src python3 -m activation_steering_safety.cli.main run-pairs --config configs/experiments/pairs_pilot.yaml
  4. Summarize and plot

    PYTHONPATH=src python3 -m activation_steering_safety.cli.main summarize --run-dir outputs/runs/pairs/

Output Files

  • outputs/vectors/*.pt: Saved vector extraction tensors.
  • *_meta.json: Tracked origin metrics (layer, source model, strategy).
  • outputs/runs/pairs/superadditivity_summary.csv: Core comparison output showing raw rAB - expected variations.
  • plots/*.png: Visualized gains.

Future Extensions

  • Scaling Families: Qwen, Gemma, Mixtral hooks.
  • Judge Interfaces: Plugging an external model for true evaluation logic replacing heuristics.
  • Optimization: Batch token strategy implementations aside from just last_token.

Troubleshooting

  • Memory Errors: Try explicitly mapping device: "mps" instead of auto if scaling up on Apple Silicon, or drastically cut max_new_tokens size during generation.
  • Imports Missing: export PYTHONPATH=src in your terminal.

About

Recent safety research investigates whether AI systems could exhibit heightened risks when multiple individually benign failure modes or steering vectors combine. This project explores whether "two individually low-risk steering vectors can compose into a superadditive safety failure, even when neither vector is risky alone."

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages