Recent safety research investigates whether AI systems could exhibit heightened risks when multiple individually benign failure modes or steering vectors combine. This project explores whether "two individually low-risk steering vectors can compose into a superadditive safety failure, even when neither vector is risky alone."
The baseline hypothesis evaluated in this pilot: Composing two harmless "stylistic" or "behavioral" vectors (e.g., conciseness + direct answer first) might produce an unexpectedly strong capability degradation or harmful compliance increase, exceeding the additive risk profile.
- Contrastive Vector Extraction: Local activation targeting from a set of contrastive examples.
- Hook-based Steering: PyTorch hooks over a Hugging Face CausalLM pipeline.
- Modular Configs: Run experiments (extraction, single-vector validation, and pairwise composition) easily via YAML config passing.
- Reporting: Basic CSV tabular output and Matplotlib bar charts for superadditive gain checks.
- Full benchmark suites (e.g., StrongREJECT, HarmBench) have not been completely wrapped; the framework instead relies on sample
evalschemas for quick starts. - Refined automated thresholding or LLM-as-a-judge score matching natively (basic placeholders included).
- Multi-layer distributed vector applying natively (assumes a single layer for composition baseline presently).
- Clone and enter the directory:
git clone <repo-url> cd TwoVectorCollusion
- Install requirements:
pip install -r requirements.txt
- Local module linking (Optional):
(Note: Ensure your
pip install -e .PYTHONPATHincludessrcif not using pip install)
project_root/
configs/ # YAML experiment settings
data/samples/ # Toy local dataset examples (jsonl)
outputs/ # Saved run logs and figures
src/ # Core packages
activation_steering_safety/
experiments/ # CLI hooks and logic blocks
eval/ # Evaluation routines
steering/ # Composition logic
tests/ # Unit tests (to be populated)
Expected in JSONL format:
{
"id": "c_1",
"behavior": "concise",
"base_prompt": "Explain overfitting in machine learning.",
"positive_response": "Overfitting happens when a model...",
"negative_response": "Overfitting is a fundamental concept..."
}- Build Data: Verify samples exist in
data/samples/. - Extract Vectors: Extract the single-behavior offsets.
- Validate Singles: Run inference with single vectors attached to ensure base capabilities run effectively.
- Run Pairs: Dynamically compose the vectors and capture superadditivity metrics.
- Inspect: Summarize scores and run plots.
You must prefix commands with PYTHONPATH=src if you didn't install the module via pip:
-
Extract vectors
PYTHONPATH=src python3 -m activation_steering_safety.cli.main extract-vectors --config configs/experiments/extract_pilot.yaml
-
Validate single vectors
PYTHONPATH=src python3 -m activation_steering_safety.cli.main validate-singles --config configs/experiments/validate_pilot.yaml
-
Run pairwise experiments
PYTHONPATH=src python3 -m activation_steering_safety.cli.main run-pairs --config configs/experiments/pairs_pilot.yaml
-
Summarize and plot
PYTHONPATH=src python3 -m activation_steering_safety.cli.main summarize --run-dir outputs/runs/pairs/
outputs/vectors/*.pt: Saved vector extraction tensors.*_meta.json: Tracked origin metrics (layer, source model, strategy).outputs/runs/pairs/superadditivity_summary.csv: Core comparison output showing rawrAB - expectedvariations.plots/*.png: Visualized gains.
- Scaling Families: Qwen, Gemma, Mixtral hooks.
- Judge Interfaces: Plugging an external model for true evaluation logic replacing heuristics.
- Optimization: Batch token strategy implementations aside from just
last_token.
- Memory Errors: Try explicitly mapping
device: "mps"instead of auto if scaling up on Apple Silicon, or drastically cutmax_new_tokenssize during generation. - Imports Missing:
export PYTHONPATH=srcin your terminal.