Activation Steering Safety Pilot

Research Motivation

Recent safety research investigates whether AI systems could exhibit heightened risks when multiple individually benign failure modes or steering vectors combine. This project explores whether "two individually low-risk steering vectors can compose into a superadditive safety failure, even when neither vector is risky alone."

Pilot Hypothesis

The baseline hypothesis evaluated in this pilot: Composing two harmless "stylistic" or "behavioral" vectors (e.g., conciseness + direct answer first) might produce an unexpectedly strong capability degradation or harmful compliance increase, exceeding the additive risk profile.

What This Repository Promotes (Current Support)

Contrastive Vector Extraction: Local activation targeting from a set of contrastive examples.
Hook-based Steering: PyTorch hooks over a Hugging Face CausalLM pipeline.
Modular Configs: Run experiments (extraction, single-vector validation, and pairwise composition) easily via YAML config passing.
Reporting: Basic CSV tabular output and Matplotlib bar charts for superadditive gain checks.

What Is Not Yet Implemented

Full benchmark suites (e.g., StrongREJECT, HarmBench) have not been completely wrapped; the framework instead relies on sample eval schemas for quick starts.
Refined automated thresholding or LLM-as-a-judge score matching natively (basic placeholders included).
Multi-layer distributed vector applying natively (assumes a single layer for composition baseline presently).

Installation Instructions

Clone and enter the directory:

git clone <repo-url>
cd TwoVectorCollusion

Install requirements:
```
pip install -r requirements.txt
```
Local module linking (Optional):
```
pip install -e .
```
(Note: Ensure your PYTHONPATH includes src if not using pip install)

Repository Layout

project_root/
  configs/           # YAML experiment settings
  data/samples/      # Toy local dataset examples (jsonl)
  outputs/           # Saved run logs and figures
  src/               # Core packages
    activation_steering_safety/
        experiments/ # CLI hooks and logic blocks
        eval/        # Evaluation routines
        steering/    # Composition logic
  tests/             # Unit tests (to be populated)

Data Format

Vector Extraction (Contrastive)

Expected in JSONL format:

{
  "id": "c_1",
  "behavior": "concise",
  "base_prompt": "Explain overfitting in machine learning.",
  "positive_response": "Overfitting happens when a model...",
  "negative_response": "Overfitting is a fundamental concept..."
}

End-to-End Workflow

Build Data: Verify samples exist in data/samples/.
Extract Vectors: Extract the single-behavior offsets.
Validate Singles: Run inference with single vectors attached to ensure base capabilities run effectively.
Run Pairs: Dynamically compose the vectors and capture superadditivity metrics.
Inspect: Summarize scores and run plots.

Example Commands

You must prefix commands with PYTHONPATH=src if you didn't install the module via pip:

Extract vectors

PYTHONPATH=src python3 -m activation_steering_safety.cli.main extract-vectors --config configs/experiments/extract_pilot.yaml

Validate single vectors

PYTHONPATH=src python3 -m activation_steering_safety.cli.main validate-singles --config configs/experiments/validate_pilot.yaml

Run pairwise experiments

PYTHONPATH=src python3 -m activation_steering_safety.cli.main run-pairs --config configs/experiments/pairs_pilot.yaml

Summarize and plot

PYTHONPATH=src python3 -m activation_steering_safety.cli.main summarize --run-dir outputs/runs/pairs/

Output Files

outputs/vectors/*.pt: Saved vector extraction tensors.
*_meta.json: Tracked origin metrics (layer, source model, strategy).
outputs/runs/pairs/superadditivity_summary.csv: Core comparison output showing raw rAB - expected variations.
plots/*.png: Visualized gains.

Future Extensions

Scaling Families: Qwen, Gemma, Mixtral hooks.
Judge Interfaces: Plugging an external model for true evaluation logic replacing heuristics.
Optimization: Batch token strategy implementations aside from just last_token.

Troubleshooting

Memory Errors: Try explicitly mapping device: "mps" instead of auto if scaling up on Apple Silicon, or drastically cut max_new_tokens size during generation.
Imports Missing: export PYTHONPATH=src in your terminal.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
configs/experiments		configs/experiments
data/samples		data/samples
src/activation_steering_safety		src/activation_steering_safety
tests		tests
.gitignore		.gitignore
ExampleRun.md		ExampleRun.md
LICENSE		LICENSE
README.md		README.md
developer_notes.md		developer_notes.md
generate_sample_data.py		generate_sample_data.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Activation Steering Safety Pilot

Research Motivation

Pilot Hypothesis

What This Repository Promotes (Current Support)

What Is Not Yet Implemented

Installation Instructions

Repository Layout

Data Format

Vector Extraction (Contrastive)

End-to-End Workflow

Example Commands

Output Files

Future Extensions

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Activation Steering Safety Pilot

Research Motivation

Pilot Hypothesis

What This Repository Promotes (Current Support)

What Is Not Yet Implemented

Installation Instructions

Repository Layout

Data Format

Vector Extraction (Contrastive)

End-to-End Workflow

Example Commands

Output Files

Future Extensions

Troubleshooting

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages