Basic usage¶
This library assumes you’re using PyTorch with a decoder-only generative language model (e.g. GPT, LLaMa, etc…), and a tokenizer from Huggingface.
To begin, collect tuples of positive and negative training prompts in a list, and run train_steering_vector()
:
from transformers import AutoModelForCausalLM, AutoTokenizer
from steering_vectors import train_steering_vector
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# training samples are tuples of (positive_prompt, negative_prompt)
training_samples = [
(
"The capital of England is London",
"The capital of England is Beijing"
),
(
"The capital of France is Paris",
"The capital of France is Berlin"
)
# ...
]
steering_vector = train_steering_vector(
model,
tokenizer,
training_samples,
show_progress=True,
)
Then, you can use the steering vector to “steer” the model’s behavior, for example to make it more truthful:
with steering_vector.apply(model):
prompt = "Is it true that crystals have magic healing properties?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs)
Using specific layers¶
By default, train_steering_vector()
will train a vector at all layers of the model.
However, often you may want to only train a steering vector for a limited set of layers.
You can customize the layers to train by passing a list of layer numbers to the layers
argument:
steering_vec = train_steering_vector(
model,
tokenizer,
training_samples,
layers=[1, 2, 3],
)
This also works with negative indices to count from the end of the model:
steering_vec = train_steering_vector(
model,
tokenizer,
training_samples,
layers=[-1, -2, -3],
)
Batch training¶
By default, train_steering_vector()
will use a batch size of 1. If your
GPU has enough memory, you can increase the batch size to train faster by
setting the batch_size
argument:
steering_vec = train_steering_vector(
model,
tokenizer,
training_samples,
batch_size=8,
)
Magnitude scaling¶
By default, the steering vector will be applied at full magnitude. However, sometimes it’s useful
to apply the steering vector at a lower or higher magnitude, depending on the application. This
can be done by passing a multiplier
argument to steering_vector.apply()
or steering_vector.patch()
:
with steering_vec.apply(model, multiplier=0.5):
# the steering vector will be applied at half magnitude
model.forward(...)
with steering_vec.apply(model, multiplier=2.0):
# the steering vector will be applied at double magnitude
model.forward(...)
with steering_vec.apply(model, multiplier=-1.0):
# the steering vector will be inverted
model.forward(...)