Advanced usage¶
Only apply steering to later tokens¶
By default, the steering vector will be applied to all tokens in the input. However, sometimes
it’s useful to only apply the steering vector to later tokens and ignore the beginning tokens,
for instance to only apply the steering vector when the model is responding to a prompt. This
can be done by passing a min_token_index
argument to steering_vector.apply()
or steering_vector.patch()
:
with steering_vec.apply(model, min_token_index=10):
# only tokens 10 and later will be affected by the steering vector
model.forward(...)
Custom operators¶
By default, the steering vector will be applied by adding it to model activations at runtime.
However, if you prefer something fancier, you can pass a custom function to the operator
argument
when calling steering_vector.apply()
or steering_vector.patch()
, like below:
# the result of this function will be added to the activation
def my_operator(activation, steering_vec):
# remove the component of the activation that is aligned with the steering vector
denom = torch.norm(steering_vec) ** 2
return -1 * torch.dot(activations, steering_vec) * steering_vec / denom
with steering_vec.apply(model, operator=my_operator):
# do something with the model
model.forward(...)
There are some built-in operators as well to help with common steering scenarios. To ablate the steering vector entirely
from the activation, you can use the ablation_operator()
. This will ensure that projection of the steering vector is
fully erase from the activation.
from steering_vectors import ablation_operator
with steering_vec.apply(model, operator=ablation_operator()):
# do something with the model
model.forward(...)
If you want to first ablate the steering vector from the activation vector and then add it in with a set multiplier, you
can use the ablation_then_addition_operator()
. This will guarantee that the projection of the activation along the
steering direction is exactly equal to the steering vector.
from steering_vectors import ablation_then_addition_operator
with steering_vec.apply(model, operator=ablation_then_addition_operator()):
# do something with the model
model.forward(...)
Custom aggregators¶
By default, the steering vector is trained by taking the mean of the differences between positive and negative activations.
If you need different behavior, for example PCA, you can pass a custom function to the aggregator
argument
when calling train_steering_vector()
. This function takes 2 arguments, pos_activations
and neg_activations
,
each of shape (num_samples, hidden_dim)
, and returns a 1-d tensor of shape (hidden_dim,)
. This is demonstrated below:
def norm_mean_aggregator(pos_activations, neg_activations):
mean_act = torch.mean(pos_activations - neg_activations, dim=0)
return mean_act / torch.norm(mean_act)
vec = train_steering_vector(model, tokenizer, data, aggregator=norm_mean_aggregator)
For the common use-case of PCA, you can use the built-in pca_aggregator
function. This will find a steering vector
by taking the first principal component of the delta between positive and negative activations. Unlike the default mean
aggregator, the steering vector from PCA will always have norm of 1.
from steering_vectors import train_steering_vector, pca_aggregator
vec = train_steering_vector(model, tokenizer, data, aggregator=pca_aggregator())
There is also a built-in logistic linear regression aggregator, which will find a steering vector by using scikit-learn’s logistic regression model.
from steering_vectors import train_steering_vector, logistic_aggregator
vec = train_steering_vector(model, tokenizer, data, aggregator=logistic_aggregator())
Manually patching and unpatching¶
The steering_vector.apply()
context manager is a convenient way to apply the steering vector
while ensuring that the model is returned to its original state after the context manager exits.
However, if you need to manually patch and unpatch the model, you can do so by calling
steering_vector.patch()
:
# patch the model
handle = steering_vec.patch(model)
# do something with the model
model.forward(...)
# unpatch the model
handle.remove()
Using MLP, attention, or other layers¶
By default, the steering vector will be trained on the output of each transformer block. However,
it’s also possible to train on other parts of the transformer block, for instance the attention
or MLP layers, or even layernorms inside the transformer block. This can be configured by passing
a layer_type
argument to train_steering_vector()
:
# train on decoder block output (default behavior)
vec = train_steering_vector(model, tokenizer, data, layer_type="decoder_block")
# train on the attention layers
vec = train_steering_vector(model, tokenizer, data, layer_type="self_attn")
# train on the MLP layers
vec = train_steering_vector(model, tokenizer, data, layer_type="mlp")
# train on the input layernorm
vec = train_steering_vector(model, tokenizer, data, layer_type="input_layernorm")
# train on the post attention layernorm
vec = train_steering_vector(model, tokenizer, data, layer_type="post_attention_layernorm")
Whichever layer type you choose during training, the same layer type will be used by the steering vector at runtime. For instance, if you train on the attention layers, the steering vector will be applied to the attention layers at runtime.
Custom layer mapping¶
The library will automatically guess the layer selectors for most language models in Huggingface
as long as the layers are named in a normal way (e.g. MLP layers called mlp
). However, if you
need to customize how layer matching works, or if the library is not able to guess the correct
layer, you can pass in a custom layer_config
parameter to all functions in this library.
The layer_config
is a dictionary which maps layer types to layer selectors. A layer selector is
a template string with the special string {num}
in it, which gets replaced by the layer number during
runtime, and maps to how the layer is named within the Pytorch module. You can find a list of all layers in a model by calling
model.named_modules()
.
For instance, the layer config for GPT2 looks like this:
gpt_layer_config = {
"decoder_block": "transformer.h.{num}",
"self_attn": "transformer.h.{num}.attn",
"mlp": "transformer.h.{num}.mlp",
"input_layernorm": "transformer.h.{num}.ln_1",
"post_attention_layernorm": "transformer.h.{num}.ln_2",
}
vec = train_steering_vector(model, tokenizer, data, layer_config=gpt_layer_config)
For most cases, using a string is sufficient, but if you want to customize the layer matcher further
you can pass in a function which takes in the layer number as an int and
returns the layer in the model as a string. For instance, for GPT models, this could be provided as
lambda num: f"transformer.h.{num}"
for the decoder block.
Extracting activations and aggergating manually¶
If you need to extract the activations of the model explicitly without running through a full training loop,
you can use the extract_activations()
function. This function takes all the same parameters as the
train_steering_vector()
function (excluding aggregator
), and returns dictionaries mapping layer to
positive and negative activations tensors.
You can then aggregate these activations yourself using aggregate_activations()
, and manually create a
steering vector.