SteeringVector

class steering_vectors.SteeringVector(layer_activations, layer_type='decoder_block')[source]

A steering vector that can be applied to a model.

apply(model, layer_config=None, operator=None, multiplier=1.0, min_token_index=0, token_indices=None)[source]

Apply this steering vector to the given model. Tokens to patch can be selected using either min_token_index or token_indices, but not both. If neither is provided, all tokens will be patched.

Return type:

Generator[None, None, None]

Parameters:
  • model – The model to patch

  • layer_config – A dictionary mapping layer types to layer matching functions. If not provided, this will be inferred automatically.

  • operator – A function that takes the original activation and the steering vector and returns a modified vector that is added to the original activation.

  • multiplier – A multiplier to scale the patch activations. Default is 1.0.

  • min_token_index – The minimum token index to apply the patch to. Default is None.

  • token_indices – Either a list of token indices to apply the patch to, a slice, or a mask tensor. Default is None.

Example

>>> model = AutoModelForCausalLM.from_pretrained("gpt2-xl")
>>> steering_vector = SteeringVector(...)
>>> with steering_vector.apply(model):
>>>     model.forward(...)
layer_activations
layer_type = 'decoder_block'
patch_activations(model, layer_config=None, operator=None, multiplier=1.0, min_token_index=None, token_indices=None)[source]

Patch the activations of the given model with this steering vector. This will modify the model in-place, and return a handle that can be used to undo the patching. This method does the same thing as apply, but requires manually undoing the patching to restore the model to its original state. For most cases, apply is easier to use. Tokens to patch can be selected using either min_token_index or token_indices, but not both. If neither is provided, all tokens will be patched.

Return type:

SteeringPatchHandle

Parameters:
  • model – The model to patch

  • layer_config – A dictionary mapping layer types to layer matching functions. If not provided, this will be inferred automatically.

  • operator – A function that takes the original activation and the steering vector and returns a modified vector that is added to the original activation.

  • multiplier – A multiplier to scale the patch activations. Default is 1.0.

  • min_token_index – The minimum token index to apply the patch to. Default is None.

  • token_indices – Either a list of token indices to apply the patch to, a slice, or a mask tensor. Default is None.

Example

>>> model = AutoModelForCausalLM.from_pretrained("gpt2-xl")
>>> steering_vector = SteeringVector(...)
>>> handle = steering_vector.patch_activations(model)
>>> model.forward(...)
>>> handle.remove()
to(*args, **kwargs)[source]

Return a new steering vector moved to the given device/dtype.

This method calls torch.Tensor.to on each of the layer activations.

Return type:

SteeringVector

class steering_vectors.SteeringPatchHandle(model_hooks)[source]

A handle that can be used to remove a steering patch from a model after running steering_vector.patch_activations().

model_hooks
remove()[source]

Remove the steering patch from the model

Return type:

None