About

Acknowledgements

This library is inspired by the following excellent papers:

How this works

To apply steering vectors to the model, we must have some way of modifying the model’s forward pass. The approach taken in the official codebases for Representation Engineering and Contrastive Activation Addition do this by modifying the underlying model and replacing decoder blocks with custom wrappers. While this is conceptually simple, it has some major drawbacks. This won’t work for arbitrary models, as a new wrapper needs to be built for every model and every type of layer. This also changes the model’s architecture, which can lead to unexpected behavior if the end-user isn’t aware of the changes.

The steering_vectors library uses Pytorch hooks instead of custom layer wrappers to modify the underlying model’s forward pass. You can read more about hooks at the official PyTorch documentation. This allows us to modify the forward pass of any model without changing the model’s architecture, and makes it possible to apply steering to arbitrary models from Huggingface without any custom wrapping code.

Contributing

Any contributions to improve this project are welcome! Please open an issue or pull request in the Github repo with bugfixes, changes, and improvements.

License

Steering Vectors is released under a MIT license.