train_steering_vector¶
- steering_vectors.train_steering_vector(model, tokenizer, training_samples, layers=None, layer_type='decoder_block', layer_config=None, move_to_cpu=False, read_token_index=-1, show_progress=False, aggregator=<function mean_aggregator.<locals>._mean_aggregator>, batch_size=1, tqdm_desc='Training steering vector')[source]¶
Train a steering vector for the given model.
- Return type:
- Parameters:
model – The model to train the steering vector for
tokenizer – The tokenizer to use
training_samples – A list of training samples, where each sample is a tuple of (positive_str, negative_str). The steering vector approximate the difference between the positive prompt and negative prompt activations.
layers – A list of layer numbers to train the steering vector on. If None, train on all layers.
layer_type – The type of layer to train the steering vector on. Default is “decoder_block”.
layer_config – A dictionary mapping layer types to layer matching functions. If not provided, this will be inferred automatically.
move_to_cpu – If True, move the activations to the CPU before training. Default False.
read_token_index – The index of the token to read the activations from. Default -1, meaning final token.
show_progress – If True, show a progress bar. Default False.
aggregator – A function that takes the positive and negative activations for a layer and returns a single vector. Default is mean_aggregator.
- steering_vectors.extract_activations(model, tokenizer, training_samples, layers=None, layer_type='decoder_block', layer_config=None, move_to_cpu=False, read_token_index=-1, show_progress=False, batch_size=1, tqdm_desc='Extracting activations')[source]¶
Extract activations from the model for the given training samples.
- Return type:
tuple
[dict
[int
,list
[Tensor
]],dict
[int
,list
[Tensor
]]]- Parameters:
model – The model to extract activations from
tokenizer – The tokenizer to use
training_samples – A list of training samples, where each sample is a tuple of (positive_str, negative_str). The steering vector approximate the difference between the positive prompt and negative prompt activations.
layers – A list of layer numbers to extract activations from. If None, extract from all layers.
layer_type – The type of layer to extract activations from. Default is “decoder_block”.
layer_config – A dictionary mapping layer types to layer matching functions. If not provided, this will be inferred automatically.
move_to_cpu – If True, move the activations to the CPU before training. Default False.
read_token_index – The index of the token to read the activations from. Default -1, meaning final token.
show_progress – If True, show a progress bar. Default False.
batch_size – The batch size to use. Default 1.
tqdm_desc – The description to use for the progress bar. Default “Extracting activations”.
- Returns:
A tuple of two dictionaries. The first dictionary maps layer numbers to lists of positive activations, and the second dictionary maps layer numbers to lists of negative activations.
- steering_vectors.aggregate_activations(pos_acts_by_layer, neg_acts_by_layer, aggregator=<function mean_aggregator.<locals>._mean_aggregator>)[source]¶
Apply the aggregator to the positive and negative activations for each layer.
- Return type:
dict
[int
,Tensor
]- Parameters:
pos_acts_by_layer – A dictionary mapping layer numbers to lists of positive activations.
neg_acts_by_layer – A dictionary mapping layer numbers to lists of negative activations.
aggregator – A function that takes the positive and negative activations for a layer and returns a single vector.
- Returns:
A dictionary mapping layer numbers to the aggregated activations.