train_steering_vector¶

steering_vectors.train_steering_vector(model, tokenizer, training_samples, layers=None, layer_type='decoder_block', layer_config=None, move_to_cpu=False, read_token_index=-1, show_progress=False, aggregator=<function mean_aggregator.<locals>._mean_aggregator>, batch_size=1, tqdm_desc='Training steering vector')[source]¶

Train a steering vector for the given model.

Return type:

SteeringVector

Parameters:

model – The model to train the steering vector for
tokenizer – The tokenizer to use
training_samples – A list of training samples, where each sample is a tuple of (positive_str, negative_str). The steering vector approximate the difference between the positive prompt and negative prompt activations.
layers – A list of layer numbers to train the steering vector on. If None, train on all layers.
layer_type – The type of layer to train the steering vector on. Default is “decoder_block”.
layer_config – A dictionary mapping layer types to layer matching functions. If not provided, this will be inferred automatically.
move_to_cpu – If True, move the activations to the CPU before training. Default False.
read_token_index – The index of the token to read the activations from. Default -1, meaning final token.
show_progress – If True, show a progress bar. Default False.
aggregator – A function that takes the positive and negative activations for a layer and returns a single vector. Default is mean_aggregator.

steering_vectors.extract_activations(model, tokenizer, training_samples, layers=None, layer_type='decoder_block', layer_config=None, move_to_cpu=False, read_token_index=-1, show_progress=False, batch_size=1, tqdm_desc='Extracting activations')[source]¶

Extract activations from the model for the given training samples.

Return type:

tuple[dict[int, list[Tensor]], dict[int, list[Tensor]]]

Parameters:

model – The model to extract activations from
tokenizer – The tokenizer to use
training_samples – A list of training samples, where each sample is a tuple of (positive_str, negative_str). The steering vector approximate the difference between the positive prompt and negative prompt activations.
layers – A list of layer numbers to extract activations from. If None, extract from all layers.
layer_type – The type of layer to extract activations from. Default is “decoder_block”.
layer_config – A dictionary mapping layer types to layer matching functions. If not provided, this will be inferred automatically.
move_to_cpu – If True, move the activations to the CPU before training. Default False.
read_token_index – The index of the token to read the activations from. Default -1, meaning final token.
show_progress – If True, show a progress bar. Default False.
batch_size – The batch size to use. Default 1.
tqdm_desc – The description to use for the progress bar. Default “Extracting activations”.

Returns:

A tuple of two dictionaries. The first dictionary maps layer numbers to lists of positive activations, and the second dictionary maps layer numbers to lists of negative activations.

steering_vectors.aggregate_activations(pos_acts_by_layer, neg_acts_by_layer, aggregator=<function mean_aggregator.<locals>._mean_aggregator>)[source]¶

Apply the aggregator to the positive and negative activations for each layer.

Return type:

dict[int, Tensor]

Parameters:

pos_acts_by_layer – A dictionary mapping layer numbers to lists of positive activations.
neg_acts_by_layer – A dictionary mapping layer numbers to lists of negative activations.
aggregator – A function that takes the positive and negative activations for a layer and returns a single vector.

Returns:

A dictionary mapping layer numbers to the aggregated activations.