train_steering_vector

steering_vectors.train_steering_vector(model, tokenizer, training_samples, layers=None, layer_type='decoder_block', layer_config=None, move_to_cpu=False, read_token_index=-1, show_progress=False, aggregator=<function mean_aggregator.<locals>._mean_aggregator>, batch_size=1, tqdm_desc='Training steering vector')[source]

Train a steering vector for the given model.

Return type:

SteeringVector

Parameters:
  • model – The model to train the steering vector for

  • tokenizer – The tokenizer to use

  • training_samples – A list of training samples, where each sample is a tuple of (positive_str, negative_str). The steering vector approximate the difference between the positive prompt and negative prompt activations.

  • layers – A list of layer numbers to train the steering vector on. If None, train on all layers.

  • layer_type – The type of layer to train the steering vector on. Default is “decoder_block”.

  • layer_config – A dictionary mapping layer types to layer matching functions. If not provided, this will be inferred automatically.

  • move_to_cpu – If True, move the activations to the CPU before training. Default False.

  • read_token_index – The index of the token to read the activations from. Default -1, meaning final token.

  • show_progress – If True, show a progress bar. Default False.

  • aggregator – A function that takes the positive and negative activations for a layer and returns a single vector. Default is mean_aggregator.

steering_vectors.extract_activations(model, tokenizer, training_samples, layers=None, layer_type='decoder_block', layer_config=None, move_to_cpu=False, read_token_index=-1, show_progress=False, batch_size=1, tqdm_desc='Extracting activations')[source]

Extract activations from the model for the given training samples.

Return type:

tuple[dict[int, list[Tensor]], dict[int, list[Tensor]]]

Parameters:
  • model – The model to extract activations from

  • tokenizer – The tokenizer to use

  • training_samples – A list of training samples, where each sample is a tuple of (positive_str, negative_str). The steering vector approximate the difference between the positive prompt and negative prompt activations.

  • layers – A list of layer numbers to extract activations from. If None, extract from all layers.

  • layer_type – The type of layer to extract activations from. Default is “decoder_block”.

  • layer_config – A dictionary mapping layer types to layer matching functions. If not provided, this will be inferred automatically.

  • move_to_cpu – If True, move the activations to the CPU before training. Default False.

  • read_token_index – The index of the token to read the activations from. Default -1, meaning final token.

  • show_progress – If True, show a progress bar. Default False.

  • batch_size – The batch size to use. Default 1.

  • tqdm_desc – The description to use for the progress bar. Default “Extracting activations”.

Returns:

A tuple of two dictionaries. The first dictionary maps layer numbers to lists of positive activations, and the second dictionary maps layer numbers to lists of negative activations.

steering_vectors.aggregate_activations(pos_acts_by_layer, neg_acts_by_layer, aggregator=<function mean_aggregator.<locals>._mean_aggregator>)[source]

Apply the aggregator to the positive and negative activations for each layer.

Return type:

dict[int, Tensor]

Parameters:
  • pos_acts_by_layer – A dictionary mapping layer numbers to lists of positive activations.

  • neg_acts_by_layer – A dictionary mapping layer numbers to lists of negative activations.

  • aggregator – A function that takes the positive and negative activations for a layer and returns a single vector.

Returns:

A dictionary mapping layer numbers to the aggregated activations.