updown.modules.attention¶

class updown.modules.attention.BottomUpTopDownAttention(query_size: int, image_feature_size: int, projection_size: int)[source]¶

Bases: torch.nn.modules.module.Module

A PyTorch module to compute bottom-up top-down attention (Anderson et al. 2017). Used in UpDownCell

Parameters

query_size: int: Size of the query vector, typically the output of Attention LSTM in UpDownCell.
image_feature_size: int: Size of the bottom-up image features.
projection_size: int: Size of the projected image and textual features before computing bottom-up top-down attention weights.

forward(self, query_vector:torch.Tensor, image_features:torch.Tensor, image_features_mask:Union[torch.Tensor, NoneType]=None) → torch.Tensor[source]¶

Compute attention weights over image features by applying bottom-up top-down attention over image features, using the query vector. Query vector is typically the output of attention LSTM in UpDownCell. Both image features and query vectors are first projected to a common dimension, that is projection_size.

Parameters

query_vector: torch.Tensor: A tensor of shape (batch_size, query_size) used for attending the image features.
image_features: torch.Tensor: A tensor of shape (batch_size, num_boxes, image_feature_size). num_boxes for each instance in a batch might be different. Instances with lesser boxes are padded with zeros up to num_boxes.
image_features_mask: torch.Tensor: A mask over image features if num_boxes are different for each instance. Elements where mask is zero are not attended over.

Returns

torch.Tensor: A tensor of shape (batch_size, num_boxes) containing attention weights for each image features of each instance in the batch. If image_features_mask is provided (for adaptive features), then weights where the mask is zero, would be zero.

_project_image_features(self, image_features:torch.Tensor) → torch.Tensor¶

Project image features to a common dimension for applying attention.

For a single training/evaluation instance, the image features remain the same from first time-step to maximum decoding steps. To keep a clean API, we use LRU cache – which would maintain a cache of last 10 return values because on call signature, and not actually execute itself if it is called with the same image features seen at least once in last 10 calls. This saves some computation.

Parameters

image_features: torch.Tensor: A tensor of shape (batch_size, num_boxes, image_feature_size). num_boxes for each instance in a batch might be different. Instances with lesser boxes are padded with zeros up to num_boxes.

Returns

torch.Tensor: Projected image features of shape (batch_size, num_boxes, image_feature_size).

updown.modules.attention¶

updown

Navigation

Related Topics