updown.modules.attention

class updown.modules.attention.BottomUpTopDownAttention(query_size: int, image_feature_size: int, projection_size: int)[source]

Bases: torch.nn.modules.module.Module

A PyTorch module to compute bottom-up top-down attention (Anderson et al. 2017). Used in UpDownCell

Parameters
query_size: int

Size of the query vector, typically the output of Attention LSTM in UpDownCell.

image_feature_size: int

Size of the bottom-up image features.

projection_size: int

Size of the projected image and textual features before computing bottom-up top-down attention weights.

forward(self, query_vector:torch.Tensor, image_features:torch.Tensor, image_features_mask:Union[torch.Tensor, NoneType]=None) → torch.Tensor[source]

Compute attention weights over image features by applying bottom-up top-down attention over image features, using the query vector. Query vector is typically the output of attention LSTM in UpDownCell. Both image features and query vectors are first projected to a common dimension, that is projection_size.

Parameters
query_vector: torch.Tensor

A tensor of shape (batch_size, query_size) used for attending the image features.

image_features: torch.Tensor

A tensor of shape (batch_size, num_boxes, image_feature_size). num_boxes for each instance in a batch might be different. Instances with lesser boxes are padded with zeros up to num_boxes.

image_features_mask: torch.Tensor

A mask over image features if num_boxes are different for each instance. Elements where mask is zero are not attended over.

Returns
torch.Tensor

A tensor of shape (batch_size, num_boxes) containing attention weights for each image features of each instance in the batch. If image_features_mask is provided (for adaptive features), then weights where the mask is zero, would be zero.

_project_image_features(self, image_features:torch.Tensor) → torch.Tensor

Project image features to a common dimension for applying attention.

For a single training/evaluation instance, the image features remain the same from first time-step to maximum decoding steps. To keep a clean API, we use LRU cache – which would maintain a cache of last 10 return values because on call signature, and not actually execute itself if it is called with the same image features seen at least once in last 10 calls. This saves some computation.

Parameters
image_features: torch.Tensor

A tensor of shape (batch_size, num_boxes, image_feature_size). num_boxes for each instance in a batch might be different. Instances with lesser boxes are padded with zeros up to num_boxes.

Returns
torch.Tensor

Projected image features of shape (batch_size, num_boxes, image_feature_size).