updown.modules.attention¶
-
class
updown.modules.attention.
BottomUpTopDownAttention
(query_size: int, image_feature_size: int, projection_size: int)[source]¶ Bases:
torch.nn.modules.module.Module
A PyTorch module to compute bottom-up top-down attention (Anderson et al. 2017). Used in
UpDownCell
- Parameters
- query_size: int
Size of the query vector, typically the output of Attention LSTM in
UpDownCell
.- image_feature_size: int
Size of the bottom-up image features.
- projection_size: int
Size of the projected image and textual features before computing bottom-up top-down attention weights.
-
forward
(self, query_vector:torch.Tensor, image_features:torch.Tensor, image_features_mask:Union[torch.Tensor, NoneType]=None) → torch.Tensor[source]¶ Compute attention weights over image features by applying bottom-up top-down attention over image features, using the query vector. Query vector is typically the output of attention LSTM in
UpDownCell
. Both image features and query vectors are first projected to a common dimension, that isprojection_size
.- Parameters
- query_vector: torch.Tensor
A tensor of shape
(batch_size, query_size)
used for attending the image features.- image_features: torch.Tensor
A tensor of shape
(batch_size, num_boxes, image_feature_size)
.num_boxes
for each instance in a batch might be different. Instances with lesser boxes are padded with zeros up tonum_boxes
.- image_features_mask: torch.Tensor
A mask over image features if
num_boxes
are different for each instance. Elements where mask is zero are not attended over.
- Returns
- torch.Tensor
A tensor of shape
(batch_size, num_boxes)
containing attention weights for each image features of each instance in the batch. Ifimage_features_mask
is provided (for adaptive features), then weights where the mask is zero, would be zero.
-
_project_image_features
(self, image_features:torch.Tensor) → torch.Tensor¶ Project image features to a common dimension for applying attention.
For a single training/evaluation instance, the image features remain the same from first time-step to maximum decoding steps. To keep a clean API, we use LRU cache – which would maintain a cache of last 10 return values because on call signature, and not actually execute itself if it is called with the same image features seen at least once in last 10 calls. This saves some computation.
- Parameters
- image_features: torch.Tensor
A tensor of shape
(batch_size, num_boxes, image_feature_size)
.num_boxes
for each instance in a batch might be different. Instances with lesser boxes are padded with zeros up tonum_boxes
.
- Returns
- torch.Tensor
Projected image features of shape
(batch_size, num_boxes, image_feature_size)
.