updown.modules.updown_cell

class updown.modules.updown_cell.UpDownCell(image_feature_size: int, embedding_size: int, hidden_size: int, attention_projection_size: int)[source]

Bases: torch.nn.modules.module.Module

The basic computation unit of UpDownCaptioner.

The architecture (Anderson et al. 2017 (Fig. 3)) is as follows:

                                h2 (t)
                                 .^.
                                  |
                   +--------------------------------+
    h2 (t-1) ----> |         Language LSTM          | ----> h2 (t)
                   +--------------------------------+
                     .^.         .^.
                      |           |
bottom-up     +----------------+  |
features  --> | BUTD Attention |  |
              +----------------+  |
                     .^.          |
                      |___________|
                                  |
                   +--------------------------------+
    h1 (t-1) ----> |         Attention LSTM         | ----> h1 (t)
                   +--------------------------------+
                                 .^.
                __________________|__________________
                |                 |                  |
                |             mean pooled        input token
            h2 (t-1)           features           embedding

If UpDownCaptioner is analogous to an LSTM, then this class would be analogous to LSTMCell.

Parameters
image_feature_size: int

Size of the bottom-up image features.

embedding_size: int

Size of the word embedding input to the captioner.

hidden_size: int

Size of the hidden / cell states of attention LSTM and language LSTM of the captioner.

attention_projection_size: int

Size of the projected image and textual features before computing bottom-up top-down attention weights.

forward(self, image_features:torch.Tensor, token_embedding:torch.Tensor, states:Union[Dict[str, torch.Tensor], NoneType]=None) → Tuple[torch.Tensor, Dict[str, torch.Tensor]][source]

Given image features, input token embeddings of current time-step and LSTM states, predict output token embeddings for next time-step and update states. This behaves very similar to LSTMCell.

Parameters
image_features: torch.Tensor

A tensor of shape (batch_size, num_boxes, image_feature_size). num_boxes for each instance in a batch might be different. Instances with lesser boxes are padded with zeros up to num_boxes.

token_embedding: torch.Tensor

A tensor of shape (batch_size, embedding_size) containing token embeddings for a particular time-step.

states: Dict[str, torch.Tensor], optional (default = None)

A dict with keys {"h1", "c1", "h2", "c2"} of LSTM states: (h1, c1) for Attention LSTM and (h2, c2) for Language LSTM. If not provided (at first time-step), these are initialized as zeros.

Returns
Tuple[torch.Tensor, Dict[str, torch.Tensor]]

A tensor of shape (batch_size, hidden_state) with output token embedding, which is the updated state “h2”, and updated states (h1, c1), (h2, c2).

_average_image_features(self, image_features:torch.Tensor) → Tuple[torch.Tensor, torch.Tensor]

Perform mean pooling of bottom-up image features, while taking care of variable num_boxes in case of adaptive features.

For a single training/evaluation instance, the image features remain the same from first time-step to maximum decoding steps. To keep a clean API, we use LRU cache – which would maintain a cache of last 10 return values because on call signature, and not actually execute itself if it is called with the same image features seen at least once in last 10 calls. This saves some computation.

Parameters
image_features: torch.Tensor

A tensor of shape (batch_size, num_boxes, image_feature_size). num_boxes for each instance in a batch might be different. Instances with lesser boxes are padded with zeros up to num_boxes.

Returns
Tuple[torch.Tensor, torch.Tensor]

Averaged image features of shape (batch_size, image_feature_size) and a binary mask of shape (batch_size, num_boxes) which is zero for padded features.