updown.modules.updown_cell¶
-
class
updown.modules.updown_cell.
UpDownCell
(image_feature_size: int, embedding_size: int, hidden_size: int, attention_projection_size: int)[source]¶ Bases:
torch.nn.modules.module.Module
The basic computation unit of
UpDownCaptioner
.The architecture (Anderson et al. 2017 (Fig. 3)) is as follows:
h2 (t) .^. | +--------------------------------+ h2 (t-1) ----> | Language LSTM | ----> h2 (t) +--------------------------------+ .^. .^. | | bottom-up +----------------+ | features --> | BUTD Attention | | +----------------+ | .^. | |___________| | +--------------------------------+ h1 (t-1) ----> | Attention LSTM | ----> h1 (t) +--------------------------------+ .^. __________________|__________________ | | | | mean pooled input token h2 (t-1) features embedding
If
UpDownCaptioner
is analogous to anLSTM
, then this class would be analogous toLSTMCell
.- Parameters
- image_feature_size: int
Size of the bottom-up image features.
- embedding_size: int
Size of the word embedding input to the captioner.
- hidden_size: int
Size of the hidden / cell states of attention LSTM and language LSTM of the captioner.
- attention_projection_size: int
Size of the projected image and textual features before computing bottom-up top-down attention weights.
-
forward
(self, image_features:torch.Tensor, token_embedding:torch.Tensor, states:Union[Dict[str, torch.Tensor], NoneType]=None) → Tuple[torch.Tensor, Dict[str, torch.Tensor]][source]¶ Given image features, input token embeddings of current time-step and LSTM states, predict output token embeddings for next time-step and update states. This behaves very similar to
LSTMCell
.- Parameters
- image_features: torch.Tensor
A tensor of shape
(batch_size, num_boxes, image_feature_size)
.num_boxes
for each instance in a batch might be different. Instances with lesser boxes are padded with zeros up tonum_boxes
.- token_embedding: torch.Tensor
A tensor of shape
(batch_size, embedding_size)
containing token embeddings for a particular time-step.- states: Dict[str, torch.Tensor], optional (default = None)
A dict with keys
{"h1", "c1", "h2", "c2"}
of LSTM states: (h1, c1) for Attention LSTM and (h2, c2) for Language LSTM. If not provided (at first time-step), these are initialized as zeros.
- Returns
- Tuple[torch.Tensor, Dict[str, torch.Tensor]]
A tensor of shape
(batch_size, hidden_state)
with output token embedding, which is the updated state “h2”, and updated states (h1, c1), (h2, c2).
-
_average_image_features
(self, image_features:torch.Tensor) → Tuple[torch.Tensor, torch.Tensor]¶ Perform mean pooling of bottom-up image features, while taking care of variable
num_boxes
in case of adaptive features.For a single training/evaluation instance, the image features remain the same from first time-step to maximum decoding steps. To keep a clean API, we use LRU cache – which would maintain a cache of last 10 return values because on call signature, and not actually execute itself if it is called with the same image features seen at least once in last 10 calls. This saves some computation.
- Parameters
- image_features: torch.Tensor
A tensor of shape
(batch_size, num_boxes, image_feature_size)
.num_boxes
for each instance in a batch might be different. Instances with lesser boxes are padded with zeros up tonum_boxes
.
- Returns
- Tuple[torch.Tensor, torch.Tensor]
Averaged image features of shape
(batch_size, image_feature_size)
and a binary mask of shape(batch_size, num_boxes)
which is zero for padded features.