updown.models.updown_captioner

class updown.models.updown_captioner.UpDownCaptioner(vocabulary: allennlp.data.vocabulary.Vocabulary, image_feature_size: int, embedding_size: int, hidden_size: int, attention_projection_size: int, max_caption_length: int = 20, beam_size: int = 1, use_cbs: bool = False, min_constraints_to_satisfy: int = 2)[source]

Bases: torch.nn.modules.module.Module

Image captioning model using bottom-up top-down attention, as in Anderson et al. 2017. At training time, this model maximizes the likelihood of ground truth caption, given image features. At inference time, given image features, captions are decoded using beam search.

This captioner is basically a recurrent language model for caption sequences. Internally, it runs UpDownCell for multiple time-steps. If this class is analogous to an LSTM, then UpDownCell would be analogous to LSTMCell.

Parameters
vocabulary: allennlp.data.Vocabulary

AllenNLP’s vocabulary containing token to index mapping for captions vocabulary.

image_feature_size: int

Size of the bottom-up image features.

embedding_size: int

Size of the word embedding input to the captioner.

hidden_size: int

Size of the hidden / cell states of attention LSTM and language LSTM of the captioner.

attention_projection_size: int

Size of the projected image and textual features before computing bottom-up top-down attention weights.

max_caption_length: int, optional (default = 20)

Maximum length of caption sequences for language modeling. Captions longer than this will be truncated to maximum length.

beam_size: int, optional (default = 1)

Beam size for finding the most likely caption during decoding time (evaluation).

use_cbs: bool, optional (default = False)

Whether to use ConstrainedBeamSearch for decoding.

min_constraints_to_satisfy: int, optional (default = 2)

Minimum number of constraints to satisfy for CBS, used for selecting the best beam. This will be ignored when use_cbs is False.

classmethod from_config(config:updown.config.Config, **kwargs)[source]

Instantiate this class directly from a Config.

_initialize_glove(self) → torch.Tensor[source]

Initialize embeddings of all the tokens in a given Vocabulary by their GloVe vectors.

It is recommended to train an UpDownCaptioner with frozen word embeddings when one wishes to perform Constrained Beam Search decoding during inference. This is because the constraint words may not appear in caption vocabulary (out of domain), and their embeddings will never be updated during training. Initializing with frozen GloVe embeddings is helpful, because they capture more meaningful semantics than randomly initialized embeddings.

Returns
torch.Tensor

GloVe Embeddings corresponding to tokens.

forward(self, image_features:torch.Tensor, caption_tokens:Union[torch.Tensor, NoneType]=None, fsm:torch.Tensor=None, num_constraints:torch.Tensor=None) → Dict[str, torch.Tensor][source]

Given bottom-up image features, maximize the likelihood of paired captions during training. During evaluation, decode captions given image features using beam search.

Parameters
image_features: torch.Tensor

A tensor of shape (batch_size, num_boxes * image_feature_size). num_boxes for each instance in a batch might be different. Instances with lesser boxes are padded with zeros up to num_boxes.

caption_tokens: torch.Tensor, optional (default = None)

A tensor of shape (batch_size, max_caption_length) of tokenized captions. This tensor does not contain @@BOUNDARY@@ tokens yet. Captions are not provided during evaluation.

fsm: torch.Tensor, optional (default = None)

A tensor of shape (batch_size, num_states, num_states, vocab_size): finite state machines per instance, represented as adjacency matrix. For a particular instance [_, s1, s2, v] = 1 shows a transition from state s1 to s2 on decoding v token (constraint). Would be None for regular beam search decoding.

num_constraints: torch.Tensor, optional (default = None)

A tensor of shape (batch_size, ) containing the total number of given constraints for CBS. Would be None for regular beam search decoding.

Returns
Dict[str, torch.Tensor]

Decoded captions and/or per-instance cross entropy loss, dict with keys either {"predictions"} or {"loss"}.

_decode_step(self, image_features:torch.Tensor, previous_predictions:torch.Tensor, states:Union[Dict[str, torch.Tensor], NoneType]=None) → Tuple[torch.Tensor, Dict[str, torch.Tensor]][source]

Given image features, tokens predicted at previous time-step and LSTM states of the UpDownCell, take a decoding step. This is also called by the beam search class.

Parameters
image_features: torch.Tensor

A tensor of shape (batch_size, num_boxes, image_feature_size).

previous_predictions: torch.Tensor

A tensor of shape (batch_size * net_beam_size, ) containing tokens predicted at previous time-step – one for each beam, for each instances in a batch. net_beam_size is 1 during teacher forcing (training), beam_size for regular allennlp.nn.beam_search.BeamSearch and beam_size * num_states for updown.modules.cbs.ConstrainedBeamSearch

states: [Dict[str, torch.Tensor], optional (default = None)

LSTM states of the UpDownCell. These are initialized as zero tensors if not provided (at first time-step).

_get_loss(self, logits:torch.Tensor, targets:torch.Tensor, target_mask:torch.Tensor) → torch.Tensor[source]

Compute cross entropy loss of predicted caption (logits) w.r.t. target caption. The cross entropy loss of caption is cross entropy loss at each time-step, summed.

Parameters
logits: torch.Tensor

A tensor of shape (batch_size, max_caption_length - 1, vocab_size) containing unnormalized log-probabilities of predicted captions.

targets: torch.Tensor

A tensor of shape (batch_size, max_caption_length - 1) of tokenized target captions.

target_mask: torch.Tensor

A mask over target captions, elements where mask is zero are ignored from loss computation. Here, we ignore @@UNKNOWN@@ token (and hence padding tokens too because they are basically the same).

Returns
torch.Tensor

A tensor of shape (batch_size, ) containing cross entropy loss of captions, summed across time-steps.