updown.models.updown_captioner¶
-
class
updown.models.updown_captioner.
UpDownCaptioner
(vocabulary: allennlp.data.vocabulary.Vocabulary, image_feature_size: int, embedding_size: int, hidden_size: int, attention_projection_size: int, max_caption_length: int = 20, beam_size: int = 1, use_cbs: bool = False, min_constraints_to_satisfy: int = 2)[source]¶ Bases:
torch.nn.modules.module.Module
Image captioning model using bottom-up top-down attention, as in Anderson et al. 2017. At training time, this model maximizes the likelihood of ground truth caption, given image features. At inference time, given image features, captions are decoded using beam search.
This captioner is basically a recurrent language model for caption sequences. Internally, it runs
UpDownCell
for multiple time-steps. If this class is analogous to anLSTM
, thenUpDownCell
would be analogous toLSTMCell
.- Parameters
- vocabulary: allennlp.data.Vocabulary
AllenNLP’s vocabulary containing token to index mapping for captions vocabulary.
- image_feature_size: int
Size of the bottom-up image features.
- embedding_size: int
Size of the word embedding input to the captioner.
- hidden_size: int
Size of the hidden / cell states of attention LSTM and language LSTM of the captioner.
- attention_projection_size: int
Size of the projected image and textual features before computing bottom-up top-down attention weights.
- max_caption_length: int, optional (default = 20)
Maximum length of caption sequences for language modeling. Captions longer than this will be truncated to maximum length.
- beam_size: int, optional (default = 1)
Beam size for finding the most likely caption during decoding time (evaluation).
- use_cbs: bool, optional (default = False)
Whether to use
ConstrainedBeamSearch
for decoding.- min_constraints_to_satisfy: int, optional (default = 2)
Minimum number of constraints to satisfy for CBS, used for selecting the best beam. This will be ignored when
use_cbs
is False.
-
classmethod
from_config
(config:updown.config.Config, **kwargs)[source]¶ Instantiate this class directly from a
Config
.
-
_initialize_glove
(self) → torch.Tensor[source]¶ Initialize embeddings of all the tokens in a given
Vocabulary
by their GloVe vectors.It is recommended to train an
UpDownCaptioner
with frozen word embeddings when one wishes to perform Constrained Beam Search decoding during inference. This is because the constraint words may not appear in caption vocabulary (out of domain), and their embeddings will never be updated during training. Initializing with frozen GloVe embeddings is helpful, because they capture more meaningful semantics than randomly initialized embeddings.- Returns
- torch.Tensor
GloVe Embeddings corresponding to tokens.
-
forward
(self, image_features:torch.Tensor, caption_tokens:Union[torch.Tensor, NoneType]=None, fsm:torch.Tensor=None, num_constraints:torch.Tensor=None) → Dict[str, torch.Tensor][source]¶ Given bottom-up image features, maximize the likelihood of paired captions during training. During evaluation, decode captions given image features using beam search.
- Parameters
- image_features: torch.Tensor
A tensor of shape
(batch_size, num_boxes * image_feature_size)
.num_boxes
for each instance in a batch might be different. Instances with lesser boxes are padded with zeros up tonum_boxes
.- caption_tokens: torch.Tensor, optional (default = None)
A tensor of shape
(batch_size, max_caption_length)
of tokenized captions. This tensor does not contain@@BOUNDARY@@
tokens yet. Captions are not provided during evaluation.- fsm: torch.Tensor, optional (default = None)
A tensor of shape
(batch_size, num_states, num_states, vocab_size)
: finite state machines per instance, represented as adjacency matrix. For a particular instance[_, s1, s2, v] = 1
shows a transition from states1
tos2
on decodingv
token (constraint). Would beNone
for regular beam search decoding.- num_constraints: torch.Tensor, optional (default = None)
A tensor of shape
(batch_size, )
containing the total number of given constraints for CBS. Would beNone
for regular beam search decoding.
- Returns
- Dict[str, torch.Tensor]
Decoded captions and/or per-instance cross entropy loss, dict with keys either
{"predictions"}
or{"loss"}
.
-
_decode_step
(self, image_features:torch.Tensor, previous_predictions:torch.Tensor, states:Union[Dict[str, torch.Tensor], NoneType]=None) → Tuple[torch.Tensor, Dict[str, torch.Tensor]][source]¶ Given image features, tokens predicted at previous time-step and LSTM states of the
UpDownCell
, take a decoding step. This is also called by the beam search class.- Parameters
- image_features: torch.Tensor
A tensor of shape
(batch_size, num_boxes, image_feature_size)
.- previous_predictions: torch.Tensor
A tensor of shape
(batch_size * net_beam_size, )
containing tokens predicted at previous time-step – one for each beam, for each instances in a batch.net_beam_size
is 1 during teacher forcing (training),beam_size
for regularallennlp.nn.beam_search.BeamSearch
andbeam_size * num_states
forupdown.modules.cbs.ConstrainedBeamSearch
- states: [Dict[str, torch.Tensor], optional (default = None)
LSTM states of the
UpDownCell
. These are initialized as zero tensors if not provided (at first time-step).
-
_get_loss
(self, logits:torch.Tensor, targets:torch.Tensor, target_mask:torch.Tensor) → torch.Tensor[source]¶ Compute cross entropy loss of predicted caption (logits) w.r.t. target caption. The cross entropy loss of caption is cross entropy loss at each time-step, summed.
- Parameters
- logits: torch.Tensor
A tensor of shape
(batch_size, max_caption_length - 1, vocab_size)
containing unnormalized log-probabilities of predicted captions.- targets: torch.Tensor
A tensor of shape
(batch_size, max_caption_length - 1)
of tokenized target captions.- target_mask: torch.Tensor
A mask over target captions, elements where mask is zero are ignored from loss computation. Here, we ignore
@@UNKNOWN@@
token (and hence padding tokens too because they are basically the same).
- Returns
- torch.Tensor
A tensor of shape
(batch_size, )
containing cross entropy loss of captions, summed across time-steps.