Bases: torch.nn.modules.module.Module
Image captioning model using bottom-up top-down attention, as in
Anderson et al. 2017. At training time, this model
maximizes the likelihood of ground truth caption, given image features. At inference time,
given image features, captions are decoded using beam search.
This captioner is basically a recurrent language model for caption sequences. Internally, it
runs UpDownCell
for multiple time-steps. If this class is
analogous to an LSTM
, then UpDownCell
would be analogous to LSTMCell
.
- Parameters
- vocabulary: allennlp.data.Vocabulary
AllenNLP’s vocabulary containing token to index mapping for captions vocabulary.
- image_feature_size: int
Size of the bottom-up image features.
- embedding_size: int
Size of the word embedding input to the captioner.
- hidden_size: int
Size of the hidden / cell states of attention LSTM and language LSTM of the captioner.
- attention_projection_size: int
Size of the projected image and textual features before computing bottom-up top-down
attention weights.
- max_caption_length: int, optional (default = 20)
Maximum length of caption sequences for language modeling. Captions longer than this will
be truncated to maximum length.
- beam_size: int, optional (default = 1)
Beam size for finding the most likely caption during decoding time (evaluation).
- use_cbs: bool, optional (default = False)
Whether to use ConstrainedBeamSearch
for decoding.
- min_constraints_to_satisfy: int, optional (default = 2)
Minimum number of constraints to satisfy for CBS, used for selecting the best beam. This
will be ignored when use_cbs
is False.
-
classmethod
from_config
(config:updown.config.Config, **kwargs)[source]
Instantiate this class directly from a Config
.
-
_initialize_glove
(self) → torch.Tensor[source]
Initialize embeddings of all the tokens in a given
Vocabulary
by their GloVe vectors.
It is recommended to train an UpDownCaptioner
with
frozen word embeddings when one wishes to perform Constrained Beam Search decoding during
inference. This is because the constraint words may not appear in caption vocabulary (out of
domain), and their embeddings will never be updated during training. Initializing with frozen
GloVe embeddings is helpful, because they capture more meaningful semantics than randomly
initialized embeddings.
- Returns
- torch.Tensor
GloVe Embeddings corresponding to tokens.
-
forward
(self, image_features:torch.Tensor, caption_tokens:Union[torch.Tensor, NoneType]=None, fsm:torch.Tensor=None, num_constraints:torch.Tensor=None) → Dict[str, torch.Tensor][source]
Given bottom-up image features, maximize the likelihood of paired captions during
training. During evaluation, decode captions given image features using beam search.
- Parameters
- image_features: torch.Tensor
A tensor of shape (batch_size, num_boxes * image_feature_size)
. num_boxes
for
each instance in a batch might be different. Instances with lesser boxes are padded
with zeros up to num_boxes
.
- caption_tokens: torch.Tensor, optional (default = None)
A tensor of shape (batch_size, max_caption_length)
of tokenized captions. This
tensor does not contain @@BOUNDARY@@
tokens yet. Captions are not provided
during evaluation.
- fsm: torch.Tensor, optional (default = None)
A tensor of shape (batch_size, num_states, num_states, vocab_size)
: finite state
machines per instance, represented as adjacency matrix. For a particular instance
[_, s1, s2, v] = 1
shows a transition from state s1
to s2
on decoding
v
token (constraint). Would be None
for regular beam search decoding.
- num_constraints: torch.Tensor, optional (default = None)
A tensor of shape (batch_size, )
containing the total number of given constraints
for CBS. Would be None
for regular beam search decoding.
- Returns
- Dict[str, torch.Tensor]
Decoded captions and/or per-instance cross entropy loss, dict with keys either
{"predictions"}
or {"loss"}
.
-
_decode_step
(self, image_features:torch.Tensor, previous_predictions:torch.Tensor, states:Union[Dict[str, torch.Tensor], NoneType]=None) → Tuple[torch.Tensor, Dict[str, torch.Tensor]][source]
Given image features, tokens predicted at previous time-step and LSTM states of the
UpDownCell
, take a decoding step. This is also
called by the beam search class.
- Parameters
- image_features: torch.Tensor
A tensor of shape (batch_size, num_boxes, image_feature_size)
.
- previous_predictions: torch.Tensor
A tensor of shape (batch_size * net_beam_size, )
containing tokens predicted at
previous time-step – one for each beam, for each instances in a batch.
net_beam_size
is 1 during teacher forcing (training), beam_size
for regular
allennlp.nn.beam_search.BeamSearch
and beam_size * num_states
for
updown.modules.cbs.ConstrainedBeamSearch
- states: [Dict[str, torch.Tensor], optional (default = None)
LSTM states of the UpDownCell
. These are
initialized as zero tensors if not provided (at first time-step).
-
_get_loss
(self, logits:torch.Tensor, targets:torch.Tensor, target_mask:torch.Tensor) → torch.Tensor[source]
Compute cross entropy loss of predicted caption (logits) w.r.t. target caption. The cross
entropy loss of caption is cross entropy loss at each time-step, summed.
- Parameters
- logits: torch.Tensor
A tensor of shape (batch_size, max_caption_length - 1, vocab_size)
containing
unnormalized log-probabilities of predicted captions.
- targets: torch.Tensor
A tensor of shape (batch_size, max_caption_length - 1)
of tokenized target
captions.
- target_mask: torch.Tensor
A mask over target captions, elements where mask is zero are ignored from loss
computation. Here, we ignore @@UNKNOWN@@
token (and hence padding tokens too
because they are basically the same).
- Returns
- torch.Tensor
A tensor of shape (batch_size, )
containing cross entropy loss of captions, summed
across time-steps.