Skip to main content
Login | Suomeksi | På svenska | In English

Browsing by Author "Polis, Arturs"

Sort by: Order: Results:

  • Polis, Arturs (2019)
    Recently, a neural network based approach to automatic generation of image descriptions has become popular. Originally introduced as neural image captioning, it refers to a family of models where several neural network components are connected end-to-end to infer the most likely caption given an input image. Neural image captioning models usually comprise a Convolutional Neural Network (CNN) based image encoder and a Recurrent Neural Network (RNN) language model for generating image captions based on the output of the CNN. Generating long image captions – commonly referred to as paragraph captions – is more challenging than producing shorter, sentence-length captions. When generating paragraph captions, the model has more degrees of freedom, due to a larger total number of combinations of possible sentences that can be produced. In this thesis, we describe a combination of two approaches to improve paragraph captioning: using a hierarchical RNN model that adds a top-level RNN to keep track of the sentence context, and using richer visual features obtained from dense captioning networks. In addition to the standard MS-COCO Captions dataset used for image captioning, we also utilize the Stanford-Paragraph dataset specifically designed for paragraph captioning. This thesis describes experiments performed on three variants of RNNs for generating paragraph captions. The flat model uses a non-hierarchical RNN, the hierarchical model implements a two-level, hierarchical RNN, and the hierarchical-coherent model improves the hierarchical model by optimizing the coherence between sentences. In the experiments, the flat model outperforms the published non-hierarchical baseline and reaches similar results to our hierarchical model. The hierarchical model performs similarly to the corresponding published model, thus validating it. The hierarchical-coherent model gives us inconclusive results – it outperforms our hierarchical model but does not reach the same scores as the corresponding published model. With our flat model implementation, we have shown that with minor improvements to a simple image captioning model, one can obtain much higher scores on standard metrics than previously reported. However, it is yet unclear whether a hierarchical RNN is required to model the paragraph captions, or whether a single RNN layer on its own can be powerful enough. Our initial human evaluation indicates that the captions produced by a hierarchical RNN may in fact be more fluent, however the standard automatic evaluation metrics do not capture this.