Compositional Generalization in Image Captioning

Mitja Nikolaus; Mostafa Abdou; Matthew Lamm; Rahul Aralikatte; Desmond Elliott

doi:10.18653/v1/k19-1009

Compositional Generalization in Image Captioning

Mitja Nikolaus, Mostafa Abdou, Matthew Lamm, Rahul Aralikatte, Desmond Elliott

4 Citationer (Scopus)

Abstract

Image captioning models are usually evaluated on their ability to describe a held-out set of images, not on their ability to generalize to unseen concepts. We study the problem of compositional generalization, which measures how well a model composes unseen combinations of concepts when describing images. State-of-the-art image captioning models show poor generalization performance on this task. We propose a multi-task model to address the poor performance, that combines caption generation and image--sentence ranking, and uses a decoding mechanism that re-ranks the captions according their similarity to the image. This model is substantially better at generalizing to unseen combinations of concepts compared to state-of-the-art captioning models.

Originalsprog	Udefineret/Ukendt
Titel	Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)
Antal sider	12
Udgivelsessted	Hong Kong, China
Forlag	Association for Computational Linguistics (ACL)
Publikationsdato	1 nov. 2019
Sider	87-98
DOI	https://doi.org/10.18653/v1/k19-1009
Status	Udgivet - 1 nov. 2019

Adgang til dokumentet

10.18653/v1/k19-1009

K19-1009.pdfForlagets udgivne version, 807 KBLicens: CC BY

https://www.aclweb.org/anthology/K19-1009

Citationsformater

Compositional Generalization in Image Captioning. / Nikolaus, Mitja; Abdou, Mostafa; Lamm, Matthew et al.
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL). Hong Kong, China: Association for Computational Linguistics (ACL), 2019. s. 87-98.

Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › peer review

@inproceedings{2f18fd294bcb4b66b1fa9decac2c8b6c,

title = "Compositional Generalization in Image Captioning",

abstract = "Image captioning models are usually evaluated on their ability to describe a held-out set of images, not on their ability to generalize to unseen concepts. We study the problem of compositional generalization, which measures how well a model composes unseen combinations of concepts when describing images. State-of-the-art image captioning models show poor generalization performance on this task. We propose a multi-task model to address the poor performance, that combines caption generation and image--sentence ranking, and uses a decoding mechanism that re-ranks the captions according their similarity to the image. This model is substantially better at generalizing to unseen combinations of concepts compared to state-of-the-art captioning models.",

author = "Mitja Nikolaus and Mostafa Abdou and Matthew Lamm and Rahul Aralikatte and Desmond Elliott",

year = "2019",

month = nov,

day = "1",

doi = "10.18653/v1/k19-1009",

language = "Udefineret/Ukendt",

pages = "87--98",

booktitle = "Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)",

publisher = "Association for Computational Linguistics (ACL)",

address = "USA",

}

TY - GEN

T1 - Compositional Generalization in Image Captioning

AU - Nikolaus, Mitja

AU - Abdou, Mostafa

AU - Lamm, Matthew

AU - Aralikatte, Rahul

AU - Elliott, Desmond

PY - 2019/11/1

Y1 - 2019/11/1

N2 - Image captioning models are usually evaluated on their ability to describe a held-out set of images, not on their ability to generalize to unseen concepts. We study the problem of compositional generalization, which measures how well a model composes unseen combinations of concepts when describing images. State-of-the-art image captioning models show poor generalization performance on this task. We propose a multi-task model to address the poor performance, that combines caption generation and image--sentence ranking, and uses a decoding mechanism that re-ranks the captions according their similarity to the image. This model is substantially better at generalizing to unseen combinations of concepts compared to state-of-the-art captioning models.

AB - Image captioning models are usually evaluated on their ability to describe a held-out set of images, not on their ability to generalize to unseen concepts. We study the problem of compositional generalization, which measures how well a model composes unseen combinations of concepts when describing images. State-of-the-art image captioning models show poor generalization performance on this task. We propose a multi-task model to address the poor performance, that combines caption generation and image--sentence ranking, and uses a decoding mechanism that re-ranks the captions according their similarity to the image. This model is substantially better at generalizing to unseen combinations of concepts compared to state-of-the-art captioning models.

U2 - 10.18653/v1/k19-1009

DO - 10.18653/v1/k19-1009

M3 - Konferencebidrag i proceedings

SP - 87

EP - 98

BT - Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

PB - Association for Computational Linguistics (ACL)

CY - Hong Kong, China

ER -