Describing Images using Inferred Visual Dependency Representations.

Desmond Elliott; Arjen de Vries

Describing Images using Inferred Visual Dependency Representations.

29 Citationer (Scopus)

Abstract

The Visual Dependency Representation (VDR) is an explicit model of the spatial relationships between objects in an image. In this paper we present an approach to training a VDR Parsing Model without the extensive human supervision used in previous work. Our approach is to find the objects mentioned in a given description using a state-of-The-Art object detector, and to use successful detections to produce training data. The description of an unseen image is produced by first predicting its VDR over automatically detected objects, and then generating the text with a template-based generation model using the predicted VDR. The performance of our approach is comparable to a state-ofthe-Art multimodal deep neural network in images depicting actions.

Originalsprog	Udefineret/Ukendt
Titel	Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing
Antal sider	11
Publikationsdato	2015
Sider	42-52
Status	Udgivet - 2015

Citationsformater

Describing Images using Inferred Visual Dependency Representations. / Elliott, Desmond; de Vries, Arjen.

Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 2015. s. 42-52.

Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › peer review

@inproceedings{d2dcb34fb57b44d2a70a554e5aca639c,

title = "Describing Images using Inferred Visual Dependency Representations.",

abstract = "The Visual Dependency Representation (VDR) is an explicit model of the spatial relationships between objects in an image. In this paper we present an approach to training a VDR Parsing Model without the extensive human supervision used in previous work. Our approach is to find the objects mentioned in a given description using a state-of-The-Art object detector, and to use successful detections to produce training data. The description of an unseen image is produced by first predicting its VDR over automatically detected objects, and then generating the text with a template-based generation model using the predicted VDR. The performance of our approach is comparable to a state-ofthe-Art multimodal deep neural network in images depicting actions.",

author = "Desmond Elliott and {de Vries}, Arjen",

year = "2015",

language = "Udefineret/Ukendt",

pages = "42--52",

booktitle = "Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing",

}

TY - GEN

T1 - Describing Images using Inferred Visual Dependency Representations.

AU - Elliott, Desmond

AU - de Vries, Arjen

PY - 2015

Y1 - 2015

N2 - The Visual Dependency Representation (VDR) is an explicit model of the spatial relationships between objects in an image. In this paper we present an approach to training a VDR Parsing Model without the extensive human supervision used in previous work. Our approach is to find the objects mentioned in a given description using a state-of-The-Art object detector, and to use successful detections to produce training data. The description of an unseen image is produced by first predicting its VDR over automatically detected objects, and then generating the text with a template-based generation model using the predicted VDR. The performance of our approach is comparable to a state-ofthe-Art multimodal deep neural network in images depicting actions.

AB - The Visual Dependency Representation (VDR) is an explicit model of the spatial relationships between objects in an image. In this paper we present an approach to training a VDR Parsing Model without the extensive human supervision used in previous work. Our approach is to find the objects mentioned in a given description using a state-of-The-Art object detector, and to use successful detections to produce training data. The description of an unseen image is produced by first predicting its VDR over automatically detected objects, and then generating the text with a template-based generation model using the predicted VDR. The performance of our approach is comparable to a state-ofthe-Art multimodal deep neural network in images depicting actions.

M3 - Konferencebidrag i proceedings

SP - 42

EP - 52

BT - Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing

ER -