Show simple item record

dc.contributor.authorHe, S
dc.contributor.authorTavakoli, HR
dc.contributor.authorBorji, A
dc.contributor.authorPugeault, N
dc.date.accessioned2019-08-12T08:29:20Z
dc.date.issued2020-02-27
dc.description.abstractIn this work, we present a novel dataset consisting of eye movements and verbal descriptions recorded synchronously over images. Using this data, we study the differences in human attention during free-viewing and image captioning tasks. We look into the relationship between human attention and language constructs during perception and sentence articulation. We also analyse attention deployment mechanisms in the top-down soft attention approach that is argued to mimic human attention in captioning tasks, and investigate whether visual saliency can help image captioning. Our study reveals that (1) human attention behaviour differs in free-viewing and image description tasks. Humans tend to fixate on a greater variety of regions under the latter task, (2) there is a strong relationship between described objects and attended objects (97% of the described objects are being attended), (3) a convolutional neural network as feature encoder accounts for human-attended regions during image captioning to a great extent (around 78%), (4) soft-attention mechanism differs from human attention, both spatially and temporally, and there is low correlation between caption scores and attention consistency scores. These indicate a large gap between humans and machines in regards to top-down attention, and (5) by integrating the soft attention model with image saliency, we can significantly improve the model's performance on Flickr30k and MSCOCO benchmarks.en_GB
dc.description.sponsorshipEngineering and Physical Sciences Research Council (EPSRC)en_GB
dc.description.sponsorshipAlan Turing Instituteen_GB
dc.identifier.citationICCV 2019: IEEE International Conference on Computer Vision, 27 October -2 November 2019, Seoul, Korea, pp. 8528-8537.en_GB
dc.identifier.doi10.1109/ICCV.2019.00862
dc.identifier.grantnumberEP/N035399/1en_GB
dc.identifier.grantnumberEP/N510129/1en_GB
dc.identifier.urihttp://hdl.handle.net/10871/38302
dc.language.isoenen_GB
dc.publisherInstitute of Electrical and Electronics Engineers (IEEE)en_GB
dc.relation.urlhttps://github.com/SenHe/ Human-Attention-in-Image-Captioningen_GB
dc.rights© 2019 IEEE.
dc.titleHuman Attention in Image Captioning: Dataset and Analysisen_GB
dc.typeConference paperen_GB
dc.date.available2019-08-12T08:29:20Z
dc.identifier.issn2380-7504
dc.descriptionThis is the author accepted manuscript. The final version is available from IEE via the DOI in this record.en_GB
dc.descriptionData availablility: The dataset can be found at: https://github.com/SenHe/ Human-Attention-in-Image-Captioningen_GB
dc.rights.urihttp://www.rioxx.net/licenses/all-rights-reserveden_GB
pubs.funder-ackownledgementYesen_GB
dcterms.dateAccepted2019-07-22
exeter.funder::Engineering and Physical Sciences Research Council (EPSRC)en_GB
exeter.funder::Alan Turing Instituteen_GB
rioxxterms.versionAMen_GB
rioxxterms.licenseref.startdate2019-07-22
rioxxterms.typeConference Paper/Proceeding/Abstracten_GB
refterms.dateFCD2019-08-09T10:39:43Z
refterms.versionFCDAM
refterms.dateFOA2020-03-20T14:38:50Z
refterms.panelBen_GB


Files in this item

This item appears in the following Collection(s)

Show simple item record