Show simple item record

dc.contributor.authorHe, S
dc.date.accessioned2021-07-28T08:41:09Z
dc.date.issued2021-08-02
dc.description.abstractThanks to deep learning, computer vision has advanced by a large margin. Attention mechanism, inspired from human vision system and acts as a versatile module or mechanism that widely applied in the current deep computer vision models, strengthens the power of deep models. However, most attention models have been trained end-to-end. Why and how those attention models work? How similar is the trained attention to the human attention where it was inspired? Those questions are still unknown to us, which thus hinders us to design a better attention model, architecture or algorithm that can further advance the computer vision field. In this thesis, we aim to unravel the mysterious attention models by studying attention mechanisms in computer vision during the deep learning era. In the first part of this thesis, we study bottom-up attention. Under the umbrella of saliency prediction, bottom-up attention has progressed a lot with the help of deep learning. However, the deep saliency models are still a black box to us and their performance has reached a ceiling. Therefore, the first part of this thesis aims to understand what happened inside the deep models when it is trained for saliency prediction. Concretely, this thesis dissected each individual unit inside a deep model that has been trained for saliency prediction. Our analysis discloses the secrets of deep models for saliency prediction as well as their limitations, and give new insights for future saliency modelling. In the second part, we study top-down attention in computer vision. Top-down attention, a mechanism usually builds on top of bottom-up attention, has achieved great success in a lot of computer vision tasks. However, their success raised an interesting question, namely, ``are those learned top-down attention similar to human attention under the same task?''. To answer this question, we have collected a dataset which recorded human attention under the image captioning task. Using our collected dataset, we analyse what is the difference between attention exploited by a deep model for image captioning and human attention under the same task. Our research shows that current widely used soft attention mechanism is different from human attention under the same task. In the meanwhile, we use human attention, as a prior knowledge, to help machine to perform better in the image captioning task. In the third part, we study contextual attention. It is a complementary part to both bottom-up and top-down attention, which contextualizes each informative region with attention. Prior contextual attention methods either adopt the contextual module in natural language processing that is only suitable for 1-D sequential inputs or complex two stream graph neural networks. Motivated by the difference of semantic units between sentences and images, we designed a transformer based architecture for image captioning. Our design widens original transformer layer by using the 2-D spatial relationship and achieves competitive performance for image captioning.en_GB
dc.identifier.urihttp://hdl.handle.net/10871/126588
dc.publisherUniversity of Exeteren_GB
dc.rights.embargoreasonSome extended works are under review.en_GB
dc.titleAttention in Computer Visionen_GB
dc.typeThesis or dissertationen_GB
dc.date.available2021-07-28T08:41:09Z
dc.contributor.advisorPugeault, Nen_GB
dc.publisher.departmentComputer Sciencesen_GB
dc.rights.urihttp://www.rioxx.net/licenses/all-rights-reserveden_GB
dc.type.degreetitlePhD in Computer Sciencesen_GB
dc.type.qualificationlevelDoctoralen_GB
dc.type.qualificationnameDoctoral Thesisen_GB
rioxxterms.versionNAen_GB
rioxxterms.licenseref.startdate2021-07-27
rioxxterms.typeThesisen_GB
refterms.dateFOA2021-07-28T08:44:34Z


Files in this item

This item appears in the following Collection(s)

Show simple item record