dc.description.abstract | Representations of data obtained from deep neural networks automatically encode structures in a data distribution that are helpful for solving arbitrary downstream tasks like classification, retrieval, etc. To achieve this, design patterns for deep neural networks, as well as their training schemes, rely on a fundamental assumption about the completeness of the input data source. Specifically, they assume that each unit of datum, consumed in its original form (images at a certain scale, or from a certain domain), contains everything that there is to know when it comes to predicting its label. However, this completeness assumption may be violated when the data distribution is ambiguous, noisy, or incomplete. This led to the development of multi-view representation learning, which posits that a complete concept may only be holistically described as a combination of multiple views, and each sample (data-point) is only one of the many required views. This thesis studies the conditions which lead to various problems being either better or worse candidates for multi-view representation learning in the context of computer vision, from both theoretical and empirical perspectives.
We start by understanding how relationships between the different views of an object can uniquely encode semantic information. We develop a rigorous theoretical framework for formalizing this idea and show its benefits in the context of fine-grained visual categorization and zero-shot learning. We further study how relational representation learning can be made more interpretable by expressing the abstract ways in which different views combine in a deep neural network as transformations over a graph of image views. In the second part of this thesis, we explore view multiplicity in the context of multi-modal representation learning. We primarily focus on cross-modal image retrieval, whereby we develop state-of-the-art algorithms that mine complementary information across views to efficiently learn unified multi-modal representations, as well as those that can operate in data and model constrained environments. In the final part of this thesis, we study various properties of conditional invariance learning in the context of domain adaptation. We present a novel perspective on invariance learning by viewing the same through the lens of learning operators over domains. We then show that certain properties of the underlying operator dictates the nature of the invariance learned. We find that a simple and computationally efficient way of learning conditional invariances is by optimizing the corresponding operator to non-commutatively direct the domain mapping towards the target. A common theme that runs throughout this thesis is a characterization of the ways in which the distribution shifts that exist across different views influence the representation spaces of neural networks, which is helpful in understanding the generalization properties of various learning paradigms. | en_GB |