Wenping Wang
Adv. Artif. Intell. Mach. Learn., 3 (3):1369–1388
Wenping Wang : Individual Researcher
DOI: 10.54364/AAIML.2023.1181
Article History: Received on: 05-Jun-23, Accepted on: 23-Aug-23, Published on: 31-Aug-23
Corresponding Author: Wenping Wang
Email: wenpingw@alumni.cmu.edu
Citation: Tong Chen, Sicong Liu, Zhiran Chen, Wenyan Hu, Dachi Chen, Yuanxin Wang, Qi Lyu, Cindy X. Le, Wenping Wang (2023). Faster, Stronger, and More Interpretable: Massive Transformer Architectures for Vision-Language Tasks. Adv. Artif. Intell. Mach. Learn., 3 (3 ):1369–1388
Multi-layered transformer architectures have lately dominated the domain of vision-language tasks. However, massive transformer architectures can often be inaccessible to many researchers due to their sheer model sizes, and they are often treated as black boxes with poor interpretability. In this paper, we examine the weaknesses of such architectures and propose our own solutions. In particular, we select one of the state-of-the-art models called Oscar \cite{li2020Oscar} and apply distilling techniques and attention visualization to address the aforementioned issues. Moreover, we attempt to improve the overall effectiveness of the Oscar model by making its inferred object tags more useful. We show with detailed experimentation that we can both improve the performance of vision-language tasks and make them more transparent and accessible to all researchers. We discuss the findings with detailed analysis, including the effects of tags and confidence, the training behavior of distillation, and point out future directions in the end.