Leveraging Transformer Models To Enhance Temporal Grounding
Open Access
Author:
Wang, Jason
Area of Honors:
Computer Science
Degree:
Bachelor of Science
Document Type:
Thesis
Thesis Supervisors:
Rui Zhang, Thesis Supervisor John Joseph Hannan, Thesis Honors Advisor
Keywords:
Natural Language Processing Computer Vision Transformers Vision and Language
Abstract:
Due to the exponential growth of video in all our lives, the goal to understand details and actions within videos has never been more important. When words and sentences are grounded into images and videos, they become far more meaningful. Incorporating both Computer Vision (CV) and Natural Language Processing (NLP), the task of temporal grounding aims to predict a specific range of time that an event happens within a video. Specifically, temporal grounding takes a natural language query and an untrimmed video as input. Tackling and ultimately optimizing this task can open a wide range of applications both in NLP and CV. From detecting actions and objects in a live video to creating unsupervised captions, temporal grounding has an abundance of benefits. In this thesis, I first recap the innovations presented in EVOQUER, a temporal grounding framework we created that incorporates an existing text-to-video grounding model and a video-assisted query generation network. Afterwards, I present and analyze the potential benefits of leveraging transformer models as well as our current attempts at replicating previous performance statistics.