Extending Video Action Recognition in the Compressed Domain

Open Access
- Author:
- Abrams, Samuel B
- Area of Honors:
- Computer Engineering
- Degree:
- Bachelor of Science
- Document Type:
- Thesis
- Thesis Supervisors:
- Vijaykrishnan Narayanan, Thesis Supervisor
Vijaykrishnan Narayanan, Thesis Honors Advisor
John Morgan Sampson, Faculty Reader - Keywords:
- Video Compression
Action Recognition - Abstract:
- As the internet continues to take root in every facet of society, video is becoming one of the most common mediums for the communication of ideas and information. As of 2021, video traffic makes up 81% of all internet traffic worldwide. It is, therefore, clear that being able to understand video autonomously and on a mass scale provides huge potential for technologies that will shape the future. This could immediately impact the entertainment industry, surveillance, data collection, human-computer interaction, and more. Much work has been done in an attempt to take the successful strategies that convolutional neural networks have had on image classification and apply them to video understanding. One instrumental approach leverages the inherent redundancy in video by utilizing the existing format in which video is commonly stored: its compressed state. By learning directly on compressed video, models are exposed only to the most pertinent spatial information as well as motion information that is present without any computation. While the present methods generate near-SOTA results, there are many areas for improvement. All notable works in this topic have been performed on video compressed with the MPEG-4 part-2 codec which is dated in part due to its nonoptimized compression ratio. An important contribution made to this discipline was showing that using this non-optimized codec the abundance of uncompressed frames allowed for complete disregard for the motion information in the stream using a process of temporal shifting between the spatial frames. This generated improved accuracy with system energy savings of over 11x over the traditional method. Another contribution has been modifying the methods of compressed recognition on the archaic aforementioned codec and applying them to the modern, highly-optimized H.264. With this codec the standard for platforms such as YouTube and Twitch, developing a method for performing recognition on H.264 directly is of immense importance for applying video understanding in the real world. Through my exploration of strategies for processing video in this format, I have developed a method that provides comparable accuracy to the methods on MPEG4 part-2 even given the constraints of utilizing H.264 directly.