Emotion recognition and identification is an important aspect of day-to-day life for almost everyone, and without it most human communication would not be nearly as effective. Computer vision tools attempt to reproduce human behavior to help solve some of the problems that humans frequently encounter, and the ability to recognize human emotion could lead to advancements in many different fields such as policing and teaching. Body language is a key aspect of emotion expression, and thus is an area of intense focus for automated emotion recognition tools. When attempting to classify a person’s emotions in a video, it is important to consider information from the entire video before making a final prediction for the emotions. Based on this principle, this work proposes the use of Temporal Segment Long Short-Term Memory networks for emotion recognition. This approach outperforms current two-stream network-based methods and identifies specific emotions that benefit the most from the context that can be retained by a Long Short-Term Memory (LSTM) unit. The use of convolutional neural networks paired with LSTMs helps these networks act more like humans and increases the accuracy compared to current emotion recognition methods.