The primary focus of this thesis is to examine the impact of integrating a structural task graph into a visual recognition network to accurately identify and segment errors in the assembly of toy cars. We have introduced enhancements to two baseline networks that specifically encode the structural and sequential intricacies of assembly processes. These enhancements have led to state-of-the-art performance in visual-only mistake recognition task, marking a 3.7% increase in the F1-score over existing benchmarks within the Assembly101 dataset. Moreover, our work pioneers in addressing the temporal mistake segmentation task which does not rely on ground truth action segments during test time. The advancements presented have yielded substantial improvements over baseline models, with a 5% increase in F1 @ 10, 3.8% at F1 @ 25, and 1.8% at F1 @ 50. Our results indicate the significant role that graph construction and attention-based mechanisms play in enhancing mistake recognition and temporal mistake segmentation tasks, setting a new precedent for future research in mistake detection.