Hand gestures play a crucial role in communication, especially in social interactions. This research investigates the viability of using coding schemes to describe hand gestures and how accurately they can be classified in crowded environments by using fine-tuned visual transform
...
Hand gestures play a crucial role in communication, especially in social interactions. This research investigates the viability of using coding schemes to describe hand gestures and how accurately they can be classified in crowded environments by using fine-tuned visual transformers such as VideoMAE. The dataset used during training is based on the Conflab dataset and contains top-view video recordings of social interactions in a crowded social setting. The videos are manually annotated for gesture phases (preparation, hold, stroke, recovery) and gesture units. The two classifiers obtain high accuracies after fine tuning, with an overall accuracy of 95% for the gesture phase classification and 93% for classifying whether a clip is a gesture unit or not. These findings suggest that the proposed approach is effective in crowded environments and can be adapted for real-time applications.