Résumé:
The choice of input image size can have a significant impact on the performance of the state-of-the-art algorithms. We can always customize the algorithms by training and fine-tuning them on our datasets, but it is time consuming. Nowadays there is a trend to use foundation models but in our application of monitoring patients, both hand detection, object detection and hand-object interaction detection resulted in mediocre performance. This study aimed to investigate the significance of input size for detecting hand-object interaction in two datasets: the patient dataset (captured by super view mode) and the EpicKitchen dataset (captured by normal view mode). The results showed that using different input sizes with the same foundation model can lead to a significant improvement in performance. In the patient dataset, using frames with input sizes of 300 × 300 pixels (px) and 256 × 256 px after cropping and resizing the original images led to more successful hand detection results. Furthermore, using video processing tools like FFmpeg for resizing frames instead of passing the original images to the MediaPipe model for resizing resulted in a 33% improvement. In the EpicKitchen dataset with normal view mode, successful hand detection results were obtained by resizing frames into a rectangle of 256 px and 300 px after padding and cropping the original images. Overall, the study emphasizes the significance of input size for detecting hand-object interaction detection for the purpose of monitoring patients with upper-limb impairment. The combination analysis within each dataset showed that the most effective combination in hand-object interaction detection is achieved by applying the MediaPipe model to an input image size of 300 × 300 px (for super view mode) or 256 × 256 px (for normal view mode) along with the result of YOLOv7 model with an input image size of 1920 × 1920 px. By using this combination, a 100% success rate was achieved for both datasets.