To address the difficulties in manual counting for the uneven distribution and severe occlusion of watermelons in natural environments, this study utilizes drones and smartphones to collect videos and images, combined with manual annotation to establish a dataset for Sanbai melons and Ningxia selenium sand melons. A watermelon video automatic counting method based on the YOLOv7-GCSF model and an improved DeepSORT algorithm is proposed. The lightweight YOLOv7 model with GhostConv is enhanced with GBS modules, G-ELAN modules, and G-SPPCSPC modules to increase the model’s detection speed. Some ELAN modules are replaced with the C2f module from YOLOv8 to reduce redundant information. The SimAM attention mechanism is introduced into the MP module of the feature fusion layer to construct the MP-SimAM module, which is used to enhance the model's feature extraction capability. The CIoU loss function is replaced with the fasterconverging, lower-loss Focal EIoU loss function to increase the model's convergence speed. In video tracking and counting, a mask collision line mechanism is proposed for more accurate counting of Sanbai melons and Ningxia selenium sand melons. The results show that in terms of object detection: the four improvements to the YOLOv7-GCSF model have all enhanced the model’s performance to some extent. Specifically, compared to the YOLOv7 model, the construction of the MP-SimAM module increased accuracy by 1.5 percentage points, indicating a greater focus on Sanbai melons and Ningxia selenium sand melons. The addition of GhostConv reduced the model size by 28.1MB, demonstrating that the construction of GBS, G-ELAN, and G-SPPCSPC modules effectively reduced the model size and improved detection speed. The incorporation of the C2f module reduced the model's floating-point operations (FLOPs) by 77.5 billion, indicating that the model has eliminated most of the redundant information. The addition of the Focal EIoU loss function significantly increased the model’s convergence speed, indicating further enhancement of the model's learning ability. The improved YOLOv7-GCSF model achieved an accuracy (P) of 94.2% and a mean average precision (mAP0.5) of 98.2%, which is 5.0, 2.3, 21.9, and 14.9 percentage points higher in accuracy and 3.7, 0.3, 4.6, and 9.3 percentage points higher in mean average precision compared to YOLOv5, YOLOv7, Faster RCNN, and SSD, respectively. In terms of model lightweighting, the YOLOv7-GCSF model has seen a decrease of 1.18M and 0.11M in the number of parameters compared to the YOLOv4-Ghostnet and YOLOv7-Slimneck models, respectively. Compared to the original YOLOv7, the YOLOv7-GCSF model has reduced the parameter count and model size by 0.57M and 18.88MB, respectively. In terms of object tracking: the improved DeepSORT multi-object tracking accuracy is 91.2%, and the multi-object tracking precision is 89.6%, which is 5.0 and 13.7 percentage points higher in tracking accuracy and 3.7 and 13.1 percentage points higher in tracking precision compared to Tracktor and SORT, respectively. Comparing the improved model with manual counting results, the determination coefficient for the counting results of Sanbai melons and Ningxia selenium sand melons is 0.93, the average counting accuracy is 96.3%, and the average absolute error is 0.77, indicating that the error between the improved model and manual counting is small. This approach, by enabling effective counting of watermelons in agricultural fields, provides a technical methodology for the forecasting of watermelon yields. [ABSTRACT FROM AUTHOR]