ObjectiveThe detection of lameness in dairy cows is an important issue that needs to be solved urgently in the process of large-scale dairy farming. Timely detection and effective intervention can reduce the culling rate of young dairy cows, which has important practical significance for increasing the milk production of dairy cows and improving the economic benefits of pastures. Due to the low efficiency and low degree of automation of traditional manual detection and contact sensor detection, the mainstream cow lameness detection method is mainly based on computer vision. The detection perspective of existing computer vision-based cow lameness detection methods is mainly side view, but the side view perspective has limitations that are difficult to eliminate. In the actual detection process, there are problems such as cows blocking each other and difficulty in deployment. The cow lameness detection method from the top view will not be difficult to use on the farm due to occlusion problems. The aim is to solve the occlusion problem under the side view.MethodsIn order to fully explore the movement undulations of the trunk of the cow and the movement information in the time dimension during the walking process of the cow, a cow lameness detection method was proposed from a top view based on fused spatiotemporal flow features. By analyzing the height changes of the lame cow in the depth video stream during movement, a spatial stream feature image sequence was constructed. By analyzing the instantaneous speed of the lame cow's body moving forward and swaying left and right when walking, optical flow was used to capture the instantaneous speed of the cow's movement, and a time flow characteristic image sequence was constructed. The spatial flow and time flow features were combined to construct a fused spatiotemporal flow feature image sequence. Different from traditional image classification tasks, the image sequence of cows walking includes features in both time and space dimensions. There would be a certain distinction between lame cows and non-lame cows due to their related postures and walking speeds when walking, so using video information analysis was feasible to characterize lameness as a behavior. The video action classification network could effectively model the spatiotemporal information in the input image sequence and output the corresponding category in the predicted result. The attention module Convolutional Block Attention Module (CBAM) was used to improve the PP-TSMv2 video action classification network and build the Cow-TSM cow lameness detection model. The CBAM module could perform channel weighting on different modes of cows, while paying attention to the weights between pixels to improve the model's feature extraction capabilities. Finally, cow lameness experiments were conducted on different modalities, different attention mechanisms, different video action classification networks and comparison of existing methods. The data was used for cow lameness included a total of 180 video streams of cows walking. Each video was decomposed into 100‒400 frames. The ratio of the number of video segments of lame cows and normal cows was 1:1. For the feature extraction of cow lameness from the top view, RGB images had less extractable information, so this work mainly used depth video streams.Results and DiscussionsIn this study, a total of 180 segments of cow image sequence data were acquired and processed, including 90 lame cows and 90 non-lame cows with a 1:1 ratio of video segments, and the prediction accuracy of automatic detection method for dairy cow lameness based on fusion of spatiotemporal stream features reaches 88.7%, the model size was 22 M, and the offline inference time was 0.046 s. The prediction accuracy of the common mainstream video action classification models TSM, PP-TSM, SlowFast and TimesFormer models on the data set of automatic detection method for dairy cow lameness based on fusion of spatiotemporal stream features reached 66.7%, 84.8%, 87.1% and 85.7%, respectively. The comprehensive performance of the improved Cow-TSM model in this paper was the most. At the same time, the recognition accuracy of the fused spatiotemporal flow feature image was improved by 12% and 4.1%, respectively, compared with the temporal mode and spatial mode, which proved the effectiveness of spatiotemporal flow fusion in this method. By conducting ablation experiments on different attention mechanisms of SE, SK, CA and CBAM, it was proved that the CBAM attention mechanism used has the best effect on the data of automatic detection method for dairy cow lameness based on fusion of spatiotemporal stream features. The channel attention in CBAM had a better effect on fused spatiotemporal flow data, and the spatial attention could also focus on the key spatial information in cow images. Finally, comparisons were made with existing lameness detection methods, including different methods from side view and top view. Compared with existing methods in the side-view perspective, the prediction accuracy of automatic detection method for dairy cow lameness based on fusion of spatiotemporal stream features was slightly lower, because the side-view perspective had more effective cow lameness characteristics. Compared with the method from the top view, a novel fused spatiotemporal flow feature detection method with better performance and practicability was proposed.ConclusionsThis method can avoid the occlusion problem of detecting lame cows from the side view, and at the same time improves the prediction accuracy of the detection method from the top view. It is of great significance for reducing the incidence of lameness in cows and improving the economic benefits of the pasture, and meets the needs of large-scale construction of the pasture.