Humans can easily infer the motivations behind human actions from only visual data by comprehensively analyzing the complex context information and utilizing abundant life experiences. Inspired by humans’ reasoning ability, existing motivation prediction methods have improved image-based deep classification models using the commonsense knowledge learned by pre-trained language models. However, the knowledge learned from public text corpora is probably incompatible with the task-specific data of the motivation prediction, which may impact the model performance. To address this problem, this paper proposes a dual scene graph convolutional network (dual-SGCN) to comprehensively explore the complex visual information and semantic context prior from the image data for motivation prediction. The proposed dual-SGCN has a visual branch and a semantic branch. For the visual branch, we build a visual graph based on scene graph where object nodes and relation edges are represented by visual features. For the semantic branch, we build a semantic graph where nodes and edges are directly represented by the word embeddings of the object and relation labels. In each branch, node-oriented and edge-oriented message passing is adopted to propagate interaction information between different nodes and edges. Besides, a multi-modal interactive attention mechanism is adopted to cooperatively attend and fuse the visual and semantic information. The proposed dual-SGCN is learned in an end-to-end form by a multi-task co-training scheme. In the inference stage, Total Direct Effect is adopted to alleviate the bias caused by the semantic context prior. Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance.