Objectives:It is well known that multimodal services containing audio,video and haptics such as mixed reality,digital twin and metaverse are bound to become killer applications in the 6G era,however,the large amount of multimodal data generated by such services is highly likely to burden the signal processing, transmission and storage of existing communication systems. Therefore, a cross-modal signal reconstruction scheme is urgently needed to reduce the amount of transmitted data to support 6G immersive multimodal services in order to meet the user's immersive experience requirements and guarantee low latency,high reliability and high capacity communication quality. Methods:Firstly,by controlling the robot to touch various materials,a dataset containing audio, visual and touch signals, VisTouch, is constructed to lay the foundation for subsequent research on various cross-modal problems; secondly, by exploiting the semantic correlation between multimodal signals, a universal and robust end-to-end cross-modal signal reconstruction architecture is designed, comprising three parts: a feature extraction module, a reconstruction module and an evaluation module. The feature extraction module maps the source modal signals into a semantic feature vector in the common semantic space, and the reconstruction module inverse transforms this semantic feature vector into the target modal signal.The evaluation module evaluates the reconstruction quality in semantic and spatio-temporal dimensions, and feeds the optimization information to the feature extraction module and the reconstruction module during the training process of the framework, forming a closed-loop loop to achieve accurate signal reconstruction through continuous iteration. Further, a teleoperated platform is designed to deploy the constructed haptic reconstruction model into the codec to actually verify the operational efficiency of the model; finally, the reliability of the cross-modal signal reconstruction architecture and the accuracy of the haptic reconstruction model are verified by experimental results. Results: The constructed VisTouch dataset involves three modalities: audio, video and haptics, and contains 47 common slices of life samples. The average absolute error and accuracy of the constructed video-assisted haptic reconstruction model on the VisTouch dataset reached 0.0135 and 0.78 respectively. In order to implement the proposed cross-modal signal reconstruction framework into practical application scenarios, a teleoperation platform was further built using the robot and Nvidia development board for the industrial scenario of The results of running on this platform show that the actual mean absolute error is 0.0126,the total end-to-end delay is 127ms and the reconstruction model delay is 98ms.A questionnaire was also used to assess user satisfaction,where the mean value of haptic realism satisfaction is 4.43 with a variance of 0.72 and the mean value of time delay satisfaction is 3.87 with a variance of 1.07. Conclusions: The results of the dataset runs fully demonstrate the practicality of the constructed VisTouch dataset and the accuracy of the video-assisted haptic reconstruction model, while the actual test results of the teleoperated platform indicate that users consider the haptic signals generated by the model to be closer to the actual signals,but are generally satisfied with the running time of the algorithm, i.e. the complexity of this modality needs further optimization.