In order to solve the problems of low accuracy, poor generalization ability and large number of parameters in the diagnosis model of COVID-19 based on deep learning, a lightweight siamese architecture network SMViT (siamese masked vision transformer) for COVID-19 diagnosis based on ViT (vision transformer) and siamese network is proposed. Firstly, a lightweight strategy of cyclic substructure is proposed, which uses multiple subnets with the same structure to make a diagnosis network, thereby reducing the number of network parameters. Secondly, masked self-supervised pre-training model based on ViT is proposed to enhance the potential feature expression ability of the model. Then, in order to effectively improve the diagnostic accuracy of the diagnosis model of COVID-19, and improve the poor generalization ability of the model under small samples, this paper constructs the twin network SMViT. Finally, the ablation experiment is used to verify and determine the structure of the model, and the diagnostic performance and lightweight capacity of the model are verified through comparative experiments. Experimental results show that, compared with the most competitive ViT-based diagnostic model, the Accuracy, Specificity, Sensitivity and F1 scores of this model on the X-ray dataset have increased by 1.42%, 4.62%, 0.40% and 2.80% respectively, and the Accuracy, Specificity, Sensitivity and F1 scores on the CT image dataset have increased by 2.16%, 2.17%, 2.05% and 2.06% respectively. The SMViT model has strong generalization ability for small sample size datasets. Compared with ViT, SMViT model has smaller parameters and higher diagnostic performance. [ABSTRACT FROM AUTHOR]