This article presents a method for the automatic learning of the potentials of a stochastic model, in particular a conditional random field (CRF), in a non-parametric fashion. The proposed model is based on a neural architecture, in order to leverage the modeling capabilities of deep learning (DL) approaches to directly learn semantic and spatial information from the input data. Specifically, the methodology is based on fully convolutional networks and fully connected neural networks. The idea is to access the multiscale information intrinsically extracted in the intermediate layers of a fully convolutional network through the integration of fully connected neural networks at different scales, while favoring the interpretability of the hidden layers as posterior probabilities. The potentials of the CRF are learned through an additional convolutional layer, whose kernel models the local spatial information considered. The loss function is computed as a linear combination of cross-entropy losses, accounting for the multiscale and the spatial information. To evaluate the capabilities of the proposed approach for the semantic segmentation of remote sensing images, the experimental validation was conducted with the ISPRS 2-D semantic labeling challenge Vaihingen and Potsdam datasets and with the IEEE GRSS data fusion contest Zeebruges dataset. As the ground truths of these benchmark datasets are spatially exhaustive, they have been modified to approximate the spatially sparse ground truths common in real remote sensing applications. The results are significant, as the proposed approach obtains higher average classification accuracies than recent state-of-the-art techniques considered in this article. The code is available at https://github.com/Ayana-Inria/CRFNet-RS.