194 results on '"Xu, Zhi-Qin John"'
Search Results
2. Complexity Control Facilitates Reasoning-Based Compositional Generalization in Transformers
- Author
-
Zhang, Zhongwang, Lin, Pengxiao, Wang, Zhiwei, Zhang, Yaoyu, and Xu, Zhi-Qin John
- Subjects
Computer Science - Computation and Language ,Computer Science - Machine Learning - Abstract
Transformers have demonstrated impressive capabilities across various tasks, yet their performance on compositional problems remains a subject of debate. In this study, we investigate the internal mechanisms underlying Transformers' behavior in compositional tasks. We find that complexity control strategies significantly influence whether the model learns primitive-level rules that generalize out-of-distribution (reasoning-based solutions) or relies solely on memorized mappings (memory-based solutions). By applying masking strategies to the model's information circuits and employing multiple complexity metrics, we reveal distinct internal working mechanisms associated with different solution types. Further analysis reveals that reasoning-based solutions exhibit a lower complexity bias, which aligns with the well-studied neuron condensation phenomenon. This lower complexity bias is hypothesized to be the key factor enabling these solutions to learn reasoning rules. We validate these conclusions across multiple real-world datasets, including image generation and natural language processing tasks, confirming the broad applicability of our findings., Comment: Mistakenly submitted as a replacement to 2405.05409v4
- Published
- 2025
3. A rationale from frequency perspective for grokking in training neural network
- Author
-
Zhou, Zhangchen, Zhang, Yaoyu, and Xu, Zhi-Qin John
- Subjects
Computer Science - Machine Learning ,Computer Science - Neural and Evolutionary Computing ,Statistics - Machine Learning - Abstract
Grokking is the phenomenon where neural networks NNs initially fit the training data and later generalize to the test data during training. In this paper, we empirically provide a frequency perspective to explain the emergence of this phenomenon in NNs. The core insight is that the networks initially learn the less salient frequency components present in the test data. We observe this phenomenon across both synthetic and real datasets, offering a novel viewpoint for elucidating the grokking phenomenon by characterizing it through the lens of frequency dynamics during the training process. Our empirical frequency-based analysis sheds new light on understanding the grokking phenomenon and its underlying mechanisms.
- Published
- 2024
4. The Buffer Mechanism for Multi-Step Information Reasoning in Language Models
- Author
-
Wang, Zhiwei, Wang, Yunji, Zhang, Zhongwang, Zhou, Zhangchen, Jin, Hui, Hu, Tianyang, Sun, Jiacheng, Li, Zhenguo, Zhang, Yaoyu, and Xu, Zhi-Qin John
- Subjects
Computer Science - Artificial Intelligence ,Computer Science - Computation and Language ,Computer Science - Machine Learning - Abstract
Large language models have consistently struggled with complex reasoning tasks, such as mathematical problem-solving. Investigating the internal reasoning mechanisms of these models can help us design better model architectures and training strategies, ultimately enhancing their reasoning capability. In this study, we constructed a symbolic dataset to investigate the mechanisms by which Transformer models employ vertical thinking strategy based on their inherent structure and horizontal thinking strategy based on Chain of Thought to achieve multi-step reasoning. We introduced the concept of buffer mechanism: the model stores various information in distinct buffers and selectively extracts them through the query-key matrix. We proposed a random matrix-based algorithm to enhance the model's reasoning ability, resulting in a 75% reduction in the training time required for the GPT-2 model to achieve generalization capability on the PrOntoQA dataset. These findings provide new insights into understanding the mechanisms of large language models.
- Published
- 2024
5. Initialization is Critical to Whether Transformers Fit Composite Functions by Reasoning or Memorizing
- Author
-
Zhang, Zhongwang, Lin, Pengxiao, Wang, Zhiwei, Zhang, Yaoyu, and Xu, Zhi-Qin John
- Subjects
Computer Science - Machine Learning - Abstract
Transformers have shown impressive capabilities across various tasks, but their performance on compositional problems remains a topic of debate. In this work, we investigate the mechanisms of how transformers behave on unseen compositional tasks. We discover that the parameter initialization scale plays a critical role in determining whether the model learns inferential (reasoning-based) solutions, which capture the underlying compositional primitives, or symmetric (memory-based) solutions, which simply memorize mappings without understanding the compositional structure. By analyzing the information flow and vector representations within the model, we reveal the distinct mechanisms underlying these solution types. We further find that inferential (reasoning-based) solutions exhibit low complexity bias, which we hypothesize is a key factor enabling them to learn individual mappings for single anchors. We validate our conclusions on various real-world datasets. Our findings provide valuable insights into the role of initialization scale in tuning the reasoning and memorizing ability and we propose the initialization rate $\gamma$ to be a convenient tunable hyper-parameter in common deep learning frameworks, where $1/d_{\mathrm{in}}^\gamma$ is the standard deviation of parameters of the layer with $d_{\mathrm{in}}$ input neurons.
- Published
- 2024
6. Loss Jump During Loss Switch in Solving PDEs with Neural Networks
- Author
-
Wang, Zhiwei, Zhang, Lulu, Zhang, Zhongwang, and Xu, Zhi-Qin John
- Subjects
Computer Science - Machine Learning ,Mathematical Physics - Abstract
Using neural networks to solve partial differential equations (PDEs) is gaining popularity as an alternative approach in the scientific computing community. Neural networks can integrate different types of information into the loss function. These include observation data, governing equations, and variational forms, etc. These loss functions can be broadly categorized into two types: observation data loss directly constrains and measures the model output, while other loss functions indirectly model the performance of the network, which can be classified as model loss. However, this alternative approach lacks a thorough understanding of its underlying mechanisms, including theoretical foundations and rigorous characterization of various phenomena. This work focuses on investigating how different loss functions impact the training of neural networks for solving PDEs. We discover a stable loss-jump phenomenon: when switching the loss function from the data loss to the model loss, which includes different orders of derivative information, the neural network solution significantly deviates from the exact solution immediately. Further experiments reveal that this phenomenon arises from the different frequency preferences of neural networks under different loss functions. We theoretically analyze the frequency preference of neural networks under model loss. This loss-jump phenomenon provides a valuable perspective for examining the underlying mechanisms of neural networks in solving PDEs.
- Published
- 2024
7. Efficient and Flexible Method for Reducing Moderate-size Deep Neural Networks with Condensation
- Author
-
Chen, Tianyi and Xu, Zhi-Qin John
- Subjects
Computer Science - Machine Learning - Abstract
Neural networks have been extensively applied to a variety of tasks, achieving astounding results. Applying neural networks in the scientific field is an important research direction that is gaining increasing attention. In scientific applications, the scale of neural networks is generally moderate-size, mainly to ensure the speed of inference during application. Additionally, comparing neural networks to traditional algorithms in scientific applications is inevitable. These applications often require rapid computations, making the reduction of neural network sizes increasingly important. Existing work has found that the powerful capabilities of neural networks are primarily due to their non-linearity. Theoretical work has discovered that under strong non-linearity, neurons in the same layer tend to behave similarly, a phenomenon known as condensation. Condensation offers an opportunity to reduce the scale of neural networks to a smaller subnetwork with similar performance. In this article, we propose a condensation reduction algorithm to verify the feasibility of this idea in practical problems. Our reduction method can currently be applied to both fully connected networks and convolutional networks, achieving positive results. In complex combustion acceleration tasks, we reduced the size of the neural network to 41.7% of its original scale while maintaining prediction accuracy. In the CIFAR10 image classification task, we reduced the network size to 11.5% of the original scale, still maintaining a satisfactory validation accuracy. Our method can be applied to most trained neural networks, reducing computational pressure and improving inference speed.
- Published
- 2024
8. Input gradient annealing neural network for solving low-temperature Fokker-Planck equations
- Author
-
Hang, Liangkai, Hu, Dan, and Xu, Zhi-Qin John
- Subjects
Mathematics - Dynamical Systems ,Physics - Computational Physics - Abstract
We present a novel yet simple deep learning approach, called input gradient annealing neural network (IGANN), for solving stationary Fokker-Planck equations. Traditional methods, such as finite difference and finite elements, suffer from the curse of dimensionality. Neural network based algorithms are meshless methods, which can avoid the curse of dimensionality. However, at low temperature, when directly solving a stationary Fokker-Planck equation with more than two metastable states in the generalized potential landscape, the small eigenvalue introduces numerical difficulties due to a large condition number. To overcome these problems, we introduce the IGANN method, which uses a penalty of negative input gradient annealing during the training. We demonstrate that the IGANN method can effectively solve high-dimensional and low-temperature Fokker-Planck equations through our numerical experiments.
- Published
- 2024
9. Overview Frequency Principle/Spectral Bias in Deep Learning
- Author
-
Xu, Zhi-Qin John, Zhang, Yaoyu, and Luo, Tao
- Published
- 2024
- Full Text
- View/download PDF
10. Understanding Time Series Anomaly State Detection through One-Class Classification
- Author
-
Zhou, Hanxu, Zhang, Yuan, Leng, Guangjie, Wang, Ruofan, and Xu, Zhi-Qin John
- Subjects
Computer Science - Machine Learning - Abstract
For a long time, research on time series anomaly detection has mainly focused on finding outliers within a given time series. Admittedly, this is consistent with some practical problems, but in other practical application scenarios, people are concerned about: assuming a standard time series is given, how to judge whether another test time series deviates from the standard time series, which is more similar to the problem discussed in one-class classification (OCC). Therefore, in this article, we try to re-understand and define the time series anomaly detection problem through OCC, which we call 'time series anomaly state detection problem'. We first use stochastic processes and hypothesis testing to strictly define the 'time series anomaly state detection problem', and its corresponding anomalies. Then, we use the time series classification dataset to construct an artificial dataset corresponding to the problem. We compile 38 anomaly detection algorithms and correct some of the algorithms to adapt to handle this problem. Finally, through a large number of experiments, we fairly compare the actual performance of various time series anomaly detection algorithms, providing insights and directions for future research by researchers.
- Published
- 2024
11. Solving multiscale dynamical systems by deep learning
- Author
-
Yao, Junjie, Yi, Yuxiao, Hang, Liangkai, E, Weinan, Wang, Weizong, Zhang, Yaoyu, Zhang, Tianhan, and Xu, Zhi-Qin John
- Subjects
Mathematics - Numerical Analysis - Abstract
Multiscale dynamical systems, modeled by high-dimensional stiff ordinary differential equations (ODEs) with wide-ranging characteristic timescales, arise across diverse fields of science and engineering, but their numerical solvers often encounter severe efficiency bottlenecks. This paper introduces a novel DeePODE method, which consists of an Evolutionary Monte Carlo Sampling method (EMCS) and an efficient end-to-end deep neural network (DNN) to predict multiscale dynamical systems. We validate this finding across dynamical systems from ecological systems to reactive flows, including a predator-prey model, a power system oscillation, a battery electrolyte thermal runaway, and turbulent reaction-diffusion systems with complex chemical kinetics. The method demonstrates robust generalization capabilities, allowing pre-trained DNN models to accurately predict the behavior in previously unseen scenarios, largely due to the delicately constructed dataset. While theoretical guarantees remain to be established, empirical evidence shows that DeePODE achieves the accuracy of implicit numerical schemes while maintaining the computational efficiency of explicit schemes. This work underscores the crucial relationship between training data distribution and neural network generalization performance. This work demonstrates the potential of deep learning approaches in modeling complex dynamical systems across scientific and engineering domains., Comment: 18 pages, 6 figures
- Published
- 2024
12. An Unsupervised Deep Learning Approach for the Wave Equation Inverse Problem
- Author
-
Yan, Xiong-Bin, Wu, Keke, Xu, Zhi-Qin John, and Ma, Zheng
- Subjects
Mathematics - Numerical Analysis ,Computer Science - Machine Learning - Abstract
Full-waveform inversion (FWI) is a powerful geophysical imaging technique that infers high-resolution subsurface physical parameters by solving a non-convex optimization problem. However, due to limitations in observation, e.g., limited shots or receivers, and random noise, conventional inversion methods are confronted with numerous challenges, such as the local-minimum problem. In recent years, a substantial body of work has demonstrated that the integration of deep neural networks and partial differential equations for solving full-waveform inversion problems has shown promising performance. In this work, drawing inspiration from the expressive capacity of neural networks, we provide an unsupervised learning approach aimed at accurately reconstructing subsurface physical velocity parameters. This method is founded on a re-parametrization technique for Bayesian inference, achieved through a deep neural network with random weights. Notably, our proposed approach does not hinge upon the requirement of the labeled training dataset, rendering it exceedingly versatile and adaptable to diverse subsurface models. Extensive experiments show that the proposed approach performs noticeably better than existing conventional inversion methods., Comment: 32 Pages,22 figures, 3 tables
- Published
- 2023
13. Optimistic Estimate Uncovers the Potential of Nonlinear Models
- Author
-
Zhang, Yaoyu, Zhang, Zhongwang, Zhang, Leyang, Bai, Zhiwei, Luo, Tao, and Xu, Zhi-Qin John
- Subjects
Computer Science - Machine Learning ,Statistics - Machine Learning - Abstract
We propose an optimistic estimate to evaluate the best possible fitting performance of nonlinear models. It yields an optimistic sample size that quantifies the smallest possible sample size to fit/recover a target function using a nonlinear model. We estimate the optimistic sample sizes for matrix factorization models, deep models, and deep neural networks (DNNs) with fully-connected or convolutional architecture. For each nonlinear model, our estimates predict a specific subset of targets that can be fitted at overparameterization, which are confirmed by our experiments. Our optimistic estimate reveals two special properties of the DNN models -- free expressiveness in width and costly expressiveness in connection. These properties suggest the following architecture design principles of DNNs: (i) feel free to add neurons/kernels; (ii) restrain from connecting neurons. Overall, our optimistic estimate theoretically unveils the vast potential of nonlinear models in fitting at overparameterization. Based on this framework, we anticipate gaining a deeper understanding of how and why numerous nonlinear models such as DNNs can effectively realize their potential in practice in the near future.
- Published
- 2023
14. Stochastic Modified Equations and Dynamics of Dropout Algorithm
- Author
-
Zhang, Zhongwang, Li, Yuqing, Luo, Tao, and Xu, Zhi-Qin John
- Subjects
Computer Science - Machine Learning - Abstract
Dropout is a widely utilized regularization technique in the training of neural networks, nevertheless, its underlying mechanism and its impact on achieving good generalization abilities remain poorly understood. In this work, we derive the stochastic modified equations for analyzing the dynamics of dropout, where its discrete iteration process is approximated by a class of stochastic differential equations. In order to investigate the underlying mechanism by which dropout facilitates the identification of flatter minima, we study the noise structure of the derived stochastic modified equation for dropout. By drawing upon the structural resemblance between the Hessian and covariance through several intuitive approximations, we empirically demonstrate the universal presence of the inverse variance-flatness relation and the Hessian-variance relation, throughout the training process of dropout. These theoretical and empirical findings make a substantial contribution to our understanding of the inherent tendency of dropout to locate flatter minima.
- Published
- 2023
15. Loss Spike in Training Neural Networks
- Author
-
Li, Xiaolong, Xu, Zhi-Qin John, and Zhang, Zhongwang
- Subjects
Computer Science - Machine Learning - Abstract
In this work, we investigate the mechanism underlying loss spikes observed during neural network training. When the training enters a region with a lower-loss-as-sharper (LLAS) structure, the training becomes unstable, and the loss exponentially increases once the loss landscape is too sharp, resulting in the rapid ascent of the loss spike. The training stabilizes when it finds a flat region. From a frequency perspective, we explain the rapid descent in loss as being primarily influenced by low-frequency components. We observe a deviation in the first eigendirection, which can be reasonably explained by the frequency principle, as low-frequency information is captured rapidly, leading to the rapid descent. Inspired by our analysis of loss spikes, we revisit the link between the maximum eigenvalue of the loss Hessian ($\lambda_{\mathrm{max}}$), flatness and generalization. We suggest that $\lambda_{\mathrm{max}}$ is a good measure of sharpness but not a good measure for generalization. Furthermore, we experimentally observe that loss spikes can facilitate condensation, causing input weights to evolve towards the same direction. And our experiments show that there is a correlation (similar trend) between $\lambda_{\mathrm{max}}$ and condensation. This observation may provide valuable insights for further theoretical research on the relationship between loss spikes, $\lambda_{\mathrm{max}}$, and generalization.
- Published
- 2023
16. Understanding the Initial Condensation of Convolutional Neural Networks
- Author
-
Zhou, Zhangchen, Zhou, Hanxu, Li, Yuqing, and Xu, Zhi-Qin John
- Subjects
Computer Science - Machine Learning ,Computer Science - Neural and Evolutionary Computing - Abstract
Previous research has shown that fully-connected networks with small initialization and gradient-based training methods exhibit a phenomenon known as condensation during training. This phenomenon refers to the input weights of hidden neurons condensing into isolated orientations during training, revealing an implicit bias towards simple solutions in the parameter space. However, the impact of neural network structure on condensation has not been investigated yet. In this study, we focus on the investigation of convolutional neural networks (CNNs). Our experiments suggest that when subjected to small initialization and gradient-based training methods, kernel weights within the same CNN layer also cluster together during training, demonstrating a significant degree of condensation. Theoretically, we demonstrate that in a finite training period, kernels of a two-layer CNN with small initialization will converge to one or a few directions. This work represents a step towards a better understanding of the non-linear training behavior exhibited by neural networks with specialized structures.
- Published
- 2023
17. Laplace-fPINNs: Laplace-based fractional physics-informed neural networks for solving forward and inverse problems of subdiffusion
- Author
-
Yan, Xiong-Bin, Xu, Zhi-Qin John, and Ma, Zheng
- Subjects
Mathematics - Numerical Analysis ,Computer Science - Machine Learning - Abstract
The use of Physics-informed neural networks (PINNs) has shown promise in solving forward and inverse problems of fractional diffusion equations. However, due to the fact that automatic differentiation is not applicable for fractional derivatives, solving fractional diffusion equations using PINNs requires addressing additional challenges. To address this issue, this paper proposes an extension to PINNs called Laplace-based fractional physics-informed neural networks (Laplace-fPINNs), which can effectively solve the forward and inverse problems of fractional diffusion equations. This approach avoids introducing a mass of auxiliary points and simplifies the loss function. We validate the effectiveness of the Laplace-fPINNs approach using several examples. Our numerical results demonstrate that the Laplace-fPINNs method can effectively solve both the forward and inverse problems of high-dimensional fractional diffusion equations.
- Published
- 2023
18. Phase Diagram of Initial Condensation for Two-layer Neural Networks
- Author
-
Chen, Zhengan, Li, Yuqing, Luo, Tao, Zhou, Zhangchen, and Xu, Zhi-Qin John
- Subjects
Computer Science - Machine Learning ,Condensed Matter - Disordered Systems and Neural Networks ,Mathematics - Optimization and Control ,Statistics - Machine Learning ,68U99, 90C26, 34A45 - Abstract
The phenomenon of distinct behaviors exhibited by neural networks under varying scales of initialization remains an enigma in deep learning research. In this paper, based on the earlier work by Luo et al.~\cite{luo2021phase}, we present a phase diagram of initial condensation for two-layer neural networks. Condensation is a phenomenon wherein the weight vectors of neural networks concentrate on isolated orientations during the training process, and it is a feature in non-linear learning process that enables neural networks to possess better generalization abilities. Our phase diagram serves to provide a comprehensive understanding of the dynamical regimes of neural networks and their dependence on the choice of hyperparameters related to initialization. Furthermore, we demonstrate in detail the underlying mechanisms by which small initialization leads to condensation at the initial training stage.
- Published
- 2023
19. Linear Stability Hypothesis and Rank Stratification for Nonlinear Models
- Author
-
Zhang, Yaoyu, Zhang, Zhongwang, Zhang, Leyang, Bai, Zhiwei, Luo, Tao, and Xu, Zhi-Qin John
- Subjects
Computer Science - Machine Learning ,Statistics - Machine Learning - Abstract
Models with nonlinear architectures/parameterizations such as deep neural networks (DNNs) are well known for their mysteriously good generalization performance at overparameterization. In this work, we tackle this mystery from a novel perspective focusing on the transition of the target recovery/fitting accuracy as a function of the training data size. We propose a rank stratification for general nonlinear models to uncover a model rank as an "effective size of parameters" for each function in the function space of the corresponding model. Moreover, we establish a linear stability theory proving that a target function almost surely becomes linearly stable when the training data size equals its model rank. Supported by our experiments, we propose a linear stability hypothesis that linearly stable functions are preferred by nonlinear training. By these results, model rank of a target function predicts a minimal training data size for its successful recovery. Specifically for the matrix factorization model and DNNs of fully-connected or convolutional architectures, our rank stratification shows that the model rank for specific target functions can be much lower than the size of model parameters. This result predicts the target recovery capability even at heavy overparameterization for these nonlinear models as demonstrated quantitatively by our experiments. Overall, our work provides a unified framework with quantitative prediction power to understand the mysterious target recovery behavior at overparameterization for general nonlinear models.
- Published
- 2022
20. Bayesian Inversion with Neural Operator (BINO) for Modeling Subdiffusion: Forward and Inverse Problems
- Author
-
Yan, Xiong-bin, Xu, Zhi-Qin John, and Ma, Zheng
- Subjects
Mathematics - Numerical Analysis ,Computer Science - Machine Learning - Abstract
Fractional diffusion equations have been an effective tool for modeling anomalous diffusion in complicated systems. However, traditional numerical methods require expensive computation cost and storage resources because of the memory effect brought by the convolution integral of time fractional derivative. We propose a Bayesian Inversion with Neural Operator (BINO) to overcome the difficulty in traditional methods as follows. We employ a deep operator network to learn the solution operators for the fractional diffusion equations, allowing us to swiftly and precisely solve a forward problem for given inputs (including fractional order, diffusion coefficient, source terms, etc.). In addition, we integrate the deep operator network with a Bayesian inversion method for modelling a problem by subdiffusion process and solving inverse subdiffusion problems, which reduces the time costs (without suffering from overwhelm storage resources) significantly. A large number of numerical experiments demonstrate that the operator learning method proposed in this work can efficiently solve the forward problems and Bayesian inverse problems of the subdiffusion equation.
- Published
- 2022
21. DeepFlame: A deep learning empowered open-source platform for reacting flow simulations
- Author
-
Mao, Runze, Lin, Minqi, Zhang, Yan, Zhang, Tianhan, Xu, Zhi-Qin John, and Chen, Zhi X.
- Subjects
Physics - Fluid Dynamics - Abstract
In this work, we introduce DeepFlame, an open-source C++ platform with the capabilities of utilising machine learning algorithms and pre-trained models to solve for reactive flows. We combine the individual strengths of the computational fluid dynamics library OpenFOAM, machine learning framework Torch, and chemical kinetics program Cantera. The complexity of cross-library function and data interfacing (the core of DeepFlame) is minimised to achieve a simple and clear workflow for code maintenance, extension and upgrading. As a demonstration, we apply our recent work on deep learning for predicting chemical kinetics (Zhang et al. Combust. Flame vol. 245 pp. 112319, 2022) to highlight the potential of machine learning in accelerating reacting flow simulation. A thorough code validation is conducted via a broad range of canonical cases to assess its accuracy and efficiency. The results demonstrate that the convection-diffusion-reaction algorithms implemented in DeepFlame are robust and accurate for both steady-state and transient processes. In addition, a number of methods aiming to further improve the computational efficiency, e.g. dynamic load balancing and adaptive mesh refinement, are explored. Their performances are also evaluated and reported. With the deep learning method implemented in this work, a speed-up of two orders of magnitude is achieved in a simple hydrogen ignition case when performed on a medium-end graphics processing unit (GPU). Further gain in computational efficiency is expected for hydrocarbon and other complex fuels. A similar level of acceleration is obtained on an AI-specific chip - deep computing unit (DCU), highlighting the potential of DeepFlame in leveraging the next-generation computing architecture and hardware.
- Published
- 2022
- Full Text
- View/download PDF
22. Implicit regularization of dropout
- Author
-
Zhang, Zhongwang and Xu, Zhi-Qin John
- Subjects
Computer Science - Machine Learning - Abstract
It is important to understand how dropout, a popular regularization method, aids in achieving a good generalization solution during neural network training. In this work, we present a theoretical derivation of an implicit regularization of dropout, which is validated by a series of experiments. Additionally, we numerically study two implications of the implicit regularization, which intuitively rationalizes why dropout helps generalization. Firstly, we find that input weights of hidden neurons tend to condense on isolated orientations trained with dropout. Condensation is a feature in the non-linear learning process, which makes the network less complex. Secondly, we experimentally find that the training with dropout leads to the neural network with a flatter minimum compared with standard gradient descent training, and the implicit regularization is the key to finding flat solutions. Although our theory mainly focuses on dropout used in the last hidden layer, our experiments apply to general dropout in training neural networks. This work points out a distinct characteristic of dropout compared with stochastic gradient descent and serves as an important basis for fully understanding dropout., Comment: arXiv admin note: text overlap with arXiv:2111.01022
- Published
- 2022
23. Embedding Principle in Depth for the Loss Landscape Analysis of Deep Neural Networks
- Author
-
Bai, Zhiwei, Luo, Tao, Xu, Zhi-Qin John, and Zhang, Yaoyu
- Subjects
Computer Science - Machine Learning - Abstract
Understanding the relation between deep and shallow neural networks is extremely important for the theoretical study of deep learning. In this work, we discover an embedding principle in depth that loss landscape of an NN "contains" all critical points of the loss landscapes for shallower NNs. The key tool for our discovery is the critical lifting operator proposed in this work that maps any critical point of a network to critical manifolds of any deeper network while preserving the outputs. This principle provides new insights to many widely observed behaviors of DNNs. Regarding the easy training of deep networks, we show that local minimum of an NN can be lifted to strict saddle points of a deeper NN. Regarding the acceleration effect of batch normalization, we demonstrate that batch normalization helps avoid the critical manifolds lifted from shallower NNs by suppressing layer linearization. We also prove that increasing training data shrinks the lifted critical manifolds, which can result in acceleration of training as demonstrated in experiments. Overall, our discovery of the embedding principle in depth uncovers the depth-wise hierarchical structure of deep learning loss landscape, which serves as a solid foundation for the further study about the role of depth for DNNs.
- Published
- 2022
- Full Text
- View/download PDF
24. An Experimental Comparison Between Temporal Difference and Residual Gradient with Neural Network Approximation
- Author
-
Yin, Shuyu, Luo, Tao, Liu, Peilin, and Xu, Zhi-Qin John
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence - Abstract
Gradient descent or its variants are popular in training neural networks. However, in deep Q-learning with neural network approximation, a type of reinforcement learning, gradient descent (also known as Residual Gradient (RG)) is barely used to solve Bellman residual minimization problem. On the contrary, Temporal Difference (TD), an incomplete gradient descent method prevails. In this work, we perform extensive experiments to show that TD outperforms RG, that is, when the training leads to a small Bellman residual error, the solution found by TD has a better policy and is more robust against the perturbation of neural network parameters. We further use experiments to reveal a key difference between reinforcement learning and supervised learning, that is, a small Bellman residual error can correspond to a bad policy in reinforcement learning while the test loss function in supervised learning is a standard index to indicate the performance. We also empirically examine that the missing term in TD is a key reason why RG performs badly. Our work shows that the performance of a deep Q-learning solution is closely related to the training dynamics and how an incomplete gradient descent method can find a good policy is interesting for future study.
- Published
- 2022
25. Empirical Phase Diagram for Three-layer Neural Networks with Infinite Width
- Author
-
Zhou, Hanxu, Zhou, Qixuan, Jin, Zhenyuan, Luo, Tao, Zhang, Yaoyu, and Xu, Zhi-Qin John
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence ,68T07 - Abstract
Substantial work indicates that the dynamics of neural networks (NNs) is closely related to their initialization of parameters. Inspired by the phase diagram for two-layer ReLU NNs with infinite width (Luo et al., 2021), we make a step towards drawing a phase diagram for three-layer ReLU NNs with infinite width. First, we derive a normalized gradient flow for three-layer ReLU NNs and obtain two key independent quantities to distinguish different dynamical regimes for common initialization methods. With carefully designed experiments and a large computation cost, for both synthetic datasets and real datasets, we find that the dynamics of each layer also could be divided into a linear regime and a condensed regime, separated by a critical regime. The criteria is the relative change of input weights (the input weight of a hidden neuron consists of the weight from its input layer to the hidden neuron and its bias term) as the width approaches infinity during the training, which tends to $0$, $+\infty$ and $O(1)$, respectively. In addition, we also demonstrate that different layers can lie in different dynamical regimes in a training process within a deep NN. In the condensed regime, we also observe the condensation of weights in isolated orientations with low complexity. Through experiments under three-layer condition, our phase diagram suggests a complicated dynamical regimes consisting of three possible regimes, together with their mixture, for deep NNs and provides a guidance for studying deep NNs in different initialization regimes, which reveals the possibility of completely different dynamics emerging within a deep NN for its different layers., Comment: arXiv admin note: text overlap with arXiv:2007.07497
- Published
- 2022
26. Limitation of Characterizing Implicit Regularization by Data-independent Functions
- Author
-
Zhang, Leyang, Xu, Zhi-Qin John, Luo, Tao, and Zhang, Yaoyu
- Subjects
Computer Science - Machine Learning - Abstract
In recent years, understanding the implicit regularization of neural networks (NNs) has become a central task in deep learning theory. However, implicit regularization is itself not completely defined and well understood. In this work, we attempt to mathematically define and study implicit regularization. Importantly, we explore the limitations of a common approach to characterizing implicit regularization using data-independent functions. We propose two dynamical mechanisms, i.e., Two-point and One-point Overlapping mechanisms, based on which we provide two recipes for producing classes of one-hidden-neuron NNs that provably cannot be fully characterized by a type of or all data-independent functions. Following the previous works, our results further emphasize the profound data dependency of implicit regularization in general, inspiring us to study in detail the data dependency of NN implicit regularization in the future., Comment: Revised the structure of paper and added results about implicit regularization in training two-layer network or even more general "activation function-related" models
- Published
- 2022
27. Overview frequency principle/spectral bias in deep learning
- Author
-
Xu, Zhi-Qin John, Zhang, Yaoyu, and Luo, Tao
- Subjects
Computer Science - Machine Learning - Abstract
Understanding deep learning is increasingly emergent as it penetrates more and more into industry and science. In recent years, a research line from Fourier analysis sheds lights on this magical "black box" by showing a Frequency Principle (F-Principle or spectral bias) of the training behavior of deep neural networks (DNNs) -- DNNs often fit functions from low to high frequency during the training. The F-Principle is first demonstrated by onedimensional synthetic data followed by the verification in high-dimensional real datasets. A series of works subsequently enhance the validity of the F-Principle. This low-frequency implicit bias reveals the strength of neural network in learning low-frequency functions as well as its deficiency in learning high-frequency functions. Such understanding inspires the design of DNN-based algorithms in practical problems, explains experimental phenomena emerging in various scenarios, and further advances the study of deep learning from the frequency perspective. Although incomplete, we provide an overview of F-Principle and propose some open problems for future research.
- Published
- 2022
28. A multi-scale sampling method for accurate and robust deep neural network to predict combustion chemical kinetics
- Author
-
Zhang, Tianhan, Yi, Yuxiao, Xu, Yifan, Chen, Zhi X., Zhang, Yaoyu, E, Weinan, and Xu, Zhi-Qin John
- Subjects
Physics - Chemical Physics ,Computer Science - Machine Learning ,Mathematics - Numerical Analysis ,Physics - Computational Physics ,Physics - Fluid Dynamics - Abstract
Machine learning has long been considered as a black box for predicting combustion chemical kinetics due to the extremely large number of parameters and the lack of evaluation standards and reproducibility. The current work aims to understand two basic questions regarding the deep neural network (DNN) method: what data the DNN needs and how general the DNN method can be. Sampling and preprocessing determine the DNN training dataset, further affect DNN prediction ability. The current work proposes using Box-Cox transformation (BCT) to preprocess the combustion data. In addition, this work compares different sampling methods with or without preprocessing, including the Monte Carlo method, manifold sampling, generative neural network method (cycle-GAN), and newly-proposed multi-scale sampling. Our results reveal that the DNN trained by the manifold data can capture the chemical kinetics in limited configurations but cannot remain robust toward perturbation, which is inevitable for the DNN coupled with the flow field. The Monte Carlo and cycle-GAN samplings can cover a wider phase space but fail to capture small-scale intermediate species, producing poor prediction results. A three-hidden-layer DNN, based on the multi-scale method without specific flame simulation data, allows predicting chemical kinetics in various scenarios and being stable during the temporal evolutions. This single DNN is readily implemented with several CFD codes and validated in various combustors, including (1). zero-dimensional autoignition, (2). one-dimensional freely propagating flame, (3). two-dimensional jet flame with triple-flame structure, and (4). three-dimensional turbulent lifted flames. The results demonstrate the satisfying accuracy and generalization ability of the pre-trained DNN. The Fortran and Python versions of DNN and example code are attached in the supplementary for reproducibility.
- Published
- 2022
29. A deep learning-based model reduction (DeePMR) method for simplifying chemical kinetics
- Author
-
Wang, Zhiwei, Zhang, Yaoyu, Zhao, Enhan, Ju, Yiguang, E, Weinan, Xu, Zhi-Qin John, and Zhang, Tianhan
- Subjects
Computer Science - Machine Learning ,Mathematics - Optimization and Control - Abstract
A deep learning-based model reduction (DeePMR) method for simplifying chemical kinetics is proposed and validated using high-temperature auto-ignitions, perfectly stirred reactors (PSR), and one-dimensional freely propagating flames of n-heptane/air mixtures. The mechanism reduction is modeled as an optimization problem on Boolean space, where a Boolean vector, each entry corresponding to a species, represents a reduced mechanism. The optimization goal is to minimize the reduced mechanism size given the error tolerance of a group of pre-selected benchmark quantities. The key idea of the DeePMR is to employ a deep neural network (DNN) to formulate the objective function in the optimization problem. In order to explore high dimensional Boolean space efficiently, an iterative DNN-assisted data sampling and DNN training procedure are implemented. The results show that DNN-assistance improves sampling efficiency significantly, selecting only $10^5$ samples out of $10^{34}$ possible samples for DNN to achieve sufficient accuracy. The results demonstrate the capability of the DNN to recognize key species and reasonably predict reduced mechanism performance. The well-trained DNN guarantees the optimal reduced mechanism by solving an inverse optimization problem. By comparing ignition delay times, laminar flame speeds, temperatures in PSRs, the resulting skeletal mechanism has fewer species (45 species) but the same level of accuracy as the skeletal mechanism (56 species) obtained by the Path Flux Analysis (PFA) method. In addition, the skeletal mechanism can be further reduced to 28 species if only considering atmospheric, near-stoichiometric conditions (equivalence ratio between 0.6 and 1.2). The DeePMR provides an innovative way to perform model reduction and demonstrates the great potential of data-driven methods in the combustion area.
- Published
- 2022
30. Subspace Decomposition based DNN algorithm for elliptic type multi-scale PDEs
- Author
-
Li, Xi-An, Xu, Zhi-Qin John, and Zhang, Lei
- Subjects
Computer Science - Machine Learning - Abstract
While deep learning algorithms demonstrate a great potential in scientific computing, its application to multi-scale problems remains to be a big challenge. This is manifested by the "frequency principle" that neural networks tend to learn low frequency components first. Novel architectures such as multi-scale deep neural network (MscaleDNN) were proposed to alleviate this problem to some extent. In this paper, we construct a subspace decomposition based DNN (dubbed SD$^2$NN) architecture for a class of multi-scale problems by combining traditional numerical analysis ideas and MscaleDNN algorithms. The proposed architecture includes one low frequency normal DNN submodule, and one (or a few) high frequency MscaleDNN submodule(s), which are designed to capture the smooth part and the oscillatory part of the multi-scale solutions, respectively. In addition, a novel trigonometric activation function is incorporated in the SD$^2$NN model. We demonstrate the performance of the SD$^2$NN architecture through several benchmark multi-scale problems in regular or irregular geometric domains. Numerical results show that the SD$^2$NN model is superior to existing models such as MscaleDNN., Comment: 19pages,11 figures
- Published
- 2021
31. Embedding Principle: a hierarchical structure of loss landscape of deep neural networks
- Author
-
Zhang, Yaoyu, Li, Yuqing, Zhang, Zhongwang, Luo, Tao, and Xu, Zhi-Qin John
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence - Abstract
We prove a general Embedding Principle of loss landscape of deep neural networks (NNs) that unravels a hierarchical structure of the loss landscape of NNs, i.e., loss landscape of an NN contains all critical points of all the narrower NNs. This result is obtained by constructing a class of critical embeddings which map any critical point of a narrower NN to a critical point of the target NN with the same output function. By discovering a wide class of general compatible critical embeddings, we provide a gross estimate of the dimension of critical submanifolds embedded from critical points of narrower NNs. We further prove an irreversiblility property of any critical embedding that the number of negative/zero/positive eigenvalues of the Hessian matrix of a critical point may increase but never decrease as an NN becomes wider through the embedding. Using a special realization of general compatible critical embedding, we prove a stringent necessary condition for being a "truly-bad" critical point that never becomes a strict-saddle point through any critical embedding. This result implies the commonplace of strict-saddle points in wide NNs, which may be an important reason underlying the easy optimization of wide NNs widely observed in practice.
- Published
- 2021
32. Dropout in Training Neural Networks: Flatness of Solution and Noise Structure
- Author
-
Zhang, Zhongwang, Zhou, Hanxu, and Xu, Zhi-Qin John
- Subjects
Computer Science - Machine Learning - Abstract
It is important to understand how the popular regularization method dropout helps the neural network training find a good generalization solution. In this work, we show that the training with dropout finds the neural network with a flatter minimum compared with standard gradient descent training. We further find that the variance of a noise induced by the dropout is larger at the sharper direction of the loss landscape and the Hessian of the loss landscape at the found minima aligns with the noise covariance matrix by experiments on various datasets, i.e., MNIST, CIFAR-10, CIFAR-100 and Multi30k, and various structures, i.e., fully-connected networks, large residual convolutional networks and transformer. For networks with piece-wise linear activation function and the dropout is only at the last hidden layer, we then theoretically derive the Hessian and the covariance of dropout randomness, where these two quantities are very similar. This similarity may be a key reason accounting for the goodness of dropout.
- Published
- 2021
33. Data-informed Deep Optimization
- Author
-
Zhang, Lulu, Xu, Zhi-Qin John, and Zhang, Yaoyu
- Subjects
Mathematics - Optimization and Control ,Computer Science - Machine Learning - Abstract
Complex design problems are common in the scientific and industrial fields. In practice, objective functions or constraints of these problems often do not have explicit formulas, and can be estimated only at a set of sampling points through experiments or simulations. Such optimization problems are especially challenging when design parameters are high-dimensional due to the curse of dimensionality. In this work, we propose a data-informed deep optimization (DiDo) approach as follows: first, we use a deep neural network (DNN) classifier to learn the feasible region; second, we sample feasible points based on the DNN classifier for fitting of the objective function; finally, we find optimal points of the DNN-surrogate optimization problem by gradient descent. To demonstrate the effectiveness of our DiDo approach, we consider a practical design case in industry, in which our approach yields good solutions using limited size of training data. We further use a 100-dimension toy example to show the effectiveness of our model for higher dimensional problems. Our results indicate that the DiDo approach empowered by DNN is flexible and promising for solving general high-dimensional design problems in practice.
- Published
- 2021
- Full Text
- View/download PDF
34. Force-in-domain GAN inversion
- Author
-
Leng, Guangjie, Zhu, Yekun, and Xu, Zhi-Qin John
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning ,Electrical Engineering and Systems Science - Image and Video Processing - Abstract
Empirical works suggest that various semantics emerge in the latent space of Generative Adversarial Networks (GANs) when being trained to generate images. To perform real image editing, it requires an accurate mapping from the real image to the latent space to leveraging these learned semantics, which is important yet difficult. An in-domain GAN inversion approach is recently proposed to constraint the inverted code within the latent space by forcing the reconstructed image obtained from the inverted code within the real image space. Empirically, we find that the inverted code by the in-domain GAN can deviate from the latent space significantly. To solve this problem, we propose a force-in-domain GAN based on the in-domain GAN, which utilizes a discriminator to force the inverted code within the latent space. The force-in-domain GAN can also be interpreted by a cycle-GAN with slight modification. Extensive experiments show that our force-in-domain GAN not only reconstructs the target image at the pixel level, but also align the inverted code with the latent space well for semantic editing.
- Published
- 2021
35. MOD-Net: A Machine Learning Approach via Model-Operator-Data Network for Solving PDEs
- Author
-
Zhang, Lulu, Luo, Tao, Zhang, Yaoyu, E, Weinan, Xu, Zhi-Qin John, and Ma, Zheng
- Subjects
Mathematics - Numerical Analysis ,Computer Science - Machine Learning - Abstract
In this paper, we propose a a machine learning approach via model-operator-data network (MOD-Net) for solving PDEs. A MOD-Net is driven by a model to solve PDEs based on operator representation with regularization from data. For linear PDEs, we use a DNN to parameterize the Green's function and obtain the neural operator to approximate the solution according to the Green's method. To train the DNN, the empirical risk consists of the mean squared loss with the least square formulation or the variational formulation of the governing equation and boundary conditions. For complicated problems, the empirical risk also includes a few labels, which are computed on coarse grid points with cheap computation cost and significantly improves the model accuracy. Intuitively, the labeled dataset works as a regularization in addition to the model constraints. The MOD-Net solves a family of PDEs rather than a specific one and is much more efficient than original neural operator because few expensive labels are required. We numerically show MOD-Net is very efficient in solving Poisson equation and one-dimensional radiative transfer equation. For nonlinear PDEs, the nonlinear MOD-Net can be similarly used as an ansatz for solving nonlinear PDEs, exemplified by solving several nonlinear PDE problems, such as the Burgers equation.
- Published
- 2021
- Full Text
- View/download PDF
36. Embedding Principle of Loss Landscape of Deep Neural Networks
- Author
-
Zhang, Yaoyu, Zhang, Zhongwang, Luo, Tao, and Xu, Zhi-Qin John
- Subjects
Computer Science - Machine Learning ,Statistics - Machine Learning - Abstract
Understanding the structure of loss landscape of deep neural networks (DNNs)is obviously important. In this work, we prove an embedding principle that the loss landscape of a DNN "contains" all the critical points of all the narrower DNNs. More precisely, we propose a critical embedding such that any critical point, e.g., local or global minima, of a narrower DNN can be embedded to a critical point/hyperplane of the target DNN with higher degeneracy and preserving the DNN output function. The embedding structure of critical points is independent of loss function and training data, showing a stark difference from other nonconvex problems such as protein-folding. Empirically, we find that a wide DNN is often attracted by highly-degenerate critical points that are embedded from narrow DNNs. The embedding principle provides an explanation for the general easy optimization of wide DNNs and unravels a potential implicit low-complexity regularization during the training. Overall, our work provides a skeleton for the study of loss landscape of DNNs and its implication, by which a more exact and comprehensive understanding can be anticipated in the near
- Published
- 2021
37. An Upper Limit of Decaying Rate with Respect to Frequency in Deep Neural Network
- Author
-
Luo, Tao, Ma, Zheng, Wang, Zhiwei, Xu, Zhi-Qin John, and Zhang, Yaoyu
- Subjects
Computer Science - Machine Learning ,Mathematics - Numerical Analysis - Abstract
Deep neural network (DNN) usually learns the target function from low to high frequency, which is called frequency principle or spectral bias. This frequency principle sheds light on a high-frequency curse of DNNs -- difficult to learn high-frequency information. Inspired by the frequency principle, a series of works are devoted to develop algorithms for overcoming the high-frequency curse. A natural question arises: what is the upper limit of the decaying rate w.r.t. frequency when one trains a DNN? In this work, our theory, confirmed by numerical experiments, suggests that there is a critical decaying rate w.r.t. frequency in DNN training. Below the upper limit of the decaying rate, the DNN interpolates the training data by a function with a certain regularity. However, above the upper limit, the DNN interpolates the training data by a trivial function, i.e., a function is only non-zero at training data points. Our results indicate a better way to overcome the high-frequency curse is to design a proper pre-condition approach to shift high-frequency information to low-frequency one, which coincides with several previous developed algorithms for fast learning high-frequency information. More importantly, this work rigorously proves that the high-frequency curse is an intrinsic difficulty of DNNs., Comment: arXiv admin note: substantial text overlap with arXiv:2012.03238
- Published
- 2021
38. Towards Understanding the Condensation of Neural Networks at Initial Training
- Author
-
Zhou, Hanxu, Zhou, Qixuan, Luo, Tao, Zhang, Yaoyu, and Xu, Zhi-Qin John
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence ,68T07 - Abstract
Empirical works show that for ReLU neural networks (NNs) with small initialization, input weights of hidden neurons (the input weight of a hidden neuron consists of the weight from its input layer to the hidden neuron and its bias term) condense on isolated orientations. The condensation dynamics implies that the training implicitly regularizes a NN towards one with a much smaller effective size. In this work, we illustrate the formation of the condensation in multi-layer fully connected NNs and show that the maximal number of condensed orientations in the initial training stage is twice the multiplicity of the activation function, where "multiplicity" indicates the multiple roots of activation function at origin. Our theoretical analysis confirms experiments for two cases, one is for the activation function of multiplicity one with arbitrary dimension input, which contains many common activation functions, and the other is for the layer with one-dimensional input and arbitrary multiplicity. This work makes a step towards understanding how small initialization leads NNs to condensation at the initial training stage.
- Published
- 2021
39. Deep mechanism reduction (DeePMR) method for fuel chemical kinetics
- Author
-
Wang, Zhiwei, Zhang, Yaoyu, Lin, Pengxiao, Zhao, Enhan, E, Weinan, Zhang, Tianhan, and Xu, Zhi-Qin John
- Published
- 2024
- Full Text
- View/download PDF
40. Linear Frequency Principle Model to Understand the Absence of Overfitting in Neural Networks
- Author
-
Zhang, Yaoyu, Luo, Tao, Ma, Zheng, and Xu, Zhi-Qin John
- Subjects
Computer Science - Machine Learning ,Physics - Data Analysis, Statistics and Probability - Abstract
Why heavily parameterized neural networks (NNs) do not overfit the data is an important long standing open question. We propose a phenomenological model of the NN training to explain this non-overfitting puzzle. Our linear frequency principle (LFP) model accounts for a key dynamical feature of NNs: they learn low frequencies first, irrespective of microscopic details. Theory based on our LFP model shows that low frequency dominance of target functions is the key condition for the non-overfitting of NNs and is verified by experiments. Furthermore, through an ideal two-layer NN, we unravel how detailed microscopic NN training dynamics statistically gives rise to a LFP model with quantitative prediction power., Comment: to appear in Chinese Physics Letters
- Published
- 2021
- Full Text
- View/download PDF
41. Frequency Principle in Deep Learning Beyond Gradient-descent-based Training
- Author
-
Ma, Yuheng, Xu, Zhi-Qin John, and Zhang, Jiwei
- Subjects
Computer Science - Machine Learning - Abstract
Frequency perspective recently makes progress in understanding deep learning. It has been widely verified in both empirical and theoretical studies that deep neural networks (DNNs) often fit the target function from low to high frequency, namely Frequency Principle (F-Principle). F-Principle sheds light on the strength and the weakness of DNNs and inspires a series of subsequent works, including theoretical studies, empirical studies and the design of efficient DNN structures etc. Previous works examine the F-Principle in gradient-descent-based training. It remains unclear whether gradient-descent-based training is a necessary condition for the F-Principle. In this paper, we show that the F-Principle exists stably in the training process of DNNs with non-gradient-descent-based training, including optimization algorithms with gradient information, such as conjugate gradient and BFGS, and algorithms without gradient information, such as Powell's method and Particle Swarm Optimization. These empirical studies show the universality of the F-Principle and provide hints for further study of F-Principle.
- Published
- 2021
42. Fourier-domain Variational Formulation and Its Well-posedness for Supervised Learning
- Author
-
Luo, Tao, Ma, Zheng, Wang, Zhiwei, Xu, Zhi-Qin John, and Zhang, Yaoyu
- Subjects
Mathematics - Numerical Analysis ,Computer Science - Machine Learning ,65K10(Primary), 65N20(Secondary) - Abstract
A supervised learning problem is to find a function in a hypothesis function space given values on isolated data points. Inspired by the frequency principle in neural networks, we propose a Fourier-domain variational formulation for supervised learning problem. This formulation circumvents the difficulty of imposing the constraints of given values on isolated data points in continuum modelling. Under a necessary and sufficient condition within our unified framework, we establish the well-posedness of the Fourier-domain variational problem, by showing a critical exponent depending on the data dimension. In practice, a neural network can be a convenient way to implement our formulation, which automatically satisfies the well-posedness condition., Comment: 18 pages, 5 figures
- Published
- 2020
43. On the exact computation of linear frequency principle dynamics and its generalization
- Author
-
Luo, Tao, Ma, Zheng, Xu, Zhi-Qin John, and Zhang, Yaoyu
- Subjects
Computer Science - Machine Learning - Abstract
Recent works show an intriguing phenomenon of Frequency Principle (F-Principle) that deep neural networks (DNNs) fit the target function from low to high frequency during the training, which provides insight into the training and generalization behavior of DNNs in complex tasks. In this paper, through analysis of an infinite-width two-layer NN in the neural tangent kernel (NTK) regime, we derive the exact differential equation, namely Linear Frequency-Principle (LFP) model, governing the evolution of NN output function in the frequency domain during the training. Our exact computation applies for general activation functions with no assumption on size and distribution of training data. This LFP model unravels that higher frequencies evolve polynomially or exponentially slower than lower frequencies depending on the smoothness/regularity of the activation function. We further bridge the gap between training dynamics and generalization by proving that LFP model implicitly minimizes a Frequency-Principle norm (FP-norm) of the learned function, by which higher frequencies are more severely penalized depending on the inverse of their evolution rate. Finally, we derive an \textit{a priori} generalization error bound controlled by the FP-norm of the target function, which provides a theoretical justification for the empirical results that DNNs often generalize well for low frequency functions., Comment: Authors are listed alphabetically. arXiv admin note: substantial text overlap with arXiv:1905.10264
- Published
- 2020
44. A multi-scale DNN algorithm for nonlinear elliptic equations with multiple scales
- Author
-
Li, Xi-An, Xu, Zhi-Qin John, and Zhang, Lei
- Subjects
Physics - Computational Physics ,Mathematics - Analysis of PDEs - Abstract
Algorithms based on deep neural networks (DNNs) have attracted increasing attention from the scientific computing community. DNN based algorithms are easy to implement, natural for nonlinear problems, and have shown great potential to overcome the curse of dimensionality. In this work, we utilize the multi-scale DNN-based algorithm (MscaleDNN) proposed by Liu, Cai and Xu (2020) to solve multi-scale elliptic problems with possible nonlinearity, for example, the p-Laplacian problem. We improve the MscaleDNN algorithm by a smooth and localized activation function. Several numerical examples of multi-scale elliptic problems with separable or non-separable scales in low-dimensional and high-dimensional Euclidean spaces are used to demonstrate the effectiveness and accuracy of the MscaleDNN numerical scheme., Comment: 23 papers,11 figures
- Published
- 2020
- Full Text
- View/download PDF
45. A regularized deep matrix factorized model of matrix completion for image restoration
- Author
-
Li, Zhemin, Xu, Zhi-Qin John, Luo, Tao, and Wang, Hongxia
- Subjects
Computer Science - Machine Learning ,Computer Science - Computer Vision and Pattern Recognition ,Statistics - Machine Learning - Abstract
It has been an important approach of using matrix completion to perform image restoration. Most previous works on matrix completion focus on the low-rank property by imposing explicit constraints on the recovered matrix, such as the constraint of the nuclear norm or limiting the dimension of the matrix factorization component. Recently, theoretical works suggest that deep linear neural network has an implicit bias towards low rank on matrix completion. However, low rank is not adequate to reflect the intrinsic characteristics of a natural image. Thus, algorithms with only the constraint of low rank are insufficient to perform image restoration well. In this work, we propose a Regularized Deep Matrix Factorized (RDMF) model for image restoration, which utilizes the implicit bias of the low rank of deep neural networks and the explicit bias of total variation. We demonstrate the effectiveness of our RDMF model with extensive experiments, in which our method surpasses the state of art models in common examples, especially for the restoration from very few observations. Our work sheds light on a more general framework for solving other inverse problems by combining the implicit bias of deep learning with explicit regularization.
- Published
- 2020
- Full Text
- View/download PDF
46. Deep frequency principle towards understanding why deeper learning is faster
- Author
-
Xu, Zhi-Qin John and Zhou, Hanxu
- Subjects
Computer Science - Machine Learning ,Statistics - Machine Learning - Abstract
Understanding the effect of depth in deep learning is a critical problem. In this work, we utilize the Fourier analysis to empirically provide a promising mechanism to understand why feedforward deeper learning is faster. To this end, we separate a deep neural network, trained by normal stochastic gradient descent, into two parts during analysis, i.e., a pre-condition component and a learning component, in which the output of the pre-condition one is the input of the learning one. We use a filtering method to characterize the frequency distribution of a high-dimensional function. Based on experiments of deep networks and real dataset, we propose a deep frequency principle, that is, the effective target function for a deeper hidden layer biases towards lower frequency during the training. Therefore, the learning component effectively learns a lower frequency function if the pre-condition component has more layers. Due to the well-studied frequency principle, i.e., deep neural networks learn lower frequency functions faster, the deep frequency principle provides a reasonable explanation to why deeper learning is faster. We believe these empirical studies would be valuable for future theoretical studies of the effect of depth in deep learning.
- Published
- 2020
47. Multi-scale Deep Neural Network (MscaleDNN) for Solving Poisson-Boltzmann Equation in Complex Domains
- Author
-
Liu, Ziqi, Cai, Wei, and Xu, Zhi-Qin John
- Subjects
Physics - Computational Physics ,Computer Science - Machine Learning ,Mathematics - Numerical Analysis - Abstract
In this paper, we propose multi-scale deep neural networks (MscaleDNNs) using the idea of radial scaling in frequency domain and activation functions with compact support. The radial scaling converts the problem of approximation of high frequency contents of PDEs' solutions to a problem of learning about lower frequency functions, and the compact support activation functions facilitate the separation of frequency contents of the target function to be approximated by corresponding DNNs. As a result, the MscaleDNNs achieve fast uniform convergence over multiple scales. The proposed MscaleDNNs are shown to be superior to traditional fully connected DNNs and be an effective mesh-less numerical method for Poisson-Boltzmann equations with ample frequency contents over complex and singular domains.
- Published
- 2020
- Full Text
- View/download PDF
48. Phase diagram for two-layer ReLU neural networks at infinite-width limit
- Author
-
Luo, Tao, Xu, Zhi-Qin John, Ma, Zheng, and Zhang, Yaoyu
- Subjects
Computer Science - Machine Learning ,Statistics - Machine Learning - Abstract
How neural network behaves during the training over different choices of hyperparameters is an important question in the study of neural networks. In this work, inspired by the phase diagram in statistical mechanics, we draw the phase diagram for the two-layer ReLU neural network at the infinite-width limit for a complete characterization of its dynamical regimes and their dependence on hyperparameters related to initialization. Through both experimental and theoretical approaches, we identify three regimes in the phase diagram, i.e., linear regime, critical regime and condensed regime, based on the relative change of input weights as the width approaches infinity, which tends to $0$, $O(1)$ and $+\infty$, respectively. In the linear regime, NN training dynamics is approximately linear similar to a random feature model with an exponential loss decay. In the condensed regime, we demonstrate through experiments that active neurons are condensed at several discrete orientations. The critical regime serves as the boundary between above two regimes, which exhibits an intermediate nonlinear behavior with the mean-field model as a typical example. Overall, our phase diagram for the two-layer ReLU NN serves as a map for the future studies and is a first step towards a more systematical investigation of the training behavior and the implicit regularization of NNs of different structures.
- Published
- 2020
49. Implicit bias with Ritz-Galerkin method in understanding deep learning for solving PDEs
- Author
-
Wang, Jihong, Xu, Zhi-Qin John, Zhang, Jiwei, and Zhang, Yaoyu
- Subjects
Mathematics - Numerical Analysis ,35Q68, 65N30, 65N35 - Abstract
This paper aims at studying the difference between Ritz-Galerkin (R-G) method and deep neural network (DNN) method in solving partial differential equations (PDEs) to better understand deep learning. To this end, we consider solving a particular Poisson problem, where the information of the right-hand side of the equation f is only available at n sample points, that is, f is known at finite sample points. Through both theoretical and numerical studies, we show that solution of the R-G method converges to a piecewise linear function for the one dimensional (1D) problem or functions of lower regularity for high dimensional problems. With the same setting, DNNs however learn a relative smooth solution regardless of the dimension, this is, DNNs implicitly bias towards functions with more low-frequency components among all functions that can fit the equation at available data points. This bias is explained by the recent study of frequency principle (Xu et al., (2019) [17] and Zhang et al., (2019) [11, 19]). In addition to the similarity between the traditional numerical methods and DNNs in the approximation perspective, our work shows that the implicit bias in the learning process, which is different from traditional numerical methods, could help better understand the characteristics of DNNs., Comment: 16 pages, 34 figures
- Published
- 2020
50. Subspace decomposition based DNN algorithm for elliptic type multi-scale PDEs
- Author
-
Li, Xi-An, Xu, Zhi-Qin John, and Zhang, Lei
- Published
- 2023
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.