Author: "Hoffer, Elad" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Hoffer, Elad"' showing total 30 results

Start Over Author "Hoffer, Elad"

30 results on '"Hoffer, Elad"'

1. DropCompute: simple and more robust distributed synchronous training via compute variance reduction

Author: Giladi, Niv, Gottlieb, Shahar, Shkolnik, Moran, Karnieli, Asaf, Banner, Ron, Hoffer, Elad, Levy, Kfir Yehuda, and Soudry, Daniel
Subjects: Computer Science - Machine Learning
Abstract: Background: Distributed training is essential for large scale training of deep neural networks (DNNs). The dominant methods for large scale DNN training are synchronous (e.g. All-Reduce), but these require waiting for all workers in each step. Thus, these methods are limited by the delays caused by straggling workers. Results: We study a typical scenario in which workers are straggling due to variability in compute time. We find an analytical relation between compute time properties and scalability limitations, caused by such straggling workers. With these findings, we propose a simple yet effective decentralized method to reduce the variation among workers and thus improve the robustness of synchronous training. This method can be integrated with the widely used All-Reduce. Our findings are validated on large-scale training tasks using 200 Gaudi Accelerators., Comment: https://github.com/paper-submissions/dropcompute
Published: 2023

2. Energy awareness in low precision neural networks

Author: Eliezer, Nurit Spingarn, Banner, Ron, Hoffer, Elad, Ben-Yaakov, Hilla, and Michaeli, Tomer
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition
Abstract: Power consumption is a major obstacle in the deployment of deep neural networks (DNNs) on end devices. Existing approaches for reducing power consumption rely on quite general principles, including avoidance of multiplication operations and aggressive quantization of weights and activations. However, these methods do not take into account the precise power consumed by each module in the network, and are therefore not optimal. In this paper we develop accurate power consumption models for all arithmetic operations in the DNN, under various working conditions. We reveal several important factors that have been overlooked to date. Based on our analysis, we present PANN (power-aware neural network), a simple approach for approximating any full-precision network by a low-power fixed-precision variant. Our method can be applied to a pre-trained network, and can also be used during training to achieve improved performance. In contrast to previous methods, PANN incurs only a minor degradation in accuracy w.r.t. the full-precision version of the network, even when working at the power-budget of a 2-bit quantized variant. In addition, our scheme enables to seamlessly traverse the power-accuracy trade-off at deployment time, which is a major advantage over existing quantization methods that are constrained to specific bit widths.
Published: 2022

3. Accurate Neural Training with 4-bit Matrix Multiplications at Standard Formats

Author: Chmiel, Brian, Banner, Ron, Hoffer, Elad, Yaacov, Hilla Ben, and Soudry, Daniel
Subjects: Computer Science - Machine Learning
Abstract: Quantization of the weights and activations is one of the main methods to reduce the computational footprint of Deep Neural Networks (DNNs) training. Current methods enable 4-bit quantization of the forward phase. However, this constitutes only a third of the training process. Reducing the computational footprint of the entire training process requires the quantization of the neural gradients, i.e., the loss gradients with respect to the outputs of intermediate neural layers. Previous works separately showed that accurate 4-bit quantization of the neural gradients needs to (1) be unbiased and (2) have a log scale. However, no previous work aimed to combine both ideas, as we do in this work. Specifically, we examine the importance of having unbiased quantization in quantized neural network training, where to maintain it, and how to combine it with logarithmic quantization. Based on this, we suggest a $\textit{logarithmic unbiased quantization}$ (LUQ) method to quantize both the forward and backward phases to 4-bit, achieving state-of-the-art results in 4-bit training without the overhead. For example, in ResNet50 on ImageNet, we achieved a degradation of 1.1%. We further improve this to a degradation of only 0.32% after three epochs of high precision fine-tuning, combined with a variance reduction method -- where both these methods add overhead comparable to previously suggested methods.
Published: 2021

4. Power Awareness in Low Precision Neural Networks

Author: Eliezer, Nurit Spingarn, Banner, Ron, Ben-Yaakov, Hilla, Hoffer, Elad, Michaeli, Tomer, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Karlinsky, Leonid, editor, Michaeli, Tomer, editor, and Nishino, Ko, editor
Published: 2023
Full Text: View/download PDF

5. Task Agnostic Continual Learning Using Online Variational Bayes with Fixed-Point Updates

Author: Zeno, Chen, Golan, Itay, Hoffer, Elad, and Soudry, Daniel
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: Background: Catastrophic forgetting is the notorious vulnerability of neural networks to the changes in the data distribution during learning. This phenomenon has long been considered a major obstacle for using learning agents in realistic continual learning settings. A large body of continual learning research assumes that task boundaries are known during training. However, only a few works consider scenarios in which task boundaries are unknown or not well defined -- task agnostic scenarios. The optimal Bayesian solution for this requires an intractable online Bayes update to the weights posterior. Contributions: We aim to approximate the online Bayes update as accurately as possible. To do so, we derive novel fixed-point equations for the online variational Bayes optimization problem, for multivariate Gaussian parametric distributions. By iterating the posterior through these fixed-point equations, we obtain an algorithm (FOO-VB) for continual learning which can handle non-stationary data distribution using a fixed architecture and without using external memory (i.e. without access to previous data). We demonstrate that our method (FOO-VB) outperforms existing methods in task agnostic scenarios. FOO-VB Pytorch implementation will be available online., Comment: The arXiv paper "Task Agnostic Continual Learning Using Online Variational Bayes" is a preliminary pre-print of this paper. The main differences between the versions are: 1. We develop new algorithmic framework (FOO-VB). 2. We add multivariate Gaussian and matrix variate Gaussian versions of the algorithm. 3. We demonstrate the new algorithm performance in task agnostic scenarios
Published: 2020
Full Text: View/download PDF

6. Neural gradients are near-lognormal: improved quantized and sparse training

Author: Chmiel, Brian, Ben-Uri, Liad, Shkolnik, Moran, Hoffer, Elad, Banner, Ron, and Soudry, Daniel
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: While training can mostly be accelerated by reducing the time needed to propagate neural gradients back throughout the model, most previous works focus on the quantization/pruning of weights and activations. These methods are often not applicable to neural gradients, which have very different statistical properties. Distinguished from weights and activations, we find that the distribution of neural gradients is approximately lognormal. Considering this, we suggest two closed-form analytical methods to reduce the computational and memory burdens of neural gradients. The first method optimizes the floating-point format and scale of the gradients. The second method accurately sets sparsity thresholds for gradient pruning. Each method achieves state-of-the-art results on ImageNet. To the best of our knowledge, this paper is the first to (1) quantize the gradients to 6-bit floating-point formats, or (2) achieve up to 85% gradient sparsity -- in each case without accuracy degradation. Reference implementation accompanies the paper.
Published: 2020

7. The Knowledge Within: Methods for Data-Free Model Compression

Author: Haroush, Matan, Hubara, Itay, Hoffer, Elad, and Soudry, Daniel
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Statistics - Machine Learning
Abstract: Recently, an extensive amount of research has been focused on compressing and accelerating Deep Neural Networks (DNN). So far, high compression rate algorithms require part of the training dataset for a low precision calibration, or a fine-tuning process. However, this requirement is unacceptable when the data is unavailable or contains sensitive information, as in medical and biometric use-cases. We present three methods for generating synthetic samples from trained models. Then, we demonstrate how these samples can be used to calibrate and fine-tune quantized models without using any real data in the process. Our best performing method has a negligible accuracy degradation compared to the original training set. This method, which leverages intrinsic batch normalization layers' statistics of the trained model, can be used to evaluate data similarity. Our approach opens a path towards genuine data-free model compression, alleviating the need for training data during model deployment.
Published: 2019

8. At Stability's Edge: How to Adjust Hyperparameters to Preserve Minima Selection in Asynchronous Training of Neural Networks?

Author: Giladi, Niv, Nacson, Mor Shpigel, Hoffer, Elad, and Soudry, Daniel
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Background: Recent developments have made it possible to accelerate neural networks training significantly using large batch sizes and data parallelism. Training in an asynchronous fashion, where delay occurs, can make training even more scalable. However, asynchronous training has its pitfalls, mainly a degradation in generalization, even after convergence of the algorithm. This gap remains not well understood, as theoretical analysis so far mainly focused on the convergence rate of asynchronous methods. Contributions: We examine asynchronous training from the perspective of dynamical stability. We find that the degree of delay interacts with the learning rate, to change the set of minima accessible by an asynchronous stochastic gradient descent algorithm. We derive closed-form rules on how the learning rate could be changed, while keeping the accessible set the same. Specifically, for high delay values, we find that the learning rate should be kept inversely proportional to the delay. We then extend this analysis to include momentum. We find momentum should be either turned off, or modified to improve training stability. We provide empirical experiments to validate our theoretical findings., Comment: ICLR 2020 Camera ready version
Published: 2019

9. Mix & Match: training convnets with mixed image sizes for improved accuracy, speed and scale resiliency

Author: Hoffer, Elad, Weinstein, Berry, Hubara, Itay, Ben-Nun, Tal, Hoefler, Torsten, and Soudry, Daniel
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Convolutional neural networks (CNNs) are commonly trained using a fixed spatial image size predetermined for a given model. Although trained on images of aspecific size, it is well established that CNNs can be used to evaluate a wide range of image sizes at test time, by adjusting the size of intermediate feature maps. In this work, we describe and evaluate a novel mixed-size training regime that mixes several image sizes at training time. We demonstrate that models trained using our method are more resilient to image size changes and generalize well even on small images. This allows faster inference by using smaller images attest time. For instance, we receive a 76.43% top-1 accuracy using ResNet50 with an image size of 160, which matches the accuracy of the baseline model with 2x fewer computations. Furthermore, for a given image size used at test time, we show this method can be exploited either to accelerate training or the final test accuracy. For example, we are able to reach a 79.27% accuracy with a model evaluated at a 288 spatial size for a relative improvement of 14% over the baseline.
Published: 2019

10. Augment your batch: better training with larger batches

Author: Hoffer, Elad, Ben-Nun, Tal, Hubara, Itay, Giladi, Niv, Hoefler, Torsten, and Soudry, Daniel
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Large-batch SGD is important for scaling training of deep neural networks. However, without fine-tuning hyperparameter schedules, the generalization of the model may be hampered. We propose to use batch augmentation: replicating instances of samples within the same batch with different data augmentations. Batch augmentation acts as a regularizer and an accelerator, increasing both generalization and performance scaling. We analyze the effect of batch augmentation on gradient variance and show that it empirically improves convergence for a wide variety of deep neural networks and datasets. Our results show that batch augmentation reduces the number of necessary SGD updates to achieve the same accuracy as the state-of-the-art. Overall, this simple yet effective method enables faster training and better generalization by allowing more computational resources to be used concurrently.
Published: 2019

11. Post-training 4-bit quantization of convolution networks for rapid-deployment

Author: Banner, Ron, Nahshan, Yury, Hoffer, Elad, and Soudry, Daniel
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Convolutional neural networks require significant memory bandwidth and storage for intermediate computations, apart from substantial computing resources. Neural network quantization has significant benefits in reducing the amount of intermediate results, but it often requires the full datasets and time-consuming fine tuning to recover the accuracy lost after quantization. This paper introduces the first practical 4-bit post training quantization approach: it does not involve training the quantized model (fine-tuning), nor it requires the availability of the full dataset. We target the quantization of both activations and weights and suggest three complementary methods for minimizing quantization error at the tensor level, two of whom obtain a closed-form analytical solution. Combining these methods, our approach achieves accuracy that is just a few percents less the state-of-the-art baseline across a wide range of convolutional models. The source code to replicate all experiments is available on GitHub: \url{https://github.com/submission2019/cnn-quantization}.
Published: 2018

12. Scalable Methods for 8-bit Training of Neural Networks

Author: Banner, Ron, Hubara, Itay, Hoffer, Elad, and Soudry, Daniel
Subjects: Computer Science - Learning, Statistics - Machine Learning
Abstract: Quantized Neural Networks (QNNs) are often used to improve network efficiency during the inference phase, i.e. after the network has been trained. Extensive research in the field suggests many different quantization schemes. Still, the number of bits required, as well as the best quantization scheme, are yet unknown. Our theoretical analysis suggests that most of the training process is robust to substantial precision reduction, and points to only a few specific operations that require higher precision. Armed with this knowledge, we quantize the model parameters, activations and layer gradients to 8-bit, leaving at a higher precision only the final step in the computation of the weight gradients. Additionally, as QNNs require batch-normalization to be trained at high precision, we introduce Range Batch-Normalization (BN) which has significantly higher tolerance to quantization noise and improved computational complexity. Our simulations show that Range BN is equivalent to the traditional batch norm if a precise scale adjustment, which can be approximated analytically, is applied. To the best of the authors' knowledge, this work is the first to quantize the weights, activations, as well as a substantial volume of the gradients stream, in all layers (including batch normalization) to 8-bit while showing state-of-the-art results over the ImageNet-1K dataset.
Published: 2018

13. Task Agnostic Continual Learning Using Online Variational Bayes

Author: Zeno, Chen, Golan, Itay, Hoffer, Elad, and Soudry, Daniel
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: Catastrophic forgetting is the notorious vulnerability of neural networks to the change of the data distribution while learning. This phenomenon has long been considered a major obstacle for allowing the use of learning agents in realistic continual learning settings. A large body of continual learning research assumes that task boundaries are known during training. However, research for scenarios in which task boundaries are unknown during training has been lacking. In this paper we present, for the first time, a method for preventing catastrophic forgetting (BGD) for scenarios with task boundaries that are unknown during training --- task-agnostic continual learning. Code of our algorithm is available at https://github.com/igolan/bgd.
Published: 2018

14. Norm matters: efficient and accurate normalization schemes in deep networks

Author: Hoffer, Elad, Banner, Ron, Golan, Itay, and Soudry, Daniel
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: Over the past few years, Batch-Normalization has been commonly used in deep networks, allowing faster training and high performance for a wide variety of applications. However, the reasons behind its merits remained unanswered, with several shortcomings that hindered its use for certain tasks. In this work, we present a novel view on the purpose and function of normalization methods and weight-decay, as tools to decouple weights' norm from the underlying optimized objective. This property highlights the connection between practices such as normalization, weight decay and learning-rate adjustments. We suggest several alternatives to the widely used $L^2$ batch-norm, using normalization in $L^1$ and $L^\infty$ spaces that can substantially improve numerical stability in low-precision implementations as well as provide computational and memory benefits. We demonstrate that such methods enable the first batch-norm alternative to work for half-precision implementations. Finally, we suggest a modification to weight-normalization, which improves its performance on large-scale tasks., Comment: http://papers.nips.cc/paper/7485-norm-matters-efficient-and-accurate-normalization-schemes-in-deep-networks
Published: 2018

15. On the Blindspots of Convolutional Networks

Author: Hoffer, Elad, Fine, Shai, and Soudry, Daniel
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: Deep convolutional network has been the state-of-the-art approach for a wide variety of tasks over the last few years. Its successes have, in many cases, turned it into the default model in quite a few domains. In this work, we will demonstrate that convolutional networks have limitations that may, in some cases, hinder it from learning properties of the data, which are easily recognizable by traditional, less demanding, models. To this end, we present a series of competitive analysis studies on image recognition and text analysis tasks, for which convolutional networks are known to provide state-of-the-art results. In our studies, we inject a truth-revealing signal, indiscernible for the network, thus hitting time and again the network's blind spots. The signal does not impair the network's existing performances, but it does provide an opportunity for a significant performance boost by models that can capture it. The various forms of the carefully designed signals shed a light on the strengths and weaknesses of convolutional network, which may provide insights for both theoreticians that study the power of deep architectures, and for practitioners that consider applying convolutional networks to the task at hand.
Published: 2018

16. Fix your classifier: the marginal value of training the last weight layer

Author: Hoffer, Elad, Hubara, Itay, and Soudry, Daniel
Subjects: Computer Science - Learning, Computer Science - Computer Vision and Pattern Recognition, Statistics - Machine Learning
Abstract: Neural networks are commonly used as models for classification for a wide variety of tasks. Typically, a learned affine transformation is placed at the end of such models, yielding a per-class value used for classification. This classifier can have a vast number of parameters, which grows linearly with the number of possible classes, thus requiring increasingly more resources. In this work we argue that this classifier can be fixed, up to a global scale constant, with little or no loss of accuracy for most tasks, allowing memory and computational benefits. Moreover, we show that by initializing the classifier with a Hadamard matrix we can speed up inference as well. We discuss the implications for current understanding of neural network models., Comment: https://openreview.net/forum?id=S1Dh8Tg0-
Published: 2018

17. The Implicit Bias of Gradient Descent on Separable Data

Author: Soudry, Daniel, Hoffer, Elad, Nacson, Mor Shpigel, Gunasekar, Suriya, and Srebro, Nathan
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: We examine gradient descent on unregularized logistic regression problems, with homogeneous linear predictors on linearly separable datasets. We show the predictor converges to the direction of the max-margin (hard margin SVM) solution. The result also generalizes to other monotone decreasing loss functions with an infimum at infinity, to multi-class problems, and to training a weight layer in a deep network in a certain restricted setting. Furthermore, we show this convergence is very slow, and only logarithmic in the convergence of the loss itself. This can help explain the benefit of continuing to optimize the logistic or cross-entropy loss even after the training error is zero and the training loss is extremely small, and, as we show, even if the validation loss increases. Our methodology can also aid in understanding implicit regularization n more complex models and with other optimization methods., Comment: Added a missing assumption to Theorem 7 (multi-class case) and a discussion of this assumption after the Theorem
Published: 2017

18. Train longer, generalize better: closing the generalization gap in large batch training of neural networks

Author: Hoffer, Elad, Hubara, Itay, and Soudry, Daniel
Subjects: Statistics - Machine Learning, Computer Science - Learning
Abstract: Background: Deep learning models are typically trained using stochastic gradient descent or one of its variants. These methods update the weights using their gradient, estimated from a small fraction of the training data. It has been observed that when using large batch sizes there is a persistent degradation in generalization performance - known as the "generalization gap" phenomena. Identifying the origin of this gap and closing it had remained an open problem. Contributions: We examine the initial high learning rate training phase. We find that the weight distance from its initialization grows logarithmically with the number of weight updates. We therefore propose a "random walk on random landscape" statistical model which is known to exhibit similar "ultra-slow" diffusion behavior. Following this hypothesis we conducted experiments to show empirically that the "generalization gap" stems from the relatively small number of updates rather than the batch size, and can be completely eliminated by adapting the training regime used. We further investigate different techniques to train models in the large-batch regime and present a novel algorithm named "Ghost Batch Normalization" which enables significant decrease in the generalization gap without increasing the number of updates. To validate our findings we conduct several additional experiments on MNIST, CIFAR-10, CIFAR-100 and ImageNet. Finally, we reassess common practices and beliefs concerning training of deep models and suggest they may not be optimal to achieve good generalization.
Published: 2017

19. Exponentially vanishing sub-optimal local minima in multilayer neural networks

Author: Soudry, Daniel and Hoffer, Elad
Subjects: Statistics - Machine Learning
Abstract: Background: Statistical mechanics results (Dauphin et al. (2014); Choromanska et al. (2015)) suggest that local minima with high error are exponentially rare in high dimensions. However, to prove low error guarantees for Multilayer Neural Networks (MNNs), previous works so far required either a heavily modified MNN model or training method, strong assumptions on the labels (e.g., "near" linear separability), or an unrealistic hidden layer with $\Omega\left(N\right)$ units. Results: We examine a MNN with one hidden layer of piecewise linear units, a single output, and a quadratic loss. We prove that, with high probability in the limit of $N\rightarrow\infty$ datapoints, the volume of differentiable regions of the empiric loss containing sub-optimal differentiable local minima is exponentially vanishing in comparison with the same volume of global minima, given standard normal input of dimension $d_{0}=\tilde{\Omega}\left(\sqrt{N}\right)$, and a more realistic number of $d_{1}=\tilde{\Omega}\left(N/d_{0}\right)$ hidden units. We demonstrate our results numerically: for example, $0\%$ binary classification training error on CIFAR with only $N/d_{0}\approx 16$ hidden neurons.
Published: 2017

20. Spatial contrasting for deep unsupervised learning

Author: Hoffer, Elad, Hubara, Itay, and Ailon, Nir
Subjects: Statistics - Machine Learning, Computer Science - Learning
Abstract: Convolutional networks have marked their place over the last few years as the best performing model for various visual tasks. They are, however, most suited for supervised learning from large amounts of labeled data. Previous attempts have been made to use unlabeled data to improve model performance by applying unsupervised techniques. These attempts require different architectures and training methods. In this work we present a novel approach for unsupervised training of Convolutional networks that is based on contrasting between spatial regions within images. This criterion can be employed within conventional neural networks and trained using standard techniques such as SGD and back-propagation, thus complementing supervised methods., Comment: Presented at NIPS 2016 Workshop on Interpretable Machine Learning in Complex Systems
Published: 2016

21. Semi-supervised deep learning by metric embedding

Author: Hoffer, Elad and Ailon, Nir
Subjects: Computer Science - Machine Learning
Abstract: Deep networks are successfully used as classification models yielding state-of-the-art results when trained on a large number of labeled samples. These models, however, are usually much less suited for semi-supervised problems because of their tendency to overfit easily when trained on small amounts of data. In this work we will explore a new training objective that is targeting a semi-supervised regime with only a small subset of labeled data. This criterion is based on a deep metric embedding over distance relations within the set of labeled samples, together with constraints over the embeddings of the unlabeled set. The final learned representations are discriminative in euclidean space, and hence can be used with subsequent nearest-neighbor classification using the labeled samples.
Published: 2016

22. Deep unsupervised learning through spatial contrasting

Author: Hoffer, Elad, Hubara, Itay, and Ailon, Nir
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Statistics - Machine Learning
Abstract: Convolutional networks have marked their place over the last few years as the best performing model for various visual tasks. They are, however, most suited for supervised learning from large amounts of labeled data. Previous attempts have been made to use unlabeled data to improve model performance by applying unsupervised techniques. These attempts require different architectures and training methods. In this work we present a novel approach for unsupervised training of Convolutional networks that is based on contrasting between spatial regions within images. This criterion can be employed within conventional neural networks and trained using standard techniques such as SGD and back-propagation, thus complementing supervised methods.
Published: 2016

23. Deep metric learning using Triplet network

Author: Hoffer, Elad and Ailon, Nir
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Statistics - Machine Learning
Abstract: Deep learning has proven itself as a successful set of models for learning useful semantic representations of data. These, however, are mostly implicitly learned as part of a classification task. In this paper we propose the triplet network model, which aims to learn useful representations by distance comparisons. A similar model was defined by Wang et al. (2014), tailor made for learning a ranking for image information retrieval. Here we demonstrate using various datasets that our model learns a better representation than that of its immediate competitor, the Siamese network. We also discuss future possible usage as a framework for unsupervised learning.
Published: 2014

24. Deep Metric Learning Using Triplet Network

Author: Hoffer, Elad, Ailon, Nir, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Feragen, Aasa, editor, Pelillo, Marcello, editor, and Loog, Marco, editor
Published: 2015
Full Text: View/download PDF

25. Logarithmic Unbiased Quantization: Simple 4-bit Training in Deep Learning

Author: Chmiel, Brian, Banner, Ron, Hoffer, Elad, Yaacov, Hilla Ben, and Soudry, Daniel
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Machine Learning (cs.LG)
Abstract: Quantization of the weights and activations is one of the main methods to reduce the computational footprint of Deep Neural Networks (DNNs) training. Current methods enable 4-bit quantization of the forward phase. However, this constitutes only a third of the training process. Reducing the computational footprint of the entire training process requires the quantization of the neural gradients, i.e., the loss gradients with respect to the outputs of intermediate neural layers. In this work, we examine the importance of having unbiased quantization in quantized neural network training, where to maintain it, and how. Based on this, we suggest a \textit{logarithmic unbiased quantization} (LUQ) method to quantize all both the forward and backward phase to 4-bit, achieving state-of-the-art results in 4-bit training without overhead. For example, in ResNet50 on ImageNet, we achieved a degradation of 1.1\%. We further improve this to degradation of only 0.32\% after three epochs of high precision fine-tunining, combined with a variance reduction method -- where both these methods add overhead comparable to previously suggested methods., Main Changes: 1) FNT learning rate (sec 4.2) 2) Implementation details (sec 4.3), including solving data movement bottleneck. 3) Additional experiments
Published: 2021

26. Deep Metric Learning Using Triplet Network

Author: Hoffer, Elad, primary and Ailon, Nir, additional
Published: 2015
Full Text: View/download PDF

27. Task-Agnostic Continual Learning Using Online Variational Bayes with Fixed-Point Updates

Author: Zeno, Chen, primary, Golan, Itay, additional, Hoffer, Elad, additional, and Soudry, Daniel, additional
Published: 2021
Full Text: View/download PDF

28. Augment Your Batch: Improving Generalization Through Instance Repetition

Author: Hoffer, Elad, primary, Ben-Nun, Tal, additional, Hubara, Itay, additional, Giladi, Niv, additional, Hoefler, Torsten, additional, and Soudry, Daniel, additional
Published: 2020
Full Text: View/download PDF

29. The Knowledge Within: Methods for Data-Free Model Compression

Author: Haroush, Matan, primary, Hubara, Itay, additional, Hoffer, Elad, additional, and Soudry, Daniel, additional
Published: 2020
Full Text: View/download PDF

30. The Implicit Bias of Gradient Descent on Separable Data.

Author: Soudry, Daniel, Hoffer, Elad, Nacson, Mor Shpigel, Gunasekar, Suriya, and Srebro, Nathan
Subjects: *IMPLICIT functions, *LOSS functions (Statistics), *MATHEMATICAL regularization, *STOCHASTIC convergence, *REGRESSION analysis
Abstract: We examine gradient descent on unregularized logistic regression problems, with homogeneous linear predictors on linearly separable datasets. We show the predictor converges to the direction of the max-margin (hard margin SVM) solution. The result also generalizes to other monotone decreasing loss functions with an infimum at infinity, to multi-class problems, and to training a weight layer in a deep network in a certain restricted setting. Furthermore, we show this convergence is very slow, and only logarithmic in the convergence of the loss itself. This can help explain the benefit of continuing to optimize the logistic or cross-entropy loss even after the training error is zero and the training loss is extremely small, and, as we show, even if the validation loss increases. Our methodology can also aid in understanding implicit regularization in more complex models and with other optimization methods. [ABSTRACT FROM AUTHOR]
Published: 2018

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

30 results on '"Hoffer, Elad"'

1. DropCompute: simple and more robust distributed synchronous training via compute variance reduction

2. Energy awareness in low precision neural networks

3. Accurate Neural Training with 4-bit Matrix Multiplications at Standard Formats

4. Power Awareness in Low Precision Neural Networks

5. Task Agnostic Continual Learning Using Online Variational Bayes with Fixed-Point Updates

6. Neural gradients are near-lognormal: improved quantized and sparse training

7. The Knowledge Within: Methods for Data-Free Model Compression

8. At Stability's Edge: How to Adjust Hyperparameters to Preserve Minima Selection in Asynchronous Training of Neural Networks?

9. Mix & Match: training convnets with mixed image sizes for improved accuracy, speed and scale resiliency

10. Augment your batch: better training with larger batches

11. Post-training 4-bit quantization of convolution networks for rapid-deployment

12. Scalable Methods for 8-bit Training of Neural Networks

13. Task Agnostic Continual Learning Using Online Variational Bayes

14. Norm matters: efficient and accurate normalization schemes in deep networks

15. On the Blindspots of Convolutional Networks

16. Fix your classifier: the marginal value of training the last weight layer

17. The Implicit Bias of Gradient Descent on Separable Data

18. Train longer, generalize better: closing the generalization gap in large batch training of neural networks

19. Exponentially vanishing sub-optimal local minima in multilayer neural networks

20. Spatial contrasting for deep unsupervised learning

21. Semi-supervised deep learning by metric embedding

22. Deep unsupervised learning through spatial contrasting

23. Deep metric learning using Triplet network

24. Deep Metric Learning Using Triplet Network

25. Logarithmic Unbiased Quantization: Simple 4-bit Training in Deep Learning

26. Deep Metric Learning Using Triplet Network

27. Task-Agnostic Continual Learning Using Online Variational Bayes with Fixed-Point Updates

28. Augment Your Batch: Improving Generalization Through Instance Repetition

29. The Knowledge Within: Methods for Data-Free Model Compression

30. The Implicit Bias of Gradient Descent on Separable Data.

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

30 results on '"Hoffer, Elad"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources