1. Rethinking binary neural network design for FPGA implementation
- Author
-
Wang, Erwei and Cheung, Peter
- Abstract
Research has shown that deep neural networks contain significant redundancy, and that high classification accuracy can be achieved even when weights and activations are quantised down to binary values. Network binarisation on FPGAs greatly increases area efficiency by replacing resource-hungry multipliers with lightweight XNOR gates during inference. However, an FPGA's fundamental building block, the K-LUT, is capable of implementing far more than an XNOR: it can perform any K-input Boolean operation. Inefficiency has also been spotted in BNN training: high-precision gradients and intermediate activations become redundant because we only care about weights' signs. My PhD focusses around increasing the efficiency of BNN inference and training on FPGAs. For inference, I propose expanding BNN's inference operator to utilize LUTs' full expressiveness. I also found various redundancies in the standard BNN training method, and proposed improvements to reduce them. With the promising improvements in area, energy and memory efficiency demonstrated in my works, my research makes BNN a more promising architecture for resource-constrained AI deployment. To make BNNs embrace the full capabilities of the LUT, I propose LUTNet, an end-to-end hardware-software framework for the construction of area-efficient FPGA-based neural network accelerators using the native LUTs as inference operators. I demonstrate that the exploitation of LUT flexibility allows for far heavier pruning than possible in prior works, resulting in significant area savings while achieving comparable accuracy, when implemented on a single fully-unrolled layer. Against the state-of-the-art binarised neural network implementation, I achieve twice the area efficiency for several standard network models when inferencing popular datasets. I also demonstrate that even greater energy efficiency improvements are obtainable. Although implementing just one network layer using the unrolled LUTNet architecture leads to significant area efficiency gains for a given modern DNN, their complexity makes whole-network unrolled LUTNet implementation infeasible. Given a fixed-sized FPGA, tiling allows us to trade off throughput and efficiency for additional accuracy by enabling our architecture to be used to implement a greater proportion--including all--of the target network. Therefore, I extend LUTNet's training program to natively support network tiling, allowing inference nodes to be shared between operations both within and across channels. In this new architecture, each physical K-LUT can inference as one of many (K-P):1 logical LUTs, selected by P runtime selection bits streaming from BRAMs. This tiled architecture, (K, P)-LUTNet, facilitates whole-network LUTNet deployment on current-generation FPGAs. I comprehensively explore the tiling factor space offered by this tiling-friendly architecture, finding that (K, P)-LUTNet can achieve up to 1.28x in area savings and 1.57x in energy efficiency gains against the BNN baseline. With logic expansion, there is a significant increase in network complexity leading to greater expressiveness, but at the cost of longer training time. Hence, manually fine tuning K for each individual LUT is infeasible. Also, the LUT inputs are randomly connected which do not guarantee a good choice of network topology. Therefore, in addition to logic expansion, I propose logic shrinkage which allows the network to learn its choice of K and input connections for each LUT via fine-grained activation pruning. Saliency of each LUT input is evaluated and low-importance connections removed, thereby improving the efficiency of the resultant LUTNet netlist. With logic shrinkage, I achieve 1.54x the area efficiency and 1.31x the energy efficiency over LUTNet for CNV network classifying CIFAR-10. The above works focus on BNN's redundancies in inference. I also spotted redundancies in the training process of BNN, which serve as starting points for LUTNet. The ever-growing computational demands of increasingly complex machine learning models frequently necessitate the use of powerful cloud-based infrastructure for their training. I introduce a low-cost binary neural network training strategy exhibiting sizable memory footprint reductions and energy savings vs Courbariaux & Bengio's standard approach. Against the latter, my method reduces coincident memory requirement and energy consumption by a factor of 2-6x, while reaching similar test accuracy in comparable time, across a range of small-scale models trained to classify popular datasets. I also showcase ImageNet training of ResNetE-18, achieving a 3.12x memory reduction over the aforementioned standard. Such savings will allow for unnecessary cloud offloading to be avoided, reducing latency, increasing energy efficiency and safeguarding privacy. Open Access
- Published
- 2021