Start Over

Accelerating Convolutional Neural Network With FFT on Embedded Hardware.

Authors :: Abtahi, Tahmid
Shea, Colin
Kulkarni, Amey
Mohsenin, Tinoosh
Source :: IEEE Transactions on Very Large Scale Integration (VLSI) Systems; Sep2018, Vol. 26 Issue 9, p1737-1749, 13p
Publication Year :: 2018
Abstract: Fueled by ImageNet Large Scale Visual Recognition Challenge and Common Objects in Context competitions, the convolutional neural network (CNN) has become important in computer vision and natural language processing. However, state-of-the-art CNNs are computationally memory-intensive, thus energy-efficient implementation on the embedded platform is challenging. Recently, VGGNet and ResNet showed that deep neural networks with more convolution layers and a few fully connected layers can achieve lower error rates, thus reducing the complexity of convolution layers is of utmost importance. In this paper, we evaluate three variations of convolutions, including direct convolution (Direct-Conv), fast Fourier transform (FFT)-based convolution (FFT-Conv), and FFT overlap and add convolution (FFT-OVA-Conv) in terms of computation complexity and memory storage requirements for popular CNN networks in embedded hardware. We implemented these three techniques for ResNet-20 with the CIFAR-10 data set on a low-power domain-specific many-core architecture called power-efficient nanoclusters (PENCs), NVIDIA Jetson TX1 graphics processing unit (GPU), ARM Cortex A53 CPU, and SPARse Convolutional NETwork (SPARCNet) accelerator on Zynq 7020 FPGA to explore the tradeoff between software and hardware implementation, domain-specific logic and instructions, as well as various parallelism across different architectures. Results are evaluated and compared with respect to throughput per layer, energy consumption, and execution time for the three methods. SPARCNet deployed on Zynq FPGA achieved 42-ms runtime with 135-mJ energy consumption with a 10.8-MB/s throughput per layer using FFT-Conv for ResNet-20. Using built-in FFT instruction in PENC, the FFT-OVA-Conv performs $2.9\times $ and $1.65\times $ faster and achieves $6.8\times $ and $2.5\times $ higher throughput per watt than Direct-Conv and FFT-Conv. In ARM A53 CPU, FFT-OVA-Conv achieves $3.36\times $ and $1.38\times $ improvement in execution time and $2.72\times $ and $1.32\times $ higher throughput than Direct-Conv and FFT-Conv. In TX1 GPU, FFT-Conv is $1.9\times $ faster, $2.2\times $ more energy-efficient, and achieves $5.6\times $ higher throughput per layer than Direct-Conv. PENC is 10 $916\times $ and $1.8\times $ faster and $5053\times $ and $4.3\times $ more energy-efficient and achieves $7.5\times $ and $1.2\times $ higher throughput per layer than ARM A53 CPU and TX1 GPU, respectively. [ABSTRACT FROM AUTHOR]