1. Harnessing reconfigurable hardware to design heterogeneous systems
- Author
-
Iordanou, Konstantinos and Kotselidis, Christos-Efthymios
- Abstract
A typical Machine Learning (ML) development cycle for edge computing seeks to maximise the performance during model training, and then minimise the memory/area footprint of the trained model for deployment on edge devices, targeting CPU, GPU, microcontrollers, or custom hardware accelerators. A reasonable question to postulate would be: Could we develop a supervised learning technique that takes data as input, and generates a circuit representation for classification behaving like an ML model? This thesis proposes a methodology for automatically generating predictor circuits for the classification of tabular data. In contrast to image and text, tabular data can combine numerical and categorical data. The proposed approach provides comparable prediction performance to conventional ML techniques, whilst using substantially fewer hardware resources and power. The proposed methodology uses an evolutionary algorithm to search over the space of logic gates, and generates a classifier circuit automatically, with maximised training prediction accuracy. Classifier circuits are called ``Tiny Classifiers" since they consist of no more than 400 logic gates. They can efficiently be implemented as ASIC blocks or FPGA accelerators. The Auto Tiny Classifiers methodology or AutoTiC is evaluated on a wide range of tabular datasets and is compared against conventional ML techniques, such as Amazon's AutoGluon, Google's TabNet and a neural search over Multi-Layer Perceptrons. When they are implemented in ASIC, they use 10-75x less area/power and can be clocked 2-3x faster compared to the corresponding ML baselines. When implemented on an FPGA, they use 3-11x fewer resources. The slowing of Moore's law and the breakdown of Dennard scaling have pushed computing systems towards increased specialisation. Novel architectures are required to provide greater performance scaling than traditional approaches. Heterogeneity was introduced as an alternative to sidestep the performance wall of multi-core processors. In contrast to homogeneous systems, heterogeneous systems use a mixture of dedicated cores that are specialised for specific tasks. Cloud providers are trying to gain a competitive advantage with heterogeneous systems, that combine powerful CPUs, GPUs, FPGAs, and TPUs. An FPGA-oriented SoC includes application custom hardware kernels on the die. Improving the performance of SoCs, by including specialised hardware, requires a deep understanding of the computationally significant kernels for the applications under consideration. However, the design of SoCs does not only require the deployment of custom hardware kernels within the system. The study of system-level bottlenecks is also an important part of the development process for high-performance specialised software/hardware systems. For these systems, there is a key research problem that needs to be addressed: How does the interaction of custom compute kernels with processors affect the overall performance of a system, and what would be the optimal integration of a hardware kernel within the cache memory hierarchy of an SoC to extract better performance? This thesis describes a methodology for microarchitectural simulation of SoCs that offers the flexibility to identify and improve system-level bottlenecks, by studying the effect of custom compute kernels on the cache memory hierarchy of an SoC. Hardware designers can perform a timing simulation of SoCs while tuning the microarchitecture of the computing cores and custom hardware kernels. The proposed methodology offers the novel capability to place Register-Transfer Level (RTL) compute kernels in a simulation environment and perform a timing analysis of their interaction with the cache memory hierarchy. Application binaries are instrumented dynamically to generate processor load/store, program counter events, and any memory access generated by hardware kernels, that are sent to hardware-based timing models of processors and memory hierarchies. Some of the key features of the proposed simulation methodology are the ability to code exclusively at the user level, the dynamic discovery and use of the available hardware models at execution time, as well as the transparent testing and optimisation of the custom compute kernels with the cache memory hierarchy in a heterogeneous system. The final portion of this work focuses on the deployment of Tiny Classifier circuits as custom compute kernels in an SoC under-test. Additionally, other custom hardware RTL kernels from a wide range of benchmark suites are explored as part of an SoC. Different scenarios and integration of the hardware kernels in the cache memory hierarchy of an SoC are analysed by using the proposed simulation framework.
- Published
- 2023