HiKonv: High Throughput Quantized Convolution on Conventional Multipliers with Novel Bit-wise Management and Computation
Presentation Menu
The world of edge AI with constrained hardware resources has seen significant progresses recently in applying quantization for deep Convolutional Neural Networks (CNNs) to reduce the computation and storage cost with low-bitwidth data. There are, however, no systematic studies on how an existing full-bitwidth processing unit, such as CPUs and DSPs, can be better utilized to carry out significantly higher computation throughput for convolution under various quantized bitwidths. In this study, we propose HiKonv, as a unified solution framework to maximize the computation throughput for a given underlying processing unit when processing low-bitwidth quantized CNNs. The key idea behind HiKonv is a novel bit-wise management and parallel computation. We establish theoretical performance bounds on using a full-bitwidth multiplier for highly parallelized low-bitwidth convolution, and demonstrate new breakthroughs for high-performance CNN computation in this critical domain. For example, we show that a single 32-bit processing unit can deliver 128 binarized convolution operations (multiplications and additions) under one CPU instruction, and a single 27x18 DSP core can deliver eight convolution operations with 4-bit inputs in one cycle. We demonstrate the effectiveness of HiKonv on CPU and FPGA for both convolutional layers anda complete DNN model. For a convolutional layer quantized to 4-bit, HiKonv achieves a 3.17x latency improvement over the baseline implementation using C++ on CPU. Compared to the DAC-SDC 2020 champion model for FPGA, HiKonv achieves a 2.37x throughput improvement and 2.61x DSP efficiency improvement, respectively.