【Paper Reading】文章读后总结：2014年《An Energy-Efficient Precision-Scalable ConvNet Processor in 40-nm CMOS》

2021-04-30 21:58:47 阅读：220 来源： 互联网

标签：layers node ConvNet nm neurons paper memory GPU Processor

DaDianNao: A Machine-Learning Supercomputer [2014] 21-04-29阅

-1 感悟

感悟就是，已经接受了我是个垃圾制造机… T _ T…

0 ABSTRACT

Considering that the various applications of AI algorithms showing up increasingly, there are proposed a number of neural network accelerators for higher computational capacity/area ratio but limited by memory accesses. This paper proposes a customized architecture for machine learning with multiple chips, which achieves a speedup of 450.65x over a GPU and reducing the energy by 150.31x on average for a 64-chip system on some largest networks.

1 RELATED WORK

Temam [2] proposed a neural network but not a DNNs accelerator for multi-layer perceptrons. Esmaeilzadeh et al. [3] proposed a NPU for approximating program function via a hardware neural network, which is not dedicated for machine learning. Chen et al. [4] an accelerator for DNNs. However, there accelerators are all limited with the size of neural network and the storage of NNs’ computing values. Meanwhile, Chen et al. [4] ensured the phenomenon of the bottleneck of memory access in neural network accelerators.

2 STATE-OF-THE-ART MACHINE-LEARNING TECHNIQUES

2.1 Main Layer Types

Consisted of four types of layers: pooling layers (POOL), convolutional layers (CONV), classifier layers (CLASS) and local response normalization layers (LRN), both CNNs and DNNs achieve effective classification for the output.

CONV: A layer of CONV can map the feature data into a new matrix via sets of filters or kernels.
POOL: A layer of POOL aims to get the max or average over a region data.
LRN: A layer of LRN is used for intensifying competitions of neurons, strengthening the priority of dominant weights.
CLASS: A layer of CLASS often consists of multi-perceptrons and serves as output categories for classification.

2.2 Benchmarks

This article uses 10 of the largest known layers of each type and a full CNN from the ImageNet 2012 competition as benchmarks. The details of configurations of each layers are recorded in the paper.

3 THE GPU OPTION

The paper evaluates the performance of different layer types mentioned as above in CUDA with a GPU (NVIDIA K20M, 5GB GDDR5, 208 GB/s memory bandwidth, 3.52 TFlops peak, 28nm technology) and a 256-bit SIMD CPU (Intel Xeon E5-4620 Sandy Bridge-EP, 2.2GHz, 1TB memory). According to the analysis, it shows that GPUs own great efficiency on LRN layers due to the feature of SIMD. However, the drawbacks of GPUs are obvious for their high cost, incompatibility with industrial applications and moderate energy efficiency.

4 THE ACCELERA TOR OPTION

Chen et al. [5] proposed the DianNao accelerator for faster and better energy efficiency in computation of large CNNs and DNNs, which consists of buffers for input/output neurons and synapses, and a NFU. According to the reproduction of Chen’s article, it finds that the main limitations of Chen’s architecture is the bottleneck of memory bandwidth in the convolutional layers and classifier layers, which is thus the optimization goal of this article.

5 A MACHINE-LEARNING SUPERCOMPUTER

In this part, the paper proposes an architecture for high performance in machine learning via cheaper multiple chips than typical GPU, whose on-chip storage is fully enough for the memory needs in DNNs or CNNs.

5.1 Overview

For solving the requirements of memory storage and bandwidth, the paper decides that:

Let synapses stores near to the neurons for less time and energy to data move. Besides, the architecture has no main memory but is fully distributed.
The system is apparently biased towards storage rather than computations.
Transfer neurons values rather than synapses values for less bandwidth.
Separate the local storage into many tiles for better internal bandwidth.

5.2 Node

Synapses Close to Neurons: through locating the storage for synapses close to neurons, it can achieve only moving neurons for low overheads data transfers and high internal bandwidth.
High Internal Bandwidth: the paper uses a tile-based design to avoid congestion. The output is generated into different tiles.
Configurability: the tile and the NFU pipeline can be adjusted according to the different layers and the execution mode.

5.3 Interconnect

Considering that the bottleneck of the heavy communications shows a few layers, which is resulted from considerable reuse of neurons value, the paper turns to commercially available high-performance interfaces rather than customized interconnect for better performance. Besides, the paper implements the router via wormhole routing.

5.4 Programming, Code Generation and Multi-Node Mapping

This architecture can be viewed as a system ASIC. At the beginning, the input data is partitioned across the nodes and stored in a central eDRAM. Then the neural network can be deployed by node instructions, which drive the control of each tile.
Every output neurons values from the end of layers, which is the input neurons of the next layer, will be stored back in the central eDRAM.

6 METHODOLOGY

6.1 Measurements

The paper uses the ST 28nm Low Power (LP) technology (0.9V), the Synopsys Design Compiler for the synthesis, ICC Compiler for the layout, and Synopsys PrimeTime PX for power consumption estimation.
Use VCS to simulate the node RTL.
The baseline of GPU in the article is the NVIDIA K20M GPU.

6.2 Baseline

The paper uses the CUDA versions from a tuned open-source as the baseline for maximizing the quality.

7 EXPERIMENTAL RESULTS

Nearly a half area is occupied by 16 tiles, meanwhile, nearly a half of the chip is occupied by memory cells.
The power of the chip proposed by this paper is only about 5-10% of the SOTA GPU.
According to the analysis in the experiment, it finds that the 1-node, 4-node, 16-node and 64-node architectures achieve 21.38x, 79.81x, 216.72x, and 450.65x speed-up than the GPU baseline respectively. The excellent performance of the 1-node is resulted from the large number of operators and the necessary bandwidth provided by on-chip eDRAM.
Besides, the 1-node, 4-node, 16-node and 64-node architectures can reduce the energy by 330.56x, 323.74x, 276.04x, and 150.31x compared with the GPU baseline respectively, showing a stable energy efficiency with the increasing of the number of nodes.

8 REFERENCE

[1]Y. Chen et al., “DaDianNao: A Machine-Learning Supercomputer,” 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014, pp. 609-622, doi: 10.1109/MICRO.2014.58.

[2] O. Temam. A Defect-Tolerant Accelerator for Emerging High-Performance Applications. In International Symposium on Computer Architecture, Portland, Oregon, 2012.

[3] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger. Neural Acceleration for General-Purpose Approximate Programs. In International Symposium on Microarchitecture, number 3, pages 1–6, 2012.

[4] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning. In International Conference on Architectural Support for Programming Languages and Operating Systems, 2014

标签：layers,node,ConvNet,nm,neurons,paper,memory,GPU,Processor
来源： https://blog.csdn.net/Hide_in_Code/article/details/116309763

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

ICode9