## Algebraic Enhancements for Systolic Arrays

#### **Trevor Pogue**

Department of Electrical and Computer Engineering McMaster University

November 22 2024

- Backround
- Deep Learning Accelerator System Architecture
- Fast Inner-Product Algorithms and Hardware Architectures
- Karatsuba Matrix Multiplication Algorithm and Hardware Architectures
- Strassen Hardware Architectures
- Conclusion and Future Work

#### Backround

- Deep Learning Accelerator System Architecture
- Fast Inner-Product Algorithms and Hardware Architectures
- Karatsuba Matrix Multiplication Algorithm and Hardware Architectures
- Strassen Hardware Architectures
- Conclusion and Future Work

- Why use hardware acceleration for deep learning (DL):
  - Increasing demand in environments like real-time and datacenters alike, requires efficiency
  - Requires billions of operations
  - Breaks down into same parallelizable compute patterns great fit for hardware acceleration

- Why use hardware acceleration for deep learning (DL):
  - Increasing demand in environments like real-time and datacenters alike, requires efficiency
  - Requires billions of operations
  - Breaks down into same parallelizable compute patterns great fit for hardware acceleration
- Previous approaches for deep learning hardware acceleration:
  - Quantization
  - Sparse/pruned NNs
  - · Hardware architecture design automation
  - Hardware-oriented NN model design automation

- Why use hardware acceleration for deep learning (DL):
  - Increasing demand in environments like real-time and datacenters alike, requires efficiency
  - Requires billions of operations
  - Breaks down into same parallelizable compute patterns great fit for hardware acceleration
- Previous approaches for deep learning hardware acceleration:
  - Quantization
  - Sparse/pruned NNs
  - · Hardware architecture design automation
  - · Hardware-oriented NN model design automation
- Chosen under-explored approach: Advancements and application of fast matrix multiplication in custom hardware designs
  - Expensive portion of most neural networks (NN) decomposes to GEMM
  - NN's algebra is performed using reduced complexity GEMM
  - Less explored route for continuing progress





<sup>[1]</sup> Norman P. Jouppi et al. "In-Datacenter Performance Analysis of a Tensor Processing Unit". In: SIGARCH Comput. Archit. News 45.2 (June 2017), 1–12

• 2D array of multiply-accumulate (MAC) units



#### Benefits

- Reduced reads from memory
  - Reduced memory bandwidth and power consumption
- Local and regular interconnections between processors
  - Increases max clock frequency

<sup>[1]</sup> Norman P. Jouppi et al. "In-Datacenter Performance Analysis of a Tensor Processing Unit". In: SIGARCH Comput. Archit. News 45.2 (June 2017), 1–12





#### Benefits

- Reduced reads from memory
  - Reduced memory bandwidth and power consumption
- Local and regular interconnections between processors
  - Increases max clock frequency
- Has been implemented commercially in Google's Tensor Processing Unit (TPU) [1]

<sup>[1]</sup> Norman P. Jouppi et al. "In-Datacenter Performance Analysis of a Tensor Processing Unit". In: SIGARCH Comput. Archit. News 45.2 (June 2017), 1–12

- Backround
- Deep Learning Accelerator System Architecture
- Fast Inner-Product Algorithms and Hardware Architectures
- Karatsuba Matrix Multiplication Algorithm and Hardware Architectures
- Strassen Hardware Architectures
- Conclusion and Future Work



#### System overview

- Matrix Multiply Unit (MXU) Systolic Array
- Post-GEMM Unit NN-specific operations
- Memory Unit Memory access control, On-chip memory
- Weight DRAM (external memory)
- RxTx Unit PCIe interface to host
- Instruction Unit

### System Design - Used FPGA Platform



https://rocketboards.org/foswiki/Documentation/Arria10SoCGSRD

- Backround
- Deep Learning Accelerator System Architecture
- Fast Inner-Product Algorithms and Hardware Architectures
- Karatsuba Matrix Multiplication Algorithm and Hardware Architectures
- Strassen Hardware Architectures
- Conclusion and Future Work

Winograd's Fast Inner Product (FIP) [2]:



[2] S. Winograd. "A New Algorithm for Inner Product". In: IEEE Trans. Comput. C-17.7 (1968), pp. 693–694
[3] Trevor E. Pogue and Nicola Nicolici. "Fast Inner-Product Algorithms and Architectures for Deep Neural Network Accelerators". In: IEEE Trans. Comput. 73.2 (2024), pp. 495–509

Trevor Pogue (McMaster University)

Algebraic Enhancements for Systolic Arrays

#### Fast Inner-Product Algorithms and Hardware Architectures

Winograd's Fast Inner Product (FIP) [2]:



 [2] S. Winograd. "A New Algorithm for Inner Product". In: IEEE Trans. Comput. C-17.7 (1968), pp. 693–694
[3] Trevor E. Pogue and Nicola Nicolici. "Fast Inner-Product Algorithms and Architectures for Deep Neural Network Accelerators". In: IEEE Trans. Comput. 73.2 (2024), pp. 495–509

Trevor Pogue (McMaster University)

Algebraic Enhancements for Systolic Arrays

November 22 2024

Proposed Free-pipeline FIP (FFIP) [3]:



Resource Usage and Performance vs MXU width and height

- \*LUT/ALM resources share a similar curve as registers
- \*Memory resources are equivalent for all designs

- Backround
- Deep Learning Accelerator System Architecture
- Fast Inner-Product Algorithms and Hardware Architectures
- Karatsuba Matrix Multiplication Algorithm and Hardware Architectures
- Strassen Hardware Architectures
- Conclusion and Future Work

2-Digit Karatsuba Scalar Multiplication (KSM<sub>2</sub>)

 $\begin{array}{c} a \times b \\ = \left(a_1 \ll w/2 \, + \, a_0\right) \times \left(b_1 \ll w/2 \, + \, b_0\right) \end{array}$ 



- [4] Requires 3 single-digit mults instead of 4
- But requires 3 extra additions, increasing overall #operation

<sup>[4]</sup> Anatolii Alekseevich Karatsuba and Yu P Ofman. "Multiplication of many-digital numbers by automatic computers". In: Proc. Doklady Akademii Nauk. Vol. 145. 2. Russian Academy of Sciences. 1962, pp. 293–294

## Proposed Karatsuba Matrix Multiplication (KMM)

2-Digit Karatsuba Scalar Multiplication (KSM<sub>2</sub>)

 $a \times b = (a_1 \ll w/2 + a_0) \times (b_1 \ll w/2 + b_0)$ 



- [4] Requires 3 single-digit mults instead of 4
- But requires 3 extra additions, increasing overall #operation



- *d* = matrix height/width
- Increase in number of additions with complexity  $\mathcal{O}(d^2)$  is now insignificant relative to the reduction of 3 instead of 4 single-digit MM of complexity  $\mathcal{O}(d^3)$

<sup>[4]</sup> Anatolii Alekseevich Karatsuba and Yu P Ofman. "Multiplication of many-digital numbers by automatic computers". In: Proc. Doklady Akademii Nauk. Vol. 145. 2. Russian Academy of Sciences. 1962, pp. 293–294

## Proposed KMM Hardware Architecture



# 2-Digit Karatsuba Matrix Multiplication (KMM<sub>2</sub>) $[\mathbf{A}] \times [\mathbf{B}]$ $= \left( [\mathbf{A}_1] \ll w/2 + [\mathbf{A}_0] \right) \times \left( [\mathbf{B}_1] \ll w/2 + [\mathbf{B}_0] \right)$ $[\mathbf{A}_1] \quad \mathcal{O}(d^2) \quad \rightarrow ([\mathbf{A}_1] + [\mathbf{A}_0]) \qquad [\mathbf{A}_0]$ $\begin{array}{c} (1-2) \\ \otimes \leftarrow \mathcal{O}(d^3) \\ \end{array} \xrightarrow{(1-2)} \\ \otimes \\ (B_1) \\ \mathcal{O}(d^2) \\ \end{array} \xrightarrow{(1-2)} \\ (B_1) \\ (B_0) \\ \end{array}$ $= [\mathbf{A}_{1}\mathbf{B}_{1}] \ll w + \begin{pmatrix} [\mathbf{A}_{1}\mathbf{B}_{0}] \\ + [\mathbf{A}_{0}\mathbf{B}_{1}] \end{pmatrix} \ll w/2 + [\mathbf{A}_{0}\mathbf{B}_{0}]$ $\left|+\frac{[\mathbf{A_0B_0}]}{[\mathbf{A_1B_1}]}\right|$

- d = matrix height/width
- Increase in number of additions with complexity  $\mathcal{O}(d^2)$  is now insignificant relative to the reduction of 3 instead of 4 single-digit MM of complexity  $\mathcal{O}(d^3)$

$$Area(ADD^{[w]}) = w AU$$
  
 $Area(FF^{[w]}) = 0.7 w AU$   
 $Area(MULT^{[w]}) = w^2 AU$ 



Figure (1) Maximum achievable AU compute efficiencies for the fixed-precision  $MM_1$ ,  $KSMM_n$ , and  $KMM_n$  architectures.

- Backround
- Deep Learning Accelerator System Architecture
- Fast Inner-Product Algorithms and Hardware Architectures
- Karatsuba Matrix Multiplication Algorithm and Hardware Architectures
- Strassen Hardware Architectures
- Conclusion and Future Work

Traditional 4-tile MM requires 8 tile MMs:

$$\begin{bmatrix} C_{11} & C_{12} \\ C_{21} & C_{22} \end{bmatrix} = \begin{bmatrix} A_{11}B_{11} + A_{12}B_{21} & A_{11}B_{21} + A_{12}B_{22} \\ A_{21}B_{11} + A_{22}B_{21} & A_{21}B_{12} + A_{22}B_{22} \end{bmatrix}$$

Strassen [5] requires 7 tile MMs:

[5] Volker Strassen. "Gaussian elimination is not optimal". In: Numer. Math. 13.4 (1969), pp. 354-356



Figure (2) SMM<sub>r</sub> systolic-array architecture for implementing r levels of Strassen recursion in hardware.

- Unlike CPU/GPU implementations, extra additions & data movements are performed in parallel with the MMs
- Eliminates extra execution time needed for these steps

| Table (1) | FFIP+SMM <sub>r</sub> | architectures in | a DNN | accelerator | compared | with | prior | state-of-the-art accelerators. |
|-----------|-----------------------|------------------|-------|-------------|----------|------|-------|--------------------------------|
|-----------|-----------------------|------------------|-------|-------------|----------|------|-------|--------------------------------|

|                                   | TNNLS '22 [6]    |           |                  | TCAD '22 [7]     |                  | Entropy '22 [8] |                  | 24 [9]         | Proposed FFIP+SMM <sub>1</sub> $32 \times 32$ |                | $M_1 32 \times 32$ |
|-----------------------------------|------------------|-----------|------------------|------------------|------------------|-----------------|------------------|----------------|-----------------------------------------------|----------------|--------------------|
| FPGA                              | Arria 10 GX 1150 |           | Arria 10 GX 1150 |                  | Arria 10 GX 1150 |                 | Stratix 10 GX650 |                | Arria 10 GX 1150                              |                |                    |
| ALMs                              | 304K             |           | 304K             |                  | 303K             |                 | 152K             |                |                                               | 216K           |                    |
| Registers                         | 889K             |           | 890K             |                  | -                |                 | 567K             |                | 627K                                          |                |                    |
| Memories                          | 2334             |           | 2334             |                  | 1953             |                 | 2056             |                | 2713                                          |                |                    |
| DSPs                              | 1473             |           | 1473             |                  | 1503             |                 | 102              | 24             | 1518                                          |                |                    |
| Frequency (MHz)                   | 200              |           | 220              |                  | 172              |                 | 200              |                | 313                                           |                |                    |
| Input bitwidth<br>(fixed-point)   | 8                | 8         | 8                | 8                | 8                | 8               | 8                | 8              | 8                                             | 8              | 8                  |
| Model                             | ResNet-<br>50    | VGG<br>16 | Bayes<br>ResNet1 | Bayes<br>8 VGG11 | RCNN<br>ResNet50 | RCNN<br>VGG16   | ResNet-<br>50    | ResNet-<br>152 | ResNet-<br>50                                 | ResNet-<br>101 | ResNet-<br>152     |
| Throughput<br>(GOPS)              | 1519             | 1295      | 1590             | 534              | 719              | 865             | 800              | 794            | 4006                                          | 4397           | 4568               |
| mults/multiplier 1<br>clock cycle | 0.645            | 0.550     | 0.639            | 0.206            | 0.696            | 0.837           | 0.977            | 0.969          | 1.674                                         | 1.837          | 1.908              |

1 Multiplier compute efficiency, measures how efficiently multipliers are utilized. It can surpass 1 in the proposed designs due to the algebraic enhancements.

[6] Shuanglong Liu et al. "Toward full-stack acceleration of deep convolutional neural networks on FPGAs". In: IEEE Trans. Neural Netw. Learn. Syst. 33.8 (2022), pp. 3974–3987

[7] Hongxiang Fan et al. "FPGA-based Acceleration for Bayesian Convolutional Neural Networks". In: IEEE Trans. Comput.-Aided Design Integr. Circuits Syst. 41.12 (2022), pp. 5343–5356

[8] Jianjing An et al. "An OpenCL-Based FPGA Accelerator for Faster R-CNN". In: Entropy 24.10 (2022), p. 1346

[9] Kui Dai et al. "DCP-CNN: Efficient Acceleration of CNNs With Dynamic Computing Parallelism on FPGA". In: IEEE Trans. Comput.-Aided Design Integr. Circuits Syst. (2024)

- Backround
- Deep Learning Accelerator System Architecture
- Fast Inner-Product Algorithms and Hardware Architectures
- Karatsuba Matrix Multiplication Algorithm and Hardware Architectures
- Strassen Hardware Architectures
- Conclusion and Future Work

Future Work

- Floating-point algorithms and architectures
- Non-systolic-array architectures
- Toom-Cook Matrix Multiplication
- Transformer acceleration

Future Work

- Floating-point algorithms and architectures
- Non-systolic-array architectures
- Toom-Cook Matrix Multiplication
- Transformer acceleration

Conclusion

- Contributes to the field of DL and MM acceleration through under-explored direction
- Proposes new efficient MM algorithms and/or their systolic-array hardware architectures
- Increases performance-per-area capabilities of hardware accelerators

Future Work

- Floating-point algorithms and architectures
- Non-systolic-array architectures
- Toom-Cook Matrix Multiplication
- Transformer acceleration

Conclusion

- Contributes to the field of DL and MM acceleration through under-explored direction
- Proposes new efficient MM algorithms and/or their systolic-array hardware architectures
- Increases performance-per-area capabilities of hardware accelerators

## Thank You! Questions?