# Design Trade-offs for Machine Learning Solutions on Reconfigurable Devices

Michaela Blott Principal Engineer, Xilinx Research July 2018







Background – Xilinx Research

**Machine Learning** 

**Research Efforts** 

Summary & Outlook





#### Background – Xilinx Research

**Machine Learning** 

**Research Efforts** 

Summary & Outlook



## Xilinx Research - Ireland

- > Since 13 years
- > Part of the worldwide CTO organization (8 out of 36)
- > AI Lab expansion part-financed through **IDA** Ireland
- Increasingly external funding (H2020))

Kees Vissers Fellow

Ivo Bolsens

CTO



**EXILINX**.



## **Current Xlabs Dublin Team**

> Yaman Umuroglu, Ken O'Brien, Nick Fraser, Giulio Gambardella, Alessandro Pappalardo, Peter Ogden, Lucian Petrica, me (from left to right)

>> More faces to be added soon



> Plus 2 in Xilinx University Program (Cathal McCabe, Katy Hurley)





# **Plus a Very Active Internship Program**

#### > On average 4-6 interns at any given time

- >> From top universities all over the world
- >> We are always looking for talent ;-)

#### > Overall

- >> 67 interns since 2007
- >> Many collaborations have come from this
- >> Many found employment





## **Mission: Application-driven technology development**



- > Identify strategic applications
- > Derisk emerging technologies
- > In partnership with universities, customers, and partners
- > Current Focus:

## **Quantifying value proposition for FPGAs in Machine Learning**

>> Prototyping, testdriving, benchmarking



Background – Xilinx Research

Machine Learning

**Research Efforts** 

Summary & Outlook

### New York Times: "The Great A.I. Awakening" (Dec 2016)

**Elon Musk's** Billion-Dollar Al Plan Is About Far More Than Saving the World

The Race For AI: Google, Twitter, Intel, Apple In A Rush To Grab Artificial Intelligence Startups

World's Largest Hedge Fund to Replace Managers with an Al System

Drones Can Defeat Humans Using Artificial Intelligence

## ELON MUSK'S BILLION-DOLLAR CRUSADE TO STOP THE A.I. Apocalypse

Elon Musk is famous for his futuristic gambles, but Silicon Valley's latest rush to embrace artificial intelligence scares him. And he thinks you should be frightened too. Inside his efforts to influence the rapidly advancing field and its proponents, and to save humanity from machine-learning overlords.

BY MAUREEN DOWD

# **Convolutional Neural Networks (CNNs)**

#### > CNNs are the predominant ML algorithm used

- >> Mimics the human brain
- >> Works very well for image classification, speech recognition

#### > NNs are the "universal approximation function"

- >> If you make it big enough and train it with enough data
- >> Can outperform humans on specific tasks

#### > Requires zero domain expertise

#### > Will increasingly replace other algorithms

- >> unless for example simple rules can describe the problem
- > and solve previously unsolved problems



## Machine Learning will help address the Grand Engineering Challenges of the 21st Century (NAE)

- > Make solar energy economical
- > Reverse-engineer the human brain
- > Secure cyper space

> ...

- > Restore & improve urban infrastructure
- > Engineering better medicine
- > Advance health informatics



Jeff Dean, Google @ Strata Data Conference, 2018

"I actually think machine learning is going to help with all of these," the legendary computer scientist said. "I think there are actually going to be significant breakthroughs in some of these Grand Challenges that are at least in part fueled by the fact that we now have machine learning at scale with many of these techniques that can really push us forward in the areas of commuter vision, language understanding, speech recognition, and automating and solving engineering problems."

# What is the Challenge?





## Challenges

#### > Challenge 1:

- >> Although predominant CNN computation is simple linear algebra
- >> Huge amount of compute and memory is required





## **Example Inference**



For ResNet50:

- 70 layers7.7 billion operations25.5 Mbytes of weight storage\*
- 10.1 Mbytes for tensors\*



# **Training – 1 Image**



For ResNet50:

23 billion operations weights, weight gradients, updates: 303Mbytes of storage (3-5x) tensors, gradients: 80 Mbytes for tensors

© Copyright 2018 Xilinx

# Training – 1.2 Million Images for 1 epoch



Training – 100 Epochs



For inference: Billions of operations, and 10s of megabytes For training: Quintillions of operations, and 100s of megabytes

# On Crash course with End of Moore's Law



#### > Compute performance is no longer scaling and becomes more expensive

**E** XILINX.

## Challenges

#### > Challenge 1:

>> Challenging compute and memory requirements

#### > Challenge 2:

- >> Complicated design space
- >> Huge variation in applications, requirements and design targets



# **C2: Many Applications Require Different Networks**

ADAS

**EXILINX**.



# **C2: Huge Variation in Memory and Compute**



## C2: Different Use Cases, Different Design Targets Accuracy, speed, power, latency, cost



#### > ADAS:

- >> Accuracy
- >> High throughput



> Hearing aids:

>>

>> Low power

>> Very low latency

Low throughput



## **> AR**

- >> High throughput
- >> Low latency
- >> Low power



- > 3D reconstruction of HR images
  - >> High throughput
  - >> Offline



## Challenges

#### > Challenge 1:

>> Challenging compute and memory requirements

#### > Challenge 2:

>> Huge variation in applications, requirements and design targets

#### > Challenge 3:

>> Neural Networks Change @ Increasing Rate





# C3: Neural Networks Change @ Increasing Rate

#### > Graph connectivity, number and types of layers are changing







Ce Zhang, ETH Zurich, Systems Retreat 2018

**E** XILINX.

© Copyright 2018 Xilinx

## **Challenges in Summary**

#### Machine Learning is a very demanding use case, compute and memory intensive

>> High variation

#### > Complicated design space

- >> Different applications
- >> Different and changing algorithms
- >> Different figures of merits

#### > Changing requirements



#### > Need to be addressed through architectural and algorithmic innovation

# Spectrum of New Architectures for Deep Learning

**Exciting Times in Computer Architecture Research!** 



**E** XILINX.

## **Spectrum of New Architectures for Deep Learning Efficiency vs Flexibility**







Background – Xilinx Research

**Machine Learning** 

**Research Efforts** 

Summary & Outlook

## **Our Research Effort**

> Changing neural network algorithm by reducing precision in data types to provide performance scalability, compute efficiency

>> Numerical representations, precision, quantization

#### > Customizing architecture to hit specific design targets

» On micro and macro level

> Through automated tool flow (FINN) and open source platforms (PYNQ and AWS) to provide ease of use



# **Reducing Precision** *Scales Performance & Reduces Memory*

#### > Reducing precision shrinks LUT cost

>> Instantiate **100x** more compute within the same fabric

#### > Potential to reduce memory footprint

>> NN model can stay on-chip => no memory bottlenecks

| Precision | Modelsize [MB]<br>(ResNet50) |
|-----------|------------------------------|
| 1b        | 3.2                          |
| 8b        | 25.5                         |
| 32b       | 102.5                        |



## Reducing Precision provides Performance Scalability Example: ResNet50, ResNet152 and TinyYolo



**E** XILINX

© Copyright 2018 Xilinx

## **Reduced Precision Inherently Saves Power**



Target Device ZU7EV • Ambient temperature: 25 °C • 12.5% of toggle rate • 0.5 of Static Probability • Power reported for PL accelerated block only



Source: Bill Dally (Stanford), Cadence Embedded Neural Network Summit, February 1, 2017



# What are the downsides of reduced precision?



© Copyright 2018 Xilinx

# **RPNNs: Closing the Accuracy Gap**



## Retraining: From Floating Point to Reduced Precision NNs



#### **E** XILINX.

## How to recuperate accuracy?



- > Recuperate accuracy by increasing network size
- > Topological changes
- > New training techniques
  - >> Knowledge distillation



## Automating & Customization





## **FINN: Custom-Tailored Hardware Architectures**

- > Customized feed-forward dataflow architecture to match network topology
- > Customized to meet design requirements
- > Customized data types (n-bit)



## Automatically generated from CNN description

- > Uses a synthesizable C++ NN description
- > Enables flexibility & scalability and supports portability, rapid exploration

#### Synthesizable CNN Description

void DoCompute(ap\_uint<64> \* in, ap\_uint<64> \* out) {
#pragma HLS DATAFLOW
stream<ap\_uint<64> > memInStrm("memInStrm");
stream<ap\_uint<64> > InStrm("InStrm");
...
stream<ap\_uint<64> > memOutStrm("memOutStrm");
Mem2Stream<64, inBytesPadded>(in, memInStrm);
StreamingMatrixVector<L0\_SIMD, L0\_PE, 16, L0\_MW, L0\_MH, L0\_WMEM, L0\_TMEM>
(InStrm, inter0, weightMem0, thresMem0);
StreamingMatrixVector<L1\_SIMD, L1\_PE, 16, L1\_MW, L1\_MH, L1\_WMEM, L1\_TMEM>

- (inter0, inter1, weightMem1, thresMem1);
- StreamingMatrixVector<L2\_SIMD, L2\_PE, 16, L2\_MW, L2\_MH, L2\_WMEM, L2\_TMEM>
   (inter1, inter2, weightMem2, thresMem2);
- StreamingMatrixVector<L3\_SIMD, L3\_PE, 16, L3\_MW, L3\_MH, L3\_WMEM, L3\_TMEM>
   (inter2, outstream, weightMem3, thresMem3);
   StreamingCast<ap uint<16>, ap uint<64> >(outstream, memOutStrm);

```
Stream2Mem<64, outBytesPadded>(memOutStrm, out);
```



## Numerous Platforms – From Embedded to Cloud



### **Numerous Platforms**



**E** XILINX.

## **Numerous Test Networks**

- > Multilayer Perceptron (1b weights, 1b act), MNIST
  - >> Up to 5.8MOPS/frame
- VGG-16 derivative (1b weights, 1b act), SVHN, CIFAR-10, traffic signs, playing cards)
  - >> Up to 1.2GOPS/frame
- DorefaNet AlexNet derivative (mostly 1b weights, 2b act) (ImageNet)
  - >> Up to 3.9GOPS/frame
- > YoloV2, Yolo9000, TinyYolo (1b weights, 8b act) (VOC, COCO)
  - >> 34.9, 19 and 7.0GOPS/frame

> LSTM, for OCR on Fraktur





## **Design Trade-offs with Reduced Precision NNs**





#### > Performance

- >> VOC Object recognition: Quantized TinyYolo @ 55fps @ 7Watt (batch=1) for embedded (ZU3EG)
- >> ImageNet Classification: Dorefanet @ 11 TOPS on AWS F1 instance
- Scaled binary operations to 51TOPS on AWS F1 and 5.2 TOPS on ZU3EG & 1000x over Raspberry Pi

#### > Energy efficiency: measured 433GOPS/Watt

#### > Flexibility & Scalability

- >> Different platforms can easily be targeted from embedded to cloud
- >> Different use cases, networks & training data sets

#### > While being sufficiently accurate

> <10% top5 for ImageNet classification</p>



Background – Xilinx Research

**Machine Learning** 

**Research Efforts** 

#### Summary & Outlook



- > ML has the potential to address many of the grand engineering challenges of this century
- > However, compute & memory requirements are huge and flexibility and scalability are key
- > New, customized computer architecture are emerging
- > FPGAs can play an important role here, in particular in conjunction with reduced precision and customized macro architectures
  - >> Orders of magnitude improvement in performance, resources and power consumption



## Exciting Times for our Community: Finding Optimal Solutions within a Complex Design Space



**E** XILINX





## Architecture Exploration

• Help understand the choices!



# Adaptable.



FPGA 2017: FINN: A Framework for Fast, Scalable Binarized Neural Network Inference https://arxiv.org/abs/1612.07119

PARMA-DITAM 2017: Scaling Binarized Neural Networks on Reconfigurable Logic https://arxiv.org/abs/1701.03400

ICCD 2017: Scaling Neural Network Performance through Customized Hardware Architectures on Reconfigurable Logic

https://ieeexplore.ieee.org/abstract/document/8119246

H2RC 2016: A C++ Library for Rapid Exploration of Binary Neural Networks on Reconfigurable Logic https://h2rc.cse.sc.edu/2016/papers/paper\_25.pdf

ICONIP'2017: Compressing Low Precision Deep Neural Networks Using Sparsity-Induced Regularization in Ternary Networks

<u> 1ttps://arxiv.org/abs/1709.06262</u>

CVPR'2018: SYQ: Learning Symmetric Quantization For Efficient Deep Neural Networks DATE 2018: Inference of quantized neural networks on heterogeneous all-programmable devices <u>https://ieeexplore.ieee.org/abstract/document/8342121/</u>

ARC'2018: Accuracy Throughput Tradeoffs for Reduced Precision Neural Networks

© Copyright 2018 Xilinx