Conference Paper/Proceeding/Abstract 875 views 212 downloads
Using Runahead Execution to Hide Memory Latency in High Level Synthesis
Shane Fleming,
David B. Thomas
2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
Swansea University Author: Shane Fleming
-
PDF | Accepted Manuscript
Download (2.72MB)
DOI (Published version): 10.1109/fccm.2017.33
Abstract
Reads and writes to global data in off-chip RAM can limit the performance achieved with HLS tools, as each access takes multiple cycles and usually blocks progress in the application state machine. This can be combated by using data prefetchers, which hide access time by predicting the next memory a...
Published in: | 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) |
---|---|
ISBN: | 978-1-5386-4038-8 978-1-5386-4037-1 |
Published: |
IEEE
2017
|
URI: | https://cronfa.swan.ac.uk/Record/cronfa57993 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Abstract: |
Reads and writes to global data in off-chip RAM can limit the performance achieved with HLS tools, as each access takes multiple cycles and usually blocks progress in the application state machine. This can be combated by using data prefetchers, which hide access time by predicting the next memory access and loading it into a cache before it's required. Unfortunately, current prefetchers are only useful for memory accesses with known regular patterns, such as walking arrays, and are ineffective for those that use irregular patterns over application-specific data structures. In this work, we demonstrate prefetchers that are tailor-made for applications, even if they have irregular memory accesses. This is achieved through program slicing, a static analysis technique that extracts the memory structure of the input code and automatically constructs an application-specific prefetcher. Both our analysis and tool are fully automated and implemented as a new compiler flag in LegUp, an open source HLS tool. In this work we create a theoretical model showing that speedup must be between 1x and 2x, we also evaluate five benchmarks, achieving an average speedup of 1.38x with an average resource overhead of 1.15x. |
---|---|
College: |
Faculty of Science and Engineering |
Funders: |
EPSRC |