No Cover Image

Conference Paper/Proceeding/Abstract 1276 views 369 downloads

General hardware multicasting for fine-grained message-passing architectures

Matthew Naylor, Simon W. Moore, David Thomas, Jonathan R. Beaumont, Shane Fleming, Mark Vousden, A. Theodore Markettos, Thomas Bytheway, Andrew Brown

2021 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), Pages: 126 - 133

Swansea University Author: Shane Fleming

Abstract

Manycore architectures are increasingly favouring message-passing or partitioned global address spaces (PGAS) over cache coherency for reasons of power efficiency and scalability. However, in the absence of cache coherency, there can be a lack of hardware support for one-to-many communication patter...

Full description

Published in: 2021 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)
ISBN: 978-1-6654-4764-5 978-1-6654-1455-5
ISSN: 1066-6192 2377-5750
Published: IEEE 2021
Online Access: Check full text

URI: https://cronfa.swan.ac.uk/Record/cronfa56452
first_indexed 2021-03-16T10:02:35Z
last_indexed 2025-10-14T07:00:38Z
id cronfa56452
recordtype SURis
fullrecord <?xml version="1.0"?><rfc1807><datestamp>2025-10-13T16:50:04.1954130</datestamp><bib-version>v2</bib-version><id>56452</id><entry>2021-03-16</entry><title>General hardware multicasting for fine-grained message-passing architectures</title><swanseaauthors><author><sid>fe23ad3ebacc194b4f4c480fdde55b95</sid><firstname>Shane</firstname><surname>Fleming</surname><name>Shane Fleming</name><active>true</active><ethesisStudent>false</ethesisStudent></author></swanseaauthors><date>2021-03-16</date><deptcode>MACS</deptcode><abstract>Manycore architectures are increasingly favouring message-passing or partitioned global address spaces (PGAS) over cache coherency for reasons of power efficiency and scalability. However, in the absence of cache coherency, there can be a lack of hardware support for one-to-many communication patterns, which are prevalent in someapplication domains. To address this, we present new hardware primitives for multicast communication in rack-scale manycore systems. These primitives guarantee delivery to both colocated and distributed destinations, and can capture large unstructured communication patterns precisely. As a result, reliable multicast transfers among any number of software tasks, connected in any topology, can be fully offloaded to hardware. We implement the new primitives in a research platform consisting of 50K RISC-V threads distributed over 48 FPGAs, and demonstrate significant performance benefits on a range of applications expressed using a high-level vertex-centric programming model.</abstract><type>Conference Paper/Proceeding/Abstract</type><journal>2021 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)</journal><volume/><journalNumber/><paginationStart>126</paginationStart><paginationEnd>133</paginationEnd><publisher>IEEE</publisher><placeOfPublication/><isbnPrint>978-1-6654-4764-5</isbnPrint><isbnElectronic>978-1-6654-1455-5</isbnElectronic><issnPrint>1066-6192</issnPrint><issnElectronic>2377-5750</issnElectronic><keywords>Scalability, Computer architecture, Multicast communication, System recovery, Hardware, Software, Topology</keywords><publishedDay>1</publishedDay><publishedMonth>3</publishedMonth><publishedYear>2021</publishedYear><publishedDate>2021-03-01</publishedDate><doi>10.1109/pdp52278.2021.00028</doi><url/><notes/><college>COLLEGE NANME</college><department>Mathematics and Computer Science School</department><CollegeCode>COLLEGE CODE</CollegeCode><DepartmentCode>MACS</DepartmentCode><institution>Swansea University</institution><apcterm>Not Required</apcterm><funders>This work was supported by UK EPSRC grant EP/N031768/1 (POETS project).</funders><projectreference/><lastEdited>2025-10-13T16:50:04.1954130</lastEdited><Created>2021-03-16T09:57:33.4741858</Created><path><level id="1">Faculty of Science and Engineering</level><level id="2">School of Mathematics and Computer Science - Computer Science</level></path><authors><author><firstname>Matthew</firstname><surname>Naylor</surname><order>1</order></author><author><firstname>Simon W.</firstname><surname>Moore</surname><order>2</order></author><author><firstname>David</firstname><surname>Thomas</surname><order>3</order></author><author><firstname>Jonathan R.</firstname><surname>Beaumont</surname><order>4</order></author><author><firstname>Shane</firstname><surname>Fleming</surname><order>5</order></author><author><firstname>Mark</firstname><surname>Vousden</surname><order>6</order></author><author><firstname>A. Theodore</firstname><surname>Markettos</surname><order>7</order></author><author><firstname>Thomas</firstname><surname>Bytheway</surname><order>8</order></author><author><firstname>Andrew</firstname><surname>Brown</surname><order>9</order></author></authors><documents><document><filename>56452__19490__b5dc3b7b98bd4556a097a04cab8d08ce.pdf</filename><originalFilename>pdp2021-mcast-draft.pdf</originalFilename><uploaded>2021-03-16T10:01:48.8252523</uploaded><type>Output</type><contentLength>235400</contentLength><contentType>application/pdf</contentType><version>Accepted Manuscript</version><cronfaStatus>true</cronfaStatus><copyrightCorrect>true</copyrightCorrect><language>eng</language></document></documents><OutputDurs/></rfc1807>
spelling 2025-10-13T16:50:04.1954130 v2 56452 2021-03-16 General hardware multicasting for fine-grained message-passing architectures fe23ad3ebacc194b4f4c480fdde55b95 Shane Fleming Shane Fleming true false 2021-03-16 MACS Manycore architectures are increasingly favouring message-passing or partitioned global address spaces (PGAS) over cache coherency for reasons of power efficiency and scalability. However, in the absence of cache coherency, there can be a lack of hardware support for one-to-many communication patterns, which are prevalent in someapplication domains. To address this, we present new hardware primitives for multicast communication in rack-scale manycore systems. These primitives guarantee delivery to both colocated and distributed destinations, and can capture large unstructured communication patterns precisely. As a result, reliable multicast transfers among any number of software tasks, connected in any topology, can be fully offloaded to hardware. We implement the new primitives in a research platform consisting of 50K RISC-V threads distributed over 48 FPGAs, and demonstrate significant performance benefits on a range of applications expressed using a high-level vertex-centric programming model. Conference Paper/Proceeding/Abstract 2021 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP) 126 133 IEEE 978-1-6654-4764-5 978-1-6654-1455-5 1066-6192 2377-5750 Scalability, Computer architecture, Multicast communication, System recovery, Hardware, Software, Topology 1 3 2021 2021-03-01 10.1109/pdp52278.2021.00028 COLLEGE NANME Mathematics and Computer Science School COLLEGE CODE MACS Swansea University Not Required This work was supported by UK EPSRC grant EP/N031768/1 (POETS project). 2025-10-13T16:50:04.1954130 2021-03-16T09:57:33.4741858 Faculty of Science and Engineering School of Mathematics and Computer Science - Computer Science Matthew Naylor 1 Simon W. Moore 2 David Thomas 3 Jonathan R. Beaumont 4 Shane Fleming 5 Mark Vousden 6 A. Theodore Markettos 7 Thomas Bytheway 8 Andrew Brown 9 56452__19490__b5dc3b7b98bd4556a097a04cab8d08ce.pdf pdp2021-mcast-draft.pdf 2021-03-16T10:01:48.8252523 Output 235400 application/pdf Accepted Manuscript true true eng
title General hardware multicasting for fine-grained message-passing architectures
spellingShingle General hardware multicasting for fine-grained message-passing architectures
Shane Fleming
title_short General hardware multicasting for fine-grained message-passing architectures
title_full General hardware multicasting for fine-grained message-passing architectures
title_fullStr General hardware multicasting for fine-grained message-passing architectures
title_full_unstemmed General hardware multicasting for fine-grained message-passing architectures
title_sort General hardware multicasting for fine-grained message-passing architectures
author_id_str_mv fe23ad3ebacc194b4f4c480fdde55b95
author_id_fullname_str_mv fe23ad3ebacc194b4f4c480fdde55b95_***_Shane Fleming
author Shane Fleming
author2 Matthew Naylor
Simon W. Moore
David Thomas
Jonathan R. Beaumont
Shane Fleming
Mark Vousden
A. Theodore Markettos
Thomas Bytheway
Andrew Brown
format Conference Paper/Proceeding/Abstract
container_title 2021 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)
container_start_page 126
publishDate 2021
institution Swansea University
isbn 978-1-6654-4764-5
978-1-6654-1455-5
issn 1066-6192
2377-5750
doi_str_mv 10.1109/pdp52278.2021.00028
publisher IEEE
college_str Faculty of Science and Engineering
hierarchytype
hierarchy_top_id facultyofscienceandengineering
hierarchy_top_title Faculty of Science and Engineering
hierarchy_parent_id facultyofscienceandengineering
hierarchy_parent_title Faculty of Science and Engineering
department_str School of Mathematics and Computer Science - Computer Science{{{_:::_}}}Faculty of Science and Engineering{{{_:::_}}}School of Mathematics and Computer Science - Computer Science
document_store_str 1
active_str 0
description Manycore architectures are increasingly favouring message-passing or partitioned global address spaces (PGAS) over cache coherency for reasons of power efficiency and scalability. However, in the absence of cache coherency, there can be a lack of hardware support for one-to-many communication patterns, which are prevalent in someapplication domains. To address this, we present new hardware primitives for multicast communication in rack-scale manycore systems. These primitives guarantee delivery to both colocated and distributed destinations, and can capture large unstructured communication patterns precisely. As a result, reliable multicast transfers among any number of software tasks, connected in any topology, can be fully offloaded to hardware. We implement the new primitives in a research platform consisting of 50K RISC-V threads distributed over 48 FPGAs, and demonstrate significant performance benefits on a range of applications expressed using a high-level vertex-centric programming model.
published_date 2021-03-01T04:49:41Z
_version_ 1851457830820773888
score 11.089572