Programmable and Energy Efficient Extreme-Scale Processors

Overview:

Today, integrating 12-16 state-of-the-art cores or 10s of smaller cores on a single chip is commonplace. Since Moore's Law scaling is expected to continue for the forseeable future, processors with 1000+ cores will become possible in the future. This project is investigating techniques for supporting such extreme-scale processors. A major focus of the project is to develop new evaluation methodologies, such as multicore reuse distance analysis, for rapidly assessing extreme-scale processors. (Click here for more details on multicore RD analysis). Another focus of the project is to develop software and architectural support for extreme-scale processors, such as cache management and reconfiguration techniques, locality optimizations, and implicit synchronization techniques. Recently, the project has also begun looking at heterogeneous microprocessors in which both CPU cores and GPU cores are integrated on the same chip.

People:

Faculty

Donald Yeung

Alumni

Mike Badamo

Abdel-Hameed A. Badawy

Jeff Casarona

Inseok Choi

Daniel Gerzhoy

Wanli Liu

Lisa Stechschulte

Xiaowu Sun

Meng-Ju Wu

Xu Yang

Minshu Zhao

Mike Zuzak

Publications:

Daniel Gerzhoy and Donald Yeung. Pipelined CPU-GPU Scheduling to Reduce Main Memory Accesses. Appears in Proceedings of the 7th International Symposium on Memory Systems. Virtual Conference. October-December 2021. (pdf)

Earlier Related Tech Report:
Daniel Gerzhoy and Donald Yeung. Pipelined CPU-GPU Scheduling for Caches. University of Maryland Institute for Advanced Computer Studies Technical Report, UMIACS-TR-2021-01. March 2021. (pdf)

Daniel Gerzhoy, Xiaowu Sun, Michael Zuzak, and Donald Yeung. Nested MIMD-SIMD Parallelization for Heterogeneous Microprocessors. ACM Transactions on Architecture and Code Optimization. Vol. 16, No. 4, Article 48. December 2019.
(ACM digital library distribution)

Earlier Related Workshop Paper:
Michael Zuzak and Donald Yeung. Exploiting Multi-Loop Parallelism on Heterogeneous Microprocessors. In Proceedings of the 10th International Workshop on Programmability and Architectures for Heterogeneous Multicores (MULTIPROG-2017), held in conjunction with HiPEAC-12. Stockholm, Sweden. January 2017. Best paper award. (pdf)

Earlier Related Tech Report:
Michael Zuzak and Donald Yeung. Exploiting Multi-Loop Parallelism on Heterogeneous Microprocessors. University of Maryland Institute for Advanced Computer Studies Technical Report, UMIACS-TR-2016-01. (pdf)

Minshu Zhao and Donald Yeung. Using Multicore Reuse Distance to Study Coherence Directories. ACM Transactions on Computer Systems. Vol. 35, No. 2. Article 4. October 2017.
(ACM digital library distribution)

Earlier Related Conference Paper:
Minshu Zhao and Donald Yeung. Studying the Impact of Multicore Processor Scaling on Directory Techniques via Reuse Distance Analysis. In Proceedings of the 21st International Symposium on High Performance Computer Architecture (HPCA-XXI). San Francisco Bay Area, CA. February 2015. (pdf, gzip'd postscript)

Earlier Related Tech Report:
Minshu Zhao and Donald Yeung. Studying Directory Access Patterns via Reuse Distance Analysis and Evaluating Their Impact on Multi-Level Directory Caches. University of Maryland Institute for Advanced Computer Studies Technical Report, UMIACS-TR-2014-01. January 2014. (pdf)

Abdel-Hameed A. Badawy and Donald Yeung. Optimizing Locality in Graph Computations using Reuse Distance Profiles. In Proceedings of the 36th International Performance Computing and Communications Conference. San Diego, CA. December 2017.
(IEEE digital library distribution)

Earlier Related Journal Paper:
Abdel-Hameed A. Badawy and Donald Yeung. Guiding Locality Optimizations for Graph Computations via Reuse Distance Analysis. IEEE Computer Architecture Letters. Vol. 16, Issue 2. pp. 119-122. July - December 2017.
(IEEE digital library distribution)

I. Stephen Choi and Donald Yeung. Multi-Cache Resizing via Greedy Coordinate Descent. Journal of Supercomputing. Vol. 73, No. 6. pp. 2402-2429. June 2017.
(Springer digital library distribution)

Earlier Related Tech Report:
Inseok Choi and Donald Yeung. Symbiotic Cache Resizing for CMPs with Shared LLC. University of Maryland Institute for Advanced Computer Studies Technical Report, UMIACS-TR-2013-02. September 2013. (pdf)

Mike Badamo, Jeff Casarona, Minshu Zhao, and Donald Yeung. Identifying Power Efficient Multicore Cache Hierarchies via Reuse Distance Analysis. ACM Transactions on Computer Systems. Vol. 34, No. 1. Article 3. pp. 1-30. April 2016. (pdf)

Meng-Ju Wu, Minshu Zhao, and Donald Yeung. Studying Multicore Processor Scaling via Reuse Distance Analysis. In Proceedings of the 40th International Symposium on Computer Architecture (ISCA-XL). Tel-Aviv, Israel. June 2013. (pdf, gzip'd postscript)

Meng-Ju Wu and Donald Yeung. Identifying Optimal Multicore Cache Hierarchies for Loop-based Parallel Programs via Reuse Distance Analysis. In Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness (MSPC-2012). Beijing, China. June 2012. (pdf)

Earlier Related Tech Report:
Meng-Ju Wu and Donald Yeung. Understanding Multicore Cache Behavior of Loop-based Parallel Programs via Reuse Distance Analysis. University of Maryland Institute for Advanced Computer Studies Technical Report, UMIACS-TR-2012-01. January 2012. (pdf)

Meng-Ju Wu and Donald Yeung. Efficient Reuse Distance Analysis of Multicore Scaling for Loop-based Parallel Programs. In ACM Transactions on Computer Systems. Vol. 31, No. 1. Article 1. pp. 1-37. February 2013. (pdf)

Earlier Related Conference Paper:
Meng-Ju Wu and Donald Yeung. Coherent Profiles: Enabling Efficient Reuse Distance Analysis of Multicore Scaling for Loop-based Parallel Programs. In Proceedings of the 20th International Conference on Parallel Architectures and Compilation Techniques. Galveston Island, TX. October 2011. (pdf, gzip'd postscript)

Earlier Related Tech Report:
Meng-Ju Wu and Donald Yeung. Memory Performance Analysis for Parallel Programs Using Concurrent Reuse Distance. University of Maryland Institute for Advanced Computer Studies Technical Report, UMIACS-TR-2010-10. October 2010. (pdf)

Eric Lau, Jason Miller, Inseok Choi, Donald Yeung, Saman Amarasinghe, and Anant Agarwal. Multicore Performance Optimization Using Partner Cores. In Proceedings of the 3rd USENIX Workshop on Hot Topics in Parallelism (HotPar '11). Berkeley, CA. May 2011. (pdf)

Inseok Choi, Minshu Zhao, Xu Yang, and Donald Yeung. Experience with Improving Distributed Shared Cache Performance on Tilera's Tile Processor. IEEE Computer Architecture Letters. Vol 10, No 2. July-December 2011. (pdf, gzip'd postscript)

Earlier Related Workshop Paper:
Inseok Choi, Minshu Zhao, Xu Yang, and Donald Yeung. Early Experience with Profiling and Optimizing Distributed Shared Cache Performance on Tilera's Tile Processor. In Proceedings of the 6th International Workshop on Unique Chips and Systems. Atlanta, GA. December 2010. One of 2 best papers out of 12 papers appearing in the workshop. (pdf, gzip'd postscript)

Wanli Liu and Donald Yeung. Using Aggressor Thread Information to Improve Shared Cache Management for CMPs. In Proceedings of the 18th International Conference on Parallel Architectures and Compiler Techniques. Raleigh, NC. September 2009. (pdf, gzip'd postscript)

Earlier Related Tech Report:
Wanli Liu and Donald Yeung. Probabilistic Replacement: Enabling Flexible Use of Shared Caches for CMPs. University of Maryland Institute for Advanced Computer Studies Technical Report, UMIACS-TR-2008-13. July 2008. (pdf)

ACM permission notice:
The documents contained in these directories are included by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

ACM copyright notice:
Copyright © 2013 by the Association for Computing Machinery, Inc. (ACM). Permission to make digital or hard copies of portions of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page in print or the first screen in digital media. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Send written requests for republication to ACM Publications, Copyright & Permissions at the address above or fax +1 (212) 869-0481 or email permissions@acm.org. For other copying of articles that carry a code at the bottom of the first or last page, copying is permitted provided that the per-copy fee indicated in the code is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.

Funding:

This project is funded in part by the National Science Foundation under grants #CCF-1117042 and #CCF-1618963, in part by the Defense Advanced Research Projects Agency under contracts #HR0011-10-9-0009 and #HR0011-13-2-0005, and in part by the Naval Reconnaissance Office.

Last updated: August 2021 by Donald Yeung (yeung@umd.edu)