Monolithic 3D Integration of CPU and Main Memory Systems

Overview:

Certain emerging non-volatile memory technologies, such as resistive RAM (ReRAM), are compatible with standard CMOS logic processes. This means it may be possible to integrate them directly into the die of a CPU. This project investigates such monolithically integrated CPU-main memory chips. Similar to stacking DRAM dies over a logic die, monolithic 3D integration of main memory uses the vertical dimension to enable close physical proximity between main memory and compute logic. But because there are no die crossings between the cores and an on-die main memory system, much higher wiring density can be achieved, resulting in a massively parallel connection to main memory. This results in higher memory bandwidth and lower data movement.

People:

Faculty

Donald Yeung

Bruce Jacob

Martin Peckerar

Students

Rachid Jamil

Yinuo Wang

Hung-Yu Yeh

Alumni

Meenatchi Jagasivamani

Luyi Kang

Shang Li

Xiangyu Mao

Brendan Sheehy

Devesh Singh

Candace Walden

We are collaborating with IBM Research. Our contacts at IBM include Dirk Pfeiffer and Takashi Ando.

We are also collaborating with Global Foundries. Our contact at GF is Claudia Kretzschmar.

Publications:

Devesh Singh and Donald Yeung. MORSE: Memory Overwrite Time Guided Soft Writes to Improve ReRAM Energy and Endurance. In Proceedings of the 33rd International Conference on Parallel Architectures and Compilation Techniques. Long Beach, CA. October 2024.
(pdf)

Earlier Related Tech Report:
Devesh Singh and Donald Yeung. SRTP: Predicting Store Reuse Time to Improve ReRAM Energy and Endurance. University of Maryland Institute for Advanced Computer Studies Technical Report, UMIACS-TR-2022-01. May 2022.
(pdf)

Martin Peckerar, Po-Chun Huang, Rachid Ahmad Jamil, Bruce Jacob, and Donald Yeung. Critical Issues in Advanced ReRAM Development. In Proceedings of the 9th International Symposium on Memory Systems. Alexandria, VA. October 2023.
(pdf)

Candace Walden, Devesh Singh, Meenatchi Jagasivamani, Shang Li, Luyi Kang, Mehdi Asnaashari, Sylvain Dubois, Bruce Jacob, and Donald Yeung. Monolithically Integrating Non-Volatile Main Memory Over the Last-Level Cache. ACM Transactions on Architecture and Code Optimization. Vol. 18, No. 4, Article 48. July 2021.
(ACM digital library distribution)

Earlier Related Tech Report:
Meenatchi Jagasivamani, Candace Walden, Devesh Singh, Shang Li, Luyi Kang, Mehdi Asnaashari, Sylvain Dubois, Bruce Jacob, and Donald Yeung. Design and Evaluation of Monolithic Computers Implemented Using Crossbar ReRAM. University of Maryland Institute for Advanced Computer Studies Technical Report, UMIACS-TR-2019-01. July 2019.
(pdf)

Meenatchi Jagasivamani, Candace Walden, Devesh Singh, Luyi Kang, Mehdi Asnaashari, Sylvain Dubois, Bruce Jacob, and Donald Yeung. Tileable Monolithic ReRAM Memory Design. In Proceedings of the IEEE Symposium on Low-Power and High-Speed Chips and Systems. Tokyo, Japan. April 2020.
(pdf)

Meenatchi Jagasivamani, Candace Walden, Devesh Singh, Luyi Kang, Shang Li, Mehdi Asnaashari, Sylvain Dubois, Bruce Jacob, and Donald Yeung. Analyzing the Monolithic Integration of a ReRAM-based Main Memory into a CPU's Die. IEEE Micro (Special Issue on Monolithic 3D Architectures). Vol. 39, Issue 6. November/December 2019.
(IEEE digital library distribution)

Meenatchi Jagasivamani, Candace Walden, Devesh Singh, Luyi Kang, Shang Li, Mehdi Asnaashari, Sylvain Dubois, Donald Yeung, and Bruce Jacob. Design for ReRAM-based Main-Memory Architectures. In Proceedings of the 5th International Symposium on Memory Systems. Washington, D.C. September 2019.
(pdf)

Meenatchi Jagasivamani, Candace Walden, Devesh Singh, Luyi Kang, Shang Li, Mehdi Asnaashari, Sylvain Dubois, Bruce Jacob, and Donald Yeung. Memory Systems Challenges in Realizing Monolithic Computers. In Proceedings of the 4th International Symposium on Memory Systems. National Harbor, MD. October 2018.
(pdf)

Research Summary:

This project is multi-disciplinary, involving research in devices, circuits, physical design, architecture, and applications.

Physical Design Research

ReRAM is fabricated in the upper metal layers of the die as part of back-end-of-line (BEOL) processing steps. Like 3D XPoint, ReRAM employs a "crosspoint architecture" that employs selector devices to provide inter-cell isolation rather than per-cell access transistors. While peripheral access circuitry, such as decoders and sense amplifiers, do require logic transistors, the majority of the area underneath ReRAM memory arrays is vacant. This presents an opportunity for a new form of 3D integration in which memory cells are fabricated directly over CPU logic in the same die. However, not all CPU logic is suitable for fine-grain integration with ReRAM memory arrays. The peripheral access circuitry associated with each ReRAM memory array can disrupt the layout of the CPU, especially the random logic comprising much of the CPU's datapath circuitry.

A potentially promising approach is to integrate the ReRAM memory system over the CPU's last-level cache. Like ReRAM, SRAM caches also consist of numerous memory arrays. It is natural to co-design the SRAM and ReRAM arrays such that one fits neatly underneath the other. For example, a 3D memory building block is illustrated below in which a cache mat consisting of two SRAM sub-arrays is physically integrated underneath two ReRAM sub-arrays. Routing of the address and data busses into and out of the co-designed arrays is still required, so layout of the ReRAM peripheral access circuitry needs to accommodate those routing tracks. But the resulting routing congestion is considerably less than what would be incurred if the ReRAM is integrated over random logic.

Architecture Research

One of the research goals is to develop a CPU architecture that can make use of the massive memory-level parallelism afforded by monolithically integrated main memory systems. Currently, a large tiled CPU (illustrated below) is being considered. To exploit the memory-level parallelism capabilities of the on-die ReRAM, the tiled CPU is equipped with multithreaded cores and wide SIMD instructions. (Specifically, the CPU employs AVX-512 which can execute 8 double- or 16 single-precision FP operations at a time. AVX-512 also supports scatter-gather, exposing massive memory-level parallelism for irregular memory access patterns.)

Given a tiled CPU, it is natural to distribute the main memory system across tiles by integrating a portion of the ReRAM over each tile's local L2 slice, as shown below. (Each L2 / main memory module is implemented using the 3D memory structure discussed above). Moreover, in addition to the per-tile core, L2/main memory slice, and NOC router, each compute tile also includes its own ReRAM memory controller, providing the local core with a dedicated channel to a local portion of main memory. Each core can still access remote main memory modules across the on-chip network; however, data movement is virtually eliminated when the application's data can be partitioned across the main memory modules such that the cores' memory accesses are destined primarily to the local portion of main memory.

Circuits and Devices Research

Current commercial ReRAM technology is targeted for storage applications. An important research direction is to re-target the technology so that it is better suited for CPU main memory. One approach under consideration is to tradeoff retention, normally an important characteristic for non-volatile memories, in order to improve other characteristics which are crucial for CPU main memory, such as latency, energy and endurance. While storage devices require high retention, CPU main memory does not. (For example, Crossbar's current ReRAM has a retention of 10 years which is overkill for main memory). The lowered requirement on retention permits reducing the strength and duration of writes, which will decrease the write energy and increase endurance.

While ReRAM is often considered a non-volatile memory technology, that's under the assumption that sufficient current beyond a certain threshold is supplied during write cycles. As the yellow line in the graph below illustrates, when such "hard writes" are applied, a low resistance (SET) state can be retained for a long time after the write cycle. Unfortunately, such hard writes not only impose larger power consumption and delay, they also greatly reduce the lifetime of the ReRAM devices. Alternatively, it is possible to limit the current during write cycles, and still reach an acceptable state (red & blue lines). But in this case, the ReRAM is no longer non-volatile: the data is only retained for minutes before the device resets itself. An interesting possibility is to perform such "soft writes" for data that is written frequently to lessen the stress on those hot memory locations. Data that is written once, or written infrequently, can use "hard writes" to ensure the stability of the data over long windows of time.

Fabrication of test devices at the UMD Nano Fabrication Laboratory is underway to explore the possibilities of using ReRAM as a pseudo-volatile memory. Detailed characterization of single devices will be performed to determine the optimal write trajectory, as well as the behavior of the device under insufficient write current. ReRAM array structures will be scanned under X-ray photoelectron spectroscopy (XPS) after heavy cycling to study the causes of and possible remedies for low endurance.

ACM permission notice:
The documents contained in these directories are included by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

ACM copyright notice:
Copyright © 2013 by the Association for Computing Machinery, Inc. (ACM). Permission to make digital or hard copies of portions of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page in print or the first screen in digital media. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Send written requests for republication to ACM Publications, Copyright & Permissions at the address above or fax +1 (212) 869-0481 or email permissions@acm.org. For other copying of articles that carry a code at the bottom of the first or last page, copying is permitted provided that the per-copy fee indicated in the code is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.

Funding:

This project is funded in part by a Defense Technical Information Center (DTIC) contract.

Last updated: October 2024 by Donald Yeung (yeung@umd.edu)