Monolithic 3D Integration of Accelerators and Main Memory Systems

Overview:

Certain emerging non-volatile memory technologies, such as resistive RAM (ReRAM), are compatible with standard CMOS logic processes. This means it may be possible to integrate them directly into the die of a compute chip, such as a GPU or an accelerator. This project investigates such monolithically integrated accelerator-main memory chips. Similar to stacking DRAM dies over a logic die, monolithic 3D integration of main memory uses the vertical dimension to enable close physical proximity between main memory and compute logic. But because there are no die crossings between the compute logic and an on-die main memory system, much higher wiring density can be achieved, resulting in a massively parallel connection to main memory. This results in higher memory bandwidth, lower data movement, and lower power consumption.

People:

Faculty

  • Donald Yeung
  • Martin Peckerar
  • Bruce Jacob
  • Students

  • Suhwan Hong
  • Harold Park
  • Yinuo Wang
  • Alumni

  • Meenatchi Jagasivamani
  • Rachid Jamil
  • Luyi Kang
  • Shang Li
  • Xiangyu Mao
  • Brendan Sheehy
  • Devesh Singh
  • Candace Walden
  • Hung-Yu Yeh
  • We are collaborating with IBM Research; our contacts at IBM include Dirk Pfeiffer and Takashi Ando. We are also collaborating with Northrop Grumman; our contacts at NG include Louise Sengupta and Isidoros Doxas.

    We are also a member of the University Partnership Program at Global Foundries; our contact at GF is Claudia Kretzschmar.

    Publications:

  • Devesh Singh and Donald Yeung. MORSE: Memory Overwrite Time Guided Soft Writes to Improve ReRAM Energy and Endurance. In Proceedings of the 33rd International Conference on Parallel Architectures and Compilation Techniques. Long Beach, CA. October 2024.
    (pdf)
  • Martin Peckerar, Po-Chun Huang, Rachid Ahmad Jamil, Bruce Jacob, and Donald Yeung. Critical Issues in Advanced ReRAM Development. In Proceedings of the 9th International Symposium on Memory Systems. Alexandria, VA. October 2023.
    (pdf)
  • Candace Walden, Devesh Singh, Meenatchi Jagasivamani, Shang Li, Luyi Kang, Mehdi Asnaashari, Sylvain Dubois, Bruce Jacob, and Donald Yeung. Monolithically Integrating Non-Volatile Main Memory Over the Last-Level Cache. ACM Transactions on Architecture and Code Optimization. Vol. 18, No. 4, Article 48. July 2021.
    (ACM digital library distribution)
  • Meenatchi Jagasivamani, Candace Walden, Devesh Singh, Luyi Kang, Mehdi Asnaashari, Sylvain Dubois, Bruce Jacob, and Donald Yeung. Tileable Monolithic ReRAM Memory Design. In Proceedings of the IEEE Symposium on Low-Power and High-Speed Chips and Systems. Tokyo, Japan. April 2020.
    (pdf)
  • Meenatchi Jagasivamani, Candace Walden, Devesh Singh, Luyi Kang, Shang Li, Mehdi Asnaashari, Sylvain Dubois, Bruce Jacob, and Donald Yeung. Analyzing the Monolithic Integration of a ReRAM-based Main Memory into a CPU's Die. IEEE Micro (Special Issue on Monolithic 3D Architectures). Vol. 39, Issue 6. November/December 2019.
    (IEEE digital library distribution)
  • Meenatchi Jagasivamani, Candace Walden, Devesh Singh, Luyi Kang, Shang Li, Mehdi Asnaashari, Sylvain Dubois, Donald Yeung, and Bruce Jacob. Design for ReRAM-based Main-Memory Architectures. In Proceedings of the 5th International Symposium on Memory Systems. Washington, D.C. September 2019.
    (pdf)
  • Meenatchi Jagasivamani, Candace Walden, Devesh Singh, Luyi Kang, Shang Li, Mehdi Asnaashari, Sylvain Dubois, Bruce Jacob, and Donald Yeung. Memory Systems Challenges in Realizing Monolithic Computers. In Proceedings of the 4th International Symposium on Memory Systems. National Harbor, MD. October 2018.
    (pdf)
  • Research Summary:

    This project is multi-disciplinary, involving research in computer architecture, circuits and devices, physical design, and applications.

    Architecture Research

    Our project is undertaking computer architecture research to investigate different organizations for on-die memory systems. The goal is to create and make use of massive memory-level parallelism that can become possible. In the past, one architecture we have considered is a large tiled CPU (illustrated below) with integrated ReRAM. In this architecture, it is natural to distribute the main memory system across tiles by integrating a portion of the ReRAM into each compute tile. (The ReRAM is physically integrated over the tile's local L2 slice to form a 3D L2 / main memory module; see Physical Design research below). In essence, this is a distributed memory architecture all integrated on a single die. Moreover, in addition to the per-tile ReRAM, each compute tile also includes its own memory controller, providing the local core with a dedicated channel to a local portion of the ReRAM main memory. Each core can still access remote main memory modules across the on-chip network; however, data movement is virtually eliminated when the application's data can be partitioned across the main memory modules such that the cores' memory accesses are destined primarily to the local portion of main memory.

    Currently, we are considering ReRAM integration into modern architectures, such as GPUs and specialized accelerators for machine learning. The intent is to architect the distributed memory in such a way that maximizes the achievable memory bandwidth. Like the tiled CPU above, we also seek to exploit physical locality of the ReRAM memory and compute units to improve efficiency.

    Circuits and Devices Research

    One of the challenges of using ReRAM as a main memory technology is its limited write endurance. (ReRAM's write endurance is currently in the range of 106 to 1012 write cycles, which is far below DRAM technology). The left half of the figure below shows a conventional ReRAM device, illustrating the endurance problem. In this device, the formation of oxygen vacancies within a metal oxide material supports current flow vertically between a top and bottom electrode. Unfortunately, the oxygen vacancies form in pyramidal shapes with sharp peaks from which very high electric fields emanate. Furthermore, as oxygen vacancies begin to bridge the two electrodes and the device transitions from high resistance to low resistance state, significant currents can flow that dissipate heat. The combination of high electric fields and thermal stress contribute to the low endurance of the device.

    Our research is exploring new ReRAM device architectures that address the material stress experienced in conventional ReRAM. The right half of the figure below shows one of our approaches, called horizontally-transported ReRAM, or H-ReRAM. In an H-ReRAM device, writes still form oxygen vacancies via an electric field applied across a top and bottom electrode. However, there is a silicon nitride barrier that blocks current flow in the vertical direction; hence, there is almost no power dissipation during writes. To sense the presence or absence of the oxygen vacancies, a much smaller read voltage is applied transversely across source and drain terminals of the device, resulting in a horizontal current flow. In essence, this device separates the read and write terminals to solve the write endurance problem. Although this increases planar area, the device is still 3D stackable, and can exhibit high density.

    In addition to developing new ReRAM devices, we are also trying to improve endurance for existing ReRAM. One idea we are pursuing is soft writes which performs writes using lower voltages and/or currents to reduce the material stress. Because softly written cells do not retain their states for as long, this technique sacrifices non-volatility characteristic (the ReRAM becomes "pseudo-non-volatile"), trading off retention time to get back some endurance. Rather than only support soft writes, we envision memory systems will support both soft as well as traditional ("hard") writes. We are developing hardware support for deciding when soft writes are beneficial compared to hard writes, and then dynamically selecting the best type of write to employ based on the usage patterns of the written data.

    Fabrication of test devices at the UMD Nano Fabrication Center is underway to explore both H-ReRAM and pseudo-non-volatile ReRAM. We are also interested in creating ReRAM array structures, especially for H-ReRAM devices since their additional terminals

    Physical Design Research

    ReRAM is fabricated in the upper metal layers of the die as part of back-end-of-line (BEOL) processing steps. Like Intel's 3D XPoint, ReRAM employs a "crosspoint architecture" that uses selector devices to provide inter-cell isolation rather than per-cell access transistors. While peripheral access circuitry, such as decoders and sense amplifiers, do require logic transistors, the majority of the area underneath ReRAM memory arrays is vacant. This presents an opportunity for a new form of 3D integration in which memory cells are fabricated directly over compute logic in the same die. However, not all logic is suitable for fine-grain integration with ReRAM memory arrays. The peripheral access circuitry associated with each ReRAM memory array can be highly disruptive to the layout of other circuits, especially for random logic comprising much of any compute architecture's datapath circuitry.

    A potentially promising approach is to integrate the ReRAM memory system over the last-level cache. Like ReRAM, SRAM caches also consist of numerous memory arrays. It is natural to co-design the SRAM and ReRAM arrays such that one fits neatly underneath the other. For example, a 3D memory building block is illustrated below in which a cache mat consisting of two SRAM sub-arrays is physically integrated underneath two ReRAM sub-arrays. Routing of the address and data busses into and out of the co-designed arrays is still required, so layout of the ReRAM peripheral access circuitry needs to accommodate those routing tracks. But the resulting routing congestion is considerably less than what would be incurred if the ReRAM is integrated over random logic.

    Applications Research

    In general, we are interested in investigating how monolithically integrated accelerator - main memory chips can benefit applications. We believe an especially promising application domain for our technology is ML inference. In this workload, a model with a large number of model weights is typically read over and over again across batches of input data upon which we perform inference. While results are produced (mainly activations), the volume of written data is typically far less than the volume of read data. This mostly-read memory access pattern is a good match to existing ReRAM which supports reads efficiently, but exhibits relatively costly writes.

    ACM permission notice:
    The documents contained in these directories are included by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

    ACM copyright notice:
    Copyright © 2013 by the Association for Computing Machinery, Inc. (ACM). Permission to make digital or hard copies of portions of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page in print or the first screen in digital media. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Send written requests for republication to ACM Publications, Copyright & Permissions at the address above or fax +1 (212) 869-0481 or email permissions@acm.org. For other copying of articles that carry a code at the bottom of the first or last page, copying is permitted provided that the per-copy fee indicated in the code is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.

    Funding:

  • This project is funded in part by a Defense Technical Information Center (DTIC) contract.
  • Last updated: October 2025 by Donald Yeung (yeung@umd.edu)