Certain emerging non-volatile memory technologies, such as resistive RAM (ReRAM), are compatible with standard CMOS logic processes. This means it may be possible to integrate them directly into the die of a compute chip, such as a GPU or an accelerator. This project investigates such monolithically integrated accelerator-main memory chips. Similar to stacking DRAM dies over a logic die, monolithic 3D integration of main memory uses the vertical dimension to enable close physical proximity between main memory and compute logic. But because there are no die crossings between the compute logic and an on-die main memory system, much higher wiring density can be achieved, resulting in a massively parallel connection to main memory. This results in higher memory bandwidth, lower data movement, and lower power consumption.
We are collaborating with IBM Research; our contacts at IBM include Dirk Pfeiffer and Takashi Ando. We are also collaborating with Northrop Grumman; our contacts at NG include Louise Sengupta and Isidoros Doxas.
We are also a member of the University Partnership Program at Global Foundries; our contact at GF is Claudia Kretzschmar.
This project is multi-disciplinary, involving research in computer architecture, circuits and devices, physical design, and applications.
Our project is undertaking computer architecture research to investigate different organizations for on-die memory systems. The goal is to create and make use of massive memory-level parallelism that can become possible. In the past, one architecture we have considered is a large tiled CPU (illustrated below) with integrated ReRAM. In this architecture, it is natural to distribute the main memory system across tiles by integrating a portion of the ReRAM into each compute tile. (The ReRAM is physically integrated over the tile's local L2 slice to form a 3D L2 / main memory module; see Physical Design research below). In essence, this is a distributed memory architecture all integrated on a single die. Moreover, in addition to the per-tile ReRAM, each compute tile also includes its own memory controller, providing the local core with a dedicated channel to a local portion of the ReRAM main memory. Each core can still access remote main memory modules across the on-chip network; however, data movement is virtually eliminated when the application's data can be partitioned across the main memory modules such that the cores' memory accesses are destined primarily to the local portion of main memory.

Currently, we are considering ReRAM integration into modern architectures, such as GPUs and specialized accelerators for machine learning. The intent is to architect the distributed memory in such a way that maximizes the achievable memory bandwidth. Like the tiled CPU above, we also seek to exploit physical locality of the ReRAM memory and compute units to improve efficiency.
Our research is exploring new ReRAM device architectures that address the material stress experienced in conventional ReRAM. The right half of the figure below shows one of our approaches, called horizontally-transported ReRAM, or H-ReRAM. In an H-ReRAM device, writes still form oxygen vacancies via an electric field applied across a top and bottom electrode. However, there is a silicon nitride barrier that blocks current flow in the vertical direction; hence, there is almost no power dissipation during writes. To sense the presence or absence of the oxygen vacancies, a much smaller read voltage is applied transversely across source and drain terminals of the device, resulting in a horizontal current flow. In essence, this device separates the read and write terminals to solve the write endurance problem. Although this increases planar area, the device is still 3D stackable, and can exhibit high density.

In addition to developing new ReRAM devices, we are also trying to improve endurance for existing ReRAM. One idea we are pursuing is soft writes which performs writes using lower voltages and/or currents to reduce the material stress. Because softly written cells do not retain their states for as long, this technique sacrifices non-volatility characteristic (the ReRAM becomes "pseudo-non-volatile"), trading off retention time to get back some endurance. Rather than only support soft writes, we envision memory systems will support both soft as well as traditional ("hard") writes. We are developing hardware support for deciding when soft writes are beneficial compared to hard writes, and then dynamically selecting the best type of write to employ based on the usage patterns of the written data.
Fabrication of test devices at the UMD Nano Fabrication Center is underway to explore both H-ReRAM and pseudo-non-volatile ReRAM. We are also interested in creating ReRAM array structures, especially for H-ReRAM devices since their additional terminals
ReRAM is fabricated in the upper metal layers of the die as part of back-end-of-line (BEOL) processing steps. Like Intel's 3D XPoint, ReRAM employs a "crosspoint architecture" that uses selector devices to provide inter-cell isolation rather than per-cell access transistors. While peripheral access circuitry, such as decoders and sense amplifiers, do require logic transistors, the majority of the area underneath ReRAM memory arrays is vacant. This presents an opportunity for a new form of 3D integration in which memory cells are fabricated directly over compute logic in the same die. However, not all logic is suitable for fine-grain integration with ReRAM memory arrays. The peripheral access circuitry associated with each ReRAM memory array can be highly disruptive to the layout of other circuits, especially for random logic comprising much of any compute architecture's datapath circuitry.
A potentially promising approach is to integrate the ReRAM memory system over the last-level cache. Like ReRAM, SRAM caches also consist of numerous memory arrays. It is natural to co-design the SRAM and ReRAM arrays such that one fits neatly underneath the other. For example, a 3D memory building block is illustrated below in which a cache mat consisting of two SRAM sub-arrays is physically integrated underneath two ReRAM sub-arrays. Routing of the address and data busses into and out of the co-designed arrays is still required, so layout of the ReRAM peripheral access circuitry needs to accommodate those routing tracks. But the resulting routing congestion is considerably less than what would be incurred if the ReRAM is integrated over random logic.

In general, we are interested in investigating how monolithically integrated accelerator - main memory chips can benefit applications. We believe an especially promising application domain for our technology is ML inference. In this workload, a model with a large number of model weights is typically read over and over again across batches of input data upon which we perform inference. While results are produced (mainly activations), the volume of written data is typically far less than the volume of read data. This mostly-read memory access pattern is a good match to existing ReRAM which supports reads efficiently, but exhibits relatively costly writes.
ACM permission notice:
The documents contained in these directories are included by the
contributing authors as a means to ensure timely dissemination of
scholarly and technical work on a non-commercial basis. Copyright and
all rights therein are maintained by the authors or by other copyright
holders, notwithstanding that they have offered their works here
electronically. It is understood that all persons copying this
information will adhere to the terms and constraints invoked by each
author's copyright. These works may not be reposted without the
explicit permission of the copyright holder.
ACM copyright notice:
Copyright © 2013 by the Association for Computing Machinery,
Inc. (ACM). Permission to make digital or hard copies of portions of
this work for personal or classroom use is granted without fee
provided that the copies are not made or distributed for profit or
commercial advantage and that copies bear this notice and the full
citation on the first page in print or the first screen in digital
media. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy
otherwise, to republish, to post on servers, or to redistribute to
lists, requires prior specific permission and/or a fee. Send written
requests for republication to ACM Publications, Copyright &
Permissions at the address above or fax +1 (212) 869-0481 or email
permissions@acm.org. For other copying of articles that carry a code
at the bottom of the first or last page, copying is permitted provided
that the per-copy fee indicated in the code is paid through the
Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.