Microchip Announces DRAM Controller For OpenCAPI Memory Interface
by Billy Tallis on August 5, 2019 8:00 AM ESTMicrochip's subsidiary Microsemi is entering a new market with the introduction of the SMC 1000 8x25G Serial Memory Controller. This is a DDR4 DRAM controller that connects to host processors using the OpenCAPI-derived Open Memory Interface (OMI), a high-speed differential serial link running at 25Gbps per lane. The purpose is to enable servers to scale to much higher memory capacities by attaching DRAM through serial links with much lower pin counts than traditional parallel DDR interfaces.
OpenCAPI is one of several competing high-speed interconnect standards that seek to go beyond the performance and feature set of PCI Express. The first two CAPI standards were built atop PCIe 3.0 and 4.0 and offered a lower-latency, cache-coherent protocol. Version 3 gained the Open- prefix by moving control of the spec from IBM to a new consortium, and OpenCAPI 3.0 abandons its PCIe underpinnings in favor of a new 25Gbps link. A subset of OpenCAPI 3.1 has been dubbed Open Memory Interface, and provides a media-agnostic but low-latency protocol for accessing memory. There's open IP available for implementing the host or target side of this interface, and a growing ecosystem of commercial tools for design verification.
The Microchip SMC 1000 8x25G unsurprisingly uses an 8-lane Open Memory Interface connection to the host, and on the downstream side it has a single-channel DDR4-3200 controller with ECC and support for four ranks of memory. The SMC 1000 at heart is a SERDES with a few extra features, allowing a CPU to use an 84-pin connection in place of a 288-pin DIMM interface, without sacrificing bandwidth and only incurring an extra 4ns of latency compared to LRDIMMs attached to an on-CPU memory controller. The chip itself is a 17x17 mm package with typical power consumption below 1.7W, and it supports dynamically dropping down to four or two lanes on the OMI link to save power when the full 25GB/s isn't needed.
In principle, the DRAM interface of the SMC 1000 could fan out to traditional DIMM slots, but the preferred way to use the chip will be to put the controller and a fixed amount of DRAM together onto a module called a Differential DIMM. These DDIMMs will use the same SFF-TA-1002 connector as EDSFF/Ruler SSDs, and the modules will be 85mm long compared to 133mm LRDIMMs. Both 1U and 2U height DDIMM form factors are in the process of being standardized. Microchip already has Samsung, Micron and SMART Modular on board to manufacture DDIMMs using the SMC 1000 controller, with initial capacities ranging from 16GB to 256GB per module.
On the host side, the first platforms to support Open Memory Interface will be POWER9 processors from IBM, and they are expected to announce more details later this month at their OpenPOWER Summit. From IBM's perspective, supporting Open Memory Interface allows them to include more memory channels on the same size die, and provides a forward-compatible upgrade path to DDR5 and NVDIMMs or other memory technologies since the details of those interfaces are now handled on the DDIMM instead of on the CPU.
Microchip will be showing off the SMC 1000 8x25G at Flash Memory Summit this week, and will be giving a keynote presentation Wednesday morning.
17 Comments
View All Comments
Sahrin - Monday, August 5, 2019 - link
Didn’t Intel already try to do this and it failed miserably?ats - Monday, August 5, 2019 - link
Yes FB-DIMM with lower latency and less power....Kevin G - Monday, August 5, 2019 - link
FB-DIMM was not low power and the latency was pretty bad (especially in two or more FB-DIMMs per channel). Those we'll have to see what the real world latencies of this IBM technology is.However, it was a JEDEC standard that leveraged standard parallel DRAM chip configurations with a parallel-to-serial buffer which is what this IBM technology is also doing.
ats - Monday, August 5, 2019 - link
The quoted device power and the quoted latency numbers are both higher than AMB chips achieved. I'm not arguing that FB-DIMM was low power or latency, just that it was lower power and latency than this. And yes, if you chained FB-DIMMs the latency went up, but this device doesn't even support that functionality so its appropriate to only compare to single AMB channels.SarahKerrigan - Monday, August 5, 2019 - link
I don't think so. This is basically the successor to IBM Centaur - a combination of the buffered memory that many scale-up servers already use and a fast generic expansion interconnect; I'm not aware of anything else quite like it.ats - Monday, August 5, 2019 - link
Its basically a perfect analog for FB-DIMM.close - Monday, August 5, 2019 - link
Are you talking about Optane? Because they're quite different in what they're trying to achieve.This is a way of cramming more RAM into a server and allowing for easy generational upgrades by decoupling the RAM type from the IMC in the CPU.
Kevin G - Monday, August 5, 2019 - link
Intel has done this... twice. As has IBM.The first as already pointed out is the FB-DIMM standard. Intel was the popular advocate for this but the standard itself was part of JEDEC and handful of smaller players leveraged it as well. The DIMMs ran hot and had significantly higher latency than their traditional DDR2 counter parts of that era. Technically they could have also used DDR3 based DRAM with an appropriate buffer chip but no such configuration ever existed to my knowledge.
The FB2-DIMM spec was proposed but never adopted by JEDEC. Both Intel and IBM leveraged the concepts from this design for their highend systems (Xeon E7 and POWER7 respectively). Instead of putting the memory buffer on the DIMM itself, the serial-to-parallel conversion chip was placed on the motherboard or a daughter card which then backed traditional DDR3 DIMMs in most cases (IBM still did the proprietary DIMM format for their really, really high end systems which had features like chip kill etc.).
IBM followed up their initial memory buffer design in POWER8 by incorporating a massive amount of eDRAM (32 MB) with the buffer chip to cache as a L4 cache. This bulk caching effectively hid much of the memory buffer latency as the cache's contents could only exist on that memory channel. The buffer chip here did a few other clever things since it had a massive cache like re-ordering read/write operations for more continued burst operations on the DRAM.
name99 - Monday, August 5, 2019 - link
"The buffer chip here did a few other clever things since it had a massive cache like re-ordering read/write operations for more continued burst operations on the DRAM."This feature is called "Virtual Write Queue". It's described in this paper:
https://lca.ece.utexas.edu/pubs/ISCA_2010.pdf
It seems like IBM tried to patent it, but the patent is abandoned?
https://patents.google.com/patent/US20150143059
The technique should be feasible on any SoC where the memory controller and LLC are integrated and designed together, so basically anything modern (including eg phones). Whether it's that valuable in different environments (eg on phones with the very different traffic patterns of GPUs), well who knows? But certainly everyone should be trying to reduce time wasted on DRAM turnaround whenever you have to switch from read to write then back.
My guess is that a company like Apple, that's already engaged in every performance trick known, probably does something similar, though perhaps more sophisticated to distinguish between different types of traffic.
azfacea - Monday, August 5, 2019 - link
yes and no.it was many years ago and target at different use cases. I think the latencies were bad and the IMC was hot at the time.
now this could be useful for very large densities for data science and analytics. i dont recall intel making server CPUs back then with 6 or 8 channel IMC back then? am i wrong ? was there a demand for quad socket servers just for extra IMC and not actually that high core counts ? LTT had a video of super computer in canada with such servers