ISCA, the International Symposium for Computer Architecture is an IEEE conference that usually we don’t tend to hear from all that often in the public. The main reason for this is that most sessions and papers tend to be more academically oriented, and thus generally quite a bit further away from the practice of what we see in real products. This year, the conference has changed its format in adding an industry track of sessions, with presentations and papers from various companies in the industry, covering actual commercial products out there in the wild.

Amongst the sessions, Samsung’s SARC (Samsung Austin R&D Centre) CPU development team has presented a paper titled “Evolution of the Samsung Exynos CPU Architecture”, detailing the team’s efforts over its 8-year existence, and presented some key characteristics of its custom Arm CPU cores ranging from the Exynos M1, to the most recent Exynos M5 CPU as well as the unreleased M6 design.

As a bit of background, Samsung’s SARC CPU team was established in 2011 to develop custom CPU cores that Samsung LSI would then deploy in its Exynos SoCs, ranging from the first-generation Exynos 8890 released in 2015 in the Galaxy S7, up till the most recent Exynos 990 with its M5 cores in the Galaxy S20. SARC had completed the M6 microarchitecture before the CPU team had gotten news of it being disbanded in October of 2019, effective last December.

The ISCA paper is a result of Samsung’s willingness to publish some of the development team’s ideas that were considered worthy of preserving in the public, essentially representing a high-level burn-through of 8 years of development.

From M1 to M6: A continuously morphing CPU µarch

The paper presents a gross overview table of the microarchitectural differences between Samsung’s custom CPU cores:

The disclosure covers some of the well-known characteristics of the design as had been disclosed by Samsung in its initial M1 CPU microarchitecture deep dive at HotChips 2016, to the more recent M3 deep dive at HotChips 2018. It gives us an insight into the new M4 and M5 microarchitectures that we had measured in our S10 and S20 reviews, as well as a glimpse of what the M6 would have been.

The one key characteristic of Samsung’s design was over the years, it was based off the same blueprint RTL that was started off with the M1 core in 2011, with continuous improvements of the functional blocks of the cores over the years. The M3 had been a big change in the design, widening the core substantially in several aspects, such as going from a 4-wide design to a 6-wide mid-core.

The new disclosures that weren’t public before regard the new M5 and M6 cores. For the M5, Samsung had made bigger changes to the cache hierarchy of the cores, such as replacing private L2 caches with a new bigger shared cache, as well as disclosing a change in the L3 structure from a 3-bank design to a 2-bank design with less latency.

The unreleased M6 core that had been in development was seemingly to be a bigger jump in terms of the microarchitecture. The SARC team here had prepared large improvements, such as doubling the L1 instruction and data caches from 64KB to 128KB – a design choice that’s currently only been implemented before by Apple’s CPU cores starting with the A12.

The L2 is said to have been doubled in its bandwidth capabilities to up to 64B/cycle, and also there would have been an increase in the L3 from 3 to 4MB.

The M6 would have been an 8-wide decode core, which as far as we know would have been the widest commercial microarchitecture that we know of – at least on the decode side of things.

Interestingly, even though the core would have been much wider, the integer execution units wouldn’t have changed all that much, just seeing one complex pipeline adding a second integer division capability, whilst the load/store pipelines would have remained the same as on the M5 with 1 load unit, 1 store unit, and one 1 load/store unit.

On the floating-point/SIMD pipelines we would have seen an additional fourth unit with FMAC capabilities.

The TLBs would have seen some large changes, such as the L1 DTLB being increased from 48 pages to 128 pages, and the main TLB doubling from 4K pages to 8K pages (32MB coverage).

The M6 would also have ben the first time since the M3 that the out-of-order window of the core would have been increased, with larger integer and floating-point physical register files, and an increase in the ROB (Reorder buffer) from 228 to 256.

One key weakness of the SARC cores seems to still have been present in the M5 and upcoming M6 core, and that’s its deeper pipelines stages resulting in a relatively expensive 16-cycle mispredict penalty, quite higher than Arm’s more recent designs which fall in at 11 cycles.

The paper goes into more depth into the branch predictor design, showcasing the core’s Scaled Hashed Perceptron based design. The design had been improved continuously over the years and implementations, improving the branch accuracy and thus reducing the MPKI (mis-predicts per kilo-instructions) continuously.

An interesting table that’s showcased is the amount of storage structures that the branch predictor takes up within the front-end, in KBytes:

We’re not aware of any other vendor ever having disclosed such figures, so it’s interesting to put things into context of what a modern front-end has to house in terms of storage (and this is *just* the branch predictor).

The paper goes onto further detail onto the cores prefetching methodologies, covering the introduction of a µOP cache in the M5 generation, as well as the team’s efforts into hardening the core against security vulnerabilities such as Spectre.

Generational IPC Improvements - 20% per year - 2.71x in 6 years

The paper further describes efforts by the SARC team to improve memory latency over the generations. In the M4 core, the team had included a load-load cascade mechanism that reduced the effective L1 cycle latency from 4 cycles to 3 on subsequent loads. The M4 had also introduced a path bypass with a new interface from the CPU cores directly to the memory controllers, avoiding traffic through the interconnect, which explains some of the bigger latency improvements that we’ve seen in the Exynos 9820. The M5 had introduced speculative cache lookup bypasses, issuing a request to both the interconnect and the cache tags simultaneously, possibly saving on latency in case of a cache miss as the memory request is already underway. The average load latency had been continuously improved over the generations, from 14.9 cycles on the M1 down to 8.3 cycles on the M6.

In terms of IPC improvements, the SARC team had managed to get to an average of 20% annual improvements over the 8 years of development. The M3 had been in particular a big jump in IPC as seen in the graph. The M5 roughly correlates to what we’ve seen in our benchmarks, at around 15-17% improvement. IPC for the M6 is disclosed at having ended up at an average of 2.71 versus 1.06 for the M1, and the graph here generally seems to indicate a 20% improvement over the M5.

During the Q&A of the session, the paper’s presenter, Brian Grayson, had answered questions about the program’s cancellation. He had disclosed that the team had always been on-target and on-schedule with performance and efficiency improvements with each generation. It was stated that the team’s biggest difficulty was in terms of being extremely careful with future design changes, as the team never had the resources to completely start from scratch or completely rewrite a block. It was said that with hindsight, the team would have done different choices in the past with of some of the design directions. This serial design methodology comes in contrast to Arm’s position, having multiple leapfrogging design centres and CPU teams, allowing them to do things such as ground-up re-designs, such the Cortex-A76.

The team had plenty of ideas for improvements for upcoming cores such as the M7, but the decision to cancel the program was said to have come from very high up at Samsung. The SARC CPU cores were never really that competitive, suffering from diminished power efficiency, performance, and area usage compared to Arm’s designs. With Arm’s latest Cortex-X1 divulged last week going for all-out performance, it looks to me that SARC’s M6 design would have had issues competing against it.

The paper's authors are extremely thankful for Samsung’s graciousness in allowing the publication of the piece, and thank the SARC leadership for their management over the years on this “moonshot” CPU project. SARC currently still designs custom interconnects, memory controllers, as well as working on custom GPU architectures.

Related Reading:

Source: ISCA 2020: Evolution of the Samsung Exynos CPU Microarchitecture

Comments Locked

51 Comments

View All Comments

  • eSyr - Wednesday, June 3, 2020 - link

    "8-wide decode core, which as far as we know would have been the widest commercial microarchitecture that we know of" — POWER 8 has 8-wide decode, you have written about it in 2016[1].

    [1] https://www.anandtech.com/show/10435/assessing-ibm...
  • Tabalan - Wednesday, June 3, 2020 - link

    I think POWER8 is purely professional product, while M6 core would be installed in purely commercial devices (Galaxy S30, Note 30). At least that's my point of view.
  • eSyr - Wednesday, June 3, 2020 - link

    What do you mean by "commercial devices"? Those which sell freely? One can buy a POWER machine the same way one can buy a Samsung device, on the Internet[1][2][3].

    [1] https://www.raptorcs.com/TALOSII/
    [2] https://www.alancomputech.com/cpus-other-cpus.html
    [3] https://www.amazon.com/dp/B077T2WMGD
  • Tabalan - Wednesday, June 3, 2020 - link

    1st link are rack units/workstations and 3rd link is server rack unit on Intel Xeon. I'd say commercial - professional is more about purpose. With Nvidia you have GTX/RTX for commercial use and Tesla/Quadro for professional use. Pretty much same with many tech companies, like AMD, Intel, Samsung,...
  • cjl - Wednesday, June 3, 2020 - link

    Professional is still commercial. The only non-commercial stuff out there would be military/government/etc. A more accurate term for what you're looking for is consumer.
  • soresu - Wednesday, June 3, 2020 - link

    I think they meant consumer rather than commercial, as opposed to pro/server components.
  • Andrei Frumusanu - Wednesday, June 3, 2020 - link

    That's not exactly correct, the POWER8 has a 6+2 or 2x(3+1) decoder, where it the 6/3 would by instruction invariant and the +2/1 would be just branches. In effect Samsung's design here would be wider in terms of being a traditional 8-wide decoder.
  • eSyr - Wednesday, June 24, 2020 - link

    I've missed that aspect of P8 decoder, thank you.
  • dotjaz - Wednesday, June 3, 2020 - link

    8 is not wider than 8 in any case.
  • soresu - Wednesday, June 3, 2020 - link

    Er, Alpha EV8?

Log in

Don't have an account? Sign up now