ARM server status update / reality check

Things are getting more and more interesting.
I can't recommend Jeff Geerling's YouTube channel enough.




NVMe booting is currently in beta but will be added to the main feature-set soon via the Raspberry Pi Foundation.
Jeff's video shows that it is pretty quick to get it up and running, though, and ~400MB/s is pretty good for such a low-power SBC.
 
Last edited:
Wow, two fucking years after release, and they finally fixed the feature-list to make it competitive with every other x86 SBC that includes NVMe out there?

If you want to know why Broadcom has zero customers for their SoC world (outside PI) their complete lack of support here is a pretty easy answer.
 
  • Like
Reactions: travm
like this
Wow, two fucking years after release, and they finally fixed the feature-list to make it competitive with every other x86 SBC that includes NVMe out there?

If you want to know why Broadcom has zero customers for their SoC world (outside PI) their complete lack of support here is a pretty easy answer.

Because it's silly for Bradocom to waste time and effort on getting NVMe up on the single PCIe 2.0 lane for I/O speeds that are slower than SATA3?

I'm not sure why RPi are bothering. The PCIe connectivity is only available from RPi 4 Compute Modules, on the 4B it's dedicated to the USB ports. Any application that really needs such I/O would probably be better off with a much higher-end ARM SoC (like Snapdragon or Apple M1 level (I can never keep all of the various ARM architectures straight)) or x86-64.

AFAICT Broadcom's SOC business is doing fine. They're embedded in tons of devices from manufacturers willing to pay for proper support and docs that the cheap Chinese SoCs don't offer (e.g., you're not going to find AllWinner in your car).
 
Because it's silly for Bradocom to waste time and effort on getting NVMe up on the single PCIe 2.0 lane for I/O speeds that are slower than SATA3?
NVMe has a much great queue depth and command queues than SATA.
This makes it a low-cost boon for ARM developers looking to test high queuing tasks such as databases where singular transfer rates aren't as important, especially compared to current SATA, eMMC, or micro SD flash storage solutions which are currently natively available.

I'm not sure why RPi are bothering. The PCIe connectivity is only available from RPi 4 Compute Modules, on the 4B it's dedicated to the USB ports. Any application that really needs such I/O would probably be better off with a much higher-end ARM SoC (like Snapdragon or Apple M1 level (I can never keep all of the various ARM architectures straight)) or x86-64.
It depends on the cost and the task, and other ARM-based solutions which natively offer more PCIe lanes with much higher costs associated with them.
x86-64 solutions don't really apply unless the ISA doesn't matter to the end-user.
 
Last edited:
NVMe has a much great queue depth and command queues than SATA.
This makes it a low-cost boon for ARM developers looking to test high queuing tasks such as databases where singular transfer rates aren't as important, especially compared to current SATA, eMMC, or micro SD flash storage solutions which are currently natively available.

True, NVMe is technically more capable. But I doubt the RPi SoC or similar is going to be a top choice for anyone who needs to process a ton of IOPS. I'd be concerned that the CPU wouldn't be able to keep up and load average would go through the roof. Initial dev, yeah, sure.


It depends on the cost and the task, and other ARM-based solutions which natively offer more PCIe lanes high much higher costs associated with them.
x86-64 solutions don't really apply unless the ISA doesn't matter to the end-user.

Well yeah, that's a given, more capable == more $$$.
 

NVIDIA Announces CPU for Giant AI and High Performance Computing Workloads

Credit goes to Lakados


https://nvidianews.nvidia.com/news/...t-ai-and-high-performance-computing-workloads
Underlying Grace’s performance is fourth-generation NVIDIA NVLink® interconnect technology, which provides a record 900 GB/s connection between Grace and NVIDIA GPUs to enable 30x higher aggregate bandwidth compared to today’s leading servers.

Grace will also utilize an innovative LPDDR5x memory subsystem that will deliver twice the bandwidth and 10x better energy efficiency compared with DDR4 memory. In addition, the new architecture provides unified cache coherence with a single memory address space, combining system and HBM GPU memory to simplify programmability.

https://www.anandtech.com/show/1661...formance-arm-server-cpu-for-use-in-ai-systems
The company isn’t directly gunning for the Intel Xeon or AMD EPYC server market, but instead they are building their own chip to complement their GPU offerings, creating a specialized chip that can directly connect to their GPUs and help handle enormous, trillion parameter AI models.

Image%20-%20Grace_678x452.jpg

Old design infrastructure with x86-64 and PCIE:
PCIe_575px.jpg

New design infrastructure with AArch64 and NVLINK:
NVLink_575px.jpg
 
Last edited:

NVIDIA Announces CPU for Giant AI and High Performance Computing Workloads

Credit goes to Lakados


https://nvidianews.nvidia.com/news/...t-ai-and-high-performance-computing-workloads


https://www.anandtech.com/show/1661...formance-arm-server-cpu-for-use-in-ai-systems


View attachment 347381

Old design infrastructure with x86-64 and PCIE:
View attachment 347382

New design infrastructure with AArch64 and NVLINK:
View attachment 347383
the reason I didn't post it is that it's still a mystery. I assume this thing is built to accept as many GPU as you have slots for (just like the existing AMD servers)?

They can't even tell you any fuckng details about the CPU

So yea, this is a Pointless Press Release (tm) that we will have to wait a goddamed year to find out NVIDIA added a tiny tweak to the standard N2 design (plus the obvious addition of on-chip custom AI communicating with that CPU)

This feels like another empty Orin Press Release (along with an 18-month delay before specs and boards were shown)

https://www.anandtech.com/show/12598/nvidia-arm-soc-roadmap-updated-after-xavier-comes-orin

Fuck this preannounce shit man,you're not putting these in cars (so you don't have to give car designers 2-years empty pr notice to design these in)
 
Last edited:
the reason I didn't post it is that it's still a mystery. I assume this thing is built to accept as many GPU as you have slots for (just like the existing AMD servers)?
The full thing is rendered in the post by Red Falcon.

There are no slots, because slots are slow.
They can't even tell you any fuckng details about the CPU
It is based in a future microarchitecture to be announced by ARM. But Nvidia has already announced over 300 points on SPECrate2017_int_base.
So yea, this is a Pointless Press Release (tm) that we will have to wait a goddamed year to find out NVIDIA added a tiny tweak to the standard N2 design (plus the obvious addition of on-chip custom AI communicating with that CPU)

This feels like another empty Orin Press Release (along with an 18-month delay before specs and boards were shown)

https://www.anandtech.com/show/12598/nvidia-arm-soc-roadmap-updated-after-xavier-comes-orin

Fuck this preannounce shit man,you're not putting these in cars (so you don't have to give car designers 2-years empty pr notice to design these in)
The Swiss National Supercomputing Center and the Los Alamos National Laboratory will build supercomputers based on this.
 
The full thing is rendered in the post by Red Falcon.

There are no slots, because slots are slow.

It is based in a future microarchitecture to be announced by ARM. But Nvidia has already announced over 300 points on SPECrate2017_int_base.

The Swiss National Supercomputing Center and the Los Alamos National Laboratory will build supercomputers based on this.
TLDR: empty press release shows tons of potential, but is, in-reality, purely hype. I'm more pissed-off because Tegra has made this "normal" for NVIDIA
 
Ampere moving to custom cores - Anandtech Link

Interesting to see them jump on the Custom chain with their success with the Neoverse cores. I find it absolutely exciting to see a bunch of custom arm stuff popping up in both the server and consumer space.

Now just need to see some more RISC-V movement.

They own X-Gene, so it will be interesting to see what revision 4 brings!

Will it be faster than n2, or is arm upping the license costs after the surprise success of N1?
 
Last edited:
The Ampere Altra Max Review: Pushing it to 128 Cores per Socket

Very unique. Moar Cores / Moar Problems.

TLRL: Less L3, cache coherency rears its head, throughput for some things is amazing as is compiling (not linking), transactional java sux.
 
The Ampere Altra Max Review: Pushing it to 128 Cores per Socket

Very unique. Moar Cores / Moar Problems.

TLRL: Less L3, cache coherency rears its head, throughput for some things is amazing as is compiling (not linking), transactional java sux.


Well, we knew that pathetic refresh was coming, while it does some major work to make X-Gene 4 faster.
 

Nvidia Unveils 144-core Grace CPU Superchip, Claims Arm Chip 1.5X Faster Than AMD's EPYC Rome

RVjW8BTVzKJWJjGycR9C8h-970-80.jpg.webp


The Grace CPU Superchip memory subsystem provides up to 1TB/s of bandwidth, which Nvidia says is a first for CPUs and more than twice that of other data center processors that will support DDR5 memory. The LPDDR5X comes spread out in 16 packages that provide 1TB of capacity. In addition, Nvidia notes that Grace uses the first ECC implementation of LPDDR5X.

This brings us to benchmarks. Nvidia claims the Grace CPU Superchip is 1.5X faster in the SPECrate_2017_int_base benchmark than the two previous-gen 64-core EPYC Rome 7742 processors it uses in its DGX A100 systems. Nvidia based this claim on a pre-silicon simulation that predicts the Grace CPU at a score of 740+ (370 per chip). AMD's current-gen EPYC Milan chips, the current performance leader in the data center, have posted SPEC results ranging from 382 to 424 apiece, meaning the highest-end x86 chips will still hold the lead. However, Nvidia's solution will have many other advantages, such as power efficiency and a more GPU-friendly design.
 
I'm glad to hear they finally gave up on the VLIW distractions; there's no way in hell they could make it work as a general-purpose server chip!

I guess once they released sve2, they figured they could get the same compute throughput without hacking the rest of the core with clunky designs?
 
It looks really neat.

The new mesh

CMN-700
12 x 12 = 144

CMN-600 (old)
6 x 6 = 36

Which I guess implies each 72 core chip is a 6 x 6 mesh. 2 cores per cross point 2 x 6 x 6 = 72

Further a 4 chip module would be 4 x 2 x 6 x 6 = 288

So CMN-700 is 4 x CMN-600 how nicely incremental
 
Things are starting to get interesting. 🍎:penguin:




Of course it is.

OSx will always be hindered by its Microkernel. This is really sad this generation,because there is no-longer any officially-supported kick-in-the-ass for Apple from other OS options
 
At long last, GPU functionality on the Raspberry Pi. (y)




Executive Summary: the drivers plus i/o architecture are so bad on the ARM platform,the only folks who can get GPU Compute working on Arm servers are Supercomputer vendors like NVIDIA.
 
Last edited:
Exective Summary: the drivers plus i/o architecture are so bad on the RM platform,the onlyu folks who can get GPU Compute working on Arm servers are Supercxomputer vendors like NVIDIA.
It is a start, and that's how everything begins, very small.
As explained in the video, ARM has many various platforms with various PCIe standards enabled/disabled, so it may very well be on a case-by-case basis for GPU functionality.
 


From the comments:

AMD EPYC 7601 = 3.2 GHz Ryzen Zen1 | 894 @ 3.2 GHz
Ampere Q80-30 = 3 GHz Neoverse N1 | 882 @ 3.0 GHz

To compare, N1 has +5% higher 1T GB5 IPC vs Zen1.
Arm claims the 2022 N2 has +40% IPC gains. Iso-power, +10% perf (though with 8MB vs 4MB L3). Iso-perf, -30% power (both are iso-process, but presumably N2 will be on some 5-nm nodes).
 
Last edited:

Raspberry Pi Cluster Versus Ampere Altra Max Supermicro Arm Server



The conclusion is about what you would expect, especially if you saw our AoA Analysis Marvell ThunderX2 Equals 190 Raspberry Pi 4. I also discuss the Ampere Altra Max’s HPL performance versus the newly released AMD EPYC Genoa as in the floating point workload, there is a massive gap between the efficiency of AMD and Ampere parts. There is a reason that the Arm server CPUs typically have integer-focused performance figures for things like web serving and SPEC CPU2017 integer rates, not floating point.
 
This is actually pretty cool.


Will be doing just this on my Raspberry Pi CM4 very soon.
VMware ESXi would have been painful on a micro SD card, but should be much better over native NVMe, even at PCIe 2.0 1x (~500MB/s).

XVkIjpbInVybjpzZXJ2aWNlOmltYWdlLm9wZXJhdGlvbnMiXX0.jpg


System Specs:
Broadcom BCM2711 OC'ed @ 2.0GHz AArch64 SoC
8GB 3200MT/s LPDDR4 SDRAM
Broadcom VideoCore VI GPU
512GB Samsung PM9A1 M.2 NVMe SSD
Raspberry Pi OS 64-bit (Debian 11)
GeeekPi Aluminum Alloy SoC Heatsink
Noctua NF-A4x20 PWM Fan (40x20mm)
ineo Copper Alloy M.2 2280 Heatsink
Integrated ARM SoC 1000Base-T NIC
InnoMaker HiFi DAC PCM5122
 
Will be doing just this on my Raspberry Pi CM4 very soon.
VMware ESXi would have been painful on a micro SD card, but should be much better over native NVMe, even at PCIe 2.0 1x (~500MB/s).

View attachment 536568

System Specs:
Broadcom BCM2711 OC'ed @ 2.0GHz AArch64 SoC
8GB 3200MT/s LPDDR4 SDRAM
Broadcom VideoCore VI GPU
512GB Samsung PM9A1 M.2 NVMe SSD
Raspberry Pi OS 64-bit (Debian 11)
GeeekPi Aluminum Alloy SoC Heatsink
Noctua NF-A4x20 PWM Fan (40x20mm)
ineo Copper Alloy M.2 2280 Heatsink
Integrated ARM SoC 1000Base-T NIC
InnoMaker HiFi DAC PCM5122

just homelabbing/tinkering or do you do something specific with it?
 

NVIDIA Grace Superchip Features 144 Cores 960GB of RAM and 128 PCIe Gen5 Lanes - Posted by STH


NVIDIA-GTC-2022-Grace-CPU-Superchip-696x484.jpg


NVIDIA’s Grace Superchip is a 500W part, but that includes the LPDDR5X memory. We have been seeing roughly 5W/ DDR5 RDIMM. So for AMD EPYC 9004 Genoa with 12 channels of memory, we would add ~60W to the CPU’s TDP. That makes NVIDIA’s chip slightly more powerful, but with what should be a bit over twice the memory bandwidth than a single Genoa CPU.

That single CPU comparison will become important. The AMD EPYC Genoa more than doubled memory bandwidth, and it has much higher compute performance due to adding more cores and a faster microarchitecture. Our sense is that in HPC workloads, Grace will compete with dual-socket Genoa. On the integer side, AMD should be ahead based on what we have seen with existing Arm architectures.
 
Back
Top