128GB on 7950X3D for local LLMs

Zarax · 24 Oct 2025 at 06:07

Has anyone tried maxing up RAM to run LLMs?
I currently have LM Studio and 64GB RAM, which allows me to run fairly large models, however this is not sufficient to run larger ones.
Do you have any experience and feedback on performance?

Thank you very much!

sastusbulbas · 24 Oct 2025 at 07:52

Regardless of application, running 128gb will also be dependent on the motherboard and if you want to run two or four dimms.

If running 128gb you simply may need to play with the timings and be prepared for higher cas latency but again depends on the motherboard as well.

Quite a lot of people run 128gb with the 7950X3D, plenty of articles on Google, but as you can see reduced timings to pass memtest sessions and maintain stability is common.

Is there a kit you are considering and what motherboard.

Emlyn_Dewar · 24 Oct 2025 at 12:46

It'll work, but dual channel DDR5 bandwidth is going to keep it slow.

Zarax · 24 Oct 2025 at 13:13

Emlyn_Dewar said:
It'll work, but dual channel DDR5 bandwidth is going to keep it slow.

I'm aware of that, 64GB models with reasoning already take about 20 minutes to generate an answer, what I'm looking for is a way to replicate Stryx Halo on the cheap.

Emlyn_Dewar · 24 Oct 2025 at 18:37

If you don't mind glacially slow, if absolutely will work. I have an x99 xeon and 256gb quad channel for the same purpose.
It's a "go away and come back much later" affair. More to have the capability, not to use.

Tetras · 24 Oct 2025 at 18:40

I'd have a look at the new 64GB sticks.

Haven't watched this, but might be of interest:

Zarax · 25 Oct 2025 at 04:44

But isn't 128GB the max supported by 7950X3D?
I upgraded last year after 11 years, I might consider a RAM upgrade but no way I'm going to switch CPU and motherboard as well.

benjii · 26 Oct 2025 at 13:00

If you have spare pcie bandwidth, what about getting some budget GPUs like the Tesla P40 24GB? They're not great, but certainly better than using just RAM. Then just fill the VRAM and offload the rest to the CPU and RAM. If you bought two P40s, you could probably run a 4bit quant 70B model at 5~ tok/sec in just VRAM.

My LLM server has a Threadripper Pro 3945WX and 8 channel 128GB DDR4 3200Mhz, it is glacially slow for CPU inference. I honestly wouldn't ever recommend someone investing more money to create a "better" CPU inference setup. You're betting off scraping ebay and the MM for e-waste GPUs.

Tetras · 26 Oct 2025 at 18:26

Zarax said:
But isn't 128GB the max supported by 7950X3D?

When AM5 was released, the maximum was 128GB because the highest capacity stick was 32GB. Since then, I think Crucial/Micron was the first to do 48GB sticks and later Samsung manufactured dies for 64GB.

If you look in the BIOS updates for most boards (including 1st gen), you should see evidence of support for these capacities, even if the board's spec still says 128GB.

For example (TUF B650-Plus):

Version 3208, 2025/02/27
...
3. Added support for up to 5000MT/s when four 64GB memory modules (total 256GB) are installed. The exclusive AEMP option will appear when compatible models are populated.

Version 1616, 2023/05/16
...
2. Support 48/24GB high-density DDR5 memory module.

The CPU support pages weren't updated for AM5 1st gen or Intel 12th gen when these sticks came out, even though they work (well, work might be pushing it, work slowly is more accurate).

Zarax · 27 Oct 2025 at 04:42

Looks like you're right, 48x4 sticks seem to be supported for my ASUS ROG STRIX B650E-E GAMING WIFI AMD B650.
And yes, I would expect around 1 token per second or less according to my LM Studio benchmarks (1,4t/s for a 72b model)

Toasty · 27 Oct 2025 at 12:06

There was some crazy stuff at CES 256GB running 6000+ etc

Zarax · 28 Oct 2025 at 04:47

I expect that DDR6 along with optimized LLMs will bridge the gap between CPU and GPU performance. Dedicated architecture will always be faster of course but once you get good enough (and you don't need a big LLM to run tools or summarize web searches) it won't really matter.

jigger · 28 Oct 2025 at 16:58

benjii said:
If you have spare pcie bandwidth, what about getting some budget GPUs like the Tesla P40 24GB? They're not great, but certainly better than using just RAM. Then just fill the VRAM and offload the rest to the CPU and RAM. If you bought two P40s, you could probably run a 4bit quant 70B model at 5~ tok/sec in just VRAM.

My LLM server has a Threadripper Pro 3945WX and 8 channel 128GB DDR4 3200Mhz, it is glacially slow for CPU inference. I honestly wouldn't ever recommend someone investing more money to create a "better" CPU inference setup. You're betting off scraping ebay and the MM for e-waste GPUs.

Cards like the p40 need a ton of attending to in a desktop system. It might be doable with some chassis and power supply configurations, but the added cost and complexity would make dropping in 256gb the better option for most I think, especially if the model fits in RAM.

benjii · 28 Oct 2025 at 18:16

jigger said:
Cards like the p40 need a ton of attending to in a desktop system. It might be doable with some chassis and power supply configurations, but the added cost and complexity would make dropping in 256gb the better option for most I think, especially if the model fits in RAM.

I've seen some pretty funky setups with zipties, fans, bios modding and a dream. To me the added performance would still be worth it. You point out an important consideration though.

jigger · 28 Oct 2025 at 19:10

benjii said:
I've seen some pretty funky setups with zipties, fans, bios modding and a dream. To me the added performance would still be worth it. You point out an important consideration though.

A pair of Nv link capable cards with a decent amount of RAM would be one way to go, but you’re into another realm of expense and power consumption and still have the issue of fitting the model into memory.

jigger · 28 Oct 2025 at 19:14

Tetras said:
I'd have a look at the new 64GB sticks.

Haven't watched this, but might be of interest:

This kinda opens the flood gates for 128gb sticks of DRAM. 512gb AM5 systems ahoy!

Zarax · 29 Oct 2025 at 04:41

In the end the art of LLMs is to find the smallest model capable of solving a task and for RAG small models <8b can be surprisingly good.
Also, in my experience there are diminishing returns above 32b and I've read experts argue about 70-80b being the point where performance gains per billion parameters stop being linear.
In the end the MoE architecture that has been adopted by many recent models might be pointing out that the most efficient setup might indeed be 1 "topic detector" model along with N "topic expert" models.

Emlyn_Dewar · 29 Oct 2025 at 05:44

benjii said:
I've seen some pretty funky setups with zipties, fans, bios modding and a dream. To me the added performance would still be worth it. You point out an important consideration though.

I have a 3d printed fan mount for a 90mm fan on mine. It just needs a converter for the 8 pin cables.
Previously I had a 1080ti cooler on it, but I found the back of the card was getting pretty damn hot, even with good case airflow.

Competitor rules

128GB on 7950X3D for local LLMs

More options

Zarax

Zarax

sastusbulbas

sastusbulbas

Emlyn_Dewar

Emlyn_Dewar

Zarax

Zarax

Emlyn_Dewar

Emlyn_Dewar

Tetras

Tetras

Zarax

Zarax

benjii

benjii

Tetras

Tetras

Zarax

Zarax

Toasty

Toasty

Zarax

Zarax

jigger

jigger

benjii

benjii

jigger

jigger

jigger

jigger

Zarax

Zarax

Emlyn_Dewar

Emlyn_Dewar