• Competitor rules

    Please remember that any mention of competitors, hinting at competitors or offering to provide details of competitors will result in an account suspension. The full rules can be found under the 'Terms and Rules' link in the bottom right corner of your screen. Just don't mention competitors in any way, shape or form and you'll be OK.

128GB on 7950X3D for local LLMs

Associate
Joined
3 May 2021
Posts
1,327
Location
Italy
Has anyone tried maxing up RAM to run LLMs?
I currently have LM Studio and 64GB RAM, which allows me to run fairly large models, however this is not sufficient to run larger ones.
Do you have any experience and feedback on performance?

Thank you very much!
 
Regardless of application, running 128gb will also be dependent on the motherboard and if you want to run two or four dimms.

If running 128gb you simply may need to play with the timings and be prepared for higher cas latency but again depends on the motherboard as well.

Quite a lot of people run 128gb with the 7950X3D, plenty of articles on Google, but as you can see reduced timings to pass memtest sessions and maintain stability is common.

Is there a kit you are considering and what motherboard.
 
It'll work, but dual channel DDR5 bandwidth is going to keep it slow.
I'm aware of that, 64GB models with reasoning already take about 20 minutes to generate an answer, what I'm looking for is a way to replicate Stryx Halo on the cheap.
 
If you don't mind glacially slow, if absolutely will work. I have an x99 xeon and 256gb quad channel for the same purpose.
It's a "go away and come back much later" affair. More to have the capability, not to use. :D
 
But isn't 128GB the max supported by 7950X3D?
I upgraded last year after 11 years, I might consider a RAM upgrade but no way I'm going to switch CPU and motherboard as well.
 
If you have spare pcie bandwidth, what about getting some budget GPUs like the Tesla P40 24GB? They're not great, but certainly better than using just RAM. Then just fill the VRAM and offload the rest to the CPU and RAM. If you bought two P40s, you could probably run a 4bit quant 70B model at 5~ tok/sec in just VRAM.

My LLM server has a Threadripper Pro 3945WX and 8 channel 128GB DDR4 3200Mhz, it is glacially slow for CPU inference. I honestly wouldn't ever recommend someone investing more money to create a "better" CPU inference setup. You're betting off scraping ebay and the MM for e-waste GPUs.
 
Last edited:
But isn't 128GB the max supported by 7950X3D?
When AM5 was released, the maximum was 128GB because the highest capacity stick was 32GB. Since then, I think Crucial/Micron was the first to do 48GB sticks and later Samsung manufactured dies for 64GB.

If you look in the BIOS updates for most boards (including 1st gen), you should see evidence of support for these capacities, even if the board's spec still says 128GB.

For example (TUF B650-Plus):
Version 3208, 2025/02/27
...
3. Added support for up to 5000MT/s when four 64GB memory modules (total 256GB) are installed. The exclusive AEMP option will appear when compatible models are populated.

Version 1616, 2023/05/16
...
2. Support 48/24GB high-density DDR5 memory module.

The CPU support pages weren't updated for AM5 1st gen or Intel 12th gen when these sticks came out, even though they work (well, work might be pushing it, work slowly is more accurate).
 
Looks like you're right, 48x4 sticks seem to be supported for my ASUS ROG STRIX B650E-E GAMING WIFI AMD B650.
And yes, I would expect around 1 token per second or less according to my LM Studio benchmarks (1,4t/s for a 72b model)
 
I expect that DDR6 along with optimized LLMs will bridge the gap between CPU and GPU performance. Dedicated architecture will always be faster of course but once you get good enough (and you don't need a big LLM to run tools or summarize web searches) it won't really matter.
 
If you have spare pcie bandwidth, what about getting some budget GPUs like the Tesla P40 24GB? They're not great, but certainly better than using just RAM. Then just fill the VRAM and offload the rest to the CPU and RAM. If you bought two P40s, you could probably run a 4bit quant 70B model at 5~ tok/sec in just VRAM.

My LLM server has a Threadripper Pro 3945WX and 8 channel 128GB DDR4 3200Mhz, it is glacially slow for CPU inference. I honestly wouldn't ever recommend someone investing more money to create a "better" CPU inference setup. You're betting off scraping ebay and the MM for e-waste GPUs.

Cards like the p40 need a ton of attending to in a desktop system. It might be doable with some chassis and power supply configurations, but the added cost and complexity would make dropping in 256gb the better option for most I think, especially if the model fits in RAM.
 
Cards like the p40 need a ton of attending to in a desktop system. It might be doable with some chassis and power supply configurations, but the added cost and complexity would make dropping in 256gb the better option for most I think, especially if the model fits in RAM.
I've seen some pretty funky setups with zipties, fans, bios modding and a dream. To me the added performance would still be worth it. You point out an important consideration though.
 
Last edited:
I've seen some pretty funky setups with zipties, fans, bios modding and a dream. To me the added performance would still be worth it. You point out an important consideration though.

A pair of Nv link capable cards with a decent amount of RAM would be one way to go, but you’re into another realm of expense and power consumption and still have the issue of fitting the model into memory.
 
In the end the art of LLMs is to find the smallest model capable of solving a task and for RAG small models <8b can be surprisingly good.
Also, in my experience there are diminishing returns above 32b and I've read experts argue about 70-80b being the point where performance gains per billion parameters stop being linear.
In the end the MoE architecture that has been adopted by many recent models might be pointing out that the most efficient setup might indeed be 1 "topic detector" model along with N "topic expert" models.
 
I've seen some pretty funky setups with zipties, fans, bios modding and a dream. To me the added performance would still be worth it. You point out an important consideration though.
I have a 3d printed fan mount for a 90mm fan on mine. It just needs a converter for the 8 pin cables.
Previously I had a 1080ti cooler on it, but I found the back of the card was getting pretty damn hot, even with good case airflow.
 
Back
Top Bottom