VRAM Bandwidth Requirements for Local LLM Code Completion?

Pulseammo · 25 May 2025 at 11:57

I'm wondering if anyone can tell me what kind of VRAM bandwidth is suitable for LLM code completion (talking about something in the style of cursor tab here, but locally hosted if such a thing exists).

I'm looking at picking up a new GPU primarily for gaming, but would also like to try out llm code completion features (absolutely not interested in any kind of interactive/chat/agent mode code generation).

Budget is tight so I'll likely be looking at a 9060XT 16GB or a 5060Ti 16GB.

ROCm / CUDA and compatibility issues aside is the 322GB/s bandwidth on the 9060XT suitable compared to the 448GB/s on the 5060Ti? Is there some kind of rule of thumb for how much bandwidth is required to run these models quickly?

Thanks!

mid_gen · 26 May 2025 at 18:58

Pulseammo said:
I'm wondering if anyone can tell me what kind of VRAM bandwidth is suitable for LLM code completion (talking about something in the style of cursor tab here, but locally hosted if such a thing exists).

I'm looking at picking up a new GPU primarily for gaming, but would also like to try out llm code completion features (absolutely not interested in any kind of interactive/chat/agent mode code generation).

Budget is tight so I'll likely be looking at a 9060XT 16GB or a 5060Ti 16GB.

ROCm / CUDA and compatibility issues aside is the 322GB/s bandwidth on the 9060XT suitable compared to the 448GB/s on the 5060Ti? Is there some kind of rule of thumb for how much bandwidth is required to run these models quickly?

Thanks!

I’ve been using local models with 4070 no problems. Just install Ollama, and choose different models.

More parameters = more vram and more time to respond. I’ve got useful results from qwen2.5 1.5b which is tiny.

There’s plenty of ollama benches around, have a look on LocalLlama subreddit.

Rob_B · 26 May 2025 at 19:07

I'm running Ollama on a 7800XT totally fine and although more bandwidth than those cards I doubt you'll have an issue with a 9060XT/5060Ti 16GB if that's what your budget gives you as options.
Gemma3:12b & Qwen3:14b have no issues and can respond much quicker than I can process the answer, you can always also try the quantized models.

Pulseammo · 27 May 2025 at 08:49

Thank you both, I'll also take a look on reddit for those benchmarks.

NickK · 29 May 2025 at 20:32

Hi.
I'm running ollama with mistral in a VM with no GPU, you get a warning with CPU mode only and it's still pretty quick considering.

Code:

>>> Adding ollama user to render group...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
WARNING: No NVIDIA/AMD GPU detected. Ollama will run in CPU-only mode.
$ ollama run mistral
pulling manifest
pulling ff82381e2bea: 100% ▕████████████████████████████████████████████████████████████████████████████████████████▏ 4.1 GB                        
pulling 43070e2d4e53: 100% ▕████████████████████████████████████████████████████████████████████████████████████████▏  11 KB                        
pulling 491dfa501e59: 100% ▕████████████████████████████████████████████████████████████████████████████████████████▏  801 B                        
pulling ed11eda7790d: 100% ▕████████████████████████████████████████████████████████████████████████████████████████▏   30 B                        
pulling 42347cd80dc8: 100% ▕████████████████████████████████████████████████████████████████████████████████████████▏  485 B                        
verifying sha256 digest
writing manifest
success
>>>
>>> test
 Hello! It seems like you'd like to test something. How can I assist you with that? Please provide more details about what exactly you want to
test, and I'll do my best to help.

For instance, if you want to test a piece of code, I can help by providing an input or explaining the expected output. If it's for a different
purpose, just let me know!

The VM is 16GB ram, 8 cores. The video is VMSVGS 16MB

Pulseammo · 29 May 2025 at 20:37

NickK said:

Hi.
I'm running ollama with mistral in a VM with no GPU, you get a warning with CPU mode only and it's still pretty quick considering.

Code:

>>> Adding ollama user to render group...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
WARNING: No NVIDIA/AMD GPU detected. Ollama will run in CPU-only mode.
$ ollama run mistral
pulling manifest
pulling ff82381e2bea: 100% ▕████████████████████████████████████████████████████████████████████████████████████████▏ 4.1 GB                       
pulling 43070e2d4e53: 100% ▕████████████████████████████████████████████████████████████████████████████████████████▏  11 KB                       
pulling 491dfa501e59: 100% ▕████████████████████████████████████████████████████████████████████████████████████████▏  801 B                       
pulling ed11eda7790d: 100% ▕████████████████████████████████████████████████████████████████████████████████████████▏   30 B                       
pulling 42347cd80dc8: 100% ▕████████████████████████████████████████████████████████████████████████████████████████▏  485 B                       
verifying sha256 digest
writing manifest
success
>>>
>>> test
 Hello! It seems like you'd like to test something. How can I assist you with that? Please provide more details about what exactly you want to
test, and I'll do my best to help.

For instance, if you want to test a piece of code, I can help by providing an input or explaining the expected output. If it's for a different
purpose, just let me know!

The VM is 16GB ram, 8 cores. The video is VMSVGS 16MB

You know, it didn't even occur to me to try it in CPU mode and see how it does, I'm going to install this all at the weekend.