Anyone else use AI to learn Linux?

Associate
Joined
18 Dec 2008
Posts
582
I have been using AI constantly over the past year helping me learn Linux and it is truly amazing. I mainly use the MS Coplot Linux app or Duck Duck Go's AI features between them and can usually point me in the right direction whenever I hit a wall.

I wanted to move away from online AI as my questions and answers are probably being data mined by Microsoft as they do with everything these days.

Since I have a RTX4090 with tensor cores doing nothing for years I dipped my toes into running a LLM locally on my PC.

It turns out that it is really easy to do. Download the programme LM Studio choose the model you what to use from a huge list, and you are done, just ask away. I chose Meta Llama 3.1 70B Instruct as it fits perfectly within the VRAM of my 4090

While it’s not as fast as the online variants it has so far been able to answer all my questions with a high level of technical accuracy and excellent at suggesting possible fixes for issues I encountered.

Best of all its completely free!
 
RTX 4090 has 24GB of VRAM, Llama 3.1 70B at 4 bit quantization is around 40GB, it doesn't fit in it entirely unless you're using 2 bit quantization (which comes in at around 19-27GB depending on how the quantization is done), at that point the quality suffers a lot, it's almost about 2 years old now so it's a bit outdated on some things.

What's the tokens per second you're getting with that? I would assume 10-15 t/s with the huge bandwidth on 4090's VRAM, I tried giving it a test run on Framework Desktop and I wasn't satsifited, it ran a bit too slow for my liking, came in at about 4 t/s.

Models around 30B parameters at 4 bit released in the past year should be on par, if not better, than Llama 3.1 70B (if it's at 2 bit), try giving these a go, because they're smaller, it'll run much faster, a lot more so if you're using MoE/sparse models and you'll be able to fit these entirely within the VRAM without needing to go below 4 bit quantization, can even go up to 6 bit with some of these models and still have everything fit in the VRAM if the context length isn't too big.

WIth all that said, having the ability to run large models locally and not having to rely on things being run online on someone else's computer is a wonderful thing.
 
Last edited:
This is what I downloaded :

Tensorblock : Meta Llama 3.1 70B Instruct GGUF Q3_K_M Download • 34.27 GB last modified 10th Feb 26.

When loaded I am seeing 20GB VRAM being used.

I am very new to the self hosting stuff

No idea how to see tokens per second
 
Ah okay so you're fitting a good amount of it in the VRAM with the rest being offloaded to system RAM, so that slows it down a lot, especially with a dense 70B model.

You can find tokens per second at the bottom of the responses you get, it should be next to a small stopwatch/timer icon.

You can run even bigger models like Qwen3 Next or GPT-OSS 120B and still get good speed, try giving these a go and see how it goes.
 
That is really slow, but kind of usable depending on your use case I guess.

With the models I mentioned in previous posts, you should get anywhere from 20 to 80+ tokens per second, which is significantly faster.
 
Been trying to use copilot and generally useless to get answers to anything code related recently. Even asking for specific versions its giving odd old answers
 
That is really slow, but kind of usable depending on your use case I guess.

With the models I mentioned in previous posts, you should get anywhere from 20 to 80+ tokens per second, which is significantly faster.
Thank you!

Qwen3 Next got me up to 6.4T/S

GPT-OSS 120B 13.8T/S at low reasoning a massive difference.

Its feels as fast as I was getting using online AI.
 
Finally found a use for all these CPU cores and memory :-)

I have assigned 28 logical cores to LM Studio didn't appear to have boosted performance, but stuff is happening on all 18 physical cores.

Is there anything in particular that I should change on the model settings other than CPU thread pool size.

I would like to upload an image of my settings but I am not sure what picture upload sites still work in the UK.
 
Last edited:
I usually leave CPU thread pool size alone, you can adjust GPU offload amount so you can try and put a bit more of the model inside the GPU's VRAM to speed it up a bit, with 24GB VRAM, you can adjust it to 22-23GB on the slider.

I'm assuming you're running these stuff on the system with the specs in your signature? With DDR4 RAM it does slow it down a bit more when offloading the remaining to system RAM compared to DDR5.
 
Yeh, the one in my signature. Its DDR4 but quad channel, so I think it means it has the equivalent throughput of 5332Mz dual channel memory, although granted it's at the low end than most DDR5.
 
I loaded the 20B parameter GPT model and I am getting 140T/S :-)

The thought had occurred to me that it could be a PCIe Gen3 x16 limitation.

edit:
Code:
Spec                            Rough bandwidth  
Quad‑channel DDR4‑2667 MT/s     ≈85 GB/s
PCIe Gen 3 ×16 (effective)      =25–28 GB/s (~15.6 GB/s less overheads each way)
RTX 4090 (384‑bit bus)            =117 GB/s

I am not 100% sure this math's work but it probably explains the poor performance.
 
Last edited:
Unless you're gonna be running LLMs 24/7 and constantly asking them things, I'm almost certain it's cheaper compared to spending hours gaming each day, and certainly a lot cheaper than mining for whatever cryptocurrency (if GPU mining is even viable nowadays, which I don't think it is, but I haven't been involved with any of that in years).

Edit: I just realized you quoted the part where it said "completely free", you're not even wrong. :D
 
Last edited:
My thoughts on the matter (feel free to correct me if I am wrong)

I checked HWINFO and the 4090 with LM Studio running idle is 65W, when generating a response momentarily hits 120W before settling down to 75W until done. Card doesn't even get hot, stays at around 38C. I imagine the bottle necking of my PCIe Gen3 Bus is slowing things down when running gpt-oss-120b over VRAM and system RAM.@ 15T/S

When using the gpt-oss-20b-GGUF in VRAM only things get real, hitting 270W but only for a few seconds as the response generating @ 181 T/S

Overall both models probably use a similar amount of energy over time just one gives out an answer faster, not sure if either us more efficient as they are reading from different models, but at Idle there is not much overhead when not running a query.

My 4090 Idles at 62W with no apps running as reported by HWINFO.
 
MoE/sparse models like GPT-OSS are much more efficient since not every parameters are activated, it's dense models where they can push GPUs hard, all parameters being loaded + activated and slower token generation means GPU is working harder for longer.
 
Back
Top Bottom