Anyone else use AI to learn Linux?

RavenLunatic · 15 Feb 2026 at 12:43

I have been using AI constantly over the past year helping me learn Linux and it is truly amazing. I mainly use the MS Coplot Linux app or Duck Duck Go's AI features between them and can usually point me in the right direction whenever I hit a wall.

I wanted to move away from online AI as my questions and answers are probably being data mined by Microsoft as they do with everything these days.

Since I have a RTX4090 with tensor cores doing nothing for years I dipped my toes into running a LLM locally on my PC.

It turns out that it is really easy to do. Download the programme LM Studio choose the model you what to use from a huge list, and you are done, just ask away. I chose Meta Llama 3.1 70B Instruct as it fits perfectly within the VRAM of my 4090

While it’s not as fast as the online variants it has so far been able to answer all my questions with a high level of technical accuracy and excellent at suggesting possible fixes for issues I encountered.

Best of all its completely free!

InfiniteNoise · 15 Feb 2026 at 18:50

RTX 4090 has 24GB of VRAM, Llama 3.1 70B at 4 bit quantization is around 40GB, it doesn't fit in it entirely unless you're using 2 bit quantization (which comes in at around 19-27GB depending on how the quantization is done), at that point the quality suffers a lot, it's almost about 2 years old now so it's a bit outdated on some things.

What's the tokens per second you're getting with that? I would assume 10-15 t/s with the huge bandwidth on 4090's VRAM, I tried giving it a test run on Framework Desktop and I wasn't satsifited, it ran a bit too slow for my liking, came in at about 4 t/s.

Models around 30B parameters at 4 bit released in the past year should be on par, if not better, than Llama 3.1 70B (if it's at 2 bit), try giving these a go, because they're smaller, it'll run much faster, a lot more so if you're using MoE/sparse models and you'll be able to fit these entirely within the VRAM without needing to go below 4 bit quantization, can even go up to 6 bit with some of these models and still have everything fit in the VRAM if the context length isn't too big.

WIth all that said, having the ability to run large models locally and not having to rely on things being run online on someone else's computer is a wonderful thing.

RavenLunatic · 15 Feb 2026 at 19:08

This is what I downloaded :

Tensorblock : Meta Llama 3.1 70B Instruct GGUF Q3_K_M Download • 34.27 GB last modified 10th Feb 26.

When loaded I am seeing 20GB VRAM being used.

I am very new to the self hosting stuff

No idea how to see tokens per second

InfiniteNoise · 15 Feb 2026 at 19:12

Ah okay so you're fitting a good amount of it in the VRAM with the rest being offloaded to system RAM, so that slows it down a lot, especially with a dense 70B model.

You can find tokens per second at the bottom of the responses you get, it should be next to a small stopwatch/timer icon.

You can run even bigger models like Qwen3 Next or GPT-OSS 120B and still get good speed, try giving these a go and see how it goes.

RavenLunatic · 15 Feb 2026 at 19:19

2.6 Tokens per second, I did say it was slower :-)

I need to dive a bit deeper

InfiniteNoise · 15 Feb 2026 at 19:34

That is really slow, but kind of usable depending on your use case I guess.

With the models I mentioned in previous posts, you should get anywhere from 20 to 80+ tokens per second, which is significantly faster.

img · 15 Feb 2026 at 20:49

Been trying to use copilot and generally useless to get answers to anything code related recently. Even asking for specific versions its giving odd old answers

RavenLunatic · 15 Feb 2026 at 21:00

InfiniteNoise said:
That is really slow, but kind of usable depending on your use case I guess.

With the models I mentioned in previous posts, you should get anywhere from 20 to 80+ tokens per second, which is significantly faster.

Thank you!

Qwen3 Next got me up to 6.4T/S

GPT-OSS 120B 13.8T/S at low reasoning a massive difference.

Its feels as fast as I was getting using online AI.

RavenLunatic · 15 Feb 2026 at 21:20

Finally found a use for all these CPU cores and memory :-)

I have assigned 28 logical cores to LM Studio didn't appear to have boosted performance, but stuff is happening on all 18 physical cores.

Is there anything in particular that I should change on the model settings other than CPU thread pool size.

I would like to upload an image of my settings but I am not sure what picture upload sites still work in the UK.

InfiniteNoise · 15 Feb 2026 at 21:46

I usually leave CPU thread pool size alone, you can adjust GPU offload amount so you can try and put a bit more of the model inside the GPU's VRAM to speed it up a bit, with 24GB VRAM, you can adjust it to 22-23GB on the slider.

I'm assuming you're running these stuff on the system with the specs in your signature? With DDR4 RAM it does slow it down a bit more when offloading the remaining to system RAM compared to DDR5.

RavenLunatic · 15 Feb 2026 at 22:07

Yeh, the one in my signature. Its DDR4 but quad channel, so I think it means it has the equivalent throughput of 5332Mz dual channel memory, although granted it's at the low end than most DDR5.

RavenLunatic · 16 Feb 2026 at 07:41

I loaded the 20B parameter GPT model and I am getting 140T/S :-)

The thought had occurred to me that it could be a PCIe Gen3 x16 limitation.

edit:

Code:

Spec                            Rough bandwidth  
Quad‑channel DDR4‑2667 MT/s     ≈85 GB/s
PCIe Gen 3 ×16 (effective)      =25–28 GB/s (~15.6 GB/s less overheads each way)
RTX 4090 (384‑bit bus)            =117 GB/s

I am not 100% sure this math's work but it probably explains the poor performance.

Cromulent · 20 Feb 2026 at 17:00

RavenLunatic said:
Best of all its completely free!

Say that again when you get your next electricity bill.

InfiniteNoise · 20 Feb 2026 at 18:51

Unless you're gonna be running LLMs 24/7 and constantly asking them things, I'm almost certain it's cheaper compared to spending hours gaming each day, and certainly a lot cheaper than mining for whatever cryptocurrency (if GPU mining is even viable nowadays, which I don't think it is, but I haven't been involved with any of that in years).

Edit: I just realized you quoted the part where it said "completely free", you're not even wrong.

RavenLunatic · 20 Feb 2026 at 20:42

My thoughts on the matter (feel free to correct me if I am wrong)

I checked HWINFO and the 4090 with LM Studio running idle is 65W, when generating a response momentarily hits 120W before settling down to 75W until done. Card doesn't even get hot, stays at around 38C. I imagine the bottle necking of my PCIe Gen3 Bus is slowing things down when running gpt-oss-120b over VRAM and system RAM.@ 15T/S

When using the gpt-oss-20b-GGUF in VRAM only things get real, hitting 270W but only for a few seconds as the response generating @ 181 T/S

Overall both models probably use a similar amount of energy over time just one gives out an answer faster, not sure if either us more efficient as they are reading from different models, but at Idle there is not much overhead when not running a query.

My 4090 Idles at 62W with no apps running as reported by HWINFO.

InfiniteNoise · 20 Feb 2026 at 21:17

MoE/sparse models like GPT-OSS are much more efficient since not every parameters are activated, it's dense models where they can push GPUs hard, all parameters being loaded + activated and slower token generation means GPU is working harder for longer.

Buffalo2102 · 22 Feb 2026 at 09:20

I have had Ollama set up on an old system of mine for some time now. It's only a lowly AMD 3600G withg 32GB RAM and a RX 6900 XT with 16GB VRAM. It's not exactly fast but it's still very usable with some of the 7b or 8b models.

I have also set up ComfyUI and have been playing with image generation and editing.

It's a bit more difficult to set up for an AMD GPU on Linux but it's just the sort of tinkering I enjoy.

keyser van someone · 23 Feb 2026 at 21:57

I was on Gemini trial and it spat out a random image I didn't ask for but starting playing it with it (maybe that was the idea), I can see the appeal with this type of stuff: type something and get a result nearly instantly, but quickly run into it's limits and just plain ignoring me.

This is where all the worlds RAM is going (damn). Local AI is a lot of fun tinkering but limited to what I can run on 16GB VRAM.

but yeah learning Linux on Gemini and learning not to trust everything it says!

BogEyes · 1 Mar 2026 at 22:44

I'm using a combination of Gemini and LM studio to try to learn linux/ubuntu (among other things) too

I'm finding it really engaging, and just a generally useful tool.

It's certainly helping me navigate several ubuntu installations as a beginner!