RTX 3090 FE HELP! GPU overheating.

Devprotim Das · 10 Apr 2024 at 23:34

Pre-amble: My 3090 FE has been running perfectly up until Easter Sunday. I have done no OC on it, left everything at stock setting since I bought it on release day. Average Temps used to be 75°C maybe 82°C on really hot summer days (UK summer averages around 25-30°C), under heavy gaming load. I played Dragons Dogma 2 that sunday with DLSS 'balanced' at 4k and all other settings cranked to max, no issues. Solid 80+fps for those who cares.

Actual Issue: On Easter monday the fans on the GPU started to go nuts as soon as I started up a game. Quick look at hardware monitor showed that the temps were hitting 85°C during gaming. Right now UK is in Springtime, with ambient temps averaging 10-15°C currently. This is happening with pretty much all games I tried (Dragons Dogma 2, Jedi Survivor, BG3, Fortnite). The PC was cleaned back in December, but I gave it another thorough cleaning with the compressor. I tried turning all the settings down in games and even running them in windowed mode at 1080p resolution. Still the same thing. Infact the temps climbed up to 88°C at one point. All the while the fans at stock curve settings are ramping up to 2200+RPM and sound like jet engines. I downloaded MSI afterburner for the first time and undervolted the card. Still no change.

EDIT: I rolled back the windows OS updates and Nvidia driver versions before the next couple of steps. Needless to say those steps didn't work.

Since it's been over three and half years Nvidia's warranty has expired, I repadded and repasted the GPU. There was the thinnest layer of dust on the board so I cleaned that and the fins+fans out (since I had it all apart anyway, it made sense). There was no change. Temps didn't get worse, they didn't get any better. Now the PC case itself didn't feel that much hotter under load so I borrowed a thermal imaging camera to get an idea of what was happening. I've linked the google drive that has screencaps of hardware monitored compared to the thermal images I took.
https://drive.google.com/drive/folders/1ooHTvUY-GTGxWdBb9xXxxgD2F1n2M1xF?usp=sharing

NOTE 1: On the thermal images, the green crosshair reading is top left. Red crosshair is maximum temp reading (on the image somewhere), shown at the bottom left.
NOTE 2: My case is inverted. Just incase the photo's throw people off. And yes I did wonder if the upside down layout could have damaged the fan bearings but surely that's a moot point since even right way up one of the fans are facing the bottom anyway?

Basically what I'm seeing on the thermal camera is that there is a drastic temperature difference between what the hardware monitor is reading and what the camera is showing me (under load). Like GPU temp is 86°C but hottest point reading on thermal is 61°C.
When idling the two readings are fairly similar.

I'm hoping someone more knowledgable can shed some light. Is it a faulty temp sensor? Did the fan bearings get damaged overnight? Did a solder fail somewhere on the board?

The suddeness with how this has happened is worrying. If these are the early signs of an inevitable GPU death then I need to start making plans for what to do if it fails and for an unplanned system upgrade for the 50 series (first world problem I know).

Thank you to anyone who's read up to here.

Devprotim Das · 10 Apr 2024 at 23:46

Journey said:
Assume you have tried a different OS build with various driver revisions to rule software out?

Ah yes forgot to say that in the original post but I did try that before repasting+repadding.

Devprotim Das · 11 Apr 2024 at 10:02

Journey said:
Does anything change in you try an undervolt curve in Afterburner, so bring the card down to 900-925mV? Try it if you have not, see if it remains hot and loud, check the power draw as well if you can.

No it doesn't. It's currently undervolted to 900mV and there's been no difference. I haven't paid attention to the power draw before the issues but right now I've seen that it consistently draws 330W+. Not sure if that's higher than usual.

Devprotim Das · 11 Apr 2024 at 15:33

Thanks for replying in detail

Tetras said:
Depending on where the temperature sensor is located, that's not unusual, because you can't see the hottest part of the die and the heatsink and fan assembly should cool what you can see.

Yeah that makes sense. It just the delta seemed quite high (more than what I was expecting) and I've never been too attentive to GPU designs and thermal emissions.

Tetras said:
You definitely used pads that were the same size, right? There can be big temperature problems if you're out even a mm, due to poor contact.

I used pads that are 1.5mm thicker and slightly modified layout (adding pads to places that originally didn't have pads). Originals pads were 1mm. The re-padding guides I saw all mentioned that .5mm extra gives a better contact and the that FE's pad layout didn't cover some memory modules that benefit from it. My temps didn't get worse since re-padding. They just stayed the same.

Tetras said:
Are you sure that the games you used to play are similarly demanding and you were using the same settings?

Fortnite is the one constant measuring stick. Since getting the 3090 the graphics settings have always stayed the same, and there was never any loud fan noise or temp spikes in the past three and half years. It is there now whenever I play fortnite now, even on lowest setting, 1080p windowed and framecapped to 30fps.
I also have been playing Dragons Dogma 2 just fine a week prior to the issues(?) commencing.

Tetras said:
Fan bearings can deteriorate faster due to orientation, but running upside down shouldn't be an issue and it doesn't sound like your bearings are the problem. Heat pipes and vapour chambers can also be affected, though if your case always ran this way up then I would assume it doesn't matter.

Yes interesting, vapour chambers+heat pipes makes sense and I didn't think about that at all. Though I don't know how to diagnose them to be faulty so I can't rule them out for sure.

Tetras said:
Do all the fans in your case still work, you haven't enabled some kind of quiet/silent mode?

Yes they still do. I have five (I know that quantity is at the limit of diminishing returns). They were the first things I checked when I heard the fans ramp up. They were on an adjusted fan curves to run slightly quieter, which I turned up to turbo mode to see if the extra air flow would affect the GPU temps, but they didn't. I have two for intakes and three for outtake.

Devprotim Das · 11 Apr 2024 at 18:11

Tetras said:
You could try laying the case on the side and see if it makes any difference, though wouldn't identify if they were faulty, just if the orientation matters.

Skilid said:
I was also going to suggest trying a different orientation for the case. Some GPUs have way better thermals when oriented differently due to the way the heatpipes are facing. Might be worth a shot.

Thank you for the suggestions. I turned the case upside down and gave that a try. I don't think it made an appreciable difference. Everything (bar hotspot temps which got hotter at 103.8°C) was 1°C cooler. But that could just as well be due to me not playing something long enough for it to build up enough heat.

Screen cap and thermal captures are uploaded in a new folder named "case upside down".
https://drive.google.com/drive/folders/1ooHTvUY-GTGxWdBb9xXxxgD2F1n2M1xF?usp=sharing

I'll see how defeated I feel after dinner and decide whether to run something with the case on it's side.

Skilid said:
Would it be possible to find a broken 3090 or just a 3090 fe cooler (without the PCB) to put on it and compare?

Perhaps, I have a colleauge who has a 3090 FE so I guess I could invade his place and see what his card does under load. Though he's more into VR and flight sims I'd expect very similar GPU behaviour. Or at least similar to mine under normal circumstances.

Devprotim Das · 11 Apr 2024 at 20:01

Kirby Wurm said:
Are the clocks throttling? I'm not sure what the limits are on the 3090 but if it's reaching the max allowed then the clock behaviour could help confirm whether it's a sensor issue or not at least.

Edit: on second thoughts if the sensor is faulty then the card would behave accordingly I imagine, so not a very helpful suggestion.

I've not noticed any gameplay signs (like the game hitching or stuttering). But since I've never experienced throttling myself I don't really know what to look out for in a game.

In terms of the clockspeed I've not paid much attention to it. But GPU-Z (the googledrive has screen caps) does show performance throttling is happening due to thermal throttling (under load).

Second thoughts or not I appreciate the input. I'm the 'I can put it together' kind of guy, sadly not a 'I can fix it' kind :cry:

Devprotim Das · 11 Apr 2024 at 21:00

Finners said:
Have you tried running it with the side of the case off entirely? The temps all look normalish apart from the fan speeds needed to achieve them. Its hard to tell but your CPU looks on the hot side to. On that GPU-Z screen shot its showing 60Deg and looks to be a good bit lower than the rest of the graph so you might be hitting 90degrees plus on that.

Did you ever change the thermal pads/paste before this happened or only after to try and fix it?

I did do a couple of tests with the side panel off (before I borrowed the thermal camera) and didn't see any difference in temps.

And no, the thermal pads/paste has been whatever it shipped with until last Wednesday.

Finners said:
Another thing on that afterburner skin how do you control the temp limit/powerlimit priority? On the cyborg skin I use there is a little option between the two sliders to prioritise power limit over the 83deg limit and this is default. Yours seems to be trying its hardest to keep 83deg like its been switched round

I didn't even know I could set a temp limit O_O. Though thinking about it now it make sense. I only downloaded afterburner to try to diagnose the issue and then tried to undervolt the card to reduce high temps (and adjust GPU fan curves too). I'll try setting a temp limit and see how I get on, thank you!

Kirby Wurm said:
The clock speed (1850ish if I was looking at the right screenshot?) looks pretty normal from what I know of the FE (somewhat regret selling the one I snagged at MSRP a couple of years ago).

Good to know that clock speed is operating around a normal range. I think you probably were look at the correct one, but I had three different hardware monitors running so I'm not entirely sure :cry:

Devprotim Das · 11 Apr 2024 at 21:14

Dg834man said:
You might not be stuttering but your hwmonitor screen is indicating your thermal throttling, might be worth while trying ptm7950 just slapped it on my 3070 FE was fed up of repasting due to pump out and my card was still thermal throttling with fresh paste despite my gpu and hotspot temp being under the limit , doesn't thermal throttle with ptm7950 temps are improved, pretty sure my card was cooler when it was newer maybe heat pipes lose liquid overtime and become less effective?

Interesting. Again, not an expert on GPU cooling system but I would have expected some liquid loss in the vapor chamber overtime but three years seems rather quick. I didn't see any damage on the pipes myself but that is not to say it couldn't be there.

It is tempting to try ptm7950 given what you are saying. Though my current position is that I don't want take the GPU apart a second time unless I really have to. If somehow through everyone's help I can identify the root cause, then depending on what it is I may want to sell the card. So I'd rather not risk putting marks on the screws or snapping a ribbon cable :eek:

Devprotim Das · 12 Apr 2024 at 00:41

Apologies to all three for the late night notifications (if you are Europe/UK based)

@Tetras @Skilid I tried turning the case sideways. Again not any appreciable difference on the readings and thermals. Thank you for the suggestions though as I wouldn't have come up with it on my own!

@Finners Thank you for pointing out that I can prioritise temp vs power on afterburner. All my tests following people suggestions from the thread were done without forcing user presets, but just to be sure I uninstalled msi and restarted before taking a base test. GPU-Z and CPUID both gave me same base results (85°C, crazy fans the lot)
Then I reinstalled afterburner, set priority to temperature and experimented with temp limits.

Had to bring temperature limit down to 70°C, which also brought the power limit down to 48%. Only then did the GPU fans started to sound like they normally do (around 1500RPM). All other temperatures started to look normal too.

Though I still don't know what the root cause is, at least this will allow me to continue gaming. 40-60FPS may not be pushing the card but it's reliable and plenty good for the types of games I play.

Temps and thermals for any who're interested ("Temp Restricted" folder)
https://drive.google.com/drive/folders/1ooHTvUY-GTGxWdBb9xXxxgD2F1n2M1xF?usp=sharing

I'm still going to try and figure this out for about another week. So any input or questions is very welcome. Current plan is that if I can't identify root cause and/or fix the issue, I'll sell the card and get 3080/3070 with the money. See myself through until the next gen GPU's are released.

Devprotim Das · 12 Apr 2024 at 09:23

cjgardens said:
Them extra pads might make it sit not right. I would remove them and put the right size pads in also.

I did consider that. Before screwing everything together I did a pre-fit test to what the pressures did to the pads. All the pads got squashed and spread out like the original pads. My reasoning is that it means everything is making good contact with the vapor chambers/plates so I think I'm ok on that front(?). Given that they didn't make the recorded temps worse my thoughts are pads and paste may not be the issue.

Sadly I didn't take thermal images before repadding so I can't be confident in that. I don't want to be constantly be taking the GPU apart at this point since it seems quite likely I'll sell it. I don't want to risk damaging the delicate fan cables. Basically I want to rule everything else before opening it up again. But if I do I'll re-pad with 1mm pads.

Devprotim Das · 12 Apr 2024 at 11:28

@Finners Yes I think I understood that from your original post. And yes it is just masking up the issue which is why I'm still keen on figuring out the root cause.

That is very useful information on GPU core vs GPU hotspot delta and die shape!

Just to clarify for my benefit, what's the correlation between the pad heights for memory modules and the paste on the GPU die? I assume you mean that my pad heights are probably fine but the die might pump out the paste (diagnosed by the aforementioned temperature delta exceeding 14-15°C?)

Devprotim Das · 15 Apr 2024 at 18:50

Skilid said:
Would it be possible to find a broken 3090 or just a 3090 fe cooler (without the PCB) to put on it and compare?

So I went over to my colleagues to take a look at his system. Luckily we have a fairly similar set up (identical CPU, GPU with similar fan layout). Maybe a tad more dirt in there than mine.

Given where he had is system I couldn't get proper read on the fins with the thermal camera.

Quick summary is that his memory runs hotter than mine (but mine is recently repadded), whilst his GPU and hotspots run cooler than mine. His fans occasionally ramp to a degree comparable to mine but only momentarily.

So I think the issue for me is on the GPU die somewhere.

Temps & thermals ("Normal 3090 FE" folder)
https://drive.google.com/drive/folders/1ooHTvUY-GTGxWdBb9xXxxgD2F1n2M1xF?usp=sharing

@Dg834man @Finners where did you source your 'ptm7950' from?

Devprotim Das · 15 Apr 2024 at 19:26

tamzzy said:
fans are ramping because of the hotspot temperatures (my 3080 did the same) especially once the hotspot temps cross 100c - tjmax for these cards are 105c
question of why the hotspot temps are well in excess of your gpu max temp (85c vs 103.8c) would be a mounting issue, thermal paste issue or cooler issue
the usual delta is 10-12c

To the point and very clear thank you!

I'm hoping it's paste or mounting since I'm not sure I can address the cooler issue. Fans at high RPM sounded the same as non-malfunctioning 3090's fans (we tested it by ramping it with afterburner), so if it's vapor chamber/heat pipes I don't think I can fix it.

Devprotim Das · 15 Apr 2024 at 21:52

Dg834man said:
I also got mine from ali-express. Your hotspot delta was 11-12 degrees normal and nearly 20 degrees when you inverted your case.

yes. Though since my case layout is inverted, upside down was the right way up to most people :cry:

Devprotim Das · 16 Apr 2024 at 09:49

D1craig said:
It co6ld be the mount. Just try giving all the screws a little extra turn. Otherwise I'd suggest it could be the pads you put on.

Maybe the paste you used is one meant for benchmarking and needs replacing a lot more often than normal mx5 etc.

The pads on the memory module I put on is actually 16°C running cooler than my colleagues untouched FE.

As for the screws I think I'm at the limit what I can do without risking stripping the threads. But will definitely keep it mind for when I apply ptm7950 in a couple of days.

And I used Noctua NT-H2 for the paste

Devprotim Das · 16 Apr 2024 at 14:54

Hewligan said:
I have a 3080 Ti that was displaying a hotspot of 105 degrees C whilst gaming last night. This is with a huge external watercooling setup. I dismantled the waterblock today and sure enough the mounting pressure was incorrect and you could see that clearly with the paste pattern on the die. It may have been the pads being ever so slightly too thick, but they are soft pads so i reapplied paste and went for a tighter mounting. Hotspot is now 55 degrees C in the same game. At a guess i would say my mounting pressure was too light to begin with, and that was probably not squishing the thermal pads sufficiently to get a clear, consistent pressure across the die. Took 20 minutes to fix.

Very good to know! I was going to wait till the phase change pads arrived but will give this a try in the next hour or so

Devprotim Das · 16 Apr 2024 at 18:06

@Hewligan There was signs of pump out when I just took the GPU apart. The mounting screws were done up as tight as they could be (I checked if they could be done up more before undoing them, and I could feel the 'end' of the threads). There was a good amount of 'adhesive' tension to get the PCB off the heatsink too.

I repasted it and put everything back so unless I'm totally messing up the remounting screws (which is very possible) then I'm not sure if mounting pressure is/was the issue. I'm tightening them crosswise little at a time.

Right now there doesn't appear to be any improvements but I'll see what it is like for a day or two. By then ptm7950 would be here too so I have one last thing to try.

@Skilid I thought maybe thermal images of his system would help highlight any cooler issues but I couldn't get an angle on the cooling fins. I did ask him if he'd be comfortable with me swapping out the cooler but right now he's doing a lot of modelling work for our company (we're both engineers) he just couldn't afford to be GPU less for a bit or even risk me or him taking his working one apart

Devprotim Das · 19 Apr 2024 at 13:44

D1craig said:
Did all the thermal pads look squashed down enough when you removed the cooler? I can't remember where I saw it or what gpu it was for but I saw somebody mention thry needed to add them springs you sometimes get on gpu screws to add/increase tension.

They were squashed down to comperable levels. And the memory is running around 20°C cooler compared to a stock 3090 FE so I'm certain the pads are doing their job.

Devprotim Das · 19 Apr 2024 at 16:10

heatonpkmassive said:
The pads cooling well isn't so relevant as if they're slightly too big, they'll still cool the memory very well but they could still keep the heatsink a little too high above the core to be able to cool that sufficiently.

Yes that's been mentioned a couple of times. The mounting screws are bottomed out, they can't be mounted any tighter. I do not think the pads are inhibiting proper contact between the heat sink and the GPU die. The GPU die still has sufficient pressure to pump out the paste.

EDIT: Also it doesn't answer the question why the overheating became an issue overnight.

Devprotim Das · 19 Apr 2024 at 17:24

heatonpkmassive said:
Not sure on why it just became an issue overnight, but the paste squeezing out also can't be taken as a sign that there's good contact. Only the correct thickness pads will guarantee that.

My bad I didn't clarify that I meant due to thermal pump out. You're are right that the paste squeezing out is not enough evidence to suggest a good mounting pressure. But barring torque settings for all the screws on the PCB alongside the exact batch of stock pads used by Nvidia (and probably a step by step guide on dissassembly and reassembly) I'm not sure there's anyway to guarantee that I've done everything correctly (and even that might not be enough because following instruction also requires some skill)

This entire thread started because I suddenly got temperature spikes and after I repasted and repadded nothing changed. By my measurements the stock pads were approximately 1mm. Given their plasticity I assumed they were thicker 'out of the package'. General internet search suggested that stock thickness of 1.5mm is what was used by nvidia and was recommended as replacements pads for the FE (thermalpad.eu, nicehash, general interent teardowns, several modders, including Jayztwocents if that holds any water with anyone). So that's what I used albeit it was Arctic TP-3 instead.

That self attempted maintenance did not lower or raise the temps on GPU hotspots. So my query was more to get help to try and diagnose the issue, or apply a suggested solution that lowered the temps without addressing root cause.

The new pads have squashed themselves down to the same level as the stock pads.
And the GPU has been repasted a couple of times. It is possible I messed up the installation and not put on enough pressure the first time and then on second&third attempt I tightened it too much.

Due to seeing no improvements or deterioration of the temps on CPUID and on thermal camera I'm not entirely sure that lack of mounting pressure is the cause for initial spike or the continuing of the high temps that is plaguing the GPU. Whether that's because I'm being too ginger or of the pads are too thick (even though general information seems to suggest that it is not).

I've applied ptm7950 on wednesday (it was an ebay purchase so could be a poor clone for all I know). I'm going to give it a week to cycle through. So far it has lowered the temps by 2°C. Which is still approximately 15°C higher than what the card used to run typically.