GTX295 F@H - Help Please

Soldato
Joined
23 Nov 2002
Posts
9,762
Location
Near Bristol
Hi Guys,

Basically I sold a BFG GTX295 Single PCB to someone who uses them for folding. The card had of course been tested and used under gaming, vantage etc with zero problems.

The guy got back to me saying: "GPU 0 keeps on generating application errors, but GPU 1 is fine. The application errors are accompanied by a message from windows that the video card is 'not responding, but has recovered', and thin black horizontal line which travels quickly vertically down the screen. The card then trashes the unit it was working on, and then carries on. This is an intermittent fault, which may not be visible when playing games."

The card was returned to me but before I RMA it to BFG I wanted to test that it was indeed the card at faulty. Being rather new GPU folding I followed the instructions here: http://folding.stanford.edu/English/WinGPUGuide, especially the part at the bottom. I connected up two monitors to the card to make sure it would use both GPUs.

But I get this issue here: http://www.markljlewis.com/folding295pic1.jpg

It is not the same issue as he described but similar. One of the cores appears to be folding fine, the other states that the core is not running now. Its progress appears to resets every time I press display and seems to go from the speed shown to next to nothing. Both GPUs appear to be loaded, although the values mean nothing as it could be for either one.

Is there anything I can try before RMAing it? Just incase I've missed something obvious.

Thanks,

Mark
 
Well from following your instructions it appears to be working.

I first tried Furmark in multi-gpu mode with no problems along with a few loops of Vantage.

I then uninstalled the GUI client, disabled SLI which I think the drivers call multi gpu mode or something. Installed the console client using your instructions and it appears to be working. I've done two work units, one for each core.

I think I need to do a fair few to be certain.
 
What does EUE stand for?

Is HFM.NET still the best monitoring software? I seem to recall not getting on with it when I last used it.
 
Thanks.

Right I left it running while I was at Kemble this afternoon.

http://www.markljlewis.com/hfm.jpg

Strangely however hfm and the folding@home stats don't match the logs. HFM states 4 WUs completed per core, however going through the logs for both cores using the 100% as a reference it seems that 5 WU have been completed.

Could someone check my logs to make sure I'm not miscounting or missing something obvious: http://www.markljlewis.com/fahlogs.zip

Thanks guys!
 
The version I'm using at the moment has this:

Add Toggle Switch (F10) to change between 'ClientTotal' or 'CurrentClientRun' Completed Units Count.

If you see my logs, it has done three since the last client start, so that can't be right. It would appear it is set to ClientTotal and it is missing one unit for some reason. Unless of course it counts units after they have been sent, in which case you may be right.

Either way it isn't hugely important, I just wanted to be sure that one of the WU's wasn't failing for some reason. Hopefully it will show the correct number when Folding@Home stats update...
 
Well the folding@home website is showing 11WUs, so god knows whats going on:

http://fah-web.stanford.edu/cgi-bin/main.py?qtype=userpage&username=Speed&teamnum=10

Should be no more than 10 according to the logs, but I guess that can't be a bad thing and might be related to when the internet went down or something.

Putting it in the storage and download pc in the loft to see how it goes for a longer run. It is hot up there as well, so should be a good test of the card.
 
Have a look in the view drop down in HFM it'll give you an option for number of complete units either since stared or for the life of the client

Unfortunately the card is installed elsewhere at the moment, but I'll give that a go tomorrow.

As long as it is not showing less than the logs, I think it should be fine.

I'd imagine what you mentioned is also changed by the F10 hotkey as I stated earlier, not that I tried it mind.
 
Last edited:
Ah right, didn't know that. Pretty neat feature for remote viewing, I'll have to set it up.

The loft was 37c mid day today, so I doubt I'll run it for 24 hours, just until the morning or when the temps start getting silly.

Assuming it lasts that long! lol
 
It's really handy. I work on an oil rig for 4 weeks at a time and it's really handy for keeping an eye on things while I'm away. Here's my page so you can see what the output looks like. Free webspace 4TW :D

Be interesting to see how it copes at that sort of temperature.

Nice, wouldn't want to see your electricity bill! lol

I woke up this morning to find that HFM stated that 5 WU per core failed, only after I changed the completed settings (F10) did it state 1 was completed per core. At least it is even and not one core causing a problem as far as I can tell.

Looking at the logs I have this:

[00:30:21] Folding@home Core Shutdown: UNSTABLE_MACHINE x5

[00:30:28] EUE limit exceeded. Pausing 24 hours.

Full logs can be found here: http://www.markljlewis.com/fahlogs2.zip ; I'd appreciate someone knowledgeable to take a look at both.

I'm guessing that temps were not an issue as they didn't really get above 75c, if I had to guess I'd say it was down to the latest nVidia drivers, the power supplies or some other problem related to this new setup. It is a completely different system than I used previously.

What I don't get is why it did one WU perfectly fine and then had this problem.

In stranger news, the stat page now has me at 23WU when I've only done 2 more than last time. :confused:

http://fah-web.stanford.edu/cgi-bin/main.py?qtype=userpage&username=Speed&teamnum=10
 
The stats only update every hour/every three hours and have a time delay of 6 hours.

That doesn't explain why I'm being credited with WUs I never completed.

Everything is at stock, currently running it using a different PSU (instead of the two I used previously), although ambient temps are getting high.
 
Last edited:
Right, had the same issue happen again. It completed one, then 5 failures the same as the previous log. So it isn't the PSU. Temps seem fine, gpu 0 never went over 70c and gpu 1 never went over 63c.

Maybe it is the new drivers, 257.21.
 
Tried the older 197.45 drivers, gave UNSTABLE_MACHINE error before I even got to setup the second GPU console client. So I've removed it from the download PC in the loft, it appears to be creating more questions than answers so to speak.

I'll install it in my main PC as before, likely tomorrow and run it for most of the day. See how things go. I'd like to leave it on overnight but to be honest the card makes so much noise that I wouldn't be able to sleep. Guess I could move it elsewhere.... hmmm
 
That is true, but it doesn't really explain why it happens in one system but not the other. Although I guess it could just be luck of the draw with the WUs it is getting.

I'll try and run it overnight in my main system and see how many WU it can get through without an error.
 
Right I ran it overnight until about 30 minutes ago, had a few issues getting setup working as for some reason the OS wouldn't start properly. A quick reinstall sorted that and I finally got it folding at about 22:30 after a minor issue with it going to sleep due to the default power settings lol. :o :rolleyes:

Anyways, it appears to have done 20WU (10 per core) without any issues! :D
 
lol, Yeah! :D

Unfortunately I can't afford the electric costs to run it on a regular basis.

Might do one more run tonight as it is still setup downstairs.
 
Yeah that is true, going to plug in my cheap-o Energy Monitor. Would be interesting to see how much it would cost me to run it for 12 hours.
 
Well the CPU wasn't being used for folding, just the GTX295. Worked out at about 330W max and 7.63KwH in 24 hours. So about 87p a day due to the price being 11.39p per unit. I'm sure it could be a fair bit less with a different CPU/setup.

Anyways, I did 18 hours folding and got 30 WUs done (15 per core) with 1 failure on GPU1, but instead of looping the error 5 times it just carried on. The UNSTABLE_MACHINE was in between WU, so no harm done really.

[06:54:34] mdrun_gpu returned
[06:54:34] Going to send back what have done -- stepsTotalG=0
[06:54:34] Work fraction=0.0000 steps=0.
[06:54:38] logfile size=0 infoLength=0 edr=0 trr=25
[06:54:38] + Opened results file
[06:54:38] - Writing 637 bytes of core data to disk...
[06:54:38] Done: 125 -> 123 (compressed to 98.4 percent)
[06:54:38] ... Done.
[06:54:38] DeleteFrameFiles: successfully deleted file=work/wudata_00.ckp
[06:54:38]
[06:54:38] Folding@home Core Shutdown: UNSTABLE_MACHINE
[06:54:41] CoreStatus = 7A (122)
[06:54:41] Sending work to server
[06:54:41] Project: 5768 (Run 11, Clone 161, Gen 154)

I'm fairly happy that it isn't really a problem and the card can be sold on. I have to say I really got a taste for folding, so maybe when money isn't as much of an issue I'll build a folding rig.
 
Last edited:
Yeah would take some serious time to catch many of you.

Might give my 5970 a go at some point, although I hear it is much harder to setup and the PPD is pretty poor.
 
Back
Top Bottom