GTX295 F@H - Help Please

Speed · 22 Jun 2010 at 13:42

Right, had the same issue happen again. It completed one, then 5 failures the same as the previous log. So it isn't the PSU. Temps seem fine, gpu 0 never went over 70c and gpu 1 never went over 63c.

Maybe it is the new drivers, 257.21.

Speed · 22 Jun 2010 at 14:54

Tried the older 197.45 drivers, gave UNSTABLE_MACHINE error before I even got to setup the second GPU console client. So I've removed it from the download PC in the loft, it appears to be creating more questions than answers so to speak.

I'll install it in my main PC as before, likely tomorrow and run it for most of the day. See how things go. I'd like to leave it on overnight but to be honest the card makes so much noise that I wouldn't be able to sleep. Guess I could move it elsewhere.... hmmm

miniyazz · 22 Jun 2010 at 15:55

Unfortunately, just because it's at stock doesn't mean it isn't errors that can't be got rid of by downclocking - it would just mean the graphics card is a little borked

However, quite a few people on here, myself included, have had issues with specific types of WU bricking for no apparent reason when the hardware is fine. I can't quite remember why they went away for me, might just have happened mysteriously!

Speed · 22 Jun 2010 at 16:44

That is true, but it doesn't really explain why it happens in one system but not the other. Although I guess it could just be luck of the draw with the WUs it is getting.

I'll try and run it overnight in my main system and see how many WU it can get through without an error.

Speed · 23 Jun 2010 at 13:46

Right I ran it overnight until about 30 minutes ago, had a few issues getting setup working as for some reason the OS wouldn't start properly. A quick reinstall sorted that and I finally got it folding at about 22:30 after a minor issue with it going to sleep due to the default power settings lol.

Anyways, it appears to have done 20WU (10 per core) without any issues!

Biffa · 23 Jun 2010 at 14:48

I think you have done enough testing

Marine Iguana · 23 Jun 2010 at 14:58

He has *The Bug* :eek:

Speed · 23 Jun 2010 at 15:52

lol, Yeah!

Unfortunately I can't afford the electric costs to run it on a regular basis.

Might do one more run tonight as it is still setup downstairs.

Frozennova · 23 Jun 2010 at 18:45

Just cus you have the bug doesn't mean you have to fold full time, even a few hours a day would help the team

Speed · 23 Jun 2010 at 18:47

Yeah that is true, going to plug in my cheap-o Energy Monitor. Would be interesting to see how much it would cost me to run it for 12 hours.

Frozennova · 23 Jun 2010 at 19:19

I think it'll use less than you think actually IIRC an i7 and sli 280 pulls around 600w at full load which would equate to 6-8 pence per hour

Speed · 24 Jun 2010 at 15:18

Well the CPU wasn't being used for folding, just the GTX295. Worked out at about 330W max and 7.63KwH in 24 hours. So about 87p a day due to the price being 11.39p per unit. I'm sure it could be a fair bit less with a different CPU/setup.

Anyways, I did 18 hours folding and got 30 WUs done (15 per core) with 1 failure on GPU1, but instead of looping the error 5 times it just carried on. The UNSTABLE_MACHINE was in between WU, so no harm done really.

[06:54:34] mdrun_gpu returned
[06:54:34] Going to send back what have done -- stepsTotalG=0
[06:54:34] Work fraction=0.0000 steps=0.
[06:54:38] logfile size=0 infoLength=0 edr=0 trr=25
[06:54:38] + Opened results file
[06:54:38] - Writing 637 bytes of core data to disk...
[06:54:38] Done: 125 -> 123 (compressed to 98.4 percent)
[06:54:38] ... Done.
[06:54:38] DeleteFrameFiles: successfully deleted file=work/wudata_00.ckp
[06:54:38]
[06:54:38] Folding@home Core Shutdown: UNSTABLE_MACHINE
[06:54:41] CoreStatus = 7A (122)
[06:54:41] Sending work to server
[06:54:41] Project: 5768 (Run 11, Clone 161, Gen 154)

I'm fairly happy that it isn't really a problem and the card can be sold on. I have to say I really got a taste for folding, so maybe when money isn't as much of an issue I'll build a folding rig.

Frozennova · 24 Jun 2010 at 21:57

Another with the bug then, I still have no idea why we enjoy all it when half the time its such at PITA. Good job your not keeping it up for now, you was approaching in my rear view mirror, albeit slowly.

Speed · 25 Jun 2010 at 01:31

Yeah would take some serious time to catch many of you.

Might give my 5970 a go at some point, although I hear it is much harder to setup and the PPD is pretty poor.

Speed · 25 Jul 2010 at 04:37

Right, different GTX295, different system. Getting this strange problem after completing a WU. Then it just loops from "Attempting to send results" until it does too many Unstable Machine errors to continue. Happens on both cores, driver issue maybe?:

[03:26:35] Completed 98%
[03:27:25] Completed 99%
[03:28:14] Completed 100%
[03:28:14] Successful run
[03:28:14] DynamicWrapper: Finished Work Unit: sleep=10000
[03:28:25] Reserved 84576 bytes for xtc file; Cosm status=0
[03:28:25] Allocated 84576 bytes for xtc file
[03:28:25] - Reading up to 84576 from "work/wudata_01.xtc": Read 84576
[03:28:25] Read 84576 bytes from xtc file; available packet space=786345888
[03:28:25] xtc file hash check passed.
[03:28:25] Reserved 25248 25248 786345888 bytes for arc file=<work/wudata_01.trr> Cosm status=0
[03:28:25] Allocated 25248 bytes for arc file
[03:28:25] - Reading up to 25248 from "work/wudata_01.trr": Read 25248
[03:28:25] Read 25248 bytes from arc file; available packet space=786320640
[03:28:25] trr file hash check passed.
[03:28:25] Allocated 560 bytes for edr file
[03:28:25] Read bedfile
[03:28:25] edr file hash check passed.
[03:28:25] Allocated 0 bytes for logfile
[03:28:25] Could not open/read logfile=<work/wudata_01.log>; Cosm status=-1
[03:28:25] GuardedRun: success in DynamicWrapper
[03:28:25] GuardedRun: done
[03:28:25] Run: GuardedRun completed.
[03:28:27] + Opened results file
[03:28:27] - Writing 110896 bytes of core data to disk...
[03:28:27] Done: 110384 -> 109756 (compressed to 99.4 percent)
[03:28:27] ... Done.
[03:28:27] DeleteFrameFiles: successfully deleted file=work/wudata_01.ckp
[03:28:27] Shutting down core
[03:28:27]
[03:28:27] Folding@home Core Shutdown: FINISHED_UNIT
[03:28:31] CoreStatus = 64 (100)
[03:28:31] Unit 1 finished with 99 percent of time to deadline remaining.
[03:28:31] Updated performance fraction: 0.993553
[03:28:31] Sending work to server
[03:28:31] Project: 6601 (Run 9, Clone 693, Gen 205)
[03:28:31] + Attempting to send results [July 25 03:28:31 UTC]
[03:28:31] - Reading file work/wuresults_01.dat from core
[03:28:31] (Read 110268 bytes from disk)
[03:28:31] Connecting to http://171.64.65.61:8080/
[03:28:36] Posted data.
[03:28:36] Initial: 0000; - Uploaded at ~21 kB/s
[03:28:36] - Averaged speed for that direction ~21 kB/s
[03:28:36] + Results successfully sent
[03:28:36] Thank you for your contribution to Folding@Home.
[03:28:36] + Starting local stats count at 1
[03:28:40] Trying to send all finished work units
[03:28:40] + No unsent completed units remaining.
[03:28:40] - Preparing to get new work unit...
[03:28:40] + Attempting to get work packet
[03:28:40] - Will indicate memory of 6135 MB
[03:28:40] - Connecting to assignment server
[03:28:40] Connecting to http://assign-GPU.stanford.edu:8080/
[03:28:42] Posted data.
[03:28:42] Initial: 40AB; - Successful: assigned to (171.64.65.61).
[03:28:42] + News From Folding@Home: Welcome to Folding@Home
[03:28:42] Loaded queue successfully.
[03:28:42] Connecting to http://171.64.65.61:8080/
[03:28:43] Posted data.
[03:28:43] Initial: 0000; - Receiving payload (expected size: 74336)
[03:28:44] - Downloaded at ~72 kB/s
[03:28:44] - Averaged speed for that direction ~72 kB/s
[03:28:44] + Received work.
[03:28:44] Trying to send all finished work units
[03:28:44] + No unsent completed units remaining.
[03:28:44] + Closed connections
[03:28:44]
[03:28:44] + Processing work unit
[03:28:44] Core required: FahCore_11.exe
[03:28:44] Core found.
[03:28:44] Working on queue slot 02 [July 25 03:28:44 UTC]
[03:28:44] + Working ...
[03:28:44] - Calling '.\FahCore_11.exe -dir work/ -suffix 02 -checkpoint 15 -verbose -lifeline 3432 -version 623'

[03:28:44]
[03:28:44] *------------------------------*
[03:28:44] Folding@Home GPU Core
[03:28:44] Version 1.31 (Tue Sep 15 10:57:42 PDT 2009)
[03:28:44]
[03:28:44] Compiler : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86
[03:28:44] Build host: amoeba
[03:28:44] Board Type: Nvidia
[03:28:44] Core :
[03:28:44] Preparing to commence simulation
[03:28:44] - Looking at optimizations...
[03:28:44] DeleteFrameFiles: successfully deleted file=work/wudata_02.ckp
[03:28:44] - Created dyn
[03:28:44] - Files status OK
[03:28:44] - Expanded 73824 -> 383588 (decompressed 519.5 percent)
[03:28:44] Called DecompressByteArray: compressed_data_size=73824 data_size=383588, decompressed_data_size=383588 diff=0
[03:28:44] - Digital signature verified
[03:28:44]
[03:28:44] Project: 6600 (Run 10, Clone 922, Gen 6)
[03:28:44]
[03:28:44] Assembly optimizations on if available.
[03:28:44] Entering M.D.
[03:28:51] Tpr hash work/wudata_02.tpr: 1542977918 4078650811 266603496 1470683992 346959629
[03:28:51]
[03:28:51] Calling fah_main args: 14 usage=100
[03:28:51]
[03:28:51] mdrun_gpu returned
[03:28:51] Going to send back what have done -- stepsTotalG=0
[03:28:51] Work fraction=0.0000 steps=0.
[03:28:55] logfile size=0 infoLength=0 edr=0 trr=25
[03:28:55] + Opened results file
[03:28:55] - Writing 637 bytes of core data to disk...
[03:28:55] Done: 125 -> 124 (compressed to 99.2 percent)
[03:28:55] ... Done.
[03:28:55] DeleteFrameFiles: successfully deleted file=work/wudata_02.ckp
[03:28:55]
[03:28:55] Folding@home Core Shutdown: UNSTABLE_MACHINE
[03:28:59] CoreStatus = 7A (122)
[03:28:59] Sending work to server
[03:28:59] Project: 6600 (Run 10, Clone 922, Gen 6)

[03:29:31] + Attempting to send results [July 25 03:29:31 UTC]

Speed · 26 Jul 2010 at 23:17

Very odd, I've tried loads of solutions, new drivers, downclocking, nothing helps. Does one WU fine, then UNSTABLE_MACHINE until it reaches the EUE limit.

Even stranger, I switched to a GTX260, removing the GTX295, fresh driver install and the same issue happens again! :confused:

Any suggestions? Both cards appear to be fine under Vantage and Furmark.

Board or PSU issue maybe? Both are good models! Asus P6T V2 and BeQuiet 650w PSU.

Pilgrim57 · 27 Jul 2010 at 11:57

frozennova said:
Another with the bug then, I still have no idea why we enjoy all it when half the time its such at PITA. Good job your not keeping it up for now, you was approaching in my rear view mirror, albeit slowly.

The joy of the challenge, justification for spending money on the kit & the warm glow that we are contributing to a noble cause plus the added bonus of tiffies & stompage

Speed: I had an unstable machine unit recently when my card, a 260 has been rock solid for a v long time. Was at 70% & was killed by a reboot!
Maybe you have got some bad WU's - have you checked the FAH forum?
Have you tried a different PCI-e slot?

Speed · 27 Jul 2010 at 18:37

Pilgrim57 said:
The joy of the challenge, justification for spending money on the kit & the warm glow that we are contributing to a noble cause plus the added bonus of tiffies & stompage

Speed: I had an unstable machine unit recently when my card, a 260 has been rock solid for a v long time. Was at 70% & was killed by a reboot!
Maybe you have got some bad WU's - have you checked the FAH forum?
Have you tried a different PCI-e slot?

Just tried a different slot, similar issue:

[17:10:57] Completed 28%
[17:12:02] Run: exception thrown during GuardedRun
[17:12:02] Run: exception thrown in GuardedRun -- Gromacs cannot continue further.
[17:12:02] Going to send back what have done -- stepsTotalG=15000000
[17:12:02] Work fraction=0.2890 steps=15000000.
[17:12:06] logfile size=0 infoLength=0 edr=0 trr=23
[17:12:06] + Opened results file
[17:12:06] - Writing 642 bytes of core data to disk...
[17:12:06] Done: 130 -> 129 (compressed to 99.2 percent)
[17:12:06] ... Done.
[17:12:06] DeleteFrameFiles: successfully deleted file=work/wudata_07.ckp
[17:12:06]
[17:12:06] Folding@home Core Shutdown: UNSTABLE_MACHINE

In Windows Event Viewer I found this: Display driver nvlddmkm stopped responding and has successfully recovered.

Time stamp is 18:12 which matches.

Speed · 27 Jul 2010 at 20:26

Right, I unstalled the latest drivers and unlike the previous driver changes also unistalled the PhysX driver. Gone back to 197.45, it has just completed its first WU and has finally moved on to a second.

So the problem could well be solved, hopefully I'm not speaking too soon!

Pilgrim57 · 27 Jul 2010 at 20:42

This is a puzzle :confused:

When I had loads of unstable machine errors it was the card + driver issue.
The only other thing to try is to uninstall FAH [make sure you delete any folders/files left behind & reinstall].
Is the card running hot? Are you using MSI afterburner or another prog. to up the fan speed?
Make a note of the WU number i.e. Project: 6605 (Run 9, Clone 277, Gen 210). In the past it would try to do the same WU 3 times before getting a new one, hence multiple failure.
Just had a quick look at yr logs & it seems to be different WU's failing :confused:

That's all I can think to try at the moment.