Any VM folders updated their A2 core?

ChrissyT88 · 27 Feb 2009 at 11:01

Hi,

About a week ago, Stanford posted on the official forum that a new cleint was available for Linux that made use of a newer A2 core, with some improvements to stop WUs being generated with the wrong number of steps. I downloaded it and have pretty much been unable to complete an A2 WU since, besides one or two exceptions... I dont leave the client running 24/7, but when stopping and resuming the A2 units i get checkpoint resume errors, and the client throws a wobbly and doesnt know what to do. I have reported this on the official forum, along with others, but Dr Kasson has replied saying they are finding it difficult to replicate the error. Anyone here had this problem who would be willing to help out? The thread is here for more info:

http://foldingforum.org/viewtopic.php?f=44&t=8356

Its quite frustrating really and has cost me quite a lot of points - until now the Linux VMs have been rock solid in their stability, and this is the first problem i have really encountered with them. Anyone who is contemplating the core upgrade, think twice or be prepared for some (potential) trouble...

Mattus · 28 Feb 2009 at 17:24

I updated too. It folds OK, but I have to start the client about ten times before it works. I keep getting this:

ChrissyT88 · 28 Feb 2009 at 17:47

Strange! I have reinstalled the clients completely (not the VM) and one picked up an A2, and so far it has not had the checkpointing error. Unfortunately i dont get through too many SMP WUs so its difficult to identify the cause, and the servers keep throwing A1s my way!

What version of ubuntu are you using Mattus? That actually looks a lot like an error i was getting with some code for one of our uni projects on an Ubuntu 8.10 VM. Switching to 8.04 resulted in no problems with identical code..

EDIT: Does ubuntuServer6 indicate you are using Ubuntu 6 by any chance?

Mattus · 28 Feb 2009 at 18:32

Yeah, it's Ubuntu Server 6.06. Ubuntu 7 causes my clients to hang at the end of each WU, and downgrading fixed that problem. I tried 8.10 but for some reason each VM only used one core :confused:

So version 6 is the newest version that works OK with my VMs.

Just remembered that I lost a WU to your SaveRestoreState bug the other day too. I put it down to typical SMP bugginess but when I come to think about it, I hadn't seen that error before I upgraded the core...

Mattus · 28 Feb 2009 at 18:36

Hmm. I just checked my clients, and what do you know:

Code:

[17:25:16] Will resume from checkpoint file
[17:25:18] te: I/O failed dir=0, var=0000000000A0AA20, varsize=51024
[17:25:18] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore state.

So it looks like I have this problem too. I'll post on the Folding forum in a bit. My VMs only have 2GB hard drives so it might be feasible for me to upload one to my server overnight for kasson to look at.

ChrissyT88 · 28 Feb 2009 at 18:40

Interesting - as i mentioned in the thread i made the mistake of letting VMware setup most of the default options so my images are around 8Gb each. Still, at least i know its not just me! Im not sure its the case for every WU, as i mentioned, i have completed a few A2 units, but i would say about half have failed to resume from checkpoints. Its interesting you mention that if you try and start them lots of times they start eventually, and i'll give that a go if i ever see an A2 unit again!

Mattus · 28 Feb 2009 at 18:43

That's a separate problem

I get the gibberish in the screenshot above practically every time I start the client, and have to try about ten times to get past that. The SaveRestoreState thing is a separate issue that I've only had a couple of times so far, and it seems no amount of restarting will get past that. I just have to delete the WU.

ChrissyT88 · 28 Feb 2009 at 18:59

Ok, gotcha! I wont waste time restarting them!

WoZZeR · 1 Mar 2009 at 17:35

Hey up chaps.

Mine has been doing a similar thing recently, I've produced practically no work at all:

Code:

Launch directory: /opt/foldingathome/1
Executable: /opt/foldingathome/1/fah6
Arguments: -smp 

[17:18:28] - Ask before connecting: No
[17:18:28] - User name: WoZZeR (Team 10)
[17:18:28] - User ID: 155C79523F88F1F3
[17:18:28] - Machine ID: 1
[17:18:28] 
[17:18:28] Loaded queue successfully.
[17:18:28] 
[17:18:28] + Processing work unit
[17:18:28] Core required: FahCore_a2.exe
[17:18:28] Core found.
[17:18:28] Working on Unit 09 [March 1 17:18:28]
[17:18:28] + Working ...
[17:18:28] 
[17:18:28] *------------------------------*
[17:18:28] Folding@Home Gromacs SMP Core
[17:18:28] Version 2.04 (Thu Jan 29 16:43:57 PST 2009)
[17:18:28] 
[17:18:28] Preparing to commence simulation
[17:18:28] - Ensuring status. Please wait.
[17:18:38] - Looking at optimizations...
[17:18:38] - Working with standard loops on this execution.
[17:18:38] - Files status OK
[17:18:38] - Expanded 229168 -> 1114469 (decompressed 486.3 percent)
[17:18:38] Called DecompressByteArray: compressed_data_size=229168 data_size=1114469, decompressed_data_size=1114469 diff=0
[17:18:38] - Digital signature verified
[17:18:38] 
[17:18:38] Project: 4433 (Run 84, Clone 10, Gen 4)
[17:18:38] 
[17:18:38] Entering M.D.
[17:18:44] Will resume from checkpoint file
[17:18:45] Resuming from checkpoint
[17:18:45] fcSaveRestoreState: I/O failed dir=0, var=0000000000A501C0, varsize=50712
[17:18:45] fcCheckPointResume: failure in call to fcSaveRestoreState() to restore state.

I am running native 8.04 Ubuntu.

I've tried stopping it in the terminal before powering off the pc but it still does it occasionally.

It's starting to hack me off and to cap it all I've forgotten my user name for the FaH forums.

ChrissyT88 · 1 Mar 2009 at 17:42

I'll post the logs for you if you like...? Its also very interesting to know that it is not just a VM related issue if you are running a native client.

WoZZeR · 1 Mar 2009 at 19:46

Cheers, that would be great.

I've just deleted the 'work' folder and the cores and restarted it, it downloaded the A2 core again so I'll keep an eye on it. It just seems a pain to lose a days work so easily.

ChrissyT88 · 1 Mar 2009 at 20:34

WoZZeR said:
It just seems a pain to lose a days work so easily.

I uploaded the log - Dr Kasson has said that they have managed to isolate the bug, but that a fix may take a little while to surface. I agree fully, its a real pain. I dont get a lot of points from the SMp client, but it has already bricked 5 units for me, all on above 50% completion. I have started to close the clients after they have only completed a few % of each A2 WU - if they dont fail then, they dont seem to fail later on, but it is a little inconvienient at times.

WoZZeR · 1 Mar 2009 at 22:13

Thanks for your help.

I'll try restarting the machine fairly early on in the WU's as well, it may work for me as well.

ChrissyT88 · 29 Mar 2009 at 20:33

I just posted on the folding forum regarding this issue which has sort of surfaced again for me now that there are more A2 units out and about. Dr Kasson indicated that they have a fix, and that it could be rolled out early (although no indication of a timescale was given). However, i did get some particularly useful advice from another member, dnamechanic (thanks!).

This only applies to VMware users unfortunately, but instead of ctrl+c to close the client, use VMware to take a snapshot of the guest operating system. This just dumps the contents of the RAM to disk, and then the client can be shutdown however. Instead of restarting the VM and restarting the client, you simply restore from the snapshot, and it carrys on from the time of the snapshot. Apparently this can take some time with a low RAM machine, but on my system (4Gb of RAM) its pretty damn quick. Its also much quicker to resume the VM from the snapshot than it is to manually boot up the VM each time. Not sure if this is news to people or indeed if others have always been doing this (i think from now on i will!), but its certainly a fix for the problem. The only problem is it puts the VM's clocks out of sync with the host, meaning its necessary to either alter them or set the VM to sync clocks automatically to avoid FahMon problems.

Hope this helps.

WoZZeR · 31 Mar 2009 at 22:45

Ta for the info, although not running via VMWare this recent batch of 1920's hasn't messed up yet despite several assorted reboots.