Errors in Test WU's V6.41

Message boards : Number crunching : Errors in Test WU's V6.41

Author Message
Profile The Gas Giant
Avatar
Send message
Joined: Mar 7 06
Posts: 1214
Credit: 3,625,990
RAC: 2,644

I'm seeing quite a few errors in test version V6.41.

See here from June 25. Looks like others are as well.

RandyC
Avatar
Send message
Joined: Jun 23 06
Posts: 2940
Credit: 926,890
RAC: 1,261

I'm seeing quite a few errors in test version V6.41.

See here from June 25. Looks like others are as well.


About 20-30% of my WUs are dying due to max disk usage exceeded. Many of my wingmen on the same WUs get the same results.

Profile mikey
Avatar
Send message
Joined: Mar 23 07
Posts: 4384
Credit: 5,361,744
RAC: 1,110

I'm seeing quite a few errors in test version V6.41.

See here from June 25. Looks like others are as well.


About 20-30% of my WUs are dying due to max disk usage exceeded. Many of my wingmen on the same WUs get the same results.


Hmmm, I guess we will have to give Boinc more space then! I have mine set so Boinc only uses a max of 10GB of space, I wonder if that is why I didn't get any units this time around?

ps silly me I did have one, I upped the disk usage on this pc to 15gb and even after updating got no more units though.

Profile Krunchin-Keith [USA]
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: Nov 10 05
Posts: 3219
Credit: 5,501,570
RAC: 3,668

Could you guys that have max disk usage exceeded tell us what your settings are ?

I'm confused by this issue as i have no failures so far, but only 3 results returnd so far for 6.41 so it may be too soon to tell.

My setting though limit disk usage to 3.21 GB

Boinc shows about half used or less of that used on both my systems, This includes about 9 active projects out of about 20 attached to.

For mc.n it also shows low usages, on my two systems here it reads 116 MB and 281 MB, although each is only running 1 mc task.

For admin, Can you tell us how much disk space approximately should be needed per running task ?

John Clark
Avatar
Send message
Joined: Feb 10 08
Posts: 2149
Credit: 1,194,032
RAC: 1,520

On my older PCs the new work seems to be going through OK, except that the majority are in pending.

The newer PCs are getting from 40% to 65% with "Error while computing"
____________
Go away, I was asleep

Said a Russell, 3 Shih-Tzus & a Bischeon Frize

hardy
Volunteer moderator
Project administrator
Project developer
Avatar
Send message
Joined: Feb 18 09
Posts: 141
Credit: 54,421
RAC: 128

Well, I had a look at some of John Clark's failed workunits -- all errors say out of disk usage. John, could you please tell us how much disk space is currently available for BOINC apps on each of these computers?

We just saw an openmalaria checkpoint of size x=100 MB (the size will vary a bit).
The current openmalaria applications alternate between two checkpoint files (to make sure, should the process be killed while checkpointing, the old checkpoint file is still usable -- maybe in the future we'll make it delete the old checkpoint as soon as it's no longer needed though). Then, for example, if you run openmalaria on two CPUs at once, you could easily reach 2*2*x = 400 MB of disk usage, which if it exceeds what the BOINC client makes available will cause at least one of the running work-units to be killed.

I just increased the required host disk space for BOINC to half a gigabyte, but as you can see, this almost certainly won't be enough for quad-cores that run 4 big openmalaria work-units at once.

For those of you finding this a problem, you can also increase the disk space available to BOINC on your computers.

glaesum
Send message
Joined: Nov 29 07
Posts: 7
Credit: 110,498
RAC: 14

it's not going well is it...

on a five year old xp laptop my experience is 50:50, i.e. 5 successes and 5 fails with 3 in progress; (two other machines only have one each in progress so can't comment yet).

the failures pack up quickly in 5 minutes or so, at least they are not wasting much time.

/pg

RandyC
Avatar
Send message
Joined: Jun 23 06
Posts: 2940
Credit: 926,890
RAC: 1,261

Could you guys that have max disk usage exceeded tell us what your settings are ?

I'm confused by this issue as i have no failures so far, but only 3 results returnd so far for 6.41 so it may be too soon to tell.

My setting though limit disk usage to 3.21 GB

Boinc shows about half used or less of that used on both my systems, This includes about 9 active projects out of about 20 attached to.

For mc.n it also shows low usages, on my two systems here it reads 116 MB and 281 MB, although each is only running 1 mc task.

For admin, Can you tell us how much disk space approximately should be needed per running task ?


My venue for the system getting disk usage exceeded is 'home'. According to my account settings here, the values are all '--' which I assume means 'use the default settings'. The default venue shows:

Use at most 6GB (there are 99GB freespace on the drive)
Leave at least 0.2GB free space
Use at most 95% of total disk space

At Einstein, the values for this system are:
Use at most 4GB
Leave at least 0.5GB freespace
Use at most 95%

u.dgl.
Send message
Joined: Mar 8 06
Posts: 26
Credit: 1,170,815
RAC: 453

Hi,

i have got a lot (17) of wus with "error while computing" on four various hosts.

My settings are:
Use at most 100 GB disk space
Leave at least 0.1 GB disk space free
Use at most 50% of total disk space

And there is a hughe amount of free bytes on the disks.

So, for me, it seems not to be a problem with the hosts!


____________

Profile Krunchin-Keith [USA]
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: Nov 10 05
Posts: 3219
Credit: 5,501,570
RAC: 3,668

Well, I had a look at some of John Clark's failed workunits -- all errors say out of disk usage. John, could you please tell us how much disk space is currently available for BOINC apps on each of these computers?

We just saw an openmalaria checkpoint of size x=100 MB (the size will vary a bit).
The current openmalaria applications alternate between two checkpoint files (to make sure, should the process be killed while checkpointing, the old checkpoint file is still usable -- maybe in the future we'll make it delete the old checkpoint as soon as it's no longer needed though). Then, for example, if you run openmalaria on two CPUs at once, you could easily reach 2*2*x = 400 MB of disk usage, which if it exceeds what the BOINC client makes available will cause at least one of the running work-units to be killed.

I just increased the required host disk space for BOINC to half a gigabyte, but as you can see, this almost certainly won't be enough for quad-cores that run 4 big openmalaria work-units at once.

For those of you finding this a problem, you can also increase the disk space available to BOINC on your computers.

OK, I've had two errors now.

But what confuses me, since my two systems are only running 1 cpu task, limited by me in boinc, so according to you the checkpoint files should be about 100MB each x 2 files is 200MB or 1/5th a GB, but both have at least 1 GB free for boinc, so why do they error if there is plenty of free space ?

John Clark
Avatar
Send message
Joined: Feb 10 08
Posts: 2149
Credit: 1,194,032
RAC: 1,520

Well, I had a look at some of John Clark's failed workunits -- all errors say out of disk usage. John, could you please tell us how much disk space is currently available for BOINC apps on each of these computers?



Both in the account "Computing Preferences" and in BOINC Manager all PCs are set to "Use at Most" = 15GB disk space. I have upped them all - global and local to 30 GB.

Hope this helps on errorong.
____________
Go away, I was asleep

Said a Russell, 3 Shih-Tzus & a Bischeon Frize

Profile mikey
Avatar
Send message
Joined: Mar 23 07
Posts: 4384
Credit: 5,361,744
RAC: 1,110

I have both bad out units and then the very next unit from the same machine finishes just fine! I am not sure the error is accurately giving what is going on. How can one unit not have enough hard drive space and the very next unit work just fine?
https://malariacontrol.net/results.php?hostid=143197

Profile Kimegi Tepeex
Avatar
Send message
Joined: Jun 20 06
Posts: 33
Credit: 5,539,615
RAC: 848

I have had many errors, on several hosts.
I believe the cause was in the init_data.xml file containing :

400000000.000000


But something has changed recently, since the WUs received after 21:20 UTC today approximately, now specify this limit as 512M instead of 400M.

Let's crunch and see...
____________
This text was made up exclusively with recycled electrons.

swiftmallard
Avatar
Send message
Joined: Jul 24 09
Posts: 651
Credit: 1,130,259
RAC: 0

Use at most 100GB disk space
Leave at least 0 GB free
Use at most 50% total disk space

____________

John Clark
Avatar
Send message
Joined: Feb 10 08
Posts: 2149
Credit: 1,194,032
RAC: 1,520

Dspite my change I am getting the same as Kimegi.
____________
Go away, I was asleep

Said a Russell, 3 Shih-Tzus & a Bischeon Frize

Profile mikey
Avatar
Send message
Joined: Mar 23 07
Posts: 4384
Credit: 5,361,744
RAC: 1,110

I just looked at the error message on 2 of my pc's and saw this...
"- Virtual Memory Usage -
VirtualSize: 0, PeakVirtualSize: 0

- Pagefile Usage -
PagefileUsage: 0, PeakPagefileUsage: 0"

How can both virtual memory and page file usages be zero if the unit is running? I run with Windows and there is ALWAYS a page file in use for something! Very strange.

Profile John Neale
Avatar
Send message
Joined: Feb 21 10
Posts: 83
Credit: 89,364
RAC: 2

I have also had a couple of errors using V6.41. The latest gave this message:

2010/06/26 11:33:26 AM malariacontrol.net Aborting task wu_736_24_218916_0_1277433741_4: exceeded disk limit: 389.25MB > 381.47MB

However, my BOINC Manager Preferences settings (and my mc.net computing preferences) both allow 100 GB and a maximum of 50 % of disk space to be used. There is over 104 GB of space available, so I suppose that implies that about 52 GB of space is available to BOINC.

These settings don't seem to be consistent with the error message.

I've also made two observations when processing these units: (1) memory usage cycles up and down, and (2) my system (a dual core laptop) is very sluggish.
____________

Thyme Lawn
Send message
Joined: Jun 20 06
Posts: 181
Credit: 1,234,026
RAC: 1,392

I've had 3 "Maximum disk use exceeded error" failures in the past 24 hours.

The problem is nothing to do with the disk space settings in global preferences. As Kimegi previously mentioned, the project sets a disk space limit for workunits in .

The first one generated the error:

25-Jun-2010 17:31:24 [malariacontrol.net] Aborting task wu_736_35_218884_0_1277387415_4: exceeded disk limit: 577.49MB > 362.40MB

That task had set to 380000000.000000 (which corresponds with the error message).

The other 2 had a larger disk space limit but still generated too much data:
25-Jun-2010 18:32:53 [malariacontrol.net] Aborting task wu_736_326_218931_0_1277462535_0: exceeded disk limit: 402.55MB > 381.47MB
26-Jun-2010 13:53:14 [malariacontrol.net] Aborting task wu_736_426_218892_0_1277396899_3: exceeded disk limit: 402.51MB > 381.47MB

Those tasks had increased to 400000000.000000 (which also corresponds with the error messages).

My current tasks have increased to 512000000.000000. That's 488.28MB, which would have allowed the last 2 to run for a bit longer but would still have caused problems for the first one.

Edit: I have a task at 52% after 70 minutes, checkpointing every 10 minutes. It's using 45.8MB in the project directory and 213MB in the slot directory. The size of the checkpoint files seem to be increasing (checkpoint1 is 85.3MB, checkpoint0 is 128MB).
____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

Profile PinkPenguin
Avatar
Send message
Joined: Jul 24 09
Posts: 4
Credit: 270,599
RAC: 0

This from a MAC OS X 10.6.4 - malariacontrol v6.41 32-bit app. But i am getting the same on Windows and Linux machines as well.

Current value: 512000000.000000 which only applies to the WUs from 26th june.


24-Jun-2010 23:12:53 [malariacontrol.net] Aborting task wu_736_416_218903_0_1277408057_0: exceeded disk limit: 397.56MB > 381.47MB
25-Jun-2010 11:41:38 [malariacontrol.net] Aborting task wu_736_233_218915_0_1277430858_4: exceeded disk limit: 430.82MB > 381.47MB
25-Jun-2010 22:05:54 [malariacontrol.net] Aborting task wu_736_24_218919_0_1277441177_5: exceeded disk limit: 431.97MB > 381.47MB
26-Jun-2010 01:06:12 [malariacontrol.net] Aborting task wu_736_34_218931_0_1277462535_3: exceeded disk limit: 695.32MB > 381.47MB
26-Jun-2010 03:11:26 [malariacontrol.net] Aborting task wu_736_34_218937_0_1277472131_4: exceeded disk limit: 653.69MB > 488.28MB
26-Jun-2010 12:42:30 [malariacontrol.net] Aborting task wu_736_24_218975_0_1277530821_1: exceeded disk limit: 533.75MB > 488.28MB


This just confirms what the others are reporting in this thread.

John Clark
Avatar
Send message
Joined: Feb 10 08
Posts: 2149
Credit: 1,194,032
RAC: 1,520

Still quite a few WUs erroring
____________
Go away, I was asleep

Said a Russell, 3 Shih-Tzus & a Bischeon Frize

Profile PinkPenguin
Avatar
Send message
Joined: Jul 24 09
Posts: 4
Credit: 270,599
RAC: 0

Interestingly enough I had a work unit which completed OK but was flagged "Completed, can't validate" because all six wingmen all encountered the max disk space error.... ;)

https://malariacontrol.net/workunit.php?wuid=21857423

All machines would have had set to the old value (see Thyme Lawn above) as the change in value was early morning on the 26th (see my previous post).

The change may have reduced the frequency of the problem but it does not seem by much as I have had three errors of the same type since then (more or less the same as in the previous 24 hours under the old value):

26-Jun-2010 03:11:26 [malariacontrol.net] Aborting task wu_736_34_218937_0_1277472131_4: exceeded disk limit: 653.69MB > 488.28MB
26-Jun-2010 12:42:30 [malariacontrol.net] Aborting task wu_736_24_218975_0_1277530821_1: exceeded disk limit: 533.75MB > 488.28MB
26-Jun-2010 20:00:53 [malariacontrol.net] Aborting task wu_736_34_218986_0_1277551101_4: exceeded disk limit: 881.28MB > 488.28MB


From the disk values registered in the log it seems that another increment in the value is in order and, judging by the values above, this needs to be of the order of 1Gbyte or double the actual value... looks like actual disk space requirements vary a lot depending on WU.

Profile The Gas Giant
Avatar
Send message
Joined: Mar 7 06
Posts: 1214
Credit: 3,625,990
RAC: 2,644

Used by BOINC: 471.49MB
Free, available to BOINC: 19.62GB
Free, not available to BOINC: 17.66GB
Used by other programs: 36.78GB.

And yet for this wu
27/06/2010 4:09:14 PM malariacontrol.net Aborting task wu_736_35_219017_0_1277610618_1: exceeded disk limit: 606.86MB > 488.28MB

Profile PinkPenguin
Avatar
Send message
Joined: Jul 24 09
Posts: 4
Credit: 270,599
RAC: 0

Since the increase in to 488Mbyte there has been a reduction in the number of times the "Maximum disk usage exceeded" is encountered but on each machine (Linux, Mac OS X or Windows) it still occurs regularly.

Here's the log from a Windows machine (in addition to the examples from OS X previously posted):

25-Jun-2010 00:57:22 [malariacontrol.net] Aborting task wu_736_234_218909_0_1277417173_1: exceeded disk limit: 548.55MB > 381.47MB
25-Jun-2010 04:52:51 [malariacontrol.net] Aborting task wu_736_30_218913_0_1277425702_2: exceeded disk limit: 386.89MB > 381.47MB
25-Jun-2010 05:27:56 [malariacontrol.net] Aborting task wu_736_34_218910_0_1277418742_2: exceeded disk limit: 555.44MB > 381.47MB
25-Jun-2010 12:43:46 [malariacontrol.net] Aborting task wu_736_316_218919_0_1277441178_4: exceeded disk limit: 411.54MB > 381.47MB
25-Jun-2010 13:08:50 [malariacontrol.net] Aborting task wu_736_24_218931_0_1277462535_1: exceeded disk limit: 516.96MB > 381.47MB
26-Jun-2010 04:30:47 [malariacontrol.net] Aborting task wu_736_35_218967_0_1277513537_1: exceeded disk limit: 690.51MB > 488.28MB
26-Jun-2010 09:31:25 [malariacontrol.net] Aborting task wu_736_233_218959_0_1277504781_3: exceeded disk limit: 571.87MB > 488.28MB
26-Jun-2010 14:37:01 [malariacontrol.net] Aborting task wu_736_35_218987_0_1277552902_0: exceeded disk limit: 733.85MB > 488.28MB
28-Jun-2010 07:51:49 [malariacontrol.net] Aborting task wu_736_233_219056_0_1277702291_1: exceeded disk limit: 524.00MB > 488.28MB


If the first parameter is anything to go by it looks like the upper limit so far encountered is a little less than 1Gbyte (judging from the logs from all my machines).

Profile The Gas Giant
Avatar
Send message
Joined: Mar 7 06
Posts: 1214
Credit: 3,625,990
RAC: 2,644

So this is a Malaria Control issue not a BOINC issue?

A few more...
28/06/2010 11:10:30 AM malariacontrol.net Aborting task wu_736_35_218932_0_1277464333_3: exceeded disk limit: 462.11MB > 381.47MB
28/06/2010 12:45:59 PM malariacontrol.net Aborting task wu_736_234_219043_0_1277667492_3: exceeded disk limit: 603.96MB > 488.28MB
28/06/2010 12:56:02 PM malariacontrol.net Aborting task wu_736_35_219044_0_1277670614_2: exceeded disk limit: 627.62MB > 488.28MB

Thyme Lawn
Send message
Joined: Jun 20 06
Posts: 181
Credit: 1,234,026
RAC: 1,392

So this is a Malaria Control issue not a BOINC issue?

Yes. The majority of the disk space used by an MC task is in the 2 checkpoint files maintained in the slot directory. Their combined size can push the total disk space used for some tasks above the limit set in for workunits. That setting has been adjusted at least twice (from 380000000.000000 to 400000000.000000 to 512000000.000000), with the number of errors decreasing each time as more tasks manage to keep within the limit.

Edit: The only way I can think of to get one of the problem tasks to complete would be set the disk write interval longer than any MC task takes on your system to prevent the application from checkpointing, but that would mean you'd have to restart the task from scratch if you stopped BOINC and your task would probably be the only successful completion from a WU.
____________
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer

hardy
Volunteer moderator
Project administrator
Project developer
Avatar
Send message
Joined: Feb 18 09
Posts: 141
Credit: 54,421
RAC: 128

Well, yesterday Guillaume & I managed to nearly half the size of checkpoints. This means we'll have to make a new app release; there's also a few other improvements we've managed to make since the last release.

I'll let you know the details when we do; hopefully it'll be some point later today.

Profile GGnaegi
Volunteer moderator
Send message
Joined: Mar 4 10
Posts: 98
Credit: 40,023
RAC: 10

We have now updated the science application (version 6.42). This release will increase the performance and reduce the size of checkpoints.
____________
Guillaume Gnaegi
Swiss Tropical and Public Health Institute
http://www.swisstph.ch

John Clark
Avatar
Send message
Joined: Feb 10 08
Posts: 2149
Credit: 1,194,032
RAC: 1,520

Starting to get a load of 6.42 WUs, but 1 rig cannot download work (says none available) and the old rig has some 6.41s to complete before it gets on with it's 6.42s.

I suppose my RAC makes me really evil?
____________
Go away, I was asleep

Said a Russell, 3 Shih-Tzus & a Bischeon Frize

Profile The Gas Giant
Avatar
Send message
Joined: Mar 7 06
Posts: 1214
Credit: 3,625,990
RAC: 2,644

...

I suppose my RAC makes me really evil?

Well that, the mustache, lecherous smile and BIG honkin' nose a RAC of 666 just confirms it all! :p

I look forward to seeing how the 6.42 wu's run.

Post to thread

Message boards : Number crunching : Errors in Test WU's V6.41


Return to malariacontrol.net main page


Copyright © 2013 africa@home