Unusual result (there IS a problem now!)

Message boards : Number crunching : Unusual result (there IS a problem now!)

Author Message
Profile Ananas
Send message
Joined: Mar 7 06
Posts: 58
Credit: 752,054
RAC: 408

wu_2952_417_742309_0_1337981624_0 is by far the longest running openMalaria result I have ever seen.

67% after 14 hours, still using CPU time and there's still progress, I just hope that no rsc_fpops_bound setting will kill it.

TheFiend
Send message
Joined: Nov 5 09
Posts: 3
Credit: 338,058
RAC: 1

I have one just like that..... currently 42% after 9.5 hours running on my 1090T

Profile Ananas
Send message
Joined: Mar 7 06
Posts: 58
Credit: 752,054
RAC: 408

No trouble with any runtime limits :-)

CPU time 85,487.20 (not validated yet) - not a fast CPU, running at 2.27GHz + HT

TheFiend
Send message
Joined: Nov 5 09
Posts: 3
Credit: 338,058
RAC: 1

My 1090T is clocked to 3.75GHx......

Now 57% after 16 hours... All other WU's been running as normal

Profile Ananas
Send message
Joined: Mar 7 06
Posts: 58
Credit: 752,054
RAC: 408

I have just finished one more of those :

wu_2952_317_742484_0_1337992814_0

CPU time 67,505.73
Validate state Valid
Credit 0.00

What kind of nasty crap is this again? A new misconcept of the Berkeley despot? Some idiotic anti-cheating failure?

Setting to NNW + aborting all unstarted results :-(

TheFiend
Send message
Joined: Nov 5 09
Posts: 3
Credit: 338,058
RAC: 1

I also had a second one...... was 40% after 10 hours...

Decided to abort both.

:(

Profile mikey
Avatar
Send message
Joined: Mar 23 07
Posts: 4382
Credit: 5,361,193
RAC: 1,084

I also had a second one...... was 40% after 10 hours...

Decided to abort both.

:(


I had one and it went for almost 24 hours! I ONLY do the A units as this is an older laptop and the A units are supposed to be SMALLER!!!

Sparky_140
Send message
Joined: May 25 11
Posts: 1
Credit: 82,812
RAC: 90

Have a WU that's been plugging away for 15hrs but the estimated completion time keeps on going UP -- now estimated at an additional 24hrs and climbing ... is this reasonable (delivery deadline is May 31st -- unlikely that I'll make that if the remaining time continues to climb).
For the record it's wu_2953_743526_0_1338070808_0

Profile Ananas
Send message
Joined: Mar 7 06
Posts: 58
Credit: 752,054
RAC: 408

My longest has been 48 hours on a slightly OC'ed C2Q 9450, it's currently "inconclusive" (there seems to be a Linux vs. Windows issue but some misguided weirdo doesn't let me see the wingmen).

All other long ones have been valid and - no matter if there was a co-victim or I had it for my own - received 0.00 credits. A total of about 6 CPU days lost. Fortunately I had aborted the rest.

I guess that this is an anti-cheating algorithm that is definitely in the wrong place here.


p.s.: I had quite a hard time keeping my finger away from the "detach" button :-/

RandyC
Avatar
Send message
Joined: Jun 23 06
Posts: 2940
Credit: 926,800
RAC: 1,262

Since MCN for me is normally a set-and-forget project, I don't normally monitor my WUs. But after seeing a post in another thread I got to checking my systems...https://malariacontrol.net/workunit.php?wuid=69287156 ran for 141,555.30 cpu secs, no errors; validated; credit = ZERO!!!

michaelT
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: Jul 20 10
Posts: 47
Credit: 16,359
RAC: 0

Regarding the non credit issue the problem :

It seems that this is due to some issues with the validator : there is a MAX_GRANTED_CREDIT parameter which should in theory grant MAX_GRANTED_CREDIT (it avoids cheating with high credit request) if WU_CREDIT > MAX_GRANTED_CREDIT but in our case it granted 0 credit ... :(

Some of the 0 granted workunits have already been purged but we manage to get all the hosts and the average credits for all those ones. So for the one who didn't get credit before it's fixed now.

We increased the MAX_GRANTED_CREDIT like that this should not be a problem anymore. But let me know if it happen again.

jdvb
Send message
Joined: Apr 4 11
Posts: 19
Credit: 3,950,827
RAC: 6,298

Thank you for your reply!

But what about the extreme runtimes of 100+ hours?
WU's are to be returned within 3 days, that's not possible when they run so extremely long.
The linked one I aborted after 106 hours and progress @ about 90%

But as we only get 3 days to process tasks, 5 is stretching it a bit too far.
Sadly I had to set NNW untill either the running time is greatly decreased or the time alowed greatly increased.

Max execution time on the given system was around 2 hours, these run for 5 days (120 hours). That is about 60 times longer runningtime!
Please either make tasks so that they are once again finished within 3 hours or increase the time alowed to process to at least a month.

(a queue can build up when boinc thinks the average time is around 2 hours but instead each task runs for 5 days!)

The Knighty Ni
Avatar
Send message
Joined: Aug 1 11
Posts: 10
Credit: 198,432
RAC: 0

@jdvb

(a queue can build up when boinc thinks the average time is around 2 hours but instead each task runs for 5 days!)


I know exactly what you mean. There are other projects where this happens. The funny thing is to me it feels like they are spamming because eventually you become flooded with WU's which are all running at High Priority. Not a happy situation when you are trying to be fair to each project with your resources. Just doesn't give other projects a chance.

Usually manage this by setting No new tasks for a period of time depending on the WU run times for each project.

Don't you just hate having to Micro Manage in this way :)
____________
The Art of Flying is Throwing Yourself at the Ground and Missing. Douglas Adams ... Hitchhikers Guide to the Galaxy.

jdvb
Send message
Joined: Apr 4 11
Posts: 19
Credit: 3,950,827
RAC: 6,298

Don't you just hate having to Micro Manage in this way :)
Then do tell me, how do I manage a 5 day workunit to fit into a 3 day period?
Yes, I do find WU's that are simply ignored due to being turned in too late.

I generally only run one project on any PC.
No managing at all exept when tasks last longer then the max time alowed to execute.
The managing then involves aborting and moving to a different project as I see no other option.

This is not being flooded with WU's when one WU is too much on a multicore system.
I have stopped all malariacontrol on all machines slower then i7's as they get WU's longer then 50 hours.
Not going to choose to waste time on WU's that I can't turn in before the deadline anyways. This needs fixing asap.

Strat
Send message
Joined: Apr 8 12
Posts: 6
Credit: 10,841
RAC: 0

So I guess these are the last jobs I do for Malariacontol.net seeing how I got ZERO credit.


Name wu_2952_24_741957_0_1337959442_2
Workunit 69267532
Created 30 May 2012 23:24:01 UTC
Sent 30 May 2012 23:26:25 UTC
Received 2 Jun 2012 21:45:32 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -177 (0xffffff4f)
Computer ID 554761
Report deadline 3 Jun 2012 10:46:25 UTC
Run time 240,684.84
CPU time 226,948.60
Validate state Invalid
Credit 0.00
Application version openMalaria: A simulator of malaria epidemology and control (Branch A) v6.58


Name wu_2953_232_743346_0_1338056246_2
Workunit 69362442
Created 30 May 2012 5:54:35 UTC
Sent 30 May 2012 6:01:51 UTC
Received 2 Jun 2012 10:03:54 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -177 (0xffffff4f)
Computer ID 554761
Report deadline 2 Jun 2012 17:21:51 UTC
Run time 249,103.84
CPU time 230,974.40
Validate state Invalid
Credit 0.00
Application version openMalaria: A simulator of malaria epidemology and control (Branch A) v6.58

michaelT
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: Jul 20 10
Posts: 47
Credit: 16,359
RAC: 0

First of all, Strat this is a public forum, so be gentle and don't be coarse (your previous post is now hidden ).
Then for you credit, as you can see the valid state field of your workunits are Invalid, this mean that you won't get any credits for those workunits.

Then as I explained in some previous post we're sorry for the messup which happened last week, we tried hard fixing problems and finding where it went wrong. We also granted credits to people who didn't get them because of too high credits and canceled workunits we identifed as corrupted.

So again sorry for the mess and thanks for yours understanding.

Profile mikey
Avatar
Send message
Joined: Mar 23 07
Posts: 4382
Credit: 5,361,193
RAC: 1,084

First of all, Strat this is a public forum, so be gentle and don't be coarse (your previous post is now hidden ).
Then for you credit, as you can see the valid state field of your workunits are Invalid, this mean that you won't get any credits for those workunits.

Then as I explained in some previous post we're sorry for the messup which happened last week, we tried hard fixing problems and finding where it went wrong. We also granted credits to people who didn't get them because of too high credits and canceled workunits we identifed as corrupted.

So again sorry for the mess and thanks for yours understanding.


MichaelT I too am STILL getting the REALLY long units, I aborted one that had run for over 24 hours and was still at 50% just this morning. My normal time frame is 1.5 hours or less. The units in question seem to be the ones starting with 2952.

michaelT
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: Jul 20 10
Posts: 47
Credit: 16,359
RAC: 0

Yes, mikey the all the long workunits are the wu_2952_* and wu_2953_*, you can abort them, they have been cancel.

Strat
Send message
Joined: Apr 8 12
Posts: 6
Credit: 10,841
RAC: 0

Well thats just fine by me cause I'm outa here!

Profile mikey
Avatar
Send message
Joined: Mar 23 07
Posts: 4382
Credit: 5,361,193
RAC: 1,084

Yes, mikey the all the long workunits are the wu_2952_* and wu_2953_*, you can abort them, they have been cancel.


THANK YOU for cancelling them! That means all units I now crunch are ones that will bring my rac back UP!!

Post to thread

Message boards : Number crunching : Unusual result (there IS a problem now!)


Return to malariacontrol.net main page


Copyright © 2013 africa@home