Thread 'Does the new scheduler suspend jobs silently?'

Message boards : Questions and problems : Does the new scheduler suspend jobs silently?
Message board moderation

To post messages, you must log in.

AuthorMessage
rvp_lan
Avatar

Send message
Joined: 30 Dec 08
Posts: 24
France
Message 44127 - Posted: 13 May 2012, 20:24:49 UTC
Last modified: 13 May 2012, 20:29:46 UTC

Hello,

There is already a lot of talk about the new scheduler. I've tried to read a lot.

-- I got different hosts @ home under Windows XP 32 or 64.
-- All upgraded to 7.0.27.
-- I crunch for more than 20 projects. Not all of them do have work.
-- The host which gives me the worst result with new version is the quad cores CPU + one nvidia GPU always up and connected.

I started a thread at Einstein because I repeatedly received a special message in log from their server. http://einstein.phys.uwm.edu/forum_thread.php?id=9438&nowrap=true#117085

The not (yet) resolved (for me) conclusion of the thread is that I can't figure out what could be the most correct settings for "minimum work" and "reserve work" for a host connected 24h/24 with quad cores CPU and a nvidia GPU, for which the scheduler will behave (almost) the same way the older does (I know: not same programming, not same behavior). "Almost" the same way!

In this thread, I explain that, as advised, I put debug booleans in the cc_config.xml, tried to disable GPU, re-enabled it, etc. Actually, I have the feeling that the new scheduler suspend projects when it reaches a limit. What limit? I don't understand.

There was this chain of actions:
-- upgrade from 7.0.25 to 7.0.27 (to follow advice of Albert@Home)
-- reset ALL non active projects (those with no WU pending or crunching)
-- client received to many WUs at a time on a single request for some projects
-- let the scheduler do its job (don't touch anything and wait more than a week)
-- After that, it's as if projects don't update themselves
-- No more job asked for CPU, only NVIDIA
-- I suspended job with to much WUs
-- Instantaneously, other jobs start their reset (after a week!!!) and ask for job

With the previous scheduler, reset orders for all (almost all) projects terminate within 15~30 mins. Whenever there was job or not to crunch.

What I don't get here, it's why because ONE project received too much job, it will saturate and stop all other projects? Even stop them for polling and updating (with or without asking for jobs).

Einstein has always been the most regular project from which I always received job. Because of this new message from their server, to test the response of Einstein (which was denied for job since a week), I suspended all other projects, then Einstein instantaneously ask for job and get 4 WUs. With a request for 138240 secs = 4 cores * (0.2d min + 0.2d res) * 24 * 3600. OK with that: all other suspended, it's the only one to work, it ask for all cores.

But after that, if I resume all other projects, why do they seem all again denied for polling? The previous scheduler was able to deal with these 20 projects on four cores and all projects seems to fairly and regularly received jobs.

I found @ GPUGRID that "best" values for a 24/24 connected host would be 0.2d min + 0.2d res. But obviously, the 2 hosts connected 24/24 do not behave like under the 6.x scheduler.

Except if there's a bug inside the 7.0.27, what would be the best values for a such host? Or what would be the best values for having a nice distribution of work between ALL projects?

Thanx. Cheers.
ID: 44127 · Report as offensive
TRuEQ & TuVaLu
Avatar

Send message
Joined: 23 May 11
Posts: 108
Sweden
Message 44134 - Posted: 14 May 2012, 10:27:01 UTC - in response to Message 44127.  
Last modified: 14 May 2012, 10:28:51 UTC

Cheers.


http://boinc.berkeley.edu/dev/forum_thread.php?id=7554&sort=6

My thread and I am also trying to figure out how it is supposed to work and why changes are so slow.....

I think it works though, but very slowly..... A minor change can take days to adjust.
ID: 44134 · Report as offensive
rvp_lan
Avatar

Send message
Joined: 30 Dec 08
Posts: 24
France
Message 44194 - Posted: 20 May 2012, 1:16:47 UTC - in response to Message 44134.  

Hi,

Thanks for pointing your thread. Bravo for your patience to observ and note all this!

As I said in my previous post: I try to let the new scheduler do its job during the longest time possible, after then I can decide if all projects are running smoothly and gently together. Actually, my RAC is getting lower, because the hosts are not getting as much tasks as before. But I don't care for score, but for fair crunching for all projects.

Actually, with job reserve at 0.2d min and 0.2d res, one of my 24/24 host doesn't receive anymore any job for any project??? It just compute a task for Climate... I'm hoping that it's not due to the very long deadline of Climate...

My good old Linux 24/24 host in 6.12.22 is a good comparison's base, because it does receive jobs continuously...

Wait and see. To be continued!
ID: 44194 · Report as offensive

Message boards : Questions and problems : Does the new scheduler suspend jobs silently?

Copyright © 2025 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.