Thread 'Does the new scheduler suspend jobs silently?'

Author	Message
rvp_lan Send message Joined: 30 Dec 08 Posts: 24	Message 44127 - Posted: 13 May 2012, 20:24:49 UTC Last modified: 13 May 2012, 20:29:46 UTC Hello, There is already a lot of talk about the new scheduler. I've tried to read a lot. -- I got different hosts @ home under Windows XP 32 or 64. -- All upgraded to 7.0.27. -- I crunch for more than 20 projects. Not all of them do have work. -- The host which gives me the worst result with new version is the quad cores CPU + one nvidia GPU always up and connected. I started a thread at Einstein because I repeatedly received a special message in log from their server. http://einstein.phys.uwm.edu/forum_thread.php?id=9438&nowrap=true#117085 The not (yet) resolved (for me) conclusion of the thread is that I can't figure out what could be the most correct settings for "minimum work" and "reserve work" for a host connected 24h/24 with quad cores CPU and a nvidia GPU, for which the scheduler will behave (almost) the same way the older does (I know: not same programming, not same behavior). "Almost" the same way! In this thread, I explain that, as advised, I put debug booleans in the cc_config.xml, tried to disable GPU, re-enabled it, etc. Actually, I have the feeling that the new scheduler suspend projects when it reaches a limit. What limit? I don't understand. There was this chain of actions: -- upgrade from 7.0.25 to 7.0.27 (to follow advice of Albert@Home) -- reset ALL non active projects (those with no WU pending or crunching) -- client received to many WUs at a time on a single request for some projects -- let the scheduler do its job (don't touch anything and wait more than a week) -- After that, it's as if projects don't update themselves -- No more job asked for CPU, only NVIDIA -- I suspended job with to much WUs -- Instantaneously, other jobs start their reset (after a week!!!) and ask for job With the previous scheduler, reset orders for all (almost all) projects terminate within 15~30 mins. Whenever there was job or not to crunch. What I don't get here, it's why because ONE project received too much job, it will saturate and stop all other projects? Even stop them for polling and updating (with or without asking for jobs). Einstein has always been the most regular project from which I always received job. Because of this new message from their server, to test the response of Einstein (which was denied for job since a week), I suspended all other projects, then Einstein instantaneously ask for job and get 4 WUs. With a request for 138240 secs = 4 cores * (0.2d min + 0.2d res) * 24 * 3600. OK with that: all other suspended, it's the only one to work, it ask for all cores. But after that, if I resume all other projects, why do they seem all again denied for polling? The previous scheduler was able to deal with these 20 projects on four cores and all projects seems to fairly and regularly received jobs. I found @ GPUGRID that "best" values for a 24/24 connected host would be 0.2d min + 0.2d res. But obviously, the 2 hosts connected 24/24 do not behave like under the 6.x scheduler. Except if there's a bug inside the 7.0.27, what would be the best values for a such host? Or what would be the best values for having a nice distribution of work between ALL projects? Thanx. Cheers. ID: 44127 ·

TRuEQ & TuVaLu Send message Joined: 23 May 11 Posts: 108	Message 44134 - Posted: 14 May 2012, 10:27:01 UTC - in response to Message 44127. Last modified: 14 May 2012, 10:28:51 UTC Cheers. http://boinc.berkeley.edu/dev/forum_thread.php?id=7554&sort=6 My thread and I am also trying to figure out how it is supposed to work and why changes are so slow..... I think it works though, but very slowly..... A minor change can take days to adjust. ID: 44134 ·

rvp_lan Send message Joined: 30 Dec 08 Posts: 24	Message 44194 - Posted: 20 May 2012, 1:16:47 UTC - in response to Message 44134. Hi, Thanks for pointing your thread. Bravo for your patience to observ and note all this! As I said in my previous post: I try to let the new scheduler do its job during the longest time possible, after then I can decide if all projects are running smoothly and gently together. Actually, my RAC is getting lower, because the hosts are not getting as much tasks as before. But I don't care for score, but for fair crunching for all projects. Actually, with job reserve at 0.2d min and 0.2d res, one of my 24/24 host doesn't receive anymore any job for any project??? It just compute a task for Climate... I'm hoping that it's not due to the very long deadline of Climate... My good old Linux 24/24 host in 6.12.22 is a good comparison's base, because it does receive jobs continuously... Wait and see. To be continued! ID: 44194 ·

Copyright © 2025 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.