Message boards : Questions and problems : strange error: garbage_collect ?cannot collect?
Message board moderation
Author | Message |
---|---|
![]() ![]() Send message Joined: 27 Jun 08 Posts: 642 ![]() |
Two of my GPUs on a 10 GPU mining rig are stuck: 0% utilization with work unit showing %100 done error messages: 7209 SETI@home 8/10/2019 3:34:45 PM [error] garbage_collect(); still have active task for acked result blc32_2bit_guppi_58643_76143_HIP73005_0101.26078.409.23.46.97.vlar_0; state 5 10233 SETI@home 8/10/2019 4:20:49 PM [error] garbage_collect(); still have active task for acked result blc33_2bit_guppi_58643_86349_HIP33332_0131.3725.0.23.46.188.vlar_0; state 5 what's happening? googling I found a previous report dated 2010 over at SETI. [EDIT] Cannot even kill boinc. tried sudo kill -9 8109 (boinc) and just kill 8109 and task 8109 never disappears from top or htop. Argument shows boinc with command line --detectgpu so it (7.16.1) seems stuck trying to detect the gpu and not bothering to accept the kill signal. This was after using the /etc/init.d/boinc-client stop to try to stop going to reboot [EDIT 2] Suspended and NNT and rebooted. The two "stuck" tasks were assigned GPUs 0 and 1 and finished in under minutes. resumed rest of tasks look back to normal. maybe I ran of out memory with only 8gb and 10 gpus. |
![]() ![]() Send message Joined: 27 Jun 08 Posts: 642 ![]() |
Trying to debug the problem as it is happening once or twice a day. it would appear that memory is not a problem. Looking here if (rp->got_server_ack) { // see if - for some reason - there's an active task // for this result. don't want to create dangling ptr. // ACTIVE_TASK* atp = active_tasks.lookup_result(rp); if (atp) { msg_printf(rp->project, MSG_INTERNAL_ERROR, "garbage_collect(); still have active task for acked result %s; state %d", rp->name, atp->task_state() State 5 means finished ok from what I understand. Looks like the Linux seti app does not realize it finished. On my boinc manager, under status I see the following typical behavior ....running....uploading....ready-to-report (1) At what point is the status set to 5? Is it after the upload? after the "ready to report" I am guessing the error occurs as the 5 is generated just after finishing the "running" but "uploading" does not take place for some reason. So it is got the server ack but is marked as still running or a dangling "active task". (2) what exactly does "uploading" mean? (3) what exactly does "reporting" mean? Could there be a timing problem in the app when looking for the ack from the server? Who handles the ack: boinc or the app? Even if this is not a boinc problem I would like to know answers to 1,2 and 3 before going over to SETI and stirring the pot. ==============some other observations============= kill and kill -9 do not kill the "dangling" task even under sudo. I am not an expert but kill -9 has always worked for me. I do see that "boinc" is the owner of the dangling task. Is that what is keeping me from being able to kill it? I would rather kill it than reboot. bionccmd --quit stops boinc but not that dangling task. A restart of the service failsL I see the task with command "boinc --detactgpu xx (don't remember exactly) and the task disappears and reappears as the service keeps trying to start but boinc never gets past that detectgpu. I end up with reboot of system and often have to power off and on as it never totally shuts down. |
![]() Send message Joined: 28 Jun 10 Posts: 2842 ![]() |
(2) what exactly does "uploading" mean? My understanding is that uploading is sending the zip file(s) with the data back to the server and reporting is telling the project that the result of the task is either success or failure. At least, that is what it means with CPDN |
Copyright © 2025 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.