Message boards : Questions and problems : BOINC client crashes immediately at startup (macOS 10.12.6)
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Send message Joined: 5 Oct 06 Posts: 5149 ![]() |
Got it, thanks, and opened OK. I'll start looking, but I may be some time... Well, the first obvious clue is from stdoutdae.txt: the client has been restarting continuously. It seems normal, until this happens: 18-Sep-2017 15:16:39 [---] (to change preferences, visit a project web site or select Preferences in the Manager) 18-Sep-2017 15:16:39 [---] Using account manager BOINCstatsBAM! 18-Sep-2017 15:16:39 Initialization completed 18-Sep-2017 15:16:39 [---] Running CPU benchmarks 18-Sep-2017 15:16:39 [---] Starting BOINC client version 7.8.2 for x86_64-apple-darwinabout 120 times in the snippet you sent me. So, init completed, start running benchmarks, crash. The next line, on my v7.8.2 for Windows (and many previous versions) is 09-Sep-2017 21:35:39 [---] Suspending computation - CPU benchmarks in progressso the crash is very quick. BTW, BAM! is not implicated - about half the crashes happened before BAM! was attached. |
Send message Joined: 8 Mar 12 Posts: 7 ![]() |
about 120 times in the snippet you sent me. Yes, I may have pressed "Yes, try again" a couple of times :) On my machine, successfully running CPU benchmarks takes around 30 seconds. It does not get much done before crashing. |
Send message Joined: 20 Nov 12 Posts: 801 ![]() |
In *** buffer overflow detected ***: boinc_client terminated the problem was CPDN task which failed and the client's error reporting wasn't prepared to deal with 100+ missing output files. pbro in message 81364 shows that even 50 files is too much. The problem is that the error message is so long that it overflows the buffer allocated for it. For several years now Debian/Ubuntu and probably Fedora as well have used compiler settings that detect such buffer overflow. I think a similar compiler setting is used on Windows too. So we should have seen people having this problem much earlier. So, question for those with more in-depth knowledge of CPDN. Are these tasks with tens of output files something new? The fix should be in client: eliminate possible buffer overflow in reporting result errors and client: use snprintf() instead of sprintf() in a few places which didn't make it to 7.8.2. edit: With message like this per file: <file_xfer_error> <file_name>wah2_wus25_ti5c_200309_25_583_011070828_1_r1095091230_5.zip</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> The safe limit on number of output files per task is about 25. The "r1095091230" is a relatively new addition to server software but having it drops the limit by about one file only. |
Send message Joined: 5 Oct 06 Posts: 5149 ![]() |
We might have something there. I'm looking through pbro's client_state.xml, and a second user - MR, you know who you are - has also sent me a set of files including client_state.xml. Both users have a failed CPDN WAH2 PNW task in their client_state, showing 50 files (51 actually - zips 1 to 49, restart, out) plus a crash dump - about 14 KB in total for the <result> section. MR also mentions suffering the "attached to Einstein@home twice" problem due to the changed master url - that's another complication we could do without. Edit - I've posted in Jim1348's thread at CPDN, asking if users have failures/successes with the 51-file WAH2 PNW workunits. Most Linux CPDN failure reports tend to come from bad 32-bit library installation - I haven't found any discussion specific to these tasks yet. |
Send message Joined: 8 Mar 12 Posts: 7 ![]() |
Richard, a shot into the blue, but the site this work was done at was having internet connection issues this weekend. Might this be related, i.e. only a temporarily failed upload triggers the client suicide? Secondly, is there anything but a clean reset that I can do to get my installation running again? |
Send message Joined: 5 Oct 06 Posts: 5149 ![]() |
One other small anomaly in pbro's bundle of files is a zero-length file called 'all_projects_list_temp.xml' (there's a full-size 'all_projects_list.xml' as well). We had problems with a zero-length RPC (different file) recently - could this indicate another problem causing or caused by the client crashing at startup? File is dated 18/09/2017 10:21, which is in the middle of the sequence of crashes with v7.8.2 - in fact the file coincided with "Version change (7.6.34 -> 7.8.2)", so it's part of the testing just before I got the files. |
Send message Joined: 8 Mar 12 Posts: 7 ![]() |
I did attempt to see if a BOINC upgrade resolves the issues, before contacting the forum. That will explain the version change. |
Send message Joined: 5 Oct 06 Posts: 5149 ![]() |
Richard, a shot into the blue, but the site this work was done at was having internet connection issues this weekend. Might this be related, i.e. only a temporarily failed upload triggers the client suicide?I think it's unlikely - BOINC is designed to cope with that sort of thing. My ISP resets my line once a week, and that doesn't cause any problems while it's down. But thanks for adding it to the report - we're probably going to need every scrap of information we can get. Secondly, is there anything but a clean reset that I can do to get my installation running again?The reset will be the quickest and easiest, but if you feel up to performing a little experiment first, it would be very helpful. Could you open the file 'client_state.xml', please, with a plain text editor - not a fancy XML editor. Then search for the line <name>wah2_pnw25_c4ci_190312_49_658_011241397_0</name>It will be in a block of XML that looks like ... </file_ref> <file_ref> <file_name>ozone_hist_N96_1899_1910v2.gz</file_name> <open_name>ozone_hist_N96_1899_1910v2.gz</open_name> </file_ref> </workunit> <result> <name>wah2_pnw25_c4ci_190312_49_658_011241397_0</name> <final_cpu_time>15880.410000</final_cpu_time> <final_elapsed_time>16233.776360</final_elapsed_time> <exit_status>0</exit_status> <state>3</state> ...If you could go back to the beginning of the <workunit> section, and down to the end of the </result> section, and remove absolutely everything - the whole <workunit> ... </workunit> <result> ... </result>segment, including those lines. Then save the file, and try BOINC one more time. If it starts normally, we have our smoking gun. Remember to stop work fetch from CPDN as soon as you get control again! If that doesn't work, just reset things - I doubt I'm going to come up with any more ideas tonight. |
![]() Send message Joined: 29 Aug 05 Posts: 15632 ![]() |
I've asked CPDN to come take a look in this thread and advise, or at least tell if they think it's probable or coincidence. |
Send message Joined: 5 Oct 06 Posts: 5149 ![]() |
I've re-activated my CPDN account, and with any luck I should get a WAH2 PNW in about 10 minutes (end of 1 hour backoff). If I disappear off the face of the earth, you'll know where I've gone... Otherwise, I might be able to see how it works under v7.8.2 for Windows. Edit - well, that worked better than I expected: 18/09/2017 18:05:52 | climateprediction.net | [sched_op] CPU work request: 10726.27 seconds; 4.00 devices 18/09/2017 18:05:54 | climateprediction.net | Scheduler request completed: got 4 new tasks 18/09/2017 18:05:54 | climateprediction.net | [sched_op] estimated total CPU task duration: 2309063 secondsOne and three spares. Why does the client request work for 4 devices, when all cores are busy with three CPU tasks and one OpenCL support job? Edit2 - actually, handy. One CAM model with 12 upload files, two AFR with 14 files, and one PNW with the 51 files. I'll start with an easy one. |
Send message Joined: 8 Mar 12 Posts: 7 ![]() |
That indeed fixed it. Back to happily crunching numbers (not CPDN though). Thanks for your help! |
Send message Joined: 5 Oct 06 Posts: 5149 ![]() |
And thank you for yours. We know what we need now, and the fix is already available - we just need a new build so we can test it properly. |
![]() Send message Joined: 29 Aug 05 Posts: 15632 ![]() |
Answer from the CPDN moderators list: gdp wrote: I've seen this in Linux, and I'm not running 7.8.2. I've been running 7.4.22 forever. It's been happening in the last month or two when a task crashes and something gets corrupted. I don't know whether some OS update changed something that boinc utilizes, or what (I'm running on Ubuntu 16.04 or higher). The only way I've been able to get the installation to work again without removing and reinstalling boinc is to remove all traces of the crashed task from client_state.xml. I can then start boinc back up and it will continue with whatever other tasks were running. Unfortunately that doesn't resolve why the problem is happening and why it's only been happening for the last couple months. I couldn't see a corruption in client_state.xml when I looked at it, but I am not an expert in what that file looks like at all times. |
Send message Joined: 5 Oct 06 Posts: 5149 ![]() |
That's more-or-less what we expected. Ask him to check batch 658 - that's the PNW that I've got. Data, not application. Edit - note on the CPDN front page that batch 658 was submitted on 15 September. That matches - and my first one crashed, though that was a 651. |
Send message Joined: 18 Sep 17 Posts: 2 ![]() |
FYI, I am a moderator at cpdn and run mostly linux. Certain wah2 science app batches of cpdn tasks crash on otherwise stable Mac and Linux PCs. These batches will mostly crash after 1 model month, on Jan 1 of the next model year as the regional worker takes over after the global worker finishes that day. These batches run okay on Windows PCs. There is some problem that happens on these batches at that point. The common theme for these batches is "naturalized" parameter sets. This dates back to April. However, in the last couple months, sometimes when this type of crash occurs, the boinc client can no longer communicate, and restarting boinc results in errors similar to the ones posted in this thread. The only way to recover from this for me has been to edit client_state.xml and remove all entries related to the crashed task. I'm thinking some OS update occurred in the last couple months that changed some files in how boinc works with the science app, or writes something, or who knows... I'm not a programmer or system person, just an IT enthusiast. The cpdn programmers know of the crash problem with the naturalized parameter sets, but have not been able to isolate the cause as to why it only occurs on Linux and Mac. The input files should be the same for the Windows app. Richard, 658 tasks will crash on Mac and Linux after one month, no matter which boinc client they use. Whether a the boinc "corruption" problem occurs may depend on the OS distribution, version, and what updates have been run on it. Edit...I see you got AFR and CAM tasks as well. The CAM tasks should work fine. Not sure about the AFR ones. George |
Send message Joined: 5 Oct 06 Posts: 5149 ![]() |
Thanks George. There seem to be two problems there: 1) Why do Linux and Mac CPDN tasks fail after one month? Dunno, but you might talk to the CPDN programmers about vsyscall - mentioned in this thread, I think, else search recent threads. Certainly for Linux, that function is being removed from Linux: if it's being called during the month-end file shuffle, that might cause the failure on recently updated Linux kernels. 2) Why does BOINC crash when the CPDN app crashes? We seem to have established via this thread (and the final test was exactly a repeat of your own procedure) that BOINC can't run when it has a huge stderr_txt and a huge pile of failed uploads for an unreported result. That seems to be the result of old and neglected code, perhaps dating back several versions. There is a fix in the pipeline, but no date for a test build yet: it was omitted from v7.8.2, despite being available then. |
Send message Joined: 20 Nov 12 Posts: 801 ![]() |
An alternative to editing client_state.xml would be deleting account_climateprediction.net.xml from BOINC's data directory. This should be approximately equivalent of using BOINC Manager to remove CPDN. All CPDN tasks would be lost but it doesn't have the risk of losing tasks for other projets. |
Send message Joined: 5 Oct 06 Posts: 5149 ![]() |
Good idea. My other correspondent has sent me some log files. That client is also failing immediatly after starting the 'run benchmark' process, but with one extra line: 17-Sep-2017 14:17:10 [---] Running CPU benchmarks 17-Sep-2017 14:17:10 [---] Received signal 15 17-Sep-2017 14:17:09 [---] cc_config.xml not found - using defaults 17-Sep-2017 14:17:09 [---] Starting BOINC client version 7.6.34 for x86_64-apple-darwinWith extra 'Signal 15', if that helps anyone. |
Send message Joined: 18 Sep 17 Posts: 2 ![]() |
2) Why does BOINC crash when the CPDN app crashes? We seem to have established via this thread (and the final test was exactly a repeat of your own procedure) that BOINC can't run when it has a huge stderr_txt and a huge pile of failed uploads for an unreported result. That seems to be the result of old and neglected code, perhaps dating back several versions. There is a fix in the pipeline, but no date for a test build yet: it was omitted from v7.8.2, despite being available then. Now I have thoroughly read this thread and can see why it has only been happening recently. The wah2 batches with naturalized parameters have only recently regularly started having numerous months in them. Prior to that, 1,3,10,12,13,18 months were the norm. Now a number of batches have greater than 18 months in them, and some of those have the naturalized parameters. Those die early and thus stderr and client_state are large with a long list of upload files that were never sent. I thought because these boinc problems were recent, some OS change was the culprit. Obviously not. Thanks Richard and those others troubleshooting for identifying the main problem with boinc under this scenario. |
Send message Joined: 16 Sep 17 Posts: 3 ![]() |
I finally came back to check the thread, and tried out the "edit the client_state.xml" suggestion -- success! Thanks! |
Copyright © 2025 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.