Advanced search

Message boards : Number crunching : ACEMD3: strange hanging at the end of crunching

Author Message
goldfinch
Send message
Joined: 5 May 19
Posts: 36
Credit: 675,942,021
RAC: 228,828
Level
Lys
Scientific publications
wat
Message 61990 - Posted: 3 Dec 2024 | 10:21:50 UTC

Laptop on Win11, Intel i7 11th gen, discrete RTX3070, 16 GB RAM: ACEMD3 application is processed quickly, but every time BOINC finishes calculations the computer hangs. I noticed that behaviour only on ACEMD3. Temperature has nothing to do with it, it seems, as ATM and ATMML apps are more compute-intensive with higher temperatures for longer times, but there is no hanging. Once the computer is restarted, BOINC starts normally and continues uploading the results back to the server.

Question: how to troubleshoot such a case? For now, i excluded ACEMD3 app in prefs, but i'm curious what could be wrong here. I suspect my GPU, as my older laptop running on RTX1060 never hangs even having T rising to 101 C (saw that only once though). On the other hand, computations per se don't cause this hanging; it seems, freeing resources at the end of the task causes it, so there is a possibility that it's not solely because of my GPU but (at least, partially) because of the app...

goldfinch
Send message
Joined: 5 May 19
Posts: 36
Credit: 675,942,021
RAC: 228,828
Level
Lys
Scientific publications
wat
Message 61991 - Posted: 3 Dec 2024 | 10:24:14 UTC - in response to Message 61990.

Actually, i started noticing this behaviour quite recently. Not sure if it's related to this new version of ACEMD3 which takes some 20-25 minutes to finish - previous versions of the app were running for 2-3 hours and wouldn't hang.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2362
Credit: 16,498,277,983
RAC: 4,556,499
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61992 - Posted: 3 Dec 2024 | 10:58:30 UTC - in response to Message 61991.

The app hasn't changed, just the data processed by the app has changed. So I suspect that it might be some conflicting app on your laptop (for example a GPU monitoring app). Is there anything about this error in the reliability history?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1081
Credit: 40,247,837,595
RAC: 1,557,196
Level
Trp
Scientific publications
wat
Message 61993 - Posted: 3 Dec 2024 | 13:15:40 UTC - in response to Message 61992.

The app hasn't changed, just the data processed by the app has changed. So I suspect that it might be some conflicting app on your laptop (for example a GPU monitoring app). Is there anything about this error in the reliability history?

the app actually has changed. they used to just distribute a single binary file for the app, and now they distribute it as an archive containing the binary and many other things (probably some libraries or dependencies) and it's now called via the BOINC wrapper instead of directly. the most recent versions of the acemd3 app were released in September of this year.
____________

goldfinch
Send message
Joined: 5 May 19
Posts: 36
Credit: 675,942,021
RAC: 228,828
Level
Lys
Scientific publications
wat
Message 61994 - Posted: 3 Dec 2024 | 21:14:03 UTC - in response to Message 61993.

the app actually has changed. they used to just distribute a single binary file for the app, and now they distribute it as an archive containing the binary and many other things (probably some libraries or dependencies) and it's now called via the BOINC wrapper instead of directly. the most recent versions of the acemd3 app were released in September of this year.

Thanks, Ian&Steve, for confirming. I think this weird behaviour started manifesting around that time. Before that, the laptop would occasionally hang, but that didn't depend on the app - more on the duration and complexity of a task.

The app hasn't changed, just the data processed by the app has changed. So I suspect that it might be some conflicting app on your laptop (for example a GPU monitoring app). Is there anything about this error in the reliability history?

Short-run ACEMD3 tasks never hang before. They were my favourites for 2 reasons: proper checkpointing and stability on my laptop. Now things have changed. Even running BOINC only, without any other user app, each ACEMD3 task ends with hanging. Could be something specific about the way resources are freed or, maybe, requested, and the AV software (which i can't change). But, again, that never happened before in such a consistent manner. ATMML that run much longer became far more reliable than ACEMD3 because i can set and forget about them, but each ACEMD3 requires hard restart of my laptop...

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2362
Credit: 16,498,277,983
RAC: 4,556,499
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61995 - Posted: 3 Dec 2024 | 22:36:16 UTC - in response to Message 61994.

I thought that your host was able to finish ACEMD3 (v2.32) tasks since september, but now I understand that you've processed only ATMML tasks in that period, that's why you've noticed just quite recently that your computer hangs with the ACEMD3 v2.32 app.
I've added GPUGrid to my Windows 11 pc to test the ACEMD3 v2.32 app. It can finish ACEMD3 v2.32 tasks without hanging (just as many other Windows based hosts).
What AV do you use?
Can you add the BOINC / GPUGrid work folder to the exceptions in your AV software?

c:\ProgramData\BOINC\slots\ c:\ProgramData\BOINC\projects\www.gpugrid.net\
Have you tried to reset the GPUGrid project in BOINC?
Can you check your pc's reliability history? (click on start and type "reliability" in the search field on the top)

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2362
Credit: 16,498,277,983
RAC: 4,556,499
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61996 - Posted: 4 Dec 2024 | 0:02:43 UTC

The ACEMD3 v2.32 Windows app throws 195 (0xc3) EXIT_CHILD_FAILED errors quite frequently, not only on my Windows host. Luckily it happens very early.

goldfinch
Send message
Joined: 5 May 19
Posts: 36
Credit: 675,942,021
RAC: 228,828
Level
Lys
Scientific publications
wat
Message 61997 - Posted: 4 Dec 2024 | 0:36:56 UTC - in response to Message 61995.

AV: WithSecure

It's managed by our security team, so i can't add exceptions.

I have 2 computers - please check the one that ends with *35. You'll see that it has 4 finished ACEMD3 tasks this month, but all of them resulted in a need to hard-restart the laptop. That's why i'm asking how i can trace what could be wrong here. What makes it more complex, as soon as the task has been uploaded, the slot is deleted, so it's really tricky to get the logs; and even that may not help - only indirectly, if there is the same log line before each hanging.

BTW, should i increase log level to debug?

As for the reliability report, i have Gigabyte's ControlCenter.exe failing all the time, but that has been happening for about a year or longer. Previously it didn't cause issues - at least, consistently. Apart from that, it typically has Windows was not properly shut down error which is expected due to the need for a hard reset.

I disabled the control centre service - will test it in that setup and update here. However, i'll have to do it tomorrow, so that reliability report has clear picture, not stained with today's issues.

homer__simpsons
Send message
Joined: 17 Nov 15
Posts: 14
Credit: 136,767,025
RAC: 141,114
Level
Cys
Scientific publications
wat
Message 61999 - Posted: 4 Dec 2024 | 21:44:37 UTC
Last modified: 4 Dec 2024 | 21:45:40 UTC

For the record I do not have such issue on my host. I see that your host uses Boinc 8.0.4, I'm still on 8.0.2, could the issue come from here?

8.0.4 is flagged as a pre-release on Github https://github.com/BOINC/boinc/releases

goldfinch
Send message
Joined: 5 May 19
Posts: 36
Credit: 675,942,021
RAC: 228,828
Level
Lys
Scientific publications
wat
Message 62000 - Posted: 5 Dec 2024 | 1:11:58 UTC - in response to Message 61997.

Disabled Gigabyte Control Center service, restarted the laptop, started BOINC: hanging after finishing the calculations. Checked various files, found this in stderrdae.txt:

Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00007FF7F17F60FA read attempt from address 0x0000000000000020

Engaging BOINC Windows Runtime Debugger...

As for the version, i had an issue where BOINC manager suddenly became unable to connect to the client. I tried reinstalling 8.0, 8.02 - nothing helped. 8.04 worked, so i'm staying on it for now. But i'll try reinstalling BOINC and test again - thanks for the suggestion.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2362
Credit: 16,498,277,983
RAC: 4,556,499
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 62001 - Posted: 5 Dec 2024 | 11:10:32 UTC - in response to Message 62000.
Last modified: 5 Dec 2024 | 11:16:06 UTC

I'm using the 8.04 BOINC manager on my Windows host, the ACEMD3 v2.32 is running fine with that version, when it doesn't throw the "195 (0xc3) EXIT_CHILD_FAILED" error.
I'd ask the security team to exclude the BOINC slot and GPUGrid folders from the AV (provided that your company policy allows you to run BOINC on the given hardware).
edit:
Is the timestamp of that "Unhandled Exception Record" recent?

Post to thread

Message boards : Number crunching : ACEMD3: strange hanging at the end of crunching

//