Advanced search

Message boards : Number crunching : ATMML work units erroring out with "Illegal instruction"

Author Message
William Albert
Send message
Joined: 22 Sep 24
Posts: 2
Credit: 100,400,000
RAC: 2,010,931
Level
Cys
Scientific publications
wat
Message 61830 - Posted: 26 Sep 2024 | 12:03:10 UTC

I have an older computer crunching full-time for WCG and GPUGRID.

The computer in question: https://www.gpugrid.net/show_host_detail.php?hostid=626036

This computer seems to be unable to complete any ATMML work units, and receives the following error when it tries:


+ python bin/rbfe_explicit_sync.py QB_A24_A36_asyncre.cntl
run.sh: line 24: 12561 Illegal instruction (core dumped) python bin/rbfe_explicit_sync.py $CONFIG_FILE
2024-09-25 15:43:12 (12505): bin/bash exited; CPU time 19.896937
2024-09-25 15:43:12 (12505): app exit status: 0x84
2024-09-25 15:43:12 (12505): called boinc_finish(195)


The full output of an example failed WU is here: https://www.gpugrid.net/result.php?resultid=36014912

Looking around online, this appears to be a common error with Python ML frameworks that include pre-compiled binaries built for newer CPU targets with support for SSE4, AVX, etc., where the user attempts to run it on a processor that doesn't support those instruction set extensions.

Looking at the GPUGRID apps list (https://www.gpugrid.net/apps.php), the only CPU requirement listed is a 64-bit version of Windows or Linux running on an x86-64 processor. And if that listing is machine-generated in a way that doesn't list the actual CPU requirements, then GPUGRID's "Join Us" page (https://www.gpugrid.net/join.php) also similarly lists the CPU requirements as "64-bit" with at least one core, which literally any x86-64 CPU should be able to meet.

If support for additional instruction set extensions beyond the base x86-64 spec is a requirement of ATMML, then fair enough -- this computer's processor is admittedly quite old (although the GPU is much newer and AFAIK supports all the GPUGRID apps currently running).

However, if the CPU is only being used to run a CUDA app that does the actual heavy lifting, then ATMML's CPU requirements could be unnecessarily limiting the users who are able to run these jobs.

For the time being, I've opted out of ATMML work units in my preferences.

William Albert
Send message
Joined: 22 Sep 24
Posts: 2
Credit: 100,400,000
RAC: 2,010,931
Level
Cys
Scientific publications
wat
Message 61831 - Posted: 26 Sep 2024 | 12:17:07 UTC

I found this related entry in the Linux kernel log, if it's helpful:


traps: python[12561] trap invalid opcode ip:73ddfba85876 sp:7ffc7f5a2a90 error:0 in libOpenMMPME.so[73ddfba85000+3d000]

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1067
Credit: 40,231,533,983
RAC: 1,418
Level
Trp
Scientific publications
wat
Message 61832 - Posted: 26 Sep 2024 | 12:17:19 UTC - in response to Message 61830.

my hunch is that you're right about the CPU being the issue. it lacks SSE4.1, SSE4.2, and AVX/2, and up. I wouldn't be surprised at all if SSE4 or AVX were used since those features are so ubiquitous in basically all x86_64 in the last decade.

it's very common that an application is not 100% GPU based and still needs the CPU to do some work, I believe the ATM/ATMML apps do this. Some Einstein apps do this also.

opting out is the right decision for you. if you want to run these tasks, you'll have to upgrade the system to something more modern.
____________

pututu
Send message
Joined: 8 Oct 16
Posts: 25
Credit: 4,153,801,869
RAC: 31,918,017
Level
Arg
Scientific publications
watwatwatwat
Message 61833 - Posted: 26 Sep 2024 | 17:19:31 UTC

At least your 6GB 1060 card can crunch quantum chemistry task successfully with the cpu that you have.

Post to thread

Message boards : Number crunching : ATMML work units erroring out with "Illegal instruction"

//