Advanced search

Message boards : Number crunching : many faulty ATMMLs recently

Author Message
Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,207,216,611
RAC: 12,208,486
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 62024 - Posted: 14 Dec 2024 | 19:09:40 UTC

Within the past hours, I've been receiving several ATMML tasks which errored out after about 1 1/2 minutes, e.g.:'
https://www.gpugrid.net/result.php?resultid=37138619
Viewing the working package reveals that these tasks failed on other hosts, too.
What's happening (besides the upload problem due to "disk full")?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1627
Credit: 9,583,103,923
RAC: 9,123,029
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 62025 - Posted: 14 Dec 2024 | 19:36:40 UTC - in response to Message 62024.

If you look down towards the end of that task report, you can read:

bzip2: Compressed file ends unexpectedly;
perhaps it is corrupted? *Possible* reason follows.
bzip2: No error
Input file = (stdin), output file = (stdout)

It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.

tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

which is pretty self-explanatory.

It may be an effect of the lack of disk space: to make an archive, you need to have enough space for both the source file, and the compressed version at the same time.

The depressing thing about this sequence of errors is that this is a project which has said many times that their research results are important and cutting-edge, and speed is of the essence. They use many tools to encourage us to send the results back to them as quickly as possible: the 24-hour return bonus, manipulating task sizes in a way which discourages large (idle) caches, and so on. Some of the applications process a sequence of tasks based on a single dataset: later tasks are generated from the results of the early runs. But not if we can't return them, so work supply grinds to a halt too.

And as others have pointed out, this has happened before: this outage was entirely predictable and manageable. They need to keep an eye on administration, as well as research.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,207,216,611
RAC: 12,208,486
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 62026 - Posted: 14 Dec 2024 | 20:17:08 UTC - in response to Message 62025.

And as others have pointed out, this has happened before: this outage was entirely predictable and manageable. They need to keep an eye on administration, as well as research.

+ 1

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,207,216,611
RAC: 12,208,486
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 62035 - Posted: 15 Dec 2024 | 11:32:34 UTC - in response to Message 62025.

If you look down towards the end of that task report, you can read:

bzip2: Compressed file ends unexpectedly;
perhaps it is corrupted? *Possible* reason follows.
bzip2: No error
Input file = (stdin), output file = (stdout)

It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.

tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

which is pretty self-explanatory.

It may be an effect of the lack of disk space: to make an archive, you need to have enough space for both the source file, and the compressed version at the same time.

I investigated and my finding is that the problem is NOT caused by lack of disk space.
As I had noticed already long time ago, these ATMMLs, in the first minutes after start, use up to ~14,4GB disk space, and right after the compressed file has been decompressed, it is deleted, and from then on the disk usage is slightly above 9GB, throughout the remaining task processing time.

The host on which these tasks error out after a few minutes has about 45GB disk space left for GPUGRID, so there cannot possibly be a disk space problem.

Furthermore, a look at the working package of such a task reveals that it has failed on other hosts as well, see here:
https://www.gpugrid.net/workunit.php?wuid=30372600

Hence, my finding is that these tasks most probably are misconfigured :-(

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1627
Credit: 9,583,103,923
RAC: 9,123,029
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 62039 - Posted: 15 Dec 2024 | 13:14:34 UTC - in response to Message 62035.

I was suggesting that the server might be struggling when creating the archive, not that your machine introduced the errors when unpacking it.

But the effect's the same - it doesn't work, and we can't fix it 'in the field'.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,207,216,611
RAC: 12,208,486
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 62047 - Posted: 17 Dec 2024 | 16:45:31 UTC

although the server should be okay since yesterday, I still get tasks which error out after a few minutes, like:
https://www.gpugrid.net/result.php?resultid=37154558

excerpt from stderr:

File "C:\ProgramData\BOINC\slots\0\Lib\site-packages\sync\worker.py", line 124, in run
raise RuntimeError(f"Simulation failed {ntry} times!")
RuntimeError: Simulation failed 5 times!


any explanation for this behaviour?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1627
Credit: 9,583,103,923
RAC: 9,123,029
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 62048 - Posted: 17 Dec 2024 | 17:36:22 UTC - in response to Message 62047.

GIGO - Garbage In, Garbage Out.

KeithBriggs
Send message
Joined: 29 Aug 24
Posts: 33
Credit: 1,397,690,047
RAC: 6,400,388
Level
Met
Scientific publications
wat
Message 62049 - Posted: 18 Dec 2024 | 6:57:08 UTC - in response to Message 62048.

True but at least the garbage out pretty quickly. I'm thankful.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1146
Credit: 11,207,216,611
RAC: 12,208,486
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 62063 - Posted: 20 Dec 2024 | 12:59:19 UTC - in response to Message 62049.

True but at least the garbage out pretty quickly. I'm thankful.

if such faulty tasks error out a few minutes after they start, it's okay with me.
However, if a tasks runs for about 3 1/2 hours and then crashes:
https://www.gpugrid.net/result.php?resultid=37190122
it's a real waste of ressources :-(

KeithBriggs
Send message
Joined: 29 Aug 24
Posts: 33
Credit: 1,397,690,047
RAC: 6,400,388
Level
Met
Scientific publications
wat
Message 62064 - Posted: 20 Dec 2024 | 14:23:12 UTC - in response to Message 62063.

After I sent the previous msg, I also got a few tried 5 times after a long time running as well. Sigh.

Post to thread

Message boards : Number crunching : many faulty ATMMLs recently

//