many faulty ATMMLs recently

Message boards : Number crunching : many faulty ATMMLs recently

Author	Message
Erich56 Send message Joined: 1 Jan 15 Posts: 1146 Credit: 11,207,216,611 RAC: 12,208,486 Level Scientific publications	Message 62024 - Posted: 14 Dec 2024 \| 19:09:40 UTC
	Within the past hours, I've been receiving several ATMML tasks which errored out after about 1 1/2 minutes, e.g.:' https://www.gpugrid.net/result.php?resultid=37138619 Viewing the working package reveals that these tasks failed on other hosts, too. What's happening (besides the upload problem due to "disk full")?
	ID: 62024 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1627 Credit: 9,583,103,923 RAC: 9,123,029 Level Scientific publications	Message 62025 - Posted: 14 Dec 2024 \| 19:36:40 UTC - in response to Message 62024.
	If you look down towards the end of that task report, you can read: bzip2: Compressed file ends unexpectedly; perhaps it is corrupted? Possible reason follows. bzip2: No error Input file = (stdin), output file = (stdout) It is possible that the compressed file(s) have become corrupted. You can use the -tvv option to test integrity of such files. You can use the `bzip2recover' program to attempt to recover data from undamaged sections of corrupted files. tar: Unexpected EOF in archive tar: Unexpected EOF in archive tar: Error is not recoverable: exiting now which is pretty self-explanatory. It may be an effect of the lack of disk space: to make an archive, you need to have enough space for both the source file, and the compressed version at the same time. The depressing thing about this sequence of errors is that this is a project which has said many times that their research results are important and cutting-edge, and speed is of the essence. They use many tools to encourage us to send the results back to them as quickly as possible: the 24-hour return bonus, manipulating task sizes in a way which discourages large (idle) caches, and so on. Some of the applications process a sequence of tasks based on a single dataset: later tasks are generated from the results of the early runs. But not if we can't return them, so work supply grinds to a halt too. And as others have pointed out, this has happened before: this outage was entirely predictable and manageable. They need to keep an eye on administration, as well as research.
	ID: 62025 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1146 Credit: 11,207,216,611 RAC: 12,208,486 Level Scientific publications	Message 62026 - Posted: 14 Dec 2024 \| 20:17:08 UTC - in response to Message 62025.
	And as others have pointed out, this has happened before: this outage was entirely predictable and manageable. They need to keep an eye on administration, as well as research. + 1
	ID: 62026 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1146 Credit: 11,207,216,611 RAC: 12,208,486 Level Scientific publications	Message 62035 - Posted: 15 Dec 2024 \| 11:32:34 UTC - in response to Message 62025.
	If you look down towards the end of that task report, you can read: bzip2: Compressed file ends unexpectedly; perhaps it is corrupted? Possible reason follows. bzip2: No error Input file = (stdin), output file = (stdout) It is possible that the compressed file(s) have become corrupted. You can use the -tvv option to test integrity of such files. You can use the `bzip2recover' program to attempt to recover data from undamaged sections of corrupted files. tar: Unexpected EOF in archive tar: Unexpected EOF in archive tar: Error is not recoverable: exiting now which is pretty self-explanatory. It may be an effect of the lack of disk space: to make an archive, you need to have enough space for both the source file, and the compressed version at the same time. I investigated and my finding is that the problem is NOT caused by lack of disk space. As I had noticed already long time ago, these ATMMLs, in the first minutes after start, use up to ~14,4GB disk space, and right after the compressed file has been decompressed, it is deleted, and from then on the disk usage is slightly above 9GB, throughout the remaining task processing time. The host on which these tasks error out after a few minutes has about 45GB disk space left for GPUGRID, so there cannot possibly be a disk space problem. Furthermore, a look at the working package of such a task reveals that it has failed on other hosts as well, see here: https://www.gpugrid.net/workunit.php?wuid=30372600 Hence, my finding is that these tasks most probably are misconfigured :-(
	ID: 62035 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1627 Credit: 9,583,103,923 RAC: 9,123,029 Level Scientific publications	Message 62039 - Posted: 15 Dec 2024 \| 13:14:34 UTC - in response to Message 62035.
	I was suggesting that the server might be struggling when creating the archive, not that your machine introduced the errors when unpacking it. But the effect's the same - it doesn't work, and we can't fix it 'in the field'.
	ID: 62039 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1146 Credit: 11,207,216,611 RAC: 12,208,486 Level Scientific publications	Message 62047 - Posted: 17 Dec 2024 \| 16:45:31 UTC
	although the server should be okay since yesterday, I still get tasks which error out after a few minutes, like: https://www.gpugrid.net/result.php?resultid=37154558 excerpt from stderr: File "C:\ProgramData\BOINC\slots\0\Lib\site-packages\sync\worker.py", line 124, in run raise RuntimeError(f"Simulation failed {ntry} times!") RuntimeError: Simulation failed 5 times! any explanation for this behaviour?
	ID: 62047 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1627 Credit: 9,583,103,923 RAC: 9,123,029 Level Scientific publications	Message 62048 - Posted: 17 Dec 2024 \| 17:36:22 UTC - in response to Message 62047.
	GIGO - Garbage In, Garbage Out.
	ID: 62048 \| Rating: 0 \| rate: / Reply Quote

KeithBriggs Send message Joined: 29 Aug 24 Posts: 33 Credit: 1,397,690,047 RAC: 6,400,388 Level Scientific publications	Message 62049 - Posted: 18 Dec 2024 \| 6:57:08 UTC - in response to Message 62048.
	True but at least the garbage out pretty quickly. I'm thankful.
	ID: 62049 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1146 Credit: 11,207,216,611 RAC: 12,208,486 Level Scientific publications	Message 62063 - Posted: 20 Dec 2024 \| 12:59:19 UTC - in response to Message 62049.
	True but at least the garbage out pretty quickly. I'm thankful. if such faulty tasks error out a few minutes after they start, it's okay with me. However, if a tasks runs for about 3 1/2 hours and then crashes: https://www.gpugrid.net/result.php?resultid=37190122 it's a real waste of ressources :-(
	ID: 62063 \| Rating: 0 \| rate: / Reply Quote

KeithBriggs Send message Joined: 29 Aug 24 Posts: 33 Credit: 1,397,690,047 RAC: 6,400,388 Level Scientific publications	Message 62064 - Posted: 20 Dec 2024 \| 14:23:12 UTC - in response to Message 62063.
	After I sent the previous msg, I also got a few tried 5 times after a long time running as well. Sigh.
	ID: 62064 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Number crunching : many faulty ATMMLs recently

	About	Science	Volunteers	Performance	Forum	Join us	Donate

Author	Message
Erich56 Send message Joined: 1 Jan 15 Posts: 1146 Credit: 11,207,216,611 RAC: 12,208,486 Level Scientific publications	Message 62024 - Posted: 14 Dec 2024 \| 19:09:40 UTC
	Within the past hours, I've been receiving several ATMML tasks which errored out after about 1 1/2 minutes, e.g.:' https://www.gpugrid.net/result.php?resultid=37138619 Viewing the working package reveals that these tasks failed on other hosts, too. What's happening (besides the upload problem due to "disk full")?
	ID: 62024 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1627 Credit: 9,583,103,923 RAC: 9,123,029 Level Scientific publications	Message 62025 - Posted: 14 Dec 2024 \| 19:36:40 UTC - in response to Message 62024.
	If you look down towards the end of that task report, you can read: bzip2: Compressed file ends unexpectedly; perhaps it is corrupted? Possible reason follows. bzip2: No error Input file = (stdin), output file = (stdout) It is possible that the compressed file(s) have become corrupted. You can use the -tvv option to test integrity of such files. You can use the `bzip2recover' program to attempt to recover data from undamaged sections of corrupted files. tar: Unexpected EOF in archive tar: Unexpected EOF in archive tar: Error is not recoverable: exiting now which is pretty self-explanatory. It may be an effect of the lack of disk space: to make an archive, you need to have enough space for both the source file, and the compressed version at the same time. The depressing thing about this sequence of errors is that this is a project which has said many times that their research results are important and cutting-edge, and speed is of the essence. They use many tools to encourage us to send the results back to them as quickly as possible: the 24-hour return bonus, manipulating task sizes in a way which discourages large (idle) caches, and so on. Some of the applications process a sequence of tasks based on a single dataset: later tasks are generated from the results of the early runs. But not if we can't return them, so work supply grinds to a halt too. And as others have pointed out, this has happened before: this outage was entirely predictable and manageable. They need to keep an eye on administration, as well as research.
	ID: 62025 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1146 Credit: 11,207,216,611 RAC: 12,208,486 Level Scientific publications	Message 62026 - Posted: 14 Dec 2024 \| 20:17:08 UTC - in response to Message 62025.
	And as others have pointed out, this has happened before: this outage was entirely predictable and manageable. They need to keep an eye on administration, as well as research. + 1
	ID: 62026 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1146 Credit: 11,207,216,611 RAC: 12,208,486 Level Scientific publications	Message 62035 - Posted: 15 Dec 2024 \| 11:32:34 UTC - in response to Message 62025.
	If you look down towards the end of that task report, you can read: bzip2: Compressed file ends unexpectedly; perhaps it is corrupted? Possible reason follows. bzip2: No error Input file = (stdin), output file = (stdout) It is possible that the compressed file(s) have become corrupted. You can use the -tvv option to test integrity of such files. You can use the `bzip2recover' program to attempt to recover data from undamaged sections of corrupted files. tar: Unexpected EOF in archive tar: Unexpected EOF in archive tar: Error is not recoverable: exiting now which is pretty self-explanatory. It may be an effect of the lack of disk space: to make an archive, you need to have enough space for both the source file, and the compressed version at the same time. I investigated and my finding is that the problem is NOT caused by lack of disk space. As I had noticed already long time ago, these ATMMLs, in the first minutes after start, use up to ~14,4GB disk space, and right after the compressed file has been decompressed, it is deleted, and from then on the disk usage is slightly above 9GB, throughout the remaining task processing time. The host on which these tasks error out after a few minutes has about 45GB disk space left for GPUGRID, so there cannot possibly be a disk space problem. Furthermore, a look at the working package of such a task reveals that it has failed on other hosts as well, see here: https://www.gpugrid.net/workunit.php?wuid=30372600 Hence, my finding is that these tasks most probably are misconfigured :-(
	ID: 62035 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1627 Credit: 9,583,103,923 RAC: 9,123,029 Level Scientific publications	Message 62039 - Posted: 15 Dec 2024 \| 13:14:34 UTC - in response to Message 62035.
	I was suggesting that the server might be struggling when creating the archive, not that your machine introduced the errors when unpacking it. But the effect's the same - it doesn't work, and we can't fix it 'in the field'.
	ID: 62039 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1146 Credit: 11,207,216,611 RAC: 12,208,486 Level Scientific publications	Message 62047 - Posted: 17 Dec 2024 \| 16:45:31 UTC
	although the server should be okay since yesterday, I still get tasks which error out after a few minutes, like: https://www.gpugrid.net/result.php?resultid=37154558 excerpt from stderr: File "C:\ProgramData\BOINC\slots\0\Lib\site-packages\sync\worker.py", line 124, in run raise RuntimeError(f"Simulation failed {ntry} times!") RuntimeError: Simulation failed 5 times! any explanation for this behaviour?
	ID: 62047 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1627 Credit: 9,583,103,923 RAC: 9,123,029 Level Scientific publications	Message 62048 - Posted: 17 Dec 2024 \| 17:36:22 UTC - in response to Message 62047.
	GIGO - Garbage In, Garbage Out.
	ID: 62048 \| Rating: 0 \| rate: / Reply Quote

KeithBriggs Send message Joined: 29 Aug 24 Posts: 33 Credit: 1,397,690,047 RAC: 6,400,388 Level Scientific publications	Message 62049 - Posted: 18 Dec 2024 \| 6:57:08 UTC - in response to Message 62048.
	True but at least the garbage out pretty quickly. I'm thankful.
	ID: 62049 \| Rating: 0 \| rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1146 Credit: 11,207,216,611 RAC: 12,208,486 Level Scientific publications	Message 62063 - Posted: 20 Dec 2024 \| 12:59:19 UTC - in response to Message 62049.
	True but at least the garbage out pretty quickly. I'm thankful. if such faulty tasks error out a few minutes after they start, it's okay with me. However, if a tasks runs for about 3 1/2 hours and then crashes: https://www.gpugrid.net/result.php?resultid=37190122 it's a real waste of ressources :-(
	ID: 62063 \| Rating: 0 \| rate: / Reply Quote

KeithBriggs Send message Joined: 29 Aug 24 Posts: 33 Credit: 1,397,690,047 RAC: 6,400,388 Level Scientific publications	Message 62064 - Posted: 20 Dec 2024 \| 14:23:12 UTC - in response to Message 62063.
	After I sent the previous msg, I also got a few tried 5 times after a long time running as well. Sigh.
	ID: 62064 \| Rating: 0 \| rate: / Reply Quote