Message boards : Number crunching : many faulty ATMMLs recently
Author | Message |
---|---|
Within the past hours, I've been receiving several ATMML tasks which errored out after about 1 1/2 minutes, e.g.:' | |
ID: 62024 | Rating: 0 | rate: / Reply Quote | |
If you look down towards the end of that task report, you can read: bzip2: Compressed file ends unexpectedly; which is pretty self-explanatory. It may be an effect of the lack of disk space: to make an archive, you need to have enough space for both the source file, and the compressed version at the same time. The depressing thing about this sequence of errors is that this is a project which has said many times that their research results are important and cutting-edge, and speed is of the essence. They use many tools to encourage us to send the results back to them as quickly as possible: the 24-hour return bonus, manipulating task sizes in a way which discourages large (idle) caches, and so on. Some of the applications process a sequence of tasks based on a single dataset: later tasks are generated from the results of the early runs. But not if we can't return them, so work supply grinds to a halt too. And as others have pointed out, this has happened before: this outage was entirely predictable and manageable. They need to keep an eye on administration, as well as research. | |
ID: 62025 | Rating: 0 | rate: / Reply Quote | |
And as others have pointed out, this has happened before: this outage was entirely predictable and manageable. They need to keep an eye on administration, as well as research. + 1 | |
ID: 62026 | Rating: 0 | rate: / Reply Quote | |
If you look down towards the end of that task report, you can read: I investigated and my finding is that the problem is NOT caused by lack of disk space. As I had noticed already long time ago, these ATMMLs, in the first minutes after start, use up to ~14,4GB disk space, and right after the compressed file has been decompressed, it is deleted, and from then on the disk usage is slightly above 9GB, throughout the remaining task processing time. The host on which these tasks error out after a few minutes has about 45GB disk space left for GPUGRID, so there cannot possibly be a disk space problem. Furthermore, a look at the working package of such a task reveals that it has failed on other hosts as well, see here: https://www.gpugrid.net/workunit.php?wuid=30372600 Hence, my finding is that these tasks most probably are misconfigured :-( | |
ID: 62035 | Rating: 0 | rate: / Reply Quote | |
I was suggesting that the server might be struggling when creating the archive, not that your machine introduced the errors when unpacking it. | |
ID: 62039 | Rating: 0 | rate: / Reply Quote | |
although the server should be okay since yesterday, I still get tasks which error out after a few minutes, like: | |
ID: 62047 | Rating: 0 | rate: / Reply Quote | |
GIGO - Garbage In, Garbage Out. | |
ID: 62048 | Rating: 0 | rate: / Reply Quote | |
True but at least the garbage out pretty quickly. I'm thankful. | |
ID: 62049 | Rating: 0 | rate: / Reply Quote | |
True but at least the garbage out pretty quickly. I'm thankful. if such faulty tasks error out a few minutes after they start, it's okay with me. However, if a tasks runs for about 3 1/2 hours and then crashes: https://www.gpugrid.net/result.php?resultid=37190122 it's a real waste of ressources :-( | |
ID: 62063 | Rating: 0 | rate: / Reply Quote | |
After I sent the previous msg, I also got a few tried 5 times after a long time running as well. Sigh. | |
ID: 62064 | Rating: 0 | rate: / Reply Quote | |
Message boards : Number crunching : many faulty ATMMLs recently