Message boards : Number crunching : ATM: Free Energy Calculations new application
Author | Message |
---|---|
Just starting the thread for discussion of this new application. ATM = AToM. | |
ID: 59751 | Rating: 0 | rate: / Reply Quote | |
about the restart failure. looks like it fails trying to create a directory that already exists. mkdir: cannot create directory 'atm_tmp': File exists needs some work to allow for that. ____________ | |
ID: 59752 | Rating: 0 | rate: / Reply Quote | |
another quality of life improvement should be adding a <weight> line to the main task in the job.xml file. right now with 2 tasks in the file, and no weights defined, I'm guessing it splits it 50/50 and it thinks the task is 50% done once the extraction phase is complete. | |
ID: 59754 | Rating: 0 | rate: / Reply Quote | |
task ran to completion in about an hour. but hit an error and threw it all away because the file size is too big. upload failure: <file_xfer_error> <file_name>T11_4-RAIMIS_TEST_ATM-0-1-RND7054_2_0</file_name> <error_code>-131 (file size too big)</error_code> </file_xfer_error> what a waste. ____________ | |
ID: 59755 | Rating: 0 | rate: / Reply Quote | |
Have over a dozen of quick-failing ATM tasks. | |
ID: 59771 | Rating: 0 | rate: / Reply Quote | |
looks like the small batch of tasks that went out today are better setup. ran for about an hour and completed successfully without the file size issue when complete. | |
ID: 59792 | Rating: 0 | rate: / Reply Quote | |
This one https://www.gpugrid.net/workunit.php?wuid=27399736 is runnig for about 11 hours and it is stuck at 66,666% for at least 4 hours now. There is almost no load on the GPU. Just a few percent (3-5) once in a while, but constantly some load on the memory controller (10-30). Hope it will finish some day :) | |
ID: 59899 | Rating: 0 | rate: / Reply Quote | |
still no official communication from the project about these tasks. | |
ID: 59936 | Rating: 0 | rate: / Reply Quote | |
Just been sent a TL4 from WU 27405970. I see you've aborted two previous tasks from the same WU, Ian, on two different machines. Did you get any CPU usage figures from previous runs? I think I'll start it up with the GTX 1660 plus one core, but I'll probably abort it myself if it doesn't show much response. | |
ID: 59937 | Rating: 0 | rate: / Reply Quote | |
they spin up multiple processes like the Python tasks do. but i didnt catch them at the very beginning to see if they spike in use or anything like that. | |
ID: 59938 | Rating: 0 | rate: / Reply Quote | |
OK, I've set 3 CPUs for continuity from the current Python task, and I've put weights of 1-1-1-97 in the job file so I can see what's happening. | |
ID: 59939 | Rating: 0 | rate: / Reply Quote | |
I see what you mean. Nearly half an hour in, CPU usage is showing around 25% of a single core, and GPU usage spiked once, to 41%, after about a quarter of an hour. It's one way of saving electricity, but I'd rather be doing something useful. Aborting. | |
ID: 59941 | Rating: 0 | rate: / Reply Quote | |
1.13 ATM running fine for me. | |
ID: 59961 | Rating: 0 | rate: / Reply Quote | |
FWIW, the first task I received completed successfully. | |
ID: 59964 | Rating: 0 | rate: / Reply Quote | |
I've also finished one: | |
ID: 59967 | Rating: 0 | rate: / Reply Quote | |
Over night, I had 4 of these tasks cancelled by server. | |
ID: 59968 | Rating: 0 | rate: / Reply Quote | |
1.13 ATM running fine for me. _______________ Same here. I quite enjoy completing these WUs. There should be a way to analyse these WUs as to why it is happening on certain machines. We are mostly running the same hardware and OS. It would be fun to see the results. - | |
ID: 59969 | Rating: 0 | rate: / Reply Quote | |
keep in mind these are bata tasks, and the batch being sent NOW are not necessarily the same as the batch sent last week and wont be the same as whatever is sent sometime in the future, until they get all the bugs worked out. the tasks last week basically ran with no perceived use of the GPU or CPU, so what were they doing? who knows. no official word from the project about these tasks at all. I wasn't willing to let the GPU/CPU be occupied for hours on end with the task spinning it's wheels when they could be doing something more useful. | |
ID: 59970 | Rating: 0 | rate: / Reply Quote | |
keep in mind these are bata tasks, and the batch being sent NOW are not necessarily the same as the batch sent last week and wont be the same as whatever is sent sometime in the future, until they get all the bugs worked out. the tasks last week basically ran with no perceived use of the GPU or CPU, so what were they doing? who knows. no official word from the project about these tasks at all. I wasn't willing to let the GPU/CPU be occupied for hours on end with the task spinning it's wheels when they could be doing something more useful. _______________________ Well, most of us know that Abouh reads every word written on these threads and without much song and dance, makes changes. He is the Only Admin on all the projects who diligently attend. Maybe, quite possibly. No arguments with your tweaking statement. | |
ID: 59971 | Rating: 0 | rate: / Reply Quote | |
Well, most of us know that Abouh reads every word written on these threads and without much song and dance, makes changes. He is the Only Admin on all the projects who diligently attend. Maybe, quite possibly. No arguments with your tweaking statement. well, Abouh is the only one from the project team who actively communicates with us volunteers - which is great. All others obviously don't care, and this has been like this over the years, unfortunately. For example: 9 days ago I asked in the ACEMD 4 thread when new ACEMD 4 task will be around, or whether this subproject is dead. No reply so far; whereas a reply could be very simple, not longer than just a line :-( You know what I want to say ... it's kind of disappointing at times :-( | |
ID: 59972 | Rating: 0 | rate: / Reply Quote | |
keep in mind these are bata tasks, and the batch being sent NOW are not necessarily the same as the batch sent last week and wont be the same as whatever is sent sometime in the future, until they get all the bugs worked out. the tasks last week basically ran with no perceived use of the GPU or CPU, so what were they doing? who knows. no official word from the project about these tasks at all. I wasn't willing to let the GPU/CPU be occupied for hours on end with the task spinning it's wheels when they could be doing something more useful. that's great and all, but abouh is not the researcher working with this application. Abouh deals with the research with the Python RL tasks. These ATM tasks look to be being run by Raimis. (the researcher names are in the filenames of the WUs) ____________ | |
ID: 59973 | Rating: 0 | rate: / Reply Quote | |
https://gpugrid.net/result.php?resultid=33321222 | |
ID: 59974 | Rating: 0 | rate: / Reply Quote | |
... failed due to file size limit I am just trying to remember with which other application we've had the same problem some time ago - last year or 2 years ago ??? | |
ID: 59975 | Rating: 0 | rate: / Reply Quote | |
... failed due to file size limit it's happened a few times in the past with acemd3 tasks. see here from July 2021: https://www.gpugrid.net/forum_thread.php?id=5239#57117 ____________ | |
ID: 59976 | Rating: 0 | rate: / Reply Quote | |
Yea, I got my first ATM checkpoint :-) | |
ID: 59977 | Rating: 0 | rate: / Reply Quote | |
Yea, I got my first ATM checkpoint :-) the uploads are nearly 700MB in size, and likely the same problem from my link that we saw over a year ago. their server can't accept something that big, I don't think they ever figured out how to adjust the settings of their file server and just tried to keep the file sizes below the limit, which they seem to have forgotten about. nothing you do will get them to upload. I've disabled ATM until they get it together with them. ____________ | |
ID: 59978 | Rating: 0 | rate: / Reply Quote | |
On past chance, I bet and lost. | |
ID: 59979 | Rating: 0 | rate: / Reply Quote | |
GDF, Should I Abort these 12 completed ATM WUs that won't upload or is there a reasonable chance you'll fix it? | |
ID: 59980 | Rating: 0 | rate: / Reply Quote | |
Well, I just achieved my 100 hours, which was my 1st priority. I will abort and reset (if necessary) the completed tasks I have. If/when the project gets its act together, I'll be back. | |
ID: 59981 | Rating: 0 | rate: / Reply Quote | |
For me it's just this: So 26 Feb 2023 11:57:00 CET | GPUGRID | Started upload of TL9_55-RAIMIS_TEST_ATM-0-1-RND1804_0_0 So 26 Feb 2023 11:57:02 CET | GPUGRID | Backing off 04:12:16 on upload of TL9_55-RAIMIS_TEST_ATM-0-1-RND1804_0_0 So 26 Feb 2023 11:57:19 CET | GPUGRID | Started upload of TL9_55-RAIMIS_TEST_ATM-0-1-RND1804_0_0 So 26 Feb 2023 11:57:22 CET | GPUGRID | Backing off 05:10:06 on upload of TL9_55-RAIMIS_TEST_ATM-0-1-RND1804_0_0 No message about the size, just about backing off. Hooray! ____________ - - - - - - - - - - Greetings, Jens | |
ID: 59986 | Rating: 0 | rate: / Reply Quote | |
I just aborted the upload (not the workunit) and then it was reported as valid. | |
ID: 59988 | Rating: 0 | rate: / Reply Quote | |
I just aborted the upload (not the workunit) and then it was reported as valid. Indeed, this worked out for me as well. But is there a result that can be used? ____________ - - - - - - - - - - Greetings, Jens | |
ID: 59989 | Rating: 0 | rate: / Reply Quote | |
For me it's just this: There won’t be any message about why it failed until you enable debugging messages. See the previous link I posted about when this issues happened 1.5 years ago. ____________ | |
ID: 59990 | Rating: 0 | rate: / Reply Quote | |
I just aborted the upload (not the workunit) and then it was reported as valid. Partially successful for me. I attempted with two of these and one ended up as "Upload failed" while the other "Completed and validated". | |
ID: 59991 | Rating: 0 | rate: / Reply Quote | |
I just aborted the upload (not the workunit) and then it was reported as valid. Indeed, this worked out for me as well. | |
ID: 59992 | Rating: 0 | rate: / Reply Quote | |
I just aborted the upload (not the workunit) and then it was reported as valid. It worked on multiple pc's for me too | |
ID: 59993 | Rating: 0 | rate: / Reply Quote | |
task ran to completion in about an hour. but hit an error and threw it all away because the file size is too big. No it's not a waste in my opinion because you found something out. You found that "the file size was too big" so it can be corrected so it doesn't happen again hopefully. :-) | |
ID: 60008 | Rating: 0 | rate: / Reply Quote | |
this now is a topic also on this thread: | |
ID: 60010 | Rating: 0 | rate: / Reply Quote | |
How can I get ATM ? | |
ID: 60178 | Rating: 0 | rate: / Reply Quote | |
you need to enable beta/test applications in your project preferences | |
ID: 60179 | Rating: 0 | rate: / Reply Quote | |
Ah, Thanks. The "test application" setting I have missed. | |
ID: 60180 | Rating: 0 | rate: / Reply Quote | |
So far, I noticed on ATM tasks an abnormal progress notification. | |
ID: 60313 | Rating: 0 | rate: / Reply Quote | |
There is still a mix of old, broken progress tasks along with fixed progress tasks in rotation. | |
ID: 60314 | Rating: 0 | rate: / Reply Quote | |
No, it's not the replication number. | |
ID: 60315 | Rating: 0 | rate: / Reply Quote | |
Nice explanation. | |
ID: 60317 | Rating: 0 | rate: / Reply Quote | |
Message boards : Number crunching : ATM: Free Energy Calculations new application