Message boards : Graphics cards (GPUs) : can't get more than a single WU?
Author | Message |
---|---|
hi there! | |
ID: 9749 | Rating: 0 | rate: / Reply Quote | |
a per-cpu-limit of 1 makes no sense for me The project agrees but so far has not been able to tell the BOINC server software about this. MrS ____________ Scanning for our furry friends since Jan 2002 | |
ID: 9781 | Rating: 0 | rate: / Reply Quote | |
a per-cpu-limit of 1 makes no sense for me Do you mean that standing in front of the server and waggling your finer did not work? What a shock! | |
ID: 9793 | Rating: 0 | rate: / Reply Quote | |
As far as I'm informed they also tried excessive shouting, to no avail. Scandalous! | |
ID: 9810 | Rating: 0 | rate: / Reply Quote | |
can't they set a per-cpu-limit of at least two WUs? | |
ID: 9822 | Rating: 0 | rate: / Reply Quote | |
This exponential timeout is a BOINC standard setting, so it would likely have to be changed by Berkeley. I think it does not quite make sense: if you don't get WUs and after a second request 1 min later you still don't get work, it's very likely that you won't get work anytime soon. However, the timeouts shouldn't reach obscene values of >24h (happens), especially when the machine is dry. Good timeout values could be: 1min, 5 min, 10 min, 30 min, and from there a constant 1h. Of course it would even better if the BOINC client would understand why the server is not issuing new work: if it says reached the 24h limit the client should know when it makes sense to ask again, of when the maximum number of concurrent WUs has been reached, the client should know that it has to report a finished WU to get new ones. That would save us quite some traffic, headaches and downtime. | |
ID: 9831 | Rating: 0 | rate: / Reply Quote | |
Paul, would you like to suggest this to the alpha mailing list? :) Um, well, yes and no ... :) Firstly none of my suggestions seem to be taken too kindly... but more importantly we are mixing metaphors here and mechanisms. I am a little scatter-minded at the moment so bear with me and ask questions if what I say does not make sense. Firstly the radical changes made in the 6.6.x series are not even close to being debugged. The good news is that at least for some long standing problems we are finding out what they are and pinpointing some causes. For example, just recently I found an issue that has been long-standing which had been reactivated by a new project DD which builds a huge directory structure in the slot when running the task, but at the end it has to delete 4,000 files which can cause a long pause for the BOINC Client which then causes the tasks they are running to fold up and die. So, progress of sorts. But is is slow ... As to work fetch and scheduling I will admit that we have focused on the resource scheduling issues. as they seem more critical. There are still issues. Like the one above, the tasks die silently and then are brought back to life silently ... the only time you know of trouble is when you see a lot of them dying with no heartbeat messages or too many restarts... And the long running task issue. Another sore point and I don't know that it has been solved for sure. We think we have it ... but that is not at all clear ... I think I have seen it again ... in 6.6.25 or 28 ... RIchard has seen some other CUDA scheduling issues that I have not yet tried to dig into... Ok, all that is prelude to the question you asked, what about the exponential back off... and work fetch ... well, on the list ... there are a number of issues with work fetch not the least of which is the poor mechanisms particularly on projects that have resource limits on the number of tasks that you can get and what to do when those limits are hit or when the work is not steady. You can see more bad example of what happens with this on MW. I suggested we think about another mechanism than resource share to little effect ... only time will tell if I can make progress ... However the attitude at UCB is likely to be if you don't get work, exp. back-off is a good way to go and so what if you run dry ... I guess what I am trying to say is that this is kinda on my list guys ... but Sisyphus had it easy ... all he had to do was push some rocks ... he never had to deal with UCB ... | |
ID: 9838 | Rating: 0 | rate: / Reply Quote | |
You're right, they have more pressing issues at hand, so any suggestions like "just change the back-off formula" will likely be ignored. I suggested we think about another mechanism than resource share I think it may be easy: set ressource share per computing ressource. If a project supports muliple ones, then let users set the share for each one individually. And drop the concept of debts, or at least make it such a low priority that it doesn't cause the scheduler to make strange moves like running dry. Really, debts shouldn't be that important! MrS ____________ Scanning for our furry friends since Jan 2002 | |
ID: 9851 | Rating: 0 | rate: / Reply Quote | |
maybe someone has an phenom or intel quad for me... not really nice to keep such options "set-by-hardware", especially non-GPU-related hardware. what if someone has a GX2 graphics card and a single-core processor? don't know if yours can do, but many projects are using per-host limits. can't you set a per-host-limit of approx. 4-6 WUs? this should be enough (and not too much) for every machine except "over-the-top" machines running something like four GX2 cards... | |
ID: 9852 | Rating: 0 | rate: / Reply Quote | |
I think they're wasting their time trying to debug 6.6.x. The system is fundamentally flawed, they need a clean differentiation between different co-processors. Otherwise all they can do is trying to catch strange errors which are "impossible". But whom am I telling that ;) Sadly it is not quite that simple either. For one thing what about mixed-mode tasks? The Lattice Project is planning on tasks that will use both a high amount of CPU while also using a GPU at the same time... SaH is moving in that direction though right now they are working on a fall-back option for tasks that don't run well on the GPU to run them on the CPU ... For example many people are killing VLAR tasks rather than to run them on CUDA because they take about 4-6 times longer than other "normal" tasks. But, that bolixes up the idea of separate shares for resource per project. | |
ID: 9857 | Rating: 0 | rate: / Reply Quote | |
Well, factor the "1.0 CUDA, 0.03 CPU" in. Ignore the 0.x CPU regarding ressource share. In a project mixes tasks with high and low cpu usage it will be difficult to maintain the correct cache size.. but the current system would have no way of dealing with such a situation, either. | |
ID: 9859 | Rating: 0 | rate: / Reply Quote | |
Well, factor the "1.0 CUDA, 0.03 CPU" in. Ignore the 0.x CPU regarding ressource share. In a project mixes tasks with high and low cpu usage it will be difficult to maintain the correct cache size.. but the current system would have no way of dealing with such a situation, either. Well, I started a discussion of this, but the reception wasn't even warm enough to call it tepid. | |
ID: 9862 | Rating: 0 | rate: / Reply Quote | |
Message boards : Graphics cards (GPUs) : can't get more than a single WU?