I'm trying to maintain some code that automatically selects some compilation parameters by querying the device for maximum work group size, maximum local memory size and so on. It's not a true auto-tuning framework, but rather an attempt to pick one set of sensible parameters.
This has worked okay on Fermi GPUs, CPUs and at least one AMD GPU so far, but when trying it on a Tesla C1060 I end up with this error:
ptxas error : Entry function 'scanExclusive' uses too much shared data (0x4010 bytes + 0x10 bytes system, 0x4000 max)
The declared __local data comes to 0x4000 bytes, but it seems the compiler needs some internal state in __local memory as well. Is there any way to tell how much it needs, so that I can factor that into the parameter estimation? Failing an a priori way to tell, is there any way the code can determine from the compilation failure that it was due to local memory capacity rather than some other reason (preferrably without looking for magic strings in the error message, which is not really a future-proof or portable method).