Reduce nr of threads & use loop in kernel to cover the work

Printable View