I try blocking and non-blocking version of clEnqueueWriteBuffer, It seem that non-blocking is not faster than blocking.

#include <stdio.h>
#include <stdlib.h>
#include <CL/cl.h>

long long...