Just seeking some advice for an implementation where I'm batching kernel runs on a device and there's particular read only buffers I would like to reuse across these runs.
Each batch consists of 1,000,000 kernel threads. I need to batch them because I parse an array of structs where each struct contains values that kernel thread writes to. If I didn't, it would require 39GB of device memory.
So in my host loop I build an array of structs "clmodels" for 1,000,000 items and fire it off to the kernel in such fashion:
Code :// 1,000,000 clmodels batched. cl::Buffer d_clmodels=cl::Buffer(context, CL_MEM_READ_WRITE, h_clmodels.size()*sizeof(clmodel_t)); cl::Buffer d_clvar=cl::Buffer(context, CL_MEM_READ_ONLY, sizeof(clvar_t)); queue.enqueueWriteBuffer(d_clmodels, CL_TRUE, 0, h_clmodels.size()*sizeof(clmodel_t), &h_clmodels); queue.enqueueWriteBuffer(d_clvar, CL_TRUE, 0, sizeof(clvar_t), &h_clvar); cl::Kernel kernel(program_, "compute", &err); kernel.setArg(0, d_clmodels); kernel.setArg(1, (unsigned int)h_clmodels.size()); kernel.setArg(2, d_clvar); cl::NDRange localSize(64); cl::NDRange globalSize((int)(ceil(h_clmodels.size()/(double)64)*64)); cl::Event event; queue.enqueueNDRangeKernel( kernel, cl::NullRange, globalSize, localSize, NULL, &event); event.wait(); queue.enqueueReadBuffer(d_clmodels, CL_TRUE, 0, h_clmodels.size()*sizeof(clmodel_t), &h_clmodels); // Loop for next 1,000,000 batch.
That additional read only buffer "d_clvar" is the one I want to reuse. It contains a struct of variables that get read in once by the main host program and never changed again.
So my question, how can I create that d_clvar buffer so that I can re-use it across my batched host loops without having to call enqueueWriteBuffer and hence make an expensive device memory copy operation every time. Basically, I want to write it once into device memory and use it for each new kernel run.