Hi, I don't know if i landed on the right forum. But I would like to ask if there is a way to speedup my code some more. My code is written with JOCL, but since there isn't that much of a community for that I came here. The code that I've written uses a large array of pixels of images. c[] contains (x images * 300 width * 300 height) , so it is a one dimensional array with all the pixels of different images. The purpose of my code is to take the sum of the intensities PER IMAGE. This means that if c[] contains (100x300x300) values the output should be 100 values (100 sums). This is my code

Code :
package PAR;
 
/*
 * JOCL - Java bindings for OpenCL
 * 
 * Copyright 2009 Marco Hutter - [url]http://www.jocl.org/[/url]
 */
import IMAGE_IO.ImageReader;
import IMAGE_IO.Input_Folder;
import static org.jocl.CL.*;
 
import org.jocl.*;
 
/**
 * A small JOCL sample.
 */
public class IPPARA {
 
    /**
     * The source code of the OpenCL program to execute
     */
    private static String programSource =
            "__kernel void "
            + "sampleKernel(__global uint *a,"
            + "             __global uint *c)"
            + "{"
            + "__private uint intensity_core=0;"
            + "      uint i = get_global_id(0);"
            + "      for(uint j=i*90000; j < (i+1)*90000; j++){ "
            + "              intensity_core += a[j];"
            + "     }"
            + "c[i]=intensity_core;" 
            + "}";
 
    /**
     * The entry point of this sample
     *
     * @param args Not used
     */
    public static void main(String args[]) {
        long numBytes[] = new long[1];
 
        ImageReader imagereader = new ImageReader() ;
        int srcArrayA[]  = imagereader.readImages();
 
        int size[] = new int[1];
        size[0] = srcArrayA.length;
        long before = System.nanoTime();
        int dstArray[] = new int[size[0]/90000];
 
 
        Pointer srcA = Pointer.to(srcArrayA);
        Pointer dst = Pointer.to(dstArray);
 
 
        // Obtain the platform IDs and initialize the context properties
        System.out.println("Obtaining platform...");
        cl_platform_id platforms[] = new cl_platform_id[1];
        clGetPlatformIDs(platforms.length, platforms, null);
        cl_context_properties contextProperties = new cl_context_properties();
        contextProperties.addProperty(CL_CONTEXT_PLATFORM, platforms[0]);
 
        // Create an OpenCL context on a GPU device
        cl_context context = clCreateContextFromType(
                contextProperties, CL_DEVICE_TYPE_CPU, null, null, null);
        if (context == null) {
            // If no context for a GPU device could be created,
            // try to create one for a CPU device.
            context = clCreateContextFromType(
                    contextProperties, CL_DEVICE_TYPE_CPU, null, null, null);
 
            if (context == null) {
                System.out.println("Unable to create a context");
                return;
            }
        }
 
        // Enable exceptions and subsequently omit error checks in this sample
        CL.setExceptionsEnabled(true);
 
        // Get the list of GPU devices associated with the context
        clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, null, numBytes);
 
        // Obtain the cl_device_id for the first device
        int numDevices = (int) numBytes[0] / Sizeof.cl_device_id;
        cl_device_id devices[] = new cl_device_id[numDevices];
        clGetContextInfo(context, CL_CONTEXT_DEVICES, numBytes[0],
                Pointer.to(devices), null);
 
        // Create a command-queue
        cl_command_queue commandQueue =
                clCreateCommandQueue(context, devices[0], 0, null);
 
        // Allocate the memory objects for the input- and output data
        cl_mem memObjects[] = new cl_mem[4];
        memObjects[0] = clCreateBuffer(context,
                CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
                Sizeof.cl_float * srcArrayA.length, srcA, null);
        memObjects[1] = clCreateBuffer(context,
                CL_MEM_READ_WRITE,
                Sizeof.cl_float * (srcArrayA.length/90000), null, null);
 
        // Create the program from the source code
        cl_program program = clCreateProgramWithSource(context,
                1, new String[]{programSource}, null, null);
 
        // Build the program
        clBuildProgram(program, 0, null, null, null, null);
 
        // Create the kernel
        cl_kernel kernel = clCreateKernel(program, "sampleKernel", null);
 
        // Set the arguments for the kernel
        clSetKernelArg(kernel, 0,
                Sizeof.cl_mem, Pointer.to(memObjects[0]));
        clSetKernelArg(kernel, 1,
                Sizeof.cl_mem, Pointer.to(memObjects[1]));
 
        // Set the work-item dimensions
        long local_work_size[] = new long[]{1};
        long global_work_size[] = new long[]{(srcArrayA.length/90000)*local_work_size[0]};
 
 
        // Execute the kernel
        clEnqueueNDRangeKernel(commandQueue, kernel, 1, null,
                global_work_size, local_work_size, 0, null, null);
 
        // Read the output data
        clEnqueueReadBuffer(commandQueue, memObjects[1], CL_TRUE, 0,
                (srcArrayA.length/90000) * Sizeof.cl_float, dst, 0, null, null);
 
        // Release kernel, program, and memory objects
        clReleaseMemObject(memObjects[0]);
        clReleaseMemObject(memObjects[1]);
        clReleaseKernel(kernel);
        clReleaseProgram(program);
        clReleaseCommandQueue(commandQueue);
        clReleaseContext(context);
 
 
        long after = System.nanoTime();
 
        System.out.println("Time: " + (after - before) / 1e9);
 
    }
}


At the moment the sequential code and code run by jocl in parallel on the cpu are almost the same, though parallel is a bit slower. Running it on the GPU is alot slower.

So my question is, is there a way to speed up this code some more ?

My specs
Graphics AMD Radeon HD 6490M 256 MB
Processor Intel(R) Core(TM) i7-2635QM CPU @ 2.00GHz

Regards