I have a solver written in FORTRAN, which I would like to port to OpenCL. The solver contains several nested inter-dependent loops on the same level separated by serial operations(see pseudo code).
Code :
for( iter )
{
   serial operation first
   for( i ) { generates value A }
   serial operation second
   for( j ) { uses value A & generates value B }
   serial operation third
   for( k ) { uses value B }
   serial operation fourth
}

Would it be better to create a kernel for the outer loop that performs the serial operations and calls separate kernels for the 3 inner loops (see pseudo code below), rather than pass data to and from the host at the start and end of each parallel section?

As the serial operations are compute light, but would have to copy a large amount of data to and from the host for the parallel loops. I am assuming that IO overhead for copying the data to and from the host will be significantly slower than the serial operations and keeping the data on the device would be more efficient.

Code :
kernel outer_loop(in_data, out_data)
{
    first_serial;
    call kernel parallel_i_loop(in_data);
    second_serial;
    call kernel paralel_j_loop(in_data);
    third_serial;
    call kernel parallel_k_loop(in_data);
    fourth_serial;
    transfer_to_host(out_data);
}

Or have the outer loop performed on the host calling the kernels for the inner loops and copy the data to and from the host several times.
Code :
host outer_loop(data)
{
    first;
    call kernel parallel_i_loop(in_data, out_data);
    in_data = out_data;
    second;
    call kernel parallel_j_loop(in_data, out_data);
    in_data = out_data;
    third;
    call kernel parallel_k_loop(in_data, out_data);
    in_data = out_data;
    fourth;
}

David