Reduction within one work item

Hi,

how would the more experienced devs work out that problem:

I want to calculate a financial math problem called "Ichimoku" on GPU.

The actual problem can be shortened down to:

- you have a price series array - lets say an array of 10.000 doubles - 0 to 9.9999

Calculating Ichimoku involves basically the following task 2-3 times with different widths and a few minor challenges. All major calculations are independent from the previous / next one so the outer loop is perfectly parallel. The inner loop is a min/max reduction of the _X_ previous values:

perfect parallel outer loop:

- do the inner loop (kernel) for each array value independent from the prev / next value

inner loop:

(int) argument _X_ = 26

calculating the result of array index _I_ for width _X_:

- find the low of index _I_ to index (_I_ - _X_) = _LOW_

- find the high of index _I_ to index (_I_ - _X_) = _HIGH_

- result for _I_ = (_LOW_ + _HIGH_) / 2.0

so for _X_ = 26 and array_index = 100

- find the low of array[100] to array[100-26-1] (inclusive)

- find the high of array[100] to array[100-26-1]

- global result[100]= (low+high)/2.0

- of course only calculate for index values > _X_ argument values

I could simply write a kernel which gets invoked with the array length and does a sequential calculation of the high/low in the kernel. I would gain over traditional cpu implementation because i can call that kernel for every array value perfectly in parallel but the inner loop main work load would still be sequential.

How could i do a min/max reducation within the kernel? Call array_size * _X_ work items and keep track which work items are supposed to do a min/max local mem reduction at a certain stage and nothing on the later stage?

Help is very much appreciated.