It would be useful if the spec for clEnqueueNDKernel had some wording to the following effect:
For all implementations of clEnqueueNDKernel, a call that creates N workgroups
will ensure that workgroup K is launched either before or contemporaneously
with workgroup J for all K < J. (Alternatively, one might say that the launch
of workgroup K must not depend on the completion of the workitems in workgroup
J if K < J. This is more specific and sufficient for the purposes described
Why is this useful? For collective or reduction operators, we have a chance of
using atomic operations (perhaps in the future, perhaps now...) to perform the
global reduction step. This is particularly useful in things like the
substitution step of triangular systems where the inter-workgroup dependency on
the solution vector need not delay the fetch of a workgroup's chunk of the
matrix. Workgroup J could test (poll) a location in __global memory for a
signal from Workgroup K. (Of course, the polling interval needs to be tuned
to avoid screwing up the memory traffic for other threads.)
This does not break the fundamental assumption of OpenCL that no concurrency
is guaranteed outside the workgroup: a compliant implementation could launch
all workgroups sequentially.
My guess is that all implementations are already compliant with this
restriction, but without having it in the spec, application writers should
not rely on the feature.
I do not believe the Apple OpenCL CPU implementation will do them in order. Even if they started in order, any device capable of preemption would certainly not be able to guarantee that they finish in order, at least without a substantial reduction in performance. If you want them to start in order, you can assign your own workgroup and workitem IDs using an atomic counter, and ignore the one OpenCL hands you.
Generally speaking, data dependencies are best modeled around kernel launch boundaries -- All workgroups in a launch must be able to run concurrently without interfering with one another. So if for example you need the computations above and to the left of you to finish before you can do your bit, you might enqueue a separate kernel for each diagonal stripe.
I am not actually requiring that kernels finish in order, only that the launch of kernel K < J does not require that J finish before K launches. This in fact, doesn't seem like all that onerous a requirement: Ian Ollmann's comment even lays out the alternative solution, though it requires some rather awkward
coordination on the part of the application. In any case, I can't imagine that there is an implementation out there that doesn't allocate groupIDs sequentially as the tasks are placed into the run queues. Did I miss something?
*** see below
Generally speaking, data dependencies are best modeled around kernel launch
boundaries -- All workgroups in a launch must be able to run concurrently
without interfering with one another. So if for example you need the
computations above and to the left of you to finish before you can do your
bit, you might enqueue a separate kernel for each diagonal stripe.
As for pushing this back to the host runtime scheduler, that doesn't allow overlap of the fetch for the matrix among kernels. This is a potential lose: note that the solution to the system is entirely memory bound: the reuse in the matrix is nearly zero. So what I really want to do is launch as many threads as it takes to saturate the memory system, then let them wait for the single input
on which they depend. If I know something about their order of launch, then I can actually construct the dependence resolution chain with atomic ops. And note that since the dependence chain propagates in one direction only, there is no dependence on simultaneity, just a simple ordering.
And of course, when I say K < J, I really mean that K and J should be viewed as ND specifications.
In any case, the OpenCL specification should say <something> about ordering of the tasks in a launch -- if you don't people will use the implementations as a spec. Even if the spec says "you can't assume anything about the launch order
of threads vs. their global, local or group ids, beyond the guarantee that all threads in a workgroup are contemporaneous."
*** Not to wail the tar out of a dead horse, but imagine a scheme where
clEnqueueNDKernel dumped work groups into P parallel execution queues. Imagine
further that the queues are completely decoupled, sharing only the global
memory of the target device. Therefore, if kernel workgroup K was placed
in queue 0 and J was in queue 1, J might well launch before K. But note that
K will launch at sometime in the future regardless of J's state of completion.
So, my more particular requirement is satisfied. The only additional
rule I would impose here is that if K and J were dumped into the SAME queue,
the implementation must ensure that K could launch without requiring J to
complete if K < J. (But note, no such guarantee applies to J vs. K -- if
K does not terminate, J may never issue.)
> I can't imagine that there is an implementation out there that doesn't allocate groupIDs
> sequentially as the tasks are placed into the run queues.
Imagine no more! Apple's CPU implementation aggregates multiple consecutive workgroups into a uberworkgroup. It then doles the uberworkgroups out to various cores as they become available. Therefore the uberworkgroups start in order, but the individual workgroups do not. If you put a printf() in there to print out the current workgroup id, you'll see something like this
It is far better to aggregate in this way than to make specific guarantees about the launch time of specific workgroups, because it is assumed that workgroups access memory in order and you want a single core to be walking linearly through memory.
Khronos should at least document this feature so we don't assume the wrong behavior.
I'll find an alternate solution.
There are numerous other areas of the spec that are fuzzy or mute. This is
going to cause problems for future implementations if they aren't cleared up.
Thanks for saving me some time. Now lets figure out how to get somebody
to fix the spec.
Why don't you propose specific, actionable fixes to language in the spec.
I will chime in as well. I do not think we should require that "workgroup K is launched either before or contemporaneously with workgroup J for all K < J." We want device to schedule work-groups as efficiently as possible. Adding this requirement IMO can impact the overall performance of the kernel being executed. And I believe you can handle this in the kernel code. For example you do not have to use the global IDs given by get_global_id. You can generate you own global IDs using atomic functions to meet this requirement.
Matt, can you explain why you cannot do this in the kernel itself?
re: particular language (with reference to 1.0.48 section 5.6)
There are THREE points that need to be made:
1. Workitems within a single work group appear to the programmer
to execute concurrently.
2. Workitems from different workgroups may or may not appear to the
programmer as concurrent.
3. Subject to the requirements of points #1 and #2, a compliant
implementation my initiate execution of a particular workgroup
in a set of workgroups in any order without regard to workgroup
I'd need help in figuring out how to make those three points in
language that is consistent with the rest of the specification. I'll
work on it some more in the next couple of days and would appreciate
pointers... The key issue is the introduction of a notion of time, sequence,
and concurrence. These same concepts are at the heart of my concerns
around the description of cl_mem objects and the related operations.
However, it is clear that in section 5.6 "Executing Kernels" the
following paragraph should be deleted.
These work-group instances are executed in parallel across multiple compute units or concurrently on the same compute unit.
< As it is incorrect -- the work-group instances may also be executed
sequentially across multiple or a single compute unit >
re: Aaftab's question -- why not do the assignment in the kernel itself?
It can be done. This means that the dependence chain will require TWO global
atomic ops -- one to assign the sequential group id, one to check the "go bit".
Right now, I have no idea what the performance of the atomic ops will be. In
the original design I was working on, the atomic op (the go bit) was a
SET on the producer side and a TEST on the consumer side where I could create
multiple copies of the GO bit to limit contention for a single lock.
With the less restrictive assumption on temporal relationships for group
ids I need to either build a hierarchical counter (to limit contention)
or eat the 20-way contention I'm likely to see at the start of each
group launch. (Groups can't start work until they know their group
ID so at least one thread from each group will bang on the counter-lock
at workgroup launch time while everybody else waits for the pileup to
sort itself out... sigh.)
> Workitems within a single work group appear to the programmer to execute concurrently.
Concurrently is probably too strong a word to use here. Imagine for a moment, a compute unit built out of a 8-wide vector unit that has a workgroup size of 32. Obviously in order to do a 32-wide work group on an 8-wide vector unit, we have to split each operation up into four instructions. So, at minimum the work is not happening atomically. Lets also guess that rather than simply unrolling by four and doing it all in a single instruction stream, we use four way SMT on that compute unit to handle the widening.
Alternatively, we'll imagine a AVX-equipped x86 compatible machine. (8-wide vector unit, but not enough registers to hold 4x state concurrently.) Here we might want to do our 4x expansion by simply wrapping short sequences of instructions up in a loop, and use the reorder buffer to generate parallelism. SMT might work here too, but we might worry that one or more of the threads would hop to another core, which would be bad for the caches.
With either hypothetical machine, it shouldn't be hard to see how some parts of the workgroup could "get ahead" of other parts of the work group. Indeed some parts of the results could be committed before other parts of the workgroup even got started. This should help illustrate why you need to issue barriers to synchronize data exchange through global or local memory, and why concurrently is much too strong a term here. "concurrently" would seem to imply that the work from each workitem is happening in lock step, which clearly it may not be, and probably isn't.
The only thing we can probably commit to here is that all the workitems will do their work during the time that the workgroup is live on the compute unit. Even then, there may be some cases, such as async_workgroup_copy without a terminating wait_group_events that might extend past the lifetime of the workgroup. Whether or not this can happen is not defined by the spec. Happily, there is a simple thing the user can do to get defined behavior, behavior her.
All we can assert here is that local memory is expensive to keep live -- there usually isn't much of it, and we'd like to use it for other work too. This should act to keep workgroup lifespans as short as possible and consequently should help make sure that all workitems execute with good temporal locality.
I understand the concern with the word "concurrently" -- The specification
needs to assert what it means as well. That's what makes writing specifications
both difficult and important.
I had originally used the word "contemporaneous" but that too has some
Among the folks I work with "concurrent" does not imply lock-step, but implies
"overlapping execution lifetimes" -- that is, if two tasks are not executed
concurrently, I cannot expect two-way communication between them to be
possible. Conversely, if I require two-way communication between two tasks,
I must ensure that they execute concurrently.
So, I believe your concerns are addressed by asserting a definition for
Concurrent Execution: Two tasks, process, procedures, or kernels (A and B)
are said to execute concurrently if and only if "A" begins executing before
"B" has terminated and "B" begins executing before "A" has terminated.
From this building block, you can begin to make assertions about workitems
within a workgroup. And from there you can start making assertions about
what is NOT guaranteed to be concurrent: such as multiple workgroups created
by a single call to clEnqueueNDKernel.
And note that this definition creates a pattern for talking about temporal
relationships between events. The word "before" can pretty much cover
all temporal requirements that you'd want to make. (e.g. "An update to
host memory buffer X made BEFORE a call to clEnqueueWriteBuffer(X -> gpu_X)
will be visible to a kernel if and only if the call to clEnqueueWriteBuffer
is made BEFORE the corresponding call to clEnqueueNDKernel." By the way,
is the correct statement "if and only if" or "if" ? Consider the possibility
of out-of-order execution.)
I believe it is the dearth of such statements in the spec that account
for the confusion I have encountered in some users over the lack of
synchronization primitives between threads in separate workgroups.
While we understand the execution model sufficiently to be able
to reason forward from the implied concurrent execution rules, some
new adopters of OpenCL will not. Not without some help.
This kind of demand for detail may sound obsessive, but OpenCL is not
just a project for compiler writers and run-time library developers in
three companies. Its consumers are well outside your circle and they
need a specification that describes commitments by the vendors (you)
to them (that is, "me"). The spec doesn't do that yet.
With that definition of concurrency, we can't guarantee that all workitems in a workgroup execute concurrently, unless a barrier() is used. Even here, there may be problems depending on how you defined "begins" and "terminated".
There is obviously room for a user manual that holds your hand a bit more and spends more time on broad concepts rather than a forest of devilish details. We've talked about it, but haven't done much due to our time being otherwise engaged largely delivering OpenCL.
A third party book was recently announced. I haven't had the time to read it yet, so can't comment on quality / coverage.
Given that even here, between just two of us (and I suspect we're both
reasonably sophisticated and experienced professionals) we've not been
able to easily define the commitment, I believe that this is a conclusive
and compelling argument that the work needs to be done.
So, how do we do that? What happens if you and I come up with something?
Who gets it next?
And note that this isn't an issue for a user manual and it surely should not
be left outside the architectural specification. This is a commitment that
the language and run-time developers must make to the application programmers.
Textbooks and user manuals have their places -- but there has to be an
authoritative foundation for them. That's what a language spec is for
and what it does.
Personally, I think there is no commitment that can be made in the area of (1). We can merely guide users to develop as if the workitems run at roughly the same time. They may run at the extremes of not at all the same time, or in lock step, or somewhere in between. Then guide them to use barrier, wait_group_events, or mem_fence as appropriate to ensure memory consistency when there is intra-workgroup data cross talk.
That is, I believe the nebulousness here is intended.