I'm fairly new to the concept of heterogeneous computing and I'm trying to understand the general hierarchy of the concept given that the terminology seems diverse between manufacturers.

- So a many-core GPU has many cores, these cores are also known as compute units and/or stream multiprocessors?

- Each compute unit has a series of SIMD engines in addition to individual registers/ L1 caches specific to that SIMD engine and each compute unit has an L2 cache....?

- Each compute unit executes a work-group/thread block, a thread block/work-group can be a 1D, 2D, or even 3D collection of work-items/threads and can belong to a 1D, 2D, or even 3D grid of thread blocks/work groups...? But given the processor, only a certain amount of threads can be executed in parallel at any given moment. For nVidia CUDAthis is 32 threads(warp size), and 64 threads for AMD devices(wavefront size)...?

Now my big question is concerning the latest Ivy Bridge processors by Intel and their corresponding integrated graphic processing unit known as Intel HD Graphics 4000. This GPU boasts 16 execution units(8 threads per execution unit).

How does this translate to the above model? Is an execution unit the same as a compute unit or is it a collection of compute units? How does this work?