Results 1 to 5 of 5

Thread: confusion compute unit/stream proc/warp/work gp/threads

  1. #1
    Junior Member
    Join Date
    Mar 2010
    Posts
    3

    confusion compute unit/stream proc/warp/work gp/threads

    ok i am just getting into openCL. and there is a lot of confusion i am having wrt to translating hardware and software groups.

    i have a gtx 285 and it is having 240 cuda streaming processors. But when i run the device query program in nvidia gpu computing sdk for openCL it shows

    CL_DEVICE_MAX_COMPUTE_UNITS: 30
    CL_DEVICE_MAX_WORK_ITEM_SIZES: 512 / 512 / 64
    CL_DEVICE_MAX_WORK_GROUP_SIZE: 512
    CL_NV_DEVICE_WARP_SIZE: 32

    1.So why is openCL showing compute units as 30 when my streaming processors are 240? I guess a compute unit is not a streaming processor? So then what is a compute unit?

    2. Now the work group max size is 512, so it means it can have a max of 512 threads/wrk-items? But i can have any number of work groups of any dimension?

    3. Work group is obviously the logical abstraction, so does a work group span multiple stream processors? eg a work group having more than warp size of work-items

    4. what is the logic of 16kb of local memory asigned to a work group. if work group is logical and not hardware how can it get a fast 16kb local memory from a streaming processor. (this one drives me nuts)

    5. can 2 cpu threads have their own kernels, doing the same task in parallel? obviously their data is local to them and so no need to synch them.

    6. can i pre allocate a 1d array of size n on a kernel on program-load and use this same kernel but different instance for each cpu thread?

    i know too many questions, but a noobs gotta know all this before getting hands dirty or stick to sse

  2. #2
    Senior Member
    Join Date
    Jul 2009
    Location
    Northern Europe
    Posts
    311

    Re: confusion compute unit/stream proc/warp/work gp/threads

    1. A streaming multiprocessor (compute unit) has 8 streaming processors. Therefore 30*8=240.
    2. Yes. As long as the product of the work-group dimensions are <= 512 (and the kernel doesn't need too many registers)
    3. Yes, it can have more than a warp, but no, work-groups do not span multiple compute units. A warp uses multiple stream processors already.
    4. A work-group all executes together on the same compute-unit. The compute-unit has physically only 16kB of memory, so for whatever size work-group you choose, they can only access 16kB of shared memory.
    5. Yes, if you have a global-size of 2 and a work-group size of 1, you will get one thread on each CPU.
    6. Not sure I really follow here. If you run a kernel on the CPU device you will run it across all CPU cores just the same way as a kernel on the GPU runs across all GPU stream processors at once. You only need one kernel and it will be run data-parallel on all cores. This is why you only see 1 CPU device on machines with multiple cores.

  3. #3

    Re: confusion compute unit/stream proc/warp/work gp/threads

    Quote Originally Posted by Dr.Synth
    ok i am just getting into openCL. and there is a lot of confusion i am having wrt to translating hardware and software groups.

    i have a gtx 285 and it is having 240 cuda streaming processors. But when i run the device query program in nvidia gpu computing sdk for openCL it shows

    CL_DEVICE_MAX_COMPUTE_UNITS: 30
    CL_DEVICE_MAX_WORK_ITEM_SIZES: 512 / 512 / 64
    CL_DEVICE_MAX_WORK_GROUP_SIZE: 512
    CL_NV_DEVICE_WARP_SIZE: 32
    Quote Originally Posted by Dr.Synth
    1.So why is openCL showing compute units as 30 when my streaming processors are 240? I guess a compute unit is not a streaming processor? So then what is a compute unit?
    A 'compute unit' is a lump of hardware that executes 'work groups'. A work group is, as you know, a collection of 'work items'. On NVIDIAS hardware each work item is linked to a 'CUDA thread'. A thread executes on a streaming processor (SP), and the collection of SPs that handles all the threads for a work group is called a 'Streaming Multiprocessor' (SM). This is NVIDIish for 'Compute Unit'. Remember that several threads can be assigned to the same SPs and a single SM can therefore handle work groups larger than the number of SPs on each SM. Or rather, each SM can hold several work groups and at each clock cycle (or every fourth or whatever) selects one of the work groups it holds, then a set of work items (threads) from the group and finally let the SPs it has execute instructions for these selected work item. The NVIDIA programming guide describes this rather elegantly in terms of 'warps'.

    A quick division gives that, in your case, there are #SPs/#SMs = <from your post>/<CL_DEVICE_MAX_COMPUTE_UNITS> = 240/30 = 8 SPs per SM, which can be verified in the specifications for your card.

    Quote Originally Posted by Dr.Synth
    2. Now the work group max size is 512, so it means it can have a max of 512 threads/wrk-items? But i can have any number of work groups of any dimension?
    It means that you can have at most 512 work items per work group, organized according to the limits imposed by CL_DEVICE_MAX_WORK_ITEM_SIZES. So in your case you can have work groups that are 1x512 or 2x128 or 4x8x16, but not 8*6*16 (>512), not 1x1x512(Z-dim>64). The global work size, a multiple of your chosen group size, can be as large as your problem requires. It least that is the intention. I believe NVIDIA has a limit on that as well, can someone confirm this?


    Quote Originally Posted by Dr.Synth
    3. Work group is obviously the logical abstraction, so does a work group span multiple stream processors? eg a work group having more than warp size of work-items
    This question is a bit oddly formulated, and the answer is hinted at in my answer to 1. above. The quick answer is no. A work group gets assiged to a SM when dispatched and stay on that SM until all work items have finished. The oddity is the second part of your question. A work group can certanly contain more work items than the warp size. The effect is that, as the hardware steps one warp at the time, threads from different warps will progress differently, or out-of-synch, through the instruction stream (.cl-source). This is one of the reasions for the barrier() instrinsic.

    Quote Originally Posted by Dr.Synth
    4. what is the logic of 16kb of local memory asigned to a work group. if work group is logical and not hardware how can it get a fast 16kb local memory from a streaming processor. (this one drives me nuts)
    A side-effect of the fixed assignment of a work group to a specific SM is that local memory can be physically implemented inside the SM. Each work group is associated with a fixed amount of resources (registers, local memory) required for executing that group. When the kernel is launched, each SM accepts as many work groups as the hardware has resources for and DEDICATE these resoureces to the work group. Therefore, each work group gets access to a sub-set of the 16k high performance (SRAm?) local memory available on its housing SM.

    Quote Originally Posted by Dr.Synth
    5. can 2 cpu threads have their own kernels, doing the same task in parallel? obviously their data is local to them and so no need them.
    This I can not answer. See this thread for a discussion of what belive to be the same question.
    http://forums.nvidia.com/index.php?showtopic=163678

    Quote Originally Posted by Dr.Synth
    6. can i pre allocate a 1d array of size n on a kernel on program-load and use this same kernel but different instance for each cpu thread?
    This I don't understand either. Arrays are not allocated on kernels, but in 'contexts', and in the end on 'devices'. You can launch the same kernel twice giving it the same buffer as parameter, but on currend NVIDIA hardware it will just run the kernels serially, or twice in ths case as it is the same kernel. On Fermi-like architectures where several kernel invocations can run concurrently you would probably get undefined behavior.

    I don't think I understand your question. What are you trying to achive?

  4. #4
    Junior Member
    Join Date
    Mar 2010
    Posts
    3

    Re: confusion compute unit/stream proc/warp/work gp/threads

    first of all a big thanks to dbs2 & ibbles. a lot of my doubts are now cleared. it feels like i can breath some air.

    -so basically a compute unit is a multi processor than contains 8 stream processors. On these stream processors you can have more than one thread running

    -a logical work group binds to a hardware multi processor and never spans more than one multi processor even if it has many threads > stream processor

    -work grp can have more threads than warps, but internally the multiprocessor works at one warp at a time in a clock cycle?

    about question 5+6, what i wanted to know was if i can run concurrent kernels of the same type on the gpu.
    @ibbles=> what i am trying to achieve is this. my project is connected to mutiple cameras. each camera is processed by a different cpu thread. now sometimes i need to do some ops on the frame, so i send it to gpu. but the same op is also done by the other thread on a different camera frame. So my question was if a kernel can concurrently run on the gpu. and i guess your answer was no.

    thanks again!

  5. #5

    Re: confusion compute unit/stream proc/warp/work gp/threads

    Quote Originally Posted by Dr.Synth
    first of all a big thanks to dbs2 & ibbles. a lot of my doubts are now cleared. it feels like i can breath some air.
    Quote Originally Posted by Dr.Synth
    -so basically a compute unit is a multi processor than contains 8 stream processors. On these stream processors you can have more than one thread running
    In your case, as with most NVIDIA GPUs, there are 8 stream processors on a multi processor. But there is nothing saying that there will always be eight of them, or even stream processors at all. They are an implementation detail of NVIDIAS OpenCL implementation. It may work differently on ATI cards or ordinary processors.

    Quote Originally Posted by Dr.Synth
    about question 5+6, what i wanted to know was if i can run concurrent kernels of the same type on the gpu.
    @ibbles=> what i am trying to achieve is this. my project is connected to mutiple cameras. each camera is processed by a different cpu thread. now sometimes i need to do some ops on the frame, so i send it to gpu. but the same op is also done by the other thread on a different camera frame. So my question was if a kernel can concurrently run on the gpu. and i guess your answer was no.
    Yes, at least for the time being.

Similar Threads

  1. Device info: maximum number of workitems per compute unit
    By Bilog in forum Suggestions for next release
    Replies: 0
    Last Post: 04-30-2012, 01:35 AM
  2. Replies: 3
    Last Post: 03-24-2010, 12:05 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •