Results 1 to 5 of 5

Thread: Vector Vector addition

  1. #1
    Newbie
    Join Date
    Aug 2013
    Posts
    3

    Vector Vector addition

    Hi,

    I have a basic question..not sure whether it's the right place to post.
    Why this might be true??

    Not allowing me to keep URL/image..Please add http..

    Code :
    i41.tinypic.com/24dkne0.png

    I think it's memory..but not sure how to explain it elaborately.

    Thanks.
    Jennifer

  2. #2
    Senior Member
    Join Date
    Oct 2012
    Posts
    115
    In this simple case, memory bandwith is most probably the limiting factor.

    Let's say for example you have memory with a peak bandwith of 12GB/s and a Intel i7 processor clocked at 2.40GHz.

    For each addition, the memory will have to access 3 (single-precision) float values (2 reads and 1 write), that is 12 bytes.
    Memory thus limits the number of additions to 12/12=1 Goperation/s.

    On the CPU side, let's suppose that the for loop has been unrolled so that we can neglect index incrementing and conditional branching of the for loop.
    The only instructions for an add operation are two loads from memory to registers, one addition and a store from register to memory. Supposing that memory accesses are done to and from the L1 cache and neglecting latency, a full operation needs:

    - 1st load: 1 cycle for the instruction + 1 cycle for memory throughput
    - 2nd load: idem
    - floating-point addition: 2 cycles
    - store: 1 cycle for the instruction + 1 cycle for memory throughput

    So a full add operation will take 8 cycles, and the maximum number of operations is 2.40/8=0.3 Goperation/s.

    (Of course, this is a very coarse estimation but it is enough to highlight the origin of the limitation)

    So if four threads are concurrently computing, they will execute 1.2 Goperation/s which already exceeds the number of operations the memory can sustain (1 Gop/s). The load/store instructions will stall, waiting for data from/to memory.

  3. #3
    Newbie
    Join Date
    Aug 2013
    Posts
    3
    Quote Originally Posted by utnapishtim View Post
    In this simple case, memory bandwith is most probably the limiting factor.

    Let's say for example you have memory with a peak bandwith of 12GB/s and a Intel i7 processor clocked at 2.40GHz.

    For each addition, the memory will have to access 3 (single-precision) float values (2 reads and 1 write), that is 12 bytes.
    Memory thus limits the number of additions to 12/12=1 Goperation/s.

    On the CPU side, let's suppose that the for loop has been unrolled so that we can neglect index incrementing and conditional branching of the for loop.
    The only instructions for an add operation are two loads from memory to registers, one addition and a store from register to memory. Supposing that memory accesses are done to and from the L1 cache and neglecting latency, a full operation needs:

    - 1st load: 1 cycle for the instruction + 1 cycle for memory throughput
    - 2nd load: idem
    - floating-point addition: 2 cycles
    - store: 1 cycle for the instruction + 1 cycle for memory throughput

    So a full add operation will take 8 cycles, and the maximum number of operations is 2.40/8=0.3 Goperation/s.

    (Of course, this is a very coarse estimation but it is enough to highlight the origin of the limitation)

    So if four threads are concurrently computing, they will execute 1.2 Goperation/s which already exceeds the number of operations the memory can sustain (1 Gop/s). The load/store instructions will stall, waiting for data from/to memory.


    Thanks..this is very helpful..
    As I am a beginner to this computing, I was thinking of the following..Please advise...

    ") This might probably related to caching....in the beginning the processor doesnít have any information about the computation...then after a few computations the previous calculation is stored in the cache memory and this cache memory is reused for future/parallel calculations which decreases the time needed to retrieve the output....
    Use of Cache memory for recalculation and parallel computation....which decreases the time taken by the processor to recalculate if the output is already available. So, once it goes beyond 10, the time didnít continue to decrease."

    Not sure whether both of us are coming to the same conclusion..

    Thanks...for your reply!!!

  4. #4
    Senior Member
    Join Date
    Oct 2012
    Posts
    115
    Modern CPUs have prefetchers that can detect sequential accesses like here and efficiently fill the cache before computations take place.
    The problem here is that the computations are so simple and fast that the memory cannot fill the cache fast enough to sustain the rate at which FP computations are done.

    Cache can improve access latency but it won't improve RAM throughput.

  5. #5
    Newbie
    Join Date
    Aug 2013
    Posts
    3
    Quote Originally Posted by utnapishtim View Post
    Modern CPUs have prefetchers that can detect sequential accesses like here and efficiently fill the cache before computations take place.
    The problem here is that the computations are so simple and fast that the memory cannot fill the cache fast enough to sustain the rate at which FP computations are done.

    Cache can improve access latency but it won't improve RAM throughput.
    Thanks...U are extremely helpful & appreciate it...

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •