Should I consider the caches of a single core ?

The input data is 2 3D matrices each contains 16x256x16 elements.

When the core access the data is does it slowly.

So I guess I caused a lot of cache miss.

Where can I find information about the size of L1,L2 cache of a display card ?

I'm using NVIDIA's GeForce 9400 GT: http://www.geforce.com/hardware/desk...specifications

The spec does not contains this information.