Post Transform Cache
The Post Transform Cache (sometimes called the "post-T&L cache") is a hardware feature that modern GPUs have to improve rendering performance. It is part of the rendering pipeline. It is a memory buffer containing vertex data that has passed through the vertex processing stage, but has not yet been converted into primitives.
Vertex processing with vertex shaders is a very strict process. A single set of Vertex Attributes enter the vertex shader, and a single set of post-transformed data comes out. For any particular rendering call, the output of this stage is based solely on the inputs (since Uniforms cannot change within a rendering call). Therefore, if the system can detect that you have passed the same vertex attribute inputs, then the system can avoid executing the vertex shader again. Instead, if the outputs for that input attribute set is in the cache, the cached data can be used. This saves vertex processing.
In the absolute best case, you never have to process the same vertex more than once.
The test for whether a vertex is the same as a previous one is somewhat indirect. It would be impractical to test all of the user-defined attributes for inequality. So instead, a different means is used.
Two vertices are considered equal (within a single rendering command) if the vertex's index and instance count are the same (gl_VertexID and gl_InstanceID in the shader). Since vertices for non-indexed rendering are always increasing, it is not possible to use the post transform cache with non-indexed rendering.
If the vertex is in the post transform cache, then that vertex data is not necessarily even read from the input vertex arrays again. The process skips the read and vertex shader execution steps, and simply adds another copy of that vertex's post-transform data to the output stream.
As with any memory buffer, there is a maximum size to the post transform cache. In the early days of post transform caches, when they used fixed-function pipelines and not generic vertex attributes, the size of the cache was measured in the number of vertices it could store. In current days, since the format of a vertex is generic and variable, the caches are more traditional memory buffers. Thus the number of vertices that can be stored in the post transform cache nowadays depends on how many outputs you write from your vertex shader.
Even so, you can expect vertex shader-based hardware to allow for a fairly large number of vertices, on the order of 20+ at least.
The size of the post transform cache can have an effect on how you optimize your triangles. If you optimize your mesh for a large number of vertices in the cache, this mesh may get poor caching behavior if the cache cannot contain as many vertices. However, some mesh optimization algorithms can work without regard to the cache size.
Using the cache
As long as you do indexed vertex rendering, you will have some chance of using it. However, there are ways to optimize the order that vertices are submitted, to maximize the number of cache hits when rendering a mesh.
This algorithm, developed by Tom Forsyth is a more modern algorithm. Unlike NVTriStrip, it creates an ordered triangle list (GL_TRIANGLES), not a strip. Thus, the index data may be larger than for a triangle strip.
Unlike most other algorithms, it does not care about the size of the cache. It is generally useful, able to get quite good performance in both small and large cache situations for most arbitrary meshes.
Details can be found here.
A regular grid is, topologically, a regular grid of vertices with triangles between adjacent vertices. Note that this is only topologically speaking; the actual positions of the vertices can be anywhere. So a landscape could be a regular grid.
Optimizing a regular grid for a vertex cache is somewhat easier than for a regular mesh. Details for an algorithm to do this are found here. This algorithm beats Forsyth, but is far less general, as it only works for regular grids.
Hugues Hoppe proposed a local optimization algorithm that improves on the 'greedy triangle-strip' algorithm. An implementation of this algorithm was shipped in the Direct3D 9 Utility library (D3DX9) for mesh optimization.
The original paper can be found here.
DirectXMesh is an open-source implementation of the original D3DX9 algorithm, but does not interact with graphics hardware in any way so can be used for content for any graphics API.
This is a small library that NVIDIA developed quite a while ago. It takes a list of triangles to define the topology (just the indices of the vertex) and returns either a large triangle strip or a set of triangle strips. Even though it is an NVIDIA library, it can work just fine for non-NVIDIA hardware, as the functions take a parameter specifying the size of the post transform cache (in number of vertices). It does not interact with OpenGL or the graphics hardware in any way.
The library does have some problems. It cannot handle a set of triangles where more than 2 triangles share the same edge.