Difference between revisions of "Shader"
(→Execution and invocations: Added section on how the shader execution model works, so that we can later talk about shader group vote.)
(More on the nature of shaders.)
|Line 1:||Line 1:|
A '''Shader''' is a user-defined program designed to run on some stage of a graphics processor.
A '''Shader''' is a user-defined program designed to run on some stage of a graphics processor. the programmable stages of the [[Rendering Pipeline Overview|rendering pipeline]].
Latest revision as of 15:06, 9 October 2019
- The core language
- Variable types
- Built-in variables
- Interface blocks
- Shader stages:
- Other shading languages
A Shader is a user-defined program designed to run on some stage of a graphics processor. Shaders provide the code for certain programmable stages of the rendering pipeline. They can also be used in a slightly more limited form for general, on-GPU computation.
The rendering pipeline defines certain sections to be programmable. Each of these sections, or stages, represents a particular type of programmable processing. Each stage has a set of inputs and outputs, which are passed from prior stages and on to subsequent stages (whether programmable or not).
Shaders are written in the OpenGL Shading Language. The OpenGL rendering pipeline defines the following shader stages, with their enumerator name:
- Vertex Shaders: GL_VERTEX_SHADER
- Tessellation Control and Evaluation Shaders: GL_TESS_CONTROL_SHADER and GL_TESS_EVALUATION_SHADER. (requires GL 4.0 or ARB_tessellation_shader)
- Geometry Shaders: GL_GEOMETRY_SHADER
- Fragment Shaders: GL_FRAGMENT_SHADER
- Compute Shaders: GL_COMPUTE_SHADER. (requires GL 4.3 or ARB_compute_shader)
A program object can combine multiple shader stages (built from shader objects) into a single, linked whole. A program pipeline object can combine programs that contain individual shader stages into a whole pipeline.
While shader stages do use the same language, each stage has a separate set of inputs and outputs, as well as built-in variables. As such, shader objects are built for a specific shader stage. So while program objects can contain multiple stages, shader objects only contain code for a single stage.
Execution and invocations
When a drawing command is executed, the currently bound Program Object, or Program Pipeline Object, will be used in the rendering operation. The programmable portions of the Rendering Pipeline will execute the shader code stored in the currently used program(s).
Each shader stage that has code executes one or more times, based on exactly what is rendering. Each shader stage defines the frequency at which it executes.
Each execution of a shader stage within a Rendering Command is called an "invocation". With a rare few exceptions, shader stage invocations cannot interact with one another. Exactly how many invocations execute for most shader stages depends on the amount of stuff being rendered and the nature of that shader stage.
- Vertex Shaders: Approximately once per input vertex in the vertex stream. This may be less than once per vertex if indexed rendering is used, due to the Post Transform Cache, but it will be at least once for each unique set of vertex attributes.
- Tessellation Control Shaders: Precisely once per output vertex per patch. Invocations operating on the same input patch can intercommunicate though their output variables.
- Tessellation Evaluation Shaders: Approximately once per vertex in the tessellation of the abstract patch. A unique vertex in the patch may be processed more than once. The minimum number of TES invocations is once per unique vertex in the patch; the maximum is one for each vertex for each primitive generated by a patch.
- Geometry Shaders: Once per primitive reaching this stage. Geometry shader instancing allows the GS to be invoked multiple times for each input primitive.
- Fragment Shaders: Once per Fragment generated by the rasterizer. It may be executed more than this, as "helper" fragment shader instances may be used by the implementation. These instances however cannot write data (in any way, whether fragment shader outputs, Image Load Store or anything else). They exist mostly to compute Implicit Derivatives to make many texture sampling functions work.
- Compute Shaders: The number of invocations is defined by the number of work groups requested by the dispatch operation multiplied by the compute shader's local size. Compute shader invocations within a work group have some limited intercommunication functionality.
Execution model and divergence
Let us hypothetically divide up a computational unit of a CPU or GPU into a command processor and mathematical unit (note that this division isn't really true, but it is useful for this discussion).
The command processor is responsible for reading a command and telling the mathematical unit what to do, then reading the next command. If that command is a branch operation, it is the command processor that figures out where to branch to, to pick which command to execute next. For conditional branches, the command can be broken down into the math that computes the condition (executed on the math unit), followed by the command processor reading the result of the condition and picking the next command to execute based on that.
One of the things that gives GPUs their significant performance advantages over CPUs is that GPUs are able to perform many more computations in parallel. But the specifics of GPU parallelism are quite different from standard CPU threads.
Let's say that we have a set of command sequences we want to execute. Each sequence has its own set of input data, and will write to its own output values. But each command sequence is completely independent of each other; none of them can communicate with each other in any way.
If you want to execute this set of commands completely in parallel, you need one command processor and one mathematical unit for each command sequence.
However, let's alter our scenario. What if all of the above command sequences were actually the same sequence of commands, simply acting on different input values and writing to different output locations? If we were to design a processor for such an operation, it could have a single command processor which feeds multiple mathematical units. It reads one command, and tells the math units to execute that command on that units own variables. Each math unit has its own temporary values as well, to serve as intermediates.
This all works... right up until you need to execute a conditional branch (if, for, etc). What happens if the math computations that lead to a conditional branch result in different values for different command sequences? The desired result would be that these sequences would execute different commands. But we only have one command processor, so it can't pick different commands to execute. How do you deal with this?
When a conditional branch diverges like this, you have to do something unpleasant. The branch can be broken down into path A and path B. The command processor basically has to freeze all of the math units that need to execute path B commands. Then it executes the path A commands. Once that's done, it goes back to the branch and execute path B's commands.
As such, in actual GPUs, the number of mathematical units relative to the number of command processes tends to be kept relatively small. 1 CP to 32 MUs is about as large as you might want to go.
Also, in real GPUs the mathematical units aren't really separate processing elements. After all, all of them are going to execute the same command at the same time, so they're doing the same math, just on different values.
What happens in real GPUs is that you have one "mathematical unit" which can perform the same action on multiple pieces of input data, writing to multiple intermediate data locations. That is, if you want to do x = 2 + 3 and y = 6 + 9, you're conceptually executing the same command: addition. So what a GPU does is bundle them together. You're not doing two separate addition commands; you're doing (x, y) = (2, 3) + (6, 9) as a single operation.
This is the essence of Single Instruction, Multiple Data (SIMD): a single instruction that performs its operation on multiple independent pieces of data.
Under this execution model, a shader invocation is just one of the elements executing in a SIMD processing system. As such, multiple shader invocations are bundled together and executed on the same SIMD core.
The big problem, as outlined above, is divergence upon conditional branching. But not all conditional branches are the same. There are basically 2 kinds of conditional branches: those where you can tell at compile time whether invocations in the same SIMD will never diverge, and those where you cannot tell.
The compiler can only be certain that there will be no divergence in a branch if the expression causing the branch is derived from constants, uniform values, and other expressions which themselves are derived only from constants and uniforms. We can call this a "statically uniform expression".
Note that this is different from Dynamically Uniform Expressions. All statically uniform expressions are dynamically uniform too, but not vice-versa. If a non-statically uniform value happens to be constant throughout a shader operation (all shader invocations get the same input value, for example), then expressions based on that value are dynamically uniform. But because the compiler cannot know this, it must assume that divergence is possible.
So, you can loop over a range of values defined by a uniform variable with full confidence that there will be no divergence.
Non-statically uniform expressions may cause runtime divergence. But on modern GPUs, they don't cause a performance issue unless they actually do diverge at runtime.
Even with potentially divergent branches, compilers attempt to avoid the pain of true execution divergence wherever possible. For example, a simple ? : expression will rarely cause SIMD divergence; the compiler will usually generate code that evaluates *both expressions*, and each individual invocation will discard one or the other based on the condition. If those expressions involve calling complex functions, the compiler may allow true divergence, but if it's simple math like value = condition ? X + 5 : Y - 20, the compiler will almost certainly execute both.
This extends to regular if statements as well, depending on how clever the compiler is. If the code in those conditions is pretty small, then it will try to execute both wherever possible. But again, it will only do this if the condition is not statically determinable to be non-divergent.
Shaders have access to a wide variety of resources. They can access Textures, uniforms, uniform blocks, image variables, atomic counters, shader storage buffers, and potentially other information. There are limits however on exactly how much stuff each shader stage can access. Each resource has a query-able maximum count of accessible resources for each stage.
Note that each stage may also have limitations on stage-specific resources; vertex shaders have a hard limit on the number of Vertex Attributes, for example. This section will discuss resources that are general to all shaders.
The query-able limits, and their associated meanings, are as follows. Note that the "*" is the stage name. It can be VERTEX, TESS_CONTROL, TESS_EVALUATION, COMPUTE, GEOMETRY or FRAGMENT. These values also have an OpenGL-required minimum; OpenGL implementations (of a certain version) will support at least this many of that resource.
- This is the number of active components of uniform variables that can be defined outside of a uniform block. The term "component" is meant as the basic component of a vector/matrix. So a vec3 takes up 3 components. The minimum value here is 1024, enough room for 256 vec4s.
- The maximum number of uniform blocks that this shader stage can access. The OpenGL-required minimum is 12 in GL 3.3, and 14 in GL 4.3.
- The maximum number of components that this stage can take as input. The required minimum value differs from shader stage to shader stage. Note that Vertex Shaders do not have this value, as they use a different input mechanic based on Vertex Attributes. Their limit is GL_MAX_VERTEX_ATTRIBUTES, where each attribute can be at most 4 components.
- The maximum number of components that this stage can output. The required minimum value differs from shader stage to shader stage. Note that Fragment Shaders do not have this value, as they use a different output mechanic based on draw buffers. Their limit is GL_MAX_DRAW_BUFFERS, where each draw buffer output can be at most 4 components.
- The maximum number of texture image units that the sampler in this shader can access. The OpenGL-required minimum value is 16 for each stage.
Note: For legacy reasons, the enumerator for the fragment shader equivalent is called GL_MAX_TEXTURE_IMAGE_UNITS. No "FRAGMENT".
- GL_MAX_*_IMAGE_UNIFORMS (requires GL 4.2/ARB_shader_image_load_store)
- The maximum number of image variables for this shader stage. The OpenGL-required minimum is 8 for fragment and compute shaders, and 0 for the rest. This means implementations may not allow you to use image variables in non-fragment or compute stages.
- GL_MAX_*_ATOMIC_COUNTERS (requires GL 4.2/ARB_shader_atomic_counters)
- The maximum number of Atomic Counter variables that this stage can define. The OpenGL-required minimum is 8 for fragment and compute shaders, and 0 for the rest.
- GL_MAX_*_ATOMIC_COUNTER_BUFFERS (requires GL 4.2/ARB_shader_atomic_counters)
- The maximum number of different buffers that the atomic counter variables can come from. The OpenGL-required minimum is 1 for fragment shaders, 8 for compute shaders (note: possible spec typo), and again 0 for the rest.
- GL_MAX_*_SHADER_STORAGE_BLOCKS (requires GL 4.3/ARB_shader_storage_buffer_object)
- The maximum number of different shader storage blocks that a stage can use. For fragment and compute shaders, the OpenGL-required minimum is 8; for the rest, it is 0.
While these define the resources usable to a shader stage, there are some limits beyond this, which cover all shader stages in total.
- The limit on the number of uniform buffer binding points. This is the limit for glBindBufferRange when using GL_UNIFORM_BUFFER. In GL 3.3, this value is a minimum of 36 (3 shader stages, with a minimum of 12 blocks per stage). In 4.3, this value is a minimum of 72.
- The maximum number of uniform blocks that all of the active programs can use. If two (or more) shader stages use the same block, they count separately towards this limit. In GL 3.3, this was 36; in 4.3, it is 70.
- The total number of texture units that can be used from all active programs. This is the limit on glActiveTexture(GL_TEXTURE0 + i) and glBindSampler. In GL 3.3, this was 48; in 4.3, it is 96.
- When doing separate mode Transform Feedback, this is the maximum number of varying variables that can be captured. This has a minimum of 4.
- When doing separate mode Transform Feedback, this is the maximum number of components for a single varying variable (note that varyings can be arrays or structs) that can be captured. This has a minimum of 4.
- When doing interleaved Transform Feedback, this is the total number of components that can be captured within a single buffer. This has a minimum of 64.
- GL_MAX_TRANSFORM_FEEDBACK_BUFFERS (requires GL 4.0/ARB_transform_feedback3)
- The maximum number of buffers that can be written to in transform feedback operations. This has a minimum of 4.
- GL_MAX_ATOMIC_COUNTER_BUFFER_BINDINGS (requires GL 4.2/ARB_shader_atomic_counters)
- The total number of atomic counter buffer binding points. This is the limit for glBindBufferRange when using GL_ATOMIC_COUNTER_BUFFER. This value has a minimum of 1.
- GL_MAX_COMBINED_ATOMIC_COUNTER_BUFFERS (requires GL 4.2/ARB_shader_atomic_counters)
- The maximum number of atomic counter buffers variables across all active programs. This value has a minimum of 1.
- GL_MAX_COMBINED_ATOMIC_COUNTERS (requires GL 4.2/ARB_shader_atomic_counters)
- The maximum number of atomic counter variables across all active programs. This value has a minimum of 8.
- GL_MAX_SHADER_STORAGE_BUFFER_BINDINGS (requires GL 4.2/ARB_shader_atomic_counters)
- The total number of shader storage buffer binding points. This is the limit for glBindBufferRange when using GL_SHADER_STORAGE_BUFFER. This value has a minimum of 8.
- GL_MAX_COMBINED_SHADER_STORAGE_BLOCKS (requires GL 4.3/ARB_shader_storage_buffer_object)
- The maximum number of shader storage blocks across all active programs. As with UBOs, blocks that are the same between stages are counted for each stage. This value has a minimum of 8.
- GL_MAX_IMAGE_UNITS (requires GL 4.2/ARB_image_load_store)
- The total number of image units that can be used for image variables from all active programs. This is the limit on glBindImageTexture. This value has a minimum of 8.
- GL_MAX_COMBINED_SHADER_OUTPUT_RESOURCES (requires GL 4.3/ARB_shader_storage_buffer_object)
- The total number of shader storage blocks, image variables, and fragment shader outputs across all active programs cannot exceed this number. This is the "amount of stuff" that a sequence of shaders can write to (barring Transform Feedback). This value has a minimum of 8.