OpenVX User Kernel Tiling Extension  r29628
User Kernel Tiling

The User Kernel Tiling facility enables optimizations of the user nodes (e.g. locality of execution or parallelism) when performing computation on the image data. Modern processors have a diverse memory hierarchy that varies from relatively small but fast and expensive memory to relatively large but slow and inexpensive memory. Image data is typically too large to fit into the fast but small memory. The ability to break the image data into smaller-sized rectangular units allows for optimized computation on these smaller units with fast memory access or parallel execution of a user node on multiple image tiles'' simultaneously. The OpenVX Graph Manager possesses the knowledge about the memory hierarchy of the target and is hence in a position to break the image data into smaller units for memory optimization. Knowledge of the memory access pattern of an algorithm is key for the graph manager to enable optimizations.

As with conventional User Kernels, Tiling User Kernels will typically be loaded and executed on HLOS/CPU-compatible targets, not on remote processors or other accelerators — the intent of this extension is to allow efficient scheduling and memory transfer between accelerators and user functions, not to provide code to run on an accelerator, since this capability is necessarily vendor-specific. This specification does not mandate what constitutes compatible platforms.

# Purpose

The purpose of having User Kernel Tiling Functions is to:

• Provide a mechanism for user-kernel programmers to communicate enough information about their computation to enable OpenVX implementations to perform some optimizations that take advantage of locality in the user-node computation, and enable parallelism of execution across local data sets.
• Define a portable method to enable tiling on compatible platforms.
• Define a simple way to enable users to write image processing functions.

# Basic Design

The user kernel tiling facility is largely composed of two components:

• User Node informs Graph Manager — The author of the tilable user kernel informs the graph manager about the data requirements of the algorithm implemented by the kernel. This is primarily done by setting attributes on the kernel (for all nodes of this type), or on the node (for that particular node). This is usually done at kernel definition time (for kernel attributes) or at node creation time (for node attributes). Only single-plane images are supported as input or output arguments to tiled user nodes in this version of the specification.
• Graph Manager informs User Node — The graph manager informs the user kernel code about the rectangular sub-image, or “tile”, on which the user kernel code will run. This is primarily done via macros that the user kernel code can call to get this information at run time for a particular invocation of the user code. In this version of the tiling extension, the user functions are expected to generate the same output pixels in each output image, and to rely on the corresponding pixels of the input image, potentially with the addition of some nearby “neighborhood” pixels, to perform this calculation. The effect of a tiling user kernel that writes to its input images is implementation-defined.

# Two Kernel Functions

The creator of the tiled user kernel must provide a kernel function to perform the actual computation. Optionally, the kernel creator may provide a second function with different edge-condition checks.

• One of the functions is referred to as the “flexible” function, and it may be called with irregular tile sizes and near the edge of the tile's parent image, so it must fully check all edge conditions to handle such issues.
• The second function is referred to as the “fast” function, and it will only be called with tile sizes and pixel alignments that are multiples of the user-specified “tile block size”, and will not be called in a location that would cause references to the neighborhood region to fall outside the image.

If the kernel creator only provides the fast function, some of the output image (near the edges) will not be computed if the fast function cannot be safely called at the edge of these images. If the kernel creator only provides the flexible function, it will be called at all appropriate locations of the image, including near the edges. The OpenVX implementation should still attempt to call the flexible function with regular tile sizes and alignments where possible. If both functions are provided, the fast function should be used as much as possible given its restrictions, and the flexible function will be called elsewhere.

# Glossary

Terms and Definitions of the Extension:

• Tile — The rectangle of destination pixels to be computed by the user function, whose size is less than or equal to the rectangle of the entire image. This value is determined by the OpenVX runtime. Every tile that is provided to the user node will be fully contained within the image, and no pixel of the image will be included in more than one tile. The user cannot rely on the order or concurrency of tile processing.
• Output tile block — The output tile block size imposes a coordinate grid over the image that is aligned with the top-left corner of the output image. The fast function will always be called on a tile that is aligned to this grid, and the dimensions of this tile are integer multiples of the tile block size. That is, the fast function is invoked on a tile consisting of a whole number of tile blocks. There are no restrictions on the size and alignment of a tile given to the flexible function. It is expected that the tile block size be small — potentially a single pixel — in order to maximize opportunities to use a “fast” function. The tile size should generally be much larger, and chosen by the graph manager in an implementation-specific way to make good use of cache sizes and memory transfer and to limit function call overhead, while still offering the potential for reduced memory footprint and parallel execution.
• Input tile block — The input tile block defines the dimension of a tile block in the input image corresponding to an output tile block. In OpenVX 1.0, this value is unconditionally regarded to be same as the output tile block size.
• Fast function — A function provided by the user which will be invoked only with tiles whose coordinates are aligned to, and whose dimensions are integer multiples of, the tile block size, and which will never attempt to access pixels that lie beyond the bounds of the input images (due to neighborhood). The expectation is that the user may be able to provide a more efficient implementation if the function is known to be invoked on coordinate-aligned data. For this reason, the implementation may be expected to invoke the fast function preferentially over the flexible function. If only the fast function is provided and the flexible function is not, the regions of the image which do not meet the requirements of running the fast function will not be processed. Note that coordinate alignment does not imply memory alignment, which is determined in part by image stride and is implementation-defined, though many implementations will define a relationship between coordinate and memory alignment. There is no guarantee of evaluation order of the functions, and on some implementations multiple invocations may occur concurrently. The fast function is a function pointer of type vx_tiling_kernel_f.
• Flexible function — A function provided by the user which can be used to process tiles that are not constrained by tile block size, and — with VX_BORDER_MODE_SELF — which may rely on a neighborhood that corresponds to pixels beyond the bounds of the input images. The user function should be capable of running at arbitrary coordinate alignment. If only the flexible function is provided, this will be invoked on tile-block aligned tiles as well as unaligned tiles. The user can expect aligned regions to be invoked preferentially even when only a flexible function is provided. At least one of the fast and flexible functions must be provided. The flexible function is a function pointer of type vx_tiling_kernel_f.
• Input neighborhood — The additional pixels surrounding the tile in the input images that are required in order to compute the output tile. The top, bottom, left and right values correspond to the offset of the edges of the neighborhood defined relative to the tile being processed. For example, a Sobel kernel requiring 3 by 3 pixels centered on the output would have an input neighborhood of (-1,1,-1,1). Functions which operate on only a single pixel have an input neighborhood of (0,0,0,0).
• Output neighborhood — The additional (previously computed) pixels surrounding the tile in the output image needed to compute the output tile. In OpenVX 1.0, this value is unconditionally regarded as zero in all four directions (top, bottom, left, right).
• Border mode — Defines the behavior of the OpenVX runtime against the determination of tiles near the image edge for non-zero neighborhood sizes. If the border mode is VX_BORDER_MODE_SELF, the flexible function will be invoked on tiles near the image border that would require neighborhood access beyond the image bounds, and this function is responsible for accessing any pixels whose coordinates are invalid. If the border mode is VX_BORDER_MODE_UNDEFINED, the graph manager will ensure that the user functions are not invoked on coordinates that, because of neighborhood, require access outside the image area.
• Tile memory — A facility for per-tile memory allocation. The amount allocated will be the amount requested via the VX_KERNEL_ATTRIBUTE_TILE_MEMORY_SIZE attribute (which may be zero) multiplied by the number of tile blocks in the tile being processed, and the allocation will be passed to the user-provided functions through the tile_memory parameter. The size allocated for that function invocation is provided via the tile_memory_size parameter. The valid lifetime of the given pointer is per invocation of the user function that processes a tile. This allocation is provided per tile — and therefore two parallel invocations of the user tiling functions will receive different allocations. This is distinct from the per-node memory managed via the VX_KERNEL_ATTRIBUTE_LOCAL_DATA_SIZE and VX_KERNEL_ATTRIBUTE_LOCAL_DATA_PTR kernel attributes.

# Summary of user function calls

The following tables describe the circumstances under which each user function will be invoked, categorized by the border mode settings of the node.

## User function invocation according to which functions are defined

Functions provided VX_BORDER_MODE_UNDEFINED VX_BORDER_MODE_SELF
If only the FAST function is defined...The Fast function is invoked on the largest aligned image subset that avoids out-of-bounds accesses. Other pixels are not processed.Prohibited (graph validation fails).
If only the FLEXIBLE function is defined...The flexible function is invoked on the largest image subset that ensures neighborhood accesses are within bounds, irrespective of alignment. Other pixels are not processed.The flexible function is invoked for all pixels in the output image irrespective of alignment, including pixels that correspond to neighborhood accesses beyond the image bounds.
If both the FAST and FLEXIBLE functions are defined...The fast function is invoked on the largest aligned image subset that avoids out-of-bounds neighborhood accesses. The flexible function is invoked for the remaining pixels that do not require out-of-bounds accesses.The fast function is invoked on the largest aligned image subset that avoids out-of-bounds neighborhood accesses. The flexible function is invoked on remaining pixels.

Note that a missing flexible function when using VX_BORDER_MODE_SELF is treated as an error even if the fast function might never be expected to be called (for example, if the image size is divisible by the tile block size and the neighborhood is 0).

## User function invocation according to which pixels are being processed

Pixels to process VX_BORDER_MODE_UNDEFINED VX_BORDER_MODE_SELF
Pixels that require access beyond the image bounds (due to neighborhood) Not processed by any function Processed by the flexible function, which must be provided
Pixels that require access only within the image, but in an unaligned tile Processed by the flexible function (if provided) and not processed otherwise Processed by the flexible function, which must be provided
Pixels within an aligned tile that do not access beyond the image bounds Processed by the fast function if provided, otherwise by the flexible function Processed by the fast function if provided, otherwise by the flexible function

# Adding a Tiling Kernel Function

vx_status status = VX_SUCCESS;
vx_uint32 k = 0;
for (k = 0; k < dimof(tiling_kernels); k++)
{
tiling_kernels[k].name,
tiling_kernels[k].enumeration,
tiling_kernels[k].flexible_function,
tiling_kernels[k].fast_function,
tiling_kernels[k].num_params,
tiling_kernels[k].input_validator,
tiling_kernels[k].output_validator);
if (kernel)
{
vx_uint32 p = 0;
for (p = 0; p < tiling_kernels[k].num_params; p++)
{
tiling_kernels[k].parameters[p].direction,
tiling_kernels[k].parameters[p].data_type,
tiling_kernels[k].parameters[p].state);
}
status |= vxSetKernelAttribute(kernel, VX_KERNEL_ATTRIBUTE_INPUT_NEIGHBORHOOD, &tiling_kernels[k].nbhd, sizeof(vx_neighborhood_size_t));
status |= vxSetKernelAttribute(kernel, VX_KERNEL_ATTRIBUTE_OUTPUT_TILE_BLOCK_SIZE, &tiling_kernels[k].block, sizeof(vx_tile_block_size_t));
status |= vxSetKernelAttribute(kernel, VX_KERNEL_ATTRIBUTE_BORDER, &tiling_kernels[k].border, sizeof(vx_border_mode_t));
if (status != VX_SUCCESS)
{
vxRemoveKernel(kernel);
}
else
{
status = vxFinalizeKernel(kernel);
}
if (status != VX_SUCCESS)
{
printf("Failed to publish kernel %s\n", tiling_kernels[k].name);
break;
}
}
}

# Example Kernels

Users can add simple filter-style kernels similar to this example of a Gaussian Blur.

void gaussian_image_tiling_fast(void * VX_RESTRICT parameters[VX_RESTRICT],
void * VX_RESTRICT tile_memory,
vx_size tile_memory_size)
{
vx_uint32 x, y;
vx_tile_t *in = (vx_tile_t *)parameters[0];
vx_tile_t *out = (vx_tile_t *)parameters[1];
for (y = 0; y < vxTileHeight(out, 0); y += vxTileBlockHeight(out))
{
for (x = 0; x < vxTileWidth(out, 0); x += vxTileBlockWidth(out))
{
vx_uint32 sum = 0;
/* this math covers a {-1,1},{-1,1} neighborhood and a block of 1x1.
* an internal set of for loops could have been placed here instead.
*/
sum += vxImagePixel(vx_uint8, in, 0, x, y, -1, -1);
sum += vxImagePixel(vx_uint8, in, 0, x, y, 0, -1) << 1;
sum += vxImagePixel(vx_uint8, in, 0, x, y, +1, -1);
sum += vxImagePixel(vx_uint8, in, 0, x, y, -1, 0) << 1;
sum += vxImagePixel(vx_uint8, in, 0, x, y, 0, 0) << 2;
sum += vxImagePixel(vx_uint8, in, 0, x, y, +1, 0) << 1;
sum += vxImagePixel(vx_uint8, in, 0, x, y, -1, +1);
sum += vxImagePixel(vx_uint8, in, 0, x, y, 0, +1) << 1;
sum += vxImagePixel(vx_uint8, in, 0, x, y, +1, +1);
sum >>= 4;
if (sum > 255)
sum = 255;
vxImagePixel(vx_uint8, out, 0, x, y, 0, 0) = (vx_uint8)sum;
}
}
}

Note that this function has a non-zero neighborhood, and performs no special checking for running at the image borders. Therefore it must be used as a “fast” function and, unless a corresponding “flexible” function is provided, must be used with VX_BORDER_MODE_UNDEFINED.

Users can also add more complex addressing functionality in order to optimize their code in this example which uses two inputs:

void * VX_RESTRICT tile_memory,
vx_size tile_memory_size)
{
vx_uint32 i,j;
vx_tile_t *in0 = (vx_tile_t *)parameters[0];
vx_tile_t *in1 = (vx_tile_t *)parameters[1];
vx_tile_t *out = (vx_tile_t *)parameters[2];
for (j = 0u; j < vxTileHeight(out,0); j+=vxTileBlockHeight(out))
{
for (i = 0u; i < vxTileWidth(out,0); i+=vxTileBlockWidth(out))
{
/* this math covers a 1x1 block and has no neighborhood */
vx_uint16 pixel = vxImagePixel(vx_uint8, in0, 0, i, j, 0, 0) +
vxImagePixel(vx_uint8, in1, 0, i, j, 0, 0);
if (pixel > INT16_MAX)
pixel = INT16_MAX;
vxImagePixel(vx_int16, out, 0, i, j, 0, 0) = (vx_int16)pixel;
}
}
}

With a tile block width and height of one, combined with the zero neighborhood, this function can be used as a “flexible” function. With a larger tile block size, this is a “fast” function.

For better performance, the user might split this into two functions. Firstly, a “fast” function:

void * VX_RESTRICT tile_memory,
vx_size tile_memory_size)
{
vx_uint32 i,j,ii,jj;
vx_tile_t *in0 = (vx_tile_t *)parameters[0];
vx_tile_t *in1 = (vx_tile_t *)parameters[1];
vx_tile_t *out = (vx_tile_t *)parameters[2];
for (j = 0u; j < vxTileHeight(out,0); j+=16)
{
for (jj = 0u; jj < 16u; ++jj)
{
for (i = 0u; i < vxTileWidth(out,0); i+=16)
{
for (ii = 0u; ii < 16u; ++ii) {
/* this math covers a 1x1 block and has no neighborhood */
vx_uint16 pixel = vxImagePixel(vx_uint8, in0, 0, i+ii, j+jj, 0, 0) +
vxImagePixel(vx_uint8, in1, 0, i+ii, j+jj, 0, 0);
if (pixel > INT16_MAX)
pixel = INT16_MAX;
vxImagePixel(vx_int16, out, 0, i+ii, j+jj, 0, 0) = (vx_int16)pixel;
}
}
}
}
}

This fast function has a fixed 16 by 16 tile block size, the loops of which the compiler may be able to unroll or otherwise optimize — the pixel access macro is designed to give the compiler the best chance of doing this. Less portable code may be able to rely on byte alignment and fast memory accesses, and perform SIMD operations in the fast function to maximize the host CPU performance.

In addition to performance optimization, many algorithms have tile block size requirements due to spatial effects on pixels. For example, ordered dithering, UV upsampling and debayering all require pixels to be processed differently according to location within a repeating block.

This would be complemented by a corresponding “flexible” function:

void * VX_RESTRICT tile_memory,
vx_size tile_memory_size)
{
vx_uint32 i,j;
vx_tile_t *in0 = (vx_tile_t *)parameters[0];
vx_tile_t *in1 = (vx_tile_t *)parameters[1];
vx_tile_t *out = (vx_tile_t *)parameters[2];
for (j = 0u; j < vxTileHeight(out,0); ++j)
{
for (i = 0u; i < vxTileWidth(out,0); ++i)
{
/* this math covers a 1x1 block and has no neighborhood */
vx_uint16 pixel = vxImagePixel(vx_uint8, in0, 0, i, j, 0, 0) +
vxImagePixel(vx_uint8, in1, 0, i, j, 0, 0);
if (pixel > INT16_MAX)
pixel = INT16_MAX;
vxImagePixel(vx_int16, out, 0, i, j, 0, 0) = (vx_int16)pixel;
}
}
}

This function ignores the tile block and processes the tile given to it one pixel at a time. It is expected that the flexible function should be called for much less of the image than the fast function, so this overhead should be considered acceptable. This applies whether the flexible function is invoked because of tile alignment or in order to support out-of-bounds accesses, and users writing portable OpenVX code should be able to devote effort to the fast function. However, it is legal for an implementation with minimal tiling support to process an entire image in a single, possibly unaligned, tile for portability reasons.

# Example diagrams

## A trivial kernel

The simplest user functions (that depend on any inputs at all) will read only the pixel in the input image(s) that correspond(s) to the pixel in the output image(s). Such a function requires a neighborhood size of 0 in each direction. If only a single image is processed at a time, the tile block size should be set to 1 by 1. Here we show the tile block — and therefore minimum tile size — in blue (trivially).

## A user function with a neighborhood

A more complex user function may require source pixels surrounding the coordinate of the output pixel in order to work. For example, a Sobel filter requires the a three by three block centered on the current pixel. Here we show the neighborhood in cyan around the single pixel to be processed, in blue — each square represents one pixel. This neighborhood would be specified as “-1,1,-1,1”.

## A faster function with a neighborhood

The user may choose to optimize the same function by making it work on blocks of 4 by 4 pixels, rather than one pixel at a time. In this case, the tile block should be set to 4 by 4. Depending on the implementation, the neighborhood may be unmodified. If a fast function is defined that uses this 4 by 4 tile block, the user should still offer a flexible function that can process tiles that are not aligned to this 4 by 4 grid, since otherwise some portions of the image area may be left unprocessed. Here we show the 4 by 4 tile block with a single pixel neighborhood, specified in the same way as before.

Tiles processed using the “fast” function will be aligned to a 4x4 grid, and some multiple of this tile block size; the fast function will be able to access valid pixels from the neighborhood area (which will surround the complete tile by a single pixel).

## Tiles and tile blocks

Consider a simple function that has no neighborhood - let us say a function which posterizes the image by replicating the top four bits of an 8-bit input into the lower four bits:

out(x,y) = (in(x,y) & 0xF0) | (in(x,y) >> 4)

If the implementation guarantees 32-bit alignment for the start of each row — not something guaranteed by OpenVX, but likely common — then the user could write a “fast” function which processes four pixels at once, using a 32-bit integer. This function would then have a 4 by 1 “tile block”.

In processing a 12 by 6 image using this function, the graph manager might choose to break the image into three rectangular “tiles”. An example in shown in different colors below. The fast function can be called for each of these tiles. It is up to a graph manager to decide different arrangements, and to decide the order in which the fast function would be invoked for these tiles — including whether the tiles are evaluated in parallel — but the tile block size guarantees that the tiles will be multiples of 4 by 1 pixels.

Next, consider what happens if the image is 14 by 6 pixels:

In this case, the 2-by-6 strip at the right edge of the image cannot be processed by tile-block-aligned tiles, and therefore the fast function cannot process these pixels. If the user has provided a “flexible” function, that function will be used to process these pixels (still provided with rectangular tiles of data).

## Example of borders

We might imagine a function that is similar to the above, but which requires access to all the source pixels surrounding the coordinate of the destination pixel in order to operate. While it may process a single pixel, the user may choose to perform a similar optimization to that discussed in the previous section, resulting in a 4-by-1 tile block with a one-pixel border neighborhood around it.

With the user node's border mode set to VX_BORDER_MODE_UNDEFINED, the graph manager might choose to break the same 14 by 8 image into the following tiles for the user functions to process:

In this case, the white pixels at the image edge are not processed at all. The red tile is aligned to tile blocks, and can be processed with a fast function, if one is provided (otherwise the flexible function is used). The blue and green tiles do not start on (blue tile) or end on (green tile) a 4 by 1 tile block grid, so these pixels must be processed by the flexible function. If no flexible function is provided, these pixels will not be processed.

With the user node's border mode set to VX_BORDER_MODE_SELF, the graph manager might choose to break the same 14 by 8 image into the following tiles for the user functions to process:

With VX_BORDER_MODE_SELF, the flexible function must be provided. In this example, only the red tile can be processed by the fast function (if provided), since only this tile is aligned to a tile block boundary and has no requirements on pixels beyond the image borders. The blue, cyan and yellow regions are aligned to the tile block boundary, but must be processed by the flexible function because the neighborhood of these pixels extends beyond the image boundary. The green tile is additionally not aligned to the tile block size.

# Informative Coding Guidelines

Tiled user kernel functions should be coded in a very specific manner to maintain the requirement of data independence in that all blocks should be computed in a functionally-independent manner. This allows the implementation greater flexibility in optimization choices. These informal requirements include:

• Avoid using global shared memory.
• Avoid using blocking access protected mechanisms (semaphores, etc).
• Avoid using dynamic memory allocation functions at runtime.
• Avoid calling other functions internally in the tile function.
• Do try to write the your loops in an efficient manner for your compiler environment.