OpenVX Graph Pipelining Extension  950f130
Introduction

Purpose

Enable multiple initiations of a given graph with different inputs and outputs. Additionally, this extension provides a mechanism for the application to execute a graph such that the application does not need to be involved with data reconfiguration and starting processing of the graph for each set of input/output data.

Acknowledgements

This specification would not be possible without the contributions from this partial list of the following individuals from the Khronos Working Group and the companies that they represented at the time:

  • Kedar Chitnis - Texas Instruments, Inc.
  • Jesse Villareal - Texas Instruments, Inc.
  • Radhakrishna Giduthuri - AMD
  • Tomer Schwartz - Intel
  • Frank Brill - Cadence Design Systems
  • Thierry Lepley - Cadence Design Systems

Background and Terminology

This section introduces the concepts of graph pipelining, streaming and batch processing before getting into the details of how OpenVX is extended to support these features.

Graph Pipelining

In order to demonstrate what is meant by pipelined execution, please refer to the following example system which executes the simple graph in a distributed manner:

pipe_soc.png

In this example, there are three compute units: an Image Signal Processor (ISP) HWA, a Digital Signal Processor (DSP), and a CPU. The example graph likewise, has three nodes: generically labelled Node 0, Node 1, and Node 2. There could be more or less nodes than compute units, but here, the number of nodes happens to be equal to the number of compute units. In this graph, Node 0 is executed on the ISP, Node 1 is executed on the DSP, and Node 2 is executed on the CPU. Without pipelining enabled, the execution timeline of this graph is shown below:

pipe_nopipelining.png

Assuming each node takes 33ms to execute, then the full graph takes 99ms to execute. Without this extension, OpenVX requires that a second frame can not start graph execution on this same graph until the first graph execution is completed. This means that the maximum throughput of this example will be one frame completing every 99ms. However, in this example, you can see that each compute unit is only utilized no more than one-third of the time. Furthermore, if the camera input produced a frame every 33ms, then every two out of three frames would need to be "dropped" by the system since this OpenVX graph implementation can not keep up with the input frame rate of the camera.

Pipelining the graph exection will both increase the hardware utilization, and increase the throughput of the OpenVX implementation. These effects can be seen in the timeline of a pipelined execution of the graph below:

pipe_pipelining.png

Here, the latency of the graph is still 99ms, but the throughput has been increased to one frame completing every 33ms, allowing the graph to run in real-time with the camera frame-rate.

Now, in this simple example, a lot of assumptions were made in order to illustrate the concept. We assumed that each node took the same amount of time, so pipelining looked like we went from 33% core utilization to 100% core utilization. In practice, this ideal is almost never true. Processing times will vary across both kernels and cores. So although pipelining may bring about increased utilization and throughput, the actual frame rate will be determined by the execution time of the pipeline stage with the longest execution time.

In order to enable pipelining, the implementation must provide a way for the application to update the input and output data for future executions of the graph while previously scheduled graphs are still in the executing state. Likewise, the implementation must allow scheduling and starting of graph executions while previously scheduled graphs are still in the executing state. The Pipelining and Batch Processing section introduces new APIs and gives code examples for how this extension enables this basic pipelining support. The Event handling section extends the controllability and timing of WHEN to exchange frames and schedule new frames using events.

Graph Batch Processing

Batch processing refers to the ability to execute a graph on a group or batch of input and output references. Here the user provides a list of input and output references and a single graph schedule call processes the data without further intervention of the user application. When a batch of input and output references is provided to the implementation, it allows the implementation to potentially parallelize the execution of the graphs on each input/output reference such that overall higher throughput and performance is achieved as compared to sequentially executing the graph for each input/output reference.

graph_batch_processing.png

The Pipelining and Batch Processing section introduces new APIs and gives code examples for how this extension enables batch processing support.

Graph Streaming

Graph streaming refers to the ability of the OpenVX implementation to automatically handle graph input and output updates and re-schedule each frame without intervention from the application. The concept of graph streaming is orthogonal to graph pipelining. Pipelining can be enabled or disabled on a graph which has streaming enabled or disabled, and vice-versa.

In order to enable graph streaming, the implementation must provide a way for the application to enter and exit this streaming mode. Additionally, the implementation must somehow manage the input and output swapping with upstream and downstream components outside of the OpenVX implementation. This can be handled with the concept of SOURCE nodes and SINK nodes.

A SOURCE node is a node which coordinates the supply of input into the graph from upstream components (such as a camera), and the SINK node is a node which coordinates the handoff of output from the graph into downstream components (such as a display).

pipe_sourcesink.png

The Streaming section introduces new APIs and gives code examples for how this extension enables this basic streaming support.