[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Public WebGL] WEBGL_get_buffer_sub_data_async



In the WebGL working group meeting today, we were able to agree to move forward on prototyping Jeff's proposal. Some of the performance concerns have been tentatively addressed:

* We should be able to eventually eliminate a memcpy by, in (a) multi-process browsers, using SharedArrayBuffers backed by IPC shared memory, and in (b) single-process browsers, using SharedArrayBuffers backed by glMapBuffer memory. It might essentially be exposed to WebGL like a read-only MapBuffer operation, maybe for GL_*_READ buffers only. Ideally in both cases, the SharedArrayBuffer would be read-only, which is not a concept that currently exists.
* The WEBGL_promises proposal can still be used in the same way as before, with the resulting Sync objects.

On Wed, Dec 20, 2017 at 8:21 PM Jeff Gilbert <jgilbert@mozilla.com> wrote:
A) An app demonstrates intent by successfully polling any
newer-than-enqueued-write fence, which indicates that the app may want
to read back from any now-known-completed write. This only adds memory
traffic if the user dispatches copies they later realize they can
cancel. This would be uncommon, but if it showed up in the wild,
BufferData(null) would serve as an invalidation hint, allowing for
cancellation. I don't think I would not bother with adding this
invalidation unless someone showed up with a symptomatic application.
It's also just good practice to use a rotating queue of readback
buffers, to allow for pipelining. (that or 'dropping frames' of
readback)

B) It's really not a lot of tracking. The local copy is either valid
(past a previous fence) or pending (not yet hit new enough fence). The
patch(s) that tracks and warns about misuse in Firefox is pretty
simple: https://bugzilla.mozilla.org/show_bug.cgi?id=1425488 It's an
extra u64 for the context, each fence object, and each buffer.
Implementing this in a client-server arch just means adding a
refcounted SharedArrayBuffer to each buffer, and having the fence
manager (similar to what you must already have for query management)
handle map+memcpy when a buffer becomes up-to-date with respect to
now-past fences.

C) This only applies to a portion of the proposal marked as optional
and not blocking discussion of the rest of the proposal. Reviving
MapBufferRange also satisfies this, as well as giving us reliable spec
language for free, and exposing it in a form largely familiar to
existing graphics developers.

Our underlying APIs already support all this, so I would really prefer
to stick close to our parent specs.

On Wed, Dec 20, 2017 at 7:32 PM, Ken Russell <kbr@google.com> wrote:
> It's true that it is possible to optimize the current synchronous
> GetBufferSubData API so that in the ideal case it runs much more quickly.
> Jeff demonstrated on the working group's internal mailing list how this
> could work:
>
> 1) The browser maintains a CPU-side shadow copy of all buffers that were
> allocated with GL_*_READ usage.
>
> 2) The user calls FenceSync after any operation that modifies one of these
> buffers. For example, ReadPixels calls targeting a pixel buffer object
> (PBO), or draw calls performing transform feedback into one or more buffers.
>
> 3) The browser uses that FenceSync call, combined with the usage parameter
> of that buffer, as a hint that it should begin asynchronously polling for
> the completion of that fence. Once completed, it internally calls MapBuffer,
> memcpys the result into the shadow copy, and then unmaps the buffer. At that
> point it signals the user-visible fence as completed.
>
> 4) The user polls that fence until it's completed. At that point, the user
> calls GetBufferSubData, which memcpys from the shadow copy to user-visible
> memory without blocking.
>
> It's an excellent point that it's possible (at all) to make this
> synchronous, blocking API much faster; Jeff, thanks for showing that it is.
> The advantage of optimizing the current API is that carefully written code
> will get good performance while still following the existing OpenGL ES 3.0
> APIs with no additions.
>
> There are however some pitfalls with this approach.
>
> A) The user doesn't demonstrate their intent to read back from the buffer
> until they call GetBufferSubData. In order to make that call fast, the
> entire contents of these GL_*_READ buffers has to be mirrored back to the
> CPU any time the buffer is modified and a FenceSync is inserted afterward.
> Depending on how the user allocates buffers and how much they read back from
> those buffers, this may significantly increase memory traffic, and slow down
> applications.
>
> B) A lot of tracking has to be added in order to invalidate the shadow copy
> if the user modifies it between their FenceSync and calling
> GetBufferSubData. Doing this will at most result in a warning as well as
> degraded performance. In Kai's extension proposal, it's an error to modify
> the buffer while an async readback is pending.
>
> C) Because the shadow copy is hidden in the WebGL implementation, it's not
> possible to bypass it and eliminate one copy. Kai's extension proposal
> actually supports this, because the asynchronous intent is expressed
> directly by the user, as is the destination buffer, in the form of a
> SharedArrayBuffer.
>
> Fundamentally, readback from the GPU needs to be asynchronous at some level
> in order to be efficient. I think I speak for the Chrome team in saying that
> we think it's best to express the asynchronous primitive directly, rather
> than try to optimize the existing synchronous primitive using asynchronous
> ones under the hood. We recognize that it'll add complexity to introduce new
> APIs and are concerned about this, too. Still, Kai's latest proposal is
> pretty minimal, and directly expresses the user's intent.
>
>
>
> On Wed, Dec 20, 2017 at 4:03 PM, Jeff Gilbert <jgilbert@mozilla.com> wrote:
>>
>>
>> With the mechanisms we already have in WebGL 2[1], we can support
>> no-stall polled-async readback from the GPU. Even in the case of
>> poorly written content, this cannot incur any worse of pipeline stalls
>> than we already allow for in WebGL via non-PBO ReadPixels and
>> getBufferSubData. Note also that checking for stall-less behavior is
>> fairly (though not entirely) deterministic, since apps must explicitly
>> poll/wait on a fence before accessing the potentially-in-flight data.
>> This is what I am implementing in Firefox, since it applies to all
>> implementations, regardless of whether the implementation remotes
>> calls.
>>
>> The key to this is the understanding that buffers with usage=GL_*_READ
>> are directly mappable client-side buffers, into which (primarily)
>> copyBufferSubData and readPixels enqueue writes. After the writes are
>> known to be complete (via FenceSync(GPU_COMMANDS_COMPLETE)), since
>> these are client-side buffers, they may be immediately mapped and read
>> from.
>>
>> I do think there is room for a more ergonomic helper for handling this
>> behavior, though it's not that complicated for a library to implement
>> it.
>>
>> There is room to investigate a solution to eliminating a copy.
>> MapBufferRange does this, but the naive implementation does create
>> garbage ArrayBuffer wrappers. Note however, that if you want to copy
>> the data into some existing ArrayBuffer (like the wasm heap),
>> getBufferSubData is already copy-optimal. There is only room for
>> improvement if you want to process the data in-place, which may be
>> able to save a copy with MapBufferRange or similar in both types of
>> implementation.
>>
>> [1]: Since this is only available in WebGL 2, I have proposed
>> extensions to expose these mechanisms from WebGL 2 to WebGL 1:
>> https://github.com/KhronosGroup/WebGL/pull/2563
>>
>> On Wed, Dec 20, 2017 at 11:06 AM, Kai Ninomiya <kainino@google.com> wrote:
>> > Good point Mark, I'll add that.
>> >
>> >
>> > On Wed, Dec 20, 2017, 10:57 AM Mark Callow <khronos@callow.im> wrote:
>> >>
>> >>
>> >>
>> >> On Dec 19, 2017, at 17:38, Kai Ninomiya <kainino@google.com> wrote:
>> >>
>> >> All,
>> >>
>> >> Our new proposal for WEBGL_get_buffer_sub_data_async is finally ready.
>> >> Please take a look and send along your comments and suggestions. Feel
>> >> free
>> >> to request comment access if you want to comment on the doc itself.
>> >>
>> >> Note this is a design doc and not a spec, so it will hopefully be
>> >> easier
>> >> to read but may not be explicit about every edge case yet.
>> >>
>> >>
>> >> https://docs.google.com/document/d/1f65cGlfLHbKLOuvRSqTvrakNi60Swk6GCyS54v1ImKo/edit?usp=sharing
>> >>
>> >>
>> >> This doc should mention the core reason for this extension: the
>> >> inability
>> >> of some WebGL implementations to support glMapBufferRange. And describe
>> >> how
>> >> that led to gl.getBufferSubData() and then this proposal. As far as I
>> >> can
>> >> see all the use cases listed would be solved if glMapBufferRange was
>> >> supported.
>> >>
>> >> Regards
>> >>
>> >>
>> >>     -Mark
>>
>> -----------------------------------------------------------
>> You are currently subscribed to public_webgl@khronos.org.
>> To unsubscribe, send an email to majordomo@khronos.org with
>> the following command in the body of your email:
>> unsubscribe public_webgl
>> -----------------------------------------------------------
>>
>

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature