It's true that it is possible to optimize the current synchronous GetBufferSubData API so that in the ideal case it runs much more quickly. Jeff demonstrated on the working group's internal mailing list how this could work:
1) The browser maintains a CPU-side shadow copy of all buffers that were allocated with GL_*_READ usage.
2) The user calls FenceSync after any operation that modifies one of these buffers. For example, ReadPixels calls targeting a pixel buffer object (PBO), or draw calls performing transform feedback into one or more buffers.
3) The browser uses that FenceSync call, combined with the usage parameter of that buffer, as a hint that it should begin asynchronously polling for the completion of that fence. Once completed, it internally calls MapBuffer, memcpys the result into the shadow copy, and then unmaps the buffer. At that point it signals the user-visible fence as completed.
4) The user polls that fence until it's completed. At that point, the user calls GetBufferSubData, which memcpys from the shadow copy to user-visible memory without blocking.
It's an excellent point that it's possible (at all) to make this synchronous, blocking API much faster; Jeff, thanks for showing that it is. The advantage of optimizing the current API is that carefully written code will get good performance while still following the existing OpenGL ES 3.0 APIs with no additions.
There are however some pitfalls with this approach.
A) The user doesn't demonstrate their intent to read back from the buffer until they call GetBufferSubData. In order to make that call fast, the entire contents of these GL_*_READ buffers has to be mirrored back to the CPU any time the buffer is modified and a FenceSync is inserted afterward. Depending on how the user allocates buffers and how much they read back from those buffers, this may significantly increase memory traffic, and slow down applications.
B) A lot of tracking has to be added in order to invalidate the shadow copy if the user modifies it between their FenceSync and calling GetBufferSubData. Doing this will at most result in a warning as well as degraded performance. In Kai's extension proposal, it's an error to modify the buffer while an async readback is pending.
C) Because the shadow copy is hidden in the WebGL implementation, it's not possible to bypass it and eliminate one copy. Kai's extension proposal actually supports this, because the asynchronous intent is expressed directly by the user, as is the destination buffer, in the form of a SharedArrayBuffer.
Fundamentally, readback from the GPU needs to be asynchronous at some level in order to be efficient. I think I speak for the Chrome team in saying that we think it's best to express the asynchronous primitive directly, rather than try to optimize the existing synchronous primitive using asynchronous ones under the hood. We recognize that it'll add complexity to introduce new APIs and are concerned about this, too. Still, Kai's latest proposal is pretty minimal, and directly expresses the user's intent.