[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Public WebGL] Fwd: Shared Resources revisited

Chris, can you elaborate on what calls exactly you are timing? Is it the shader compile plus the program link? For Cloud Party I have found that the shader compile times dominate on Mac, while the program link times dominate on Windows/ANGLE. I'm guessing because of the extra shader optimizations done by DirectX.

I also noticed on Mac/Chrome that one of the shader compiles will always take dramatically longer than the others (730ms instead of 10ms), but it is not consistent which shader that happens to. It is always one of the later shaders to get compiled though, suggesting some sort of queue flushing or garbage collecting is going on.

On Tue, Jun 25, 2013 at 8:18 AM, Chris Endicott <chris.endicott@jagex.com> wrote:

Here's some benchmarks of our shader compile times in the RuneScape engine. RuneScape uses a couple of hundred permutations of shader at run time to handle various combinations of effects/lighting/shadows and so on. It does this to remove control flow statements from the shader source itself, as far as possible.


Here's a table of compile times from Windows 7/Chrome 27/ANGLE:


Compile times for 131 shaders:

Total time: 20607ms

Average time: 157.3053435114504ms

Per shader type average: 

Type: 0, count: 105, average time: 157.41904761904763ms

Type: 1, count: 4, average time: 53.75ms

Type: 2, count: 10, average time: 259.3ms

Type: 3, count: 4, average time: 286.25ms

Type: 4, count: 2, average time: 30ms

Type: 5, count: 2, average time: 7ms

Type: 6, count: 1, average time: 5ms

Type: 8, count: 1, average time: 31ms

Type: 11, count: 1, average time: 11ms

Type: 12, count: 1, average time: 4ms


And from Mac OSX 10.7.2/Chrome 27/OpenGL:


Compile times for 131 shaders:

Total time: 1735ms

Average time: 13.244274809160306ms

Per shader type average: 

Type: 0, count: 105, average time: 13.152380952380952ms

Type: 1, count: 4, average time: 6ms

Type: 2, count: 10, average time: 21.9ms

Type: 3, count: 4, average time: 22.25ms

Type: 4, count: 2, average time: 5ms

Type: 5, count: 2, average time: 1.5ms

Type: 6, count: 1, average time: 1ms

Type: 8, count: 1, average time: 4ms

Type: 11, count: 1, average time: 2ms

Type: 12, count: 1, average time: 2ms


These two machines have similar hardware (Core i7s, recent GPUs). As you can see, it looks like the ANGLE path is potentially responsible for much of the slowness, but I guess there are plenty of other differences between the Windows and OSX graphics pipelines. (Unfortunately the desktop GL option on Windows doesn't seem to work for us, hence the different machine to test on Mac OSX).


Interestingly, some shaders seem to be much more seriously affected than others - here's a snippet from a common portion of one which seems to cause some of the worst delays:




This snippet does the shadow map lookups when doing the normal render pass. As you can see it unrolls a GLSL loop in _javascript_. It doesn't seem to make much difference to compile times whether we do this or just let the compiler attempt the unrolling itself. Taking this section out of the shader seems to significantly reduce the compile times we get.


RuneScape's HTLM5 client is currently in closed beta, but if anyone would like a RuneScape account  given access to the beta for testing, please get in touch with me directly.




From: Brandon Jones [mailto:bajones@google.com]
Sent: 24 June 2013 18:16
To: Chris Endicott
Cc: Ben Vanik; public_webgl@khronos.org

Subject: Re: [Public WebGL] Fwd: Shared Resources revisited


On Mon, Jun 24, 2013 at 9:51 AM, Chris Endicott <chris.endicott@jagex.com> wrote:

Thanks for the explanation Ben and sorry for not being clear - what I meant was that WebGL_shared_resources, without access to WebGL on web workers, is not very useful to us on its own. We're not looking to move our main rendering onto a web worker, just the resource uploads/shader compiles which slow us down in the main thread. From the sounds of what you've written below, what we really need is both of them completed before we'll see any real opportunities for performance improvements in this area.


This is (unfortunately) correct. Enabling WebGL in a worker would allow you to compile shaders off the main thread, but give you no way to transfer them back to the main thread for use.  There may be a fringe benefit in some browser that cache shader binaries where you could compile the shader in a worker and then recompile in the main thread, allowing the main thread to quickly reload the cached program, but that makes massive assumptions about the browser internals and should not be relied on for, well, anything.


(If an asynchronous shader compilation extension were a simpler, quick win, on this front - I can't emphasise enough how detrimental synchronous shader compilation is to RuneScape currently - is that something that could be considered as an interim solution?)


First off, thanks for providing a real-world bottleneck for us to focus on! In fact, any further data you can provide in this direction would be useful: What kind of compile times are you seeing? Do you notice a difference between Desktop GL and ANGLE? Would it be possible to provide an example shader for internal performance testing?


When you talk about asynchronous shader compilation here, do you mean an explicit async function? (IE: gl.compileShaderAsync(shader, callback)?) We've discussed the idea before (http://www.khronos.org/webgl/public-mailing-list/archives/1304/msg00140.html) and it seems the general consensus was to build a worker-centric solution that allowed finer grain control over all resource types.





From: Ben Vanik [mailto:benvanik@google.com]
Sent: 24 June 2013 17:03
To: Chris Endicott
Cc: public_webgl@khronos.org

Subject: Re: [Public WebGL] Fwd: Shared Resources revisited


Chris: did you mean to say that shared resources would be *more interesting* than WebGL on workers? The things you listed are all benefits of shared resources *and* WebGL on workers, *not* WebGL on workers alone. Without share groups even if running in a worker you'd still have your rendering blocked on shader compilation/linking, long uploads would still steal time from your rendering frames, and since the number of workers you can practically create is limited (resource constraints are very real - workers are very expensive) you wouldn't be able to create many more for doing your texture work. You'd see very little benefit for the engineering overhead of building a transferable command buffer and would add a bunch of negatives like additional latency, data duplication, non-trivial additional memory consumption, etc.


Let me rephrase: WebGL on workers *without shared resources* only gains you framerate independence from other main-thread _javascript_ tasks (xhr completes, etc) but it does not remove the blocking/variable call times of WebGL methods. Any WebGL action that makes your frames run long today will still make them run long in a worker. *Unless* you have shared resources and can split them up.


On the Google Maps side, having shared_resources would be an immediate win for us in a big way, whereas moving our rendering to WebGL to a single worker will likely never be possible. We have too much dependence on the DOM matching the WebGL rendering and synchronizing a worker and the DOM will prove impossible. We also have a tremendous amount of state that is non-trivial to serialize to typed arrays, and re-parsing/loading it in workers would use way more processing time/power (don't forget that the computation done on another thread is still computation that consumes power). Using shared resources we could shift off our texture and buffer uploads (and all of the processing that handles that) to a worker (or two) and keep our time spent on the main thread down -- that's the biggest possible win. During startup we could use a worker (or several) to do our program compiling/linking while our main thread was loading more _javascript_/etc -- a clever trick, but not much different than what Chrome does today with async compiles/links (except we'd have some determinism) and something we'd rather the browser under the covers than make us do with expensive workers. At the end of the day what matters is short predictable frame times - we can get a lot of that win by moving the sources of extreme variability (texture/buffer generation and uploading) to workers.


Having designed several rendering engines that used GL share groups in the past, I can say that WebGL in a worker alone is not what most people want (though they don't know that, yet). The additional latency of moving input events to the background thread (we all aren't just spinning cubes, you know) and the inability to properly synchronize DOM elements makes primary rendering in a worker a non-starter in most cases, and in others just makes it significantly more difficult to write code. In real apps it will be required to support both the non-worker case for months/years after this feature starts shipping, and even then maybe for much longer depending on devices (on a 2 core machine running WebGL in a worker would likely not be a gain and may even end up hurting overall system performance) --- keeping those two radically different code paths going is not an interesting prospect for those writing non-trivial code.


So: shared resources doesn't make sense without being able to run WebGL on workers (as who would you share it with?), but WebGL on workers on its own without shared resources is largely useless for large apps targeting low-latency, high performance, resource-intensive applications. I clearly see the implementation dependence between the two features, but I'm fighting against the idea that shared resources isn't useful.


As for implementation of resource sharing, we don't really care so long as it's efficient (accepts batches of objects, etc). Limiting things to just textures/buffers is probably fine as the program compile/link stuff can be done by the browser under the covers, maybe with an additional API for that to enable deterministic completion checking.  It's important that objects are not transferred during the sharing, as otherwise there would be a ton of data duplication when trying to fan out rendering with shared buffers or textures and in most cases apps trying to do this stuff are already GPU memory constrained.



On Mon, Jun 24, 2013 at 1:19 AM, Chris Endicott <chris.endicott@jagex.com> wrote:

Whilst I think that WEBGL_shared_resources is an interesting feature, as far as our project (RuneScape) goes, availability of WebGL on web workers would be significantly more useful in terms of helping us solve some of our performance issues. In particular:


·         Being able to compile and link shaders on web workers would be the single biggest performance gain we could achieve

·         Being able to upload resources from web workers direct to WebGL, without serialising them back to the main thread would also be a significant win for us as we construct most of our geometry on web workers

·         We would also be able to consider re-architecting other parts of our graphics engine (for example, texture atlasing, sprite management) onto workers to further improve performance


I don't know that the implementation status of either of these features is in Chrome/FireFox right now, but if there were a choice, I would vote strongly in favour of web workers over shared resources in terms of development priority.



Chris (Long time lurker on this mailing list)


From: owner-public_webgl@khronos.org [mailto:owner-public_webgl@khronos.org] On Behalf Of Benoit Jacob
Sent: 22 June 2013 20:42
To: public_webgl@khronos.org
Subject: Re: [Public WebGL] Fwd: Shared Resources revisited


On 13-06-21 01:53 PM, Gregg Tavares wrote:

So I was just about ready to land a WEBGL_shared_resources patch in Chrome


But asm.js got me thinking that there will probably be a push for making shared resources work exactly the same as OpenGL so that porting will be easier.

That, however, only applies if real-world applications that one would want to port, are already using OpenGL share groups. Are they?

Here is what I am not in a rush to start implementing WEBGL_shared_resources:

I have never heard of applications actually using share groups, at least not in ways that fundamentally require them as opposed to more fine-grained sharing mechanisms. Most of the time, what applications want to share across context is only a small minority of their OpenGL objects, and specifically a minority of their *textures*. For that, there exists a variety of existing mechanisms to allow sharing textures, which is also what browsers have to rely on anyways to composite WebGL frames in a GPU-accelerated browser compositor.

The value of focusing specifically on sharing only specific textures on a per-object opt-in basis, as opposed to sharing everything (as OpenGL share groups do) is that there are subtle performance caveats associated with OpenGL object sharing. In particular, having two OpenGL contexts in the same share group simultaneously current on two different threads is known to be a severe performance issue, as it requires certain drivers to turn on inefficient locking mechanisms.

For that reason, I expect that WEBGL_shared_resources is on a head-on collision course with other WebGL features that we're likely to implement in the near future, like WebGL-on-Web-Workers.

If I have to choose one --- I care far, far more for WebGL-on-Web-Workers than I do for WEBGL_shared_resources. It also seems that the majority of WEBGL_shared_resources use cases can be addressed by other means (e.g. targeting a single WebGL context to render to multiple canvases, as has been discussed).

Even if there was a strong need for OpenGL-like sharing of objects, I expect it'd still be mostly for sharing *textures* in which case we could make a spec allowing only to share textures with semantics that map well onto efficient texture-sharing mechanisms. But again, we haven't seen a lot of evidence for even that.


That got me thinking what would it take to make WebGL enforce the sharing rules in the OpenGL spec.


The spec says you must call glFinish (or use a sync object) in the context modifying a resource and then bind or attach that resource in the context you want to see the modification to be guaranteed to be able to see the change.


The problem is many drivers require far less. Some work with glFlush/glBind, Some work with just glFlush, no bind, others require the spec compliant glFInish/glBind. If we could figure out a way to enforce that you must use glFinish(or sync) and glBind in WebGL maybe it would be okay to expose that stuff as is rather than through an extension. For WebGL 1 we'd probably need a SYNC extension for sync objects otherwise so you have an alternative to the slow glFinish.


Thinking about how to implement enforcing the rules. 


  • Each context has a vector<WebGLSharedObject*> modified_objects.
  • Each WebGLSharedObject has a per context safe_to_use flag. 
  • Each WebGLSharedObject has a finished flag
  • Each WebGLSharedObject has a fence_inserted flag
  • Each WebGLSharedObject has a fence_id field
  • Each context has a newest_fence_id_waited_for field
  • There's a global fence_count
  • Anytime a resource is modified
    • if it's safe_to_use flag for this context is false
      • generate INVALID_OPERATION 
      • exit these steps
    • set its safe_to_use flag is false for all contexts but the context making the modification. 
    • set its finished flag is set to false. 
    • set its fence_inserted flag to false.
    • add it to this context's modified_objects vector.
  • Anytime a a glfinish happens
    • set the finished flag to true for every object in the context's modified_objects vector.
    • clear the modified_objects vector.
  • Anytime a FenceSync is called
    • increment the global sync_count
    • associate the sync_count with the sync object created by FenceSync
    • assign the sync_count to every object's sync_id field in the context's modified_objects vector
    • set the fence_inserted flag to true of every object in the context's modfied_objects vector 
    • clear the modified_objects vector.
  • Anytime a resource is used
    • if the object's safe_to_use flag for this context is false
  • Anytime a resource is bound/attached
    • if the finished flag is true
      • mark the object's safe_to_use flag for this context as true.
    • if the fenced_inserted flag is true and this context's newest_fence_id_waited is >= the object's fence_id field
      • mark the object's safe_to_use flag for this context as true.
    • If the object's safe_to_use flag for this context is still false
  • Anytime WaitSync or ClientWaitSync is called
    • set the context's newest_fence_id to the fence_count associated with the given Fence object.

This seems like it would work? The problems off the top of my head are


1) you can observe behavior in a worker since you can spin call BindXXX or attach until you get success as well as call GetSyncParameter


It seems like it used to be a rule that you shouldn't be able to observe the behavior of a worker but it doesn't seem like people care about that anymore?


2) You can assume your timing works? 


One of the things WEBGL_shared_resources was trying to prevent was a situation where you're running at 60fps (or 30fps) and that some resource you modified in one thread will be ready in another by the next frame so you assume it's okay to bind. I don't think this proposal has that problem though. If you called glFinish you can assume bind will work in the other context. If you called FenceSync in one context, bind in another context will not work unless you call WaitSync or ClientWaitSync


I suppose this proposal is not "exactly like OpenGL" but it's spec compliant OpenGL. Native apps that get errors when ported to this code were just getting lucky on the platforms they were running on.