[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Public WebGL] String/ArrayBuffer encoding/decoding API (follow-up)

I find the semantics a bit awkward. Let's look at how python does it:

unicodestring = somearray.decode('utf-8')
otherarray = unicodestring.encode('utf-8')

Frequently what one does have is an array (a file, a network protocol, what have you) that contains strings, so you'd do something along the lines:

unicodestring = somearray[123:456].decode('utf-8')

For decoding In this specification this would work like this:

var bytes = new Uint8Array(buffer, 123, 333);
var decoder = new TextDecoder('utf-8');
var unicodestring = decoder.decode(bytes);

Or if you prefer a oneliner:

var unicodestring = (new TextDecoder('utf-8')).decode(new Uint8Array(buffer, 123, 333));

Encoding where you put some piece of data into bytes would go like this:

result += unicode.encode('utf-8')

But that's not quite fair a comparison because typed arrays in JS don't have concatenation (why not?), so the python equivalent using typed arrays (lending from ctypes) would be something like this (pseudocode)

result = (c_byte*1024)();
result[123:456] = (c_byte*333)(unicodestring.encode('utf-8'))

In this specification this would work like this:

result = new Uint8Array(1234);
result.set((new TextDecoder('utf-8')).encode(unicodestring));

That's better, almost as concise as python (but still needing to allocate those encoders).

So some changes to the API could simplify this nicely, therefore I propose:

Having that the examples from above would be simplified:

unicodestring = new Uint8Array(buffer, 123, 333).getString('utf-8');
unicodestring = someview.getString('utf-8', 123, 333);

result.set(unicodestring.encode('utf-8'), 123)

On a technical note, there's no reason to have a TextDecoder/TextEncoder instance. String decoders/encoders are not stateful objects, they don't have any resources and they operate oneshot with maximally 4-byte lookahead on a string of data. A switch to select a codepath based on the encoding/decoding name is entirely sufficient for all cases of needing to convert bytes to strings and vice versa.

On Thu, Jan 29, 2015 at 12:26 AM, Joshua Bell <jsbell@google.com> wrote:
In the very distant past [1] there was discussion about APIs for encoding/decoding string data from ArrayBuffers/DataViews. This resulted in an API being defined as part of the Encoding Living Standard [2]. 

Chrome, Firefox and Opera have been shipping this API for about a year now. A polyfill is also available [3].

The important stuff is inter-operable. A few new attributes/flags have been specified but not yet implemented in all browsers. Browsers have also not all converged with the spec for handling every code point of all legacy encodings identically, but we're working on it.

There are links at the top of the spec for feedback, but since discussion started here I wanted to close the loop.

[2] https://encoding.spec.whatwg.org
[3] https://github.com/inexorabletash/text-encoding