[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Public WebGL] String support in DataView

On Fri, Nov 4, 2011 at 12:15 PM, John Tamplin <jat@google.com> wrote:
BTW, in re-reading the decode description, it mentions throwing an exception when writing past the end of the buffer -- cut and paste error?

Yep, fixed.
 Fixed length is definitely supported. Padding isn't, but Typed Arrays buffers are born zero-filled. Is that adequate?

A couple of pieces that seem missing:
 - if I have a fixed 20 byte area to contain a UTF8 string, how do I decode it?  If I pass the length as 20, I will get all the padding bytes and have to manually remove them.  If I give length as -1, then if the 20 bytes is full it will overflow past it.  That is why it seems useful to have a padded vs unpadded mode separate from the buffer length.

Personally, I'd manually remove them (replace \x00 with your padding character of choice):

var str = stringEncoding.decode(buffer, 0, 20, "UTF-8").replace(/\x00+$/, '');

Glenn's suggestion of stringLength (see below) might be a good place to hang a way to specify an arbitrary termination character for the padding case. (I can imagine edge cases with e.g. UTF-16 encoding in odd-length buffers, though.)

 - when writing to a fixed buffer, how do I truncate the string?  Given this API, it looks like I would have to loop through encodedLength, chopping the string until it fits.

Yes. There's an issue in the spec suggesting "partial fill" support, which would encode as much as possible and yield both bytesWritten and charactersWritten. Thoughts? (UTF-16 surrogate pairs make me cringe a bit here.)

John A. Tamplin
Software Engineer (GWT), Google

On Fri, Nov 4, 2011 at 1:06 PM, Glenn Maynard <glenn@zewt.org> wrote:
Note that Blob already supports decoding strings from named encodings, and encoding to UTF-8 Blobs (the latter isn't yet widely implemented and the BlobBuilder API may change, but it's specced).  It'd be good to have parity between these APIs, eg. in how invalid codepoints are handled via U+FFFD replacement.

Should we instead focus on Blob/TypedArray interop and extending the encoding support, and keep the encoding/decoding API in one place (i.e. Blob), rather than adding yet another Web API?

If not, I completely agree regarding parity.
I'd suggest the following signatures:

> static DOMString decode(in any array, optional unsigned long byteOffset, optional unsigned long byteLength, DOMString encoding) raises(DOMException);
> static unsigned long encode(DOMString value, in any array, optional unsigned long byteOffset, optional DOMString encoding) raises(DOMException);
> static unsigned long encodedLength(DOMString value, optional DOMString encoding) raises(DOMException);

If encoding is omitted, default to UTF-8.  This also consistently puts the input as the first argument, which I think is more intuitive than always putting the ArrayBuffer first.  This simplifies common cases:

> var s1 = StringEncoding.decode(array1);
> StringEncoding.encode("string", array2);

Earlier versions of this spec had the methods on DataView, the current placement of the view/buffer as first argument is legacy of that. I agree the above makes more sense.
In order to expose the amount of data consumed when decoding null-terminated strings, null-termination could be moved to a separate method:

> static int stringLength(in any array, optional unsigned long byteOffset, optional unsigned long byteLength, optional DOMString encoding) raises(DOMException);

> int len = StringEncoding.stringLength(array3);
> var s3 = StringEncoding.decode(array3, 0, len);

decode's byteOffset would no longer accept negative numbers.

There will be an objection raised that this requires decoding twice. (The reason for the outstanding "partial fill" issue is that there are objections to having encode twice, once to determine length and a second time to fill the buffer.) I like this proposal, though.

Given this change, it would then imply that for decode, if byteLength is omitted/undefined, the implication would be to "decode the entire buffer" (consistent w/ Typed Array APIs), rather than null termination. 

On Fri, Nov 4, 2011 at 2:53 PM, Joshua Bell <jsbell@chromium.org> wrote:
> However, some potential users have expressed the desire to use this API for parsing data files that have poorly defined string encodings, and they make use of encoding guessing. The example given is ID3 tags in MP3 files, which are often e.g. in Big5 encoding.

You mention HTML's encoding sniffing algorithm [1], but I don't think that's useful here.  That algorithm reads the beginning of the file and looks for various <meta> tags; it's very HTML-specific.  (The main other thing it does is default to encodings based on the user's locale, which is purely for backwards-compatibility with legacy HTML; that definitely shouldn't be done here.  It also handles UTF-8/16LE/16BE detection, as does Blob's FileReader.readAsText.)

Encoding detection for broken ID3 tags is probably heuristic: figuring out which encodings the tags can be cleanly decoded with, and linguistically examining the results to resolve ties.  That's orders of magnitude more difficult to specify, and then wouldn't be as useful, since it would be a set in stone heuristic--it could never be improved.

I think this should be left out.  Instead, just make sure that the conversion API is good enough to allow implementing these heuristics in _javascript_ efficiently.  For example, it already lets you decode the strings from many encodings quickly, examine the results to see how many invalid U+FFFD codepoints exist in each result, and so on.

[1] http://dev.w3.org/html5/spec/parsing.html#encoding-sniffing-algorithm

The -1s (myself included, when taking off the editor's hat) outnumber the +1s, so I'll remove it.