[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Public WebGL] String support in DataView

Note that Blob already supports decoding strings from named encodings, and encoding to UTF-8 Blobs (the latter isn't yet widely implemented and the BlobBuilder API may change, but it's specced). It'd be good to have parity between these APIs, eg. in how invalid codepoints are handled via U+FFFD replacement.

I'd suggest the following signatures:

> static DOMString decode(in any array, optional unsigned long byteOffset, optional unsigned long byteLength, DOMString encoding) raises(DOMException);
> static unsigned long encode(DOMString value, in any array, optional unsigned long byteOffset, optional DOMString encoding) raises(DOMException);
> static unsigned long encodedLength(DOMString value, optional DOMString encoding) raises(DOMException);

If encoding is omitted, default to UTF-8. This also consistently puts the input as the first argument, which I think is more intuitive than always putting the ArrayBuffer first. This simplifies common cases:

> var s1 = StringEncoding.decode(array1);
> StringEncoding.encode("string", array2);

In order to expose the amount of data consumed when decoding null-terminated strings, null-termination could be moved to a separate method:

> static int stringLength(in any array, optional unsigned long byteOffset, optional unsigned long byteLength, optional DOMString encoding) raises(DOMException);

> int len = StringEncoding.stringLength(array3);
> var s3 = StringEncoding.decode(array3, 0, len);

decode's byteOffset would no longer accept negative numbers.

On Fri, Nov 4, 2011 at 2:53 PM, Joshua Bell <jsbell@chromium.org> wrote:
> However, some potential users have expressed the desire to use this API for parsing data files that have poorly defined string encodings, and they make use of encoding guessing. The example given is ID3 tags in MP3 files, which are often e.g. in Big5 encoding.

You mention HTML's encoding sniffing algorithm [1], but I don't think that's useful here. That algorithm reads the beginning of the file and looks for various <meta> tags; it's very HTML-specific. (The main other thing it does is default to encodings based on the user's locale, which is purely for backwards-compatibility with legacy HTML; that definitely shouldn't be done here. It also handles UTF-8/16LE/16BE detection, as does Blob's FileReader.readAsText.)

Encoding detection for broken ID3 tags is probably heuristic: figuring out which encodings the tags can be cleanly decoded with, and linguistically examining the results to resolve ties. That's orders of magnitude more difficult to specify, and then wouldn't be as useful, since it would be a set in stone heuristic--it could never be improved.

I think this should be left out. Instead, just make sure that the conversion API is good enough to allow implementing these heuristics in _javascript_ efficiently. For example, it already lets you decode the strings from many encodings quickly, examine the results to see how many invalid U+FFFD codepoints exist in each result, and so on.

[1] http://dev.w3.org/html5/spec/parsing.html#encoding-sniffing-algorithm

Glenn Maynard