On Fri, Nov 4, 2011 at 11:33 AM, John Tamplin <email@example.com>
On Fri, Nov 4, 2011 at 1:56 PM, Joshua Bell <firstname.lastname@example.org>
After much (justifiable!) procrastination, I've started putting a draft spec together for this functionality, with input from Ken Russell and some other willing reviewers. I've just posted it to:
The biggest outstanding issue raised (so far) is the behavior when encoding to a buffer that isn't large enough for the string.
Feedback appreciated, although we will probably redirect the discussion to a more appropriate list.
Can you give use-cases for detecting encodings? It seems dangerous to try and guess the encoding.
The uses I see for this functionality are things like encoding protobufs or extracting strings from mixed binary/text payloads (such as an XHR response or a WebSockets message) -- in either case, the encoding would be known by definition or you would likely identify the encoding in metadata for the message.
I absolutely agree that for new protocols that is what should be done. However, some potential users have expressed the desire to use this API for parsing data files that have poorly defined string encodings, and they make use of encoding guessing. The example given is ID3 tags in MP3 files, which are often e.g. in Big5 encoding. Current applications rely on native libraries to do encoding detection.
Guessing makes me queasy, but it's been pointed out that browsers already implement encoding detection logic, although this is usually under the end-user's control. I'm not averse to removing this from the spec, but there is demand for it.
I don't like that the NULL character cannot be sent using this mechanism -- it seems that any valid character sequence should be supported.
The spec may be unclear (and I just restructured that part, which may have made it worse).
The spec's intent is that U+0000 characters should have no special meaning when encoding. U+0000 only has an special meaning when decoding, and only if explictly requested (via byteLength < 0) - basically, you can decode with a null terminator, or decode with an explicit length. IMHO, newly designed protocols and formats should have an explicit length rather than rely on a special character, but that can't be dictated for older formats. Supporting null termination was explicitly requested to support older file formats.
Can you point where you see in the spec that U+0000 / NULL characters can't be sent, and I'll reword.
To do that, you could use 0xFF as the terminator for UTF8/ASCII or U+FFFF for UTF16 (no better choice than 0x00 for Latin1 though). It might be useful to support padding to a fixed length in the API.
Fixed length is definitely supported. Padding isn't, but Typed Arrays buffers are born zero-filled. Is that adequate?
Mostly tangential: Every time I use DataView, I keep wanting something like perl's pack/unpack built on top of it, so I think of what would be needed to support those sorts of operations.
John A. Tamplin
Software Engineer (GWT), Google