[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Public WebGL] String support in DataView

On Wed, Nov 9, 2011 at 12:42 PM, Joshua Bell <jsbell@chromium.org> wrote:
I've updated the doc - http://wiki.whatwg.org/wiki/StringEncoding - to reflect the discussion on this thread, most notably:

I'd recommend against having UTF-16's encoding endianness be unspecified.  That's building interop failures right into the spec.  This feels like a premature optimization; byte swapping is quick.  Just pick a "default" endianness.  In the unlikely case where performance turns out to matter here, a property can be added later eg. "fastUTF16", which is a string equal to UTF-16LE or UTF-16BE.

There are a fair number of differences with File API.  File API never throws depending on the input text.  Don't throw when decoding invalid sequences; replace them with U+FFFD [1].  The UTF-16 BOM handling and detection could line up better too, I think.  UTF-8 handling should probably match HTML's (link further down).

Similarly, don't throw if UTF-16 input is missing a byte; just replace the last byte with U+FFFD.

> If byteLength bytes are processed without a U+0000 character decoded, -1 is returned.

I'd suggest returning byteLength instead of -1.  This removes the need to validate the return value of stringLength in the common case:

var len = stringEncoding.stringLength(array);
stringEncoding.decode(array, 0, len);

If the caller really cares that a null terminator exists then he can check, but often with static-sized string fields you don't.  (You only really care about null termination when you don't know the buffer size, as with C strings.)

> encode: BOM is not written. Exception (TBD) thrown when there is no valid UTF-8 encoding of the string (e.g. "abc\uD800def" which contains a UTF-16 "surrogate half")

I don't like throwing an exception here.  It's always going to be possible to get these in your strings, if users paste it into an input.  This forces everyone to validate their input in obscure ways, or else have their application fail with an exception that they never ordinarily see.  In practice, nobody would do that sort of validation, so pasting blocks of text from external sources would occasionally break in mysterious ways.

I think that replacement makes sense in general for characters that can't be represented in the target encoding.  For Unicode targets, D+FFFD; for most others, a question mark.

I don't know if there's a need for it here, but a useful reference is Win32's MultiByteToWideChar call.  It has the MB_ERR_INVALID_CHARS flag; unconvertable characters only cause a fatal error if that flag is set, otherwise they're replaced. [2]

[1] http://dev.w3.org/2006/webapi/FileAPI/#enctype "Replace bytes or sequences of bytes that are not valid according to the charset with a single U+FFFD character".

[2] http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%29.aspx

On Wed, Nov 9, 2011 at 1:34 PM, John Tamplin <jat@google.com> wrote:
Looks better.  My major concern is about using U+0000 as a terminator -- that still ensures that I can't send strings that might contain U+0000 unmolested, and either I must do my own quoting or filter it out.  

I don't follow: U+0000 is only special to stringLength.  If you want to encode string lengths in some other way (with a different terminator, storing it out-of-line, etc.), just don't use stringLength.  decode() and encode() don't care.

Other minor issues:
- do we want to say anything about canonical forms?  For example, are over-long UTF8 sequences allowed?  How are combining marks/etc represented?

For decoding UTF-8, reference this algorithm: http://www.whatwg.org/specs/web-apps/current-work/multipage/infrastructure.html#utf-8

Combining marks are a normalization issue.  For Unicode-to-Unicode conversions, leave the string as-is.  Converting other encodings to Unicode on the web platform normally uses NFC, I believe, though I can't find a reference for this in the spec off-hand...

- decode should specify the behavior if byteLength stops inside a multi-byte sequence for a character

A related use case that this API currently doesn't handle is streamed decoding.  See iconv(3) for an example of an API that handles this.

Glenn Maynard