[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Public WebGL] String support in DataView

Something that needs careful consideration is how to handle BOMs.  FileReader.readAsText strips the BOM, which is convenient but causes Unicode strings to not round-trip.  I don't have a strong intuition about how to handle this.  I'd sort of sooner have clean, raw conversion (keep the BOM), but that would be inconsistent with FileReader, which is probably a worse problem.

On Fri, Nov 4, 2011 at 4:54 PM, Joshua Bell <jsbell@chromium.org> wrote:
A couple of pieces that seem missing:
 - if I have a fixed 20 byte area to contain a UTF8 string, how do I decode it?  If I pass the length as 20, I will get all the padding bytes and have to manually remove them.  If I give length as -1, then if the 20 bytes is full it will overflow past it.  That is why it seems useful to have a padded vs unpadded mode separate from the buffer length.

Personally, I'd manually remove them (replace \x00 with your padding character of choice):

var str = stringEncoding.decode(buffer, 0, 20, "UTF-8").replace(/\x00+$/, '');

stringLength would handle this directly, without having to strip afterwards:

int len = StringEncoding.stringLength(buffer, offset, 20);
var s = StringEncoding.decode(buffer, offset, len);

Yes. There's an issue in the spec suggesting "partial fill" support, which would encode as much as possible and yield both bytesWritten and charactersWritten. Thoughts? (UTF-16 surrogate pairs make me cringe a bit here.)

Most Web APIs just ignore the existance of surrogate pairs (passing them through directly), leaving users to figure them out if they really need to.  Given that this is a broad problem on the platform and other APIs don't try to fix this one-by-one, I'd suggest doing the same.

On Fri, Nov 4, 2011 at 1:06 PM, Glenn Maynard <glenn@zewt.org> wrote:
Note that Blob already supports decoding strings from named encodings, and encoding to UTF-8 Blobs (the latter isn't yet widely implemented and the BlobBuilder API may change, but it's specced).  It'd be good to have parity between these APIs, eg. in how invalid codepoints are handled via U+FFFD replacement.

Should we instead focus on Blob/TypedArray interop and extending the encoding support, and keep the encoding/decoding API in one place (i.e. Blob), rather than adding yet another Web API?

My initial impression was that this felt like duplicated functionality vs. Blob's string encoding support.  I'm not really sure, though.  Parsing, say, filenames from a ZIP index using Blob's APIs could be somewhat cumbersome, because Blobs are fundamentally asynchronous; to decode a list of filenames, you'll need to make an async call for each one.  Not at all unworkable, but not terribly convenient, when you're working with data that easily fits in memory and the async interface really isn't needed.

(That's ignoring workers, where blobs can be accessed synchronously.)

> int len = StringEncoding.stringLength(array3);
> var s3 = StringEncoding.decode(array3, 0, len);

There will be an objection raised that this requires decoding twice. (The reason for the outstanding "partial fill" issue is that there are objections to having encode twice, once to determine length and a second time to fill the buffer.) I like this proposal, though.

It doesn't need to decode twice; it only needs to pass over the data twice (strnlen or wcsnlen, effectively), which is much cheaper and easily optimized than a full decode.

Given this change, it would then imply that for decode, if byteLength is omitted/undefined, the implication would be to "decode the entire buffer" (consistent w/ Typed Array APIs), rather than null termination. 

Right.  decode() would no longer care about null termination at all--that would be isolated to stringLength().

On Fri, Nov 4, 2011 at 5:23 PM, John Tamplin <jat@google.com> wrote:
I don't pretend to know the reasons for having both Blob and typed arrays (not having followed the discussion), but having both seems to necessarily mean some duplication.  The alternative would be to factor out common functionality to a new place that works with either, but that also seems unlikely 

FYI, the two fundamental differences between Blob and ArrayBuffer are that Blob is asynchronous (in order to allow very large blocks of data, which may be swapped to disk), and immutable.  ArrayBuffer is synchronous, which is more convenient but can't scratch data to disk, and mutable.

Glenn Maynard