On Fri, Nov 4, 2011 at 4:54 PM, Joshua Bell <email@example.com>
Personally, I'd manually remove them (replace \x00 with your padding character of choice):
var str = stringEncoding.decode(buffer, 0, 20, "UTF-8").replace(/\x00+$/, '');
Glenn's suggestion of stringLength (see below) might be a good place to hang a way to specify an arbitrary termination character for the padding case. (I can imagine edge cases with e.g. UTF-16 encoding in odd-length buffers, though.)
- when writing to a fixed buffer, how do I truncate the string? Given this API, it looks like I would have to loop through encodedLength, chopping the string until it fits.
Yes. There's an issue in the spec suggesting "partial fill" support, which would encode as much as possible and yield both bytesWritten and charactersWritten. Thoughts? (UTF-16 surrogate pairs make me cringe a bit here.)
I would simply say that if truncation is allowed, only complete characters are stored (including surrogate pairs).
Should we instead focus on Blob/TypedArray interop and extending the encoding support, and keep the encoding/decoding API in one place (i.e. Blob), rather than adding yet another Web API?
So if I am reading a payload into an ArrayBuffer and I want to extract a JS string from the middle of it, I would have to convert to a blob first? That seems less than ideal.
I don't pretend to know the reasons for having both Blob and typed arrays (not having followed the discussion), but having both seems to necessarily mean some duplication. The alternative would be to factor out common functionality to a new place that works with either, but that also seems unlikely
Earlier versions of this spec had the methods on DataView, the current placement of the view/buffer as first argument is legacy of that. I agree the above makes more sense.
BTW, why isn't this on DataView? That seems the correct place for it, given the other methods on DataView. If you want to keep it a separate class, then maybe it would be better to create an instance for the character set you want to use and you reuse that object to convert to/from array buffers). Ie:
var ce = CharacterEncoding.for("UTF-8"); // or static instances like CharacterEncoding.UTF8
ce.encode("Hello", buf, 10, 42, true); // padding=true
var str = ce.decode(buf, 10, 42, true);
Personally, coming from a C background having the output first makes more sense to me, but YMMV :).
There will be an objection raised that this requires decoding twice. (The reason for the outstanding "partial fill" issue is that there are objections to having encode twice, once to determine length and a second time to fill the buffer.) I like this proposal, though.
The current spec essentially already requires encoding twice in the variable-length buffer case -- once to collect the lengths, and once to write the data after allocating the buffer. Short of having extensible buffers, it isn't clear how to avoid that.
You also have the problem that in the variable-length case, you need the number of bytes consumed so you know how to advance the offset for the next item, while in the fixed-length case you don't need it. Maybe that argues for having different methods entirely:
var result = ce.decodeTerminated(buf, ofs, len, terminator); // use no more than len bytes, use terminator to terminate the string
var str = result;
var len = result;
var str = ce.decodeFixed(buf, ofs, len);
Given this change, it would then imply that for decode, if byteLength is omitted/undefined, the implication would be to "decode the entire buffer" (consistent w/ Typed Array APIs), rather than null termination.
Sure, though it isn't clear how useful that functionality is -- if the entire buffer is known to be a string, you aren't likely to be using ArrayBuffer in the first place.