[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Public WebGL] Typed Arrays in W3C Specifications | Fwd: Updates to File API




Just to follow up -- my proposal assumes that Blobs and Files are decoupled, and that Blobs just represent a chunk of data that has already been read/received, and not something from which you would read or receive.  The operations on a Blob in my email are all synchronous because there's no IO -- the data is already there, it's already been read.  You're just choosing the format to receive it in.

FileReader behaves identically to how it's specified now -- the only difference is that you don't choose how to read the data ahead of time, you always read a Blob object and then you can perform operations on that blob to access the data.  You'll note that FileReader in the draft has a DOMString result, because DOMString is the only type that you could read, even if you generated that DOMString in different ways.  To shoehorn arraybuffers in there, you'd have to add "ArrayBuffer resultBuffer;" as well, and make "result" invalid when "readAsArrayBuffer" is called.

Part of the confusion here is a naming and usage issue; my fault there, as to me a Blob is an arbitrary chunk of unstructured data... thinking of a file as a blob was just not connecting in my head.  So, change all instances of Blob to something like DataChunk in my proposal.  Then you have:


interface DataChunk {
   readonly long offset;  // offset from the start of the stream, -1 if not available
   readonly unsigned long length;

   // if the implementation has Typed Arrays:
   readonly ArrayBuffer buffer;

   // for all implementations
   readonly DOMString binaryString;
   readonly DOMString text;
   readonly DOMString dataURL;
   DOMString getAsTextWithEncoding(in DOMString encoding);

   // the above could also be "ArrayBuffer asArrayBuffer();"  "DOMString asBinaryString();" etc. so
   // that we don't need the encoding bit as a separate function
  
   // slice to access sub-chunks of this data, when access as text or dataURL is desired.
   DataChunk slice(unsigned long startIndex, unsigned long endIndex);
};

With the renaming, a File can inherit from a Blob interface if that's really desired (still doesn't make sense to me, especially if you want to reuse Blob for other things -- you can't represent a network stream as a Blob, because it's not a finite thing, for example.  But it doesn't affect this proposal).

interface File {
   readonly DOMString type;
   readonly DOMString uri;
   readonly DOMString name;
   readonly unsigned long long length;
};

The FileReader is greatly simplified -- note that originally I omitted all the event stuff for clarity, but that probably muddled things more -- that model is still identical.  With only a single read() call, you can also get rid of the awkward bits around only the last "read..." method actually taking effect.

interface FileReader {
  void read(in File file,
            [optional] in unsigned long long startOffset,
            [optional] in unsigned long long length);


  readonly attribute DataChunk result;

  //
  // all stuff below is identical as to what's already in the
  // proposed spec and behaves the same
  //

  void
abort();
  // states
const unsigned short EMPTY = 0;
const unsigned short LOADING = 1;
const unsigned short DONE = 2;

readonly attribute unsigned short readyState;

readonly attribute FileError error;

// event handler attributes
attribute Function onloadstart;
attribute Function onprogress;
attribute Function onload;
attribute Function onabort;
attribute Function onerror;
attribute Function onloadend;

};

interface FileReaderSync {
    DataChunk read(in File file,
                   [optional] in unsigned long long startOffset,
                   [optional] in unsigned long long length);

};


Does that make more sense?  There's no weirdness about how to stick an ArrayBuffer on Blob -- now that I understand the original Blob usage more, I don't think that makes sense given that reads have to be async.  Also, the XHR case becomes a matter of adding a "responseDataChunk" property, and WebSockets can easily also add a DataChunk to the message received events.

In the future, if a new data representation were to be added, adding it to DataChunk immediately makes it available in all APIs that use DataChunk -- I don't think we'll need one for raw data, since ArrayBuffer is as low level as it gets... but you can image a crazy world where, for example, you can read some kind of raw bytecode that a VM inside the browser can interpret... then adding a "executableByteCodeFunction" property to DataChunk immediately makes it possible to read and execute that byte code from a file, network socket, or XHR without any changes to those APIs.  (Yes, I realize it's a crazy example, please don't focus on the example itself too much, but more the idea of adding a new data type :-)

    - Vlad

----- "Arun Ranganathan" <arun@mozilla.com> wrote:
On 5/13/10 12:51 PM, Vladimir Vukicevic wrote:
> So in thinking about this more, here's a few comments/problems. A number of these are really comments about the File API itself; let me know if you want me to forward this elsewhere (or feel free to do so).
>
> I don't think Blob makes sense as a base class for a File -- a Blob isn't a File, especially once we can talk about slicing blobs and whatnot.

The idea behind Blob is that it represents binary data too big to be
read synchronously, and thus only asynchronous operations were
suitable.  Blob's goal was to be used asynchronously on the platform.  I
agree that decoupling File and Blob *may* make sense, but I'll note that:

* Slice operations
* Obtaining a URL
* Obtaining a type

were all desired use cases for both for binary data -- Blobs --  *and*
for Files.  You can find justification for this here [1][2], but in a
nutshell the use cases are:

* Pretty much anywhere you have a blob of data, you might want to hand
it off to the browser, even if it wasn't a user supplied file.
** Viewing a single chapter of a book in a frame
** Slicing one episode out of a media format (DVD) and handing it to the
video element; player controls start and end at episode boundaries
** Pack a number of small files to speed download (with compression),
then parse them apart.

In order for the URLs on these blobs to be useful, they'd have to have
mediaType.

See also FileWriter and BlobBuilder [3].
> But continuing that thought, then it doesn't make sense for a Blob to have a type or url -- what does it mean to have a "type" for a fragment of data read from the network? Or a URL? I think as Chris was getting at on the call this morning, we're really just talking about a bare ArrayBuffer when we talk about such a chunk.

The WebGL use case may not be served by Blobs, actually, but by "bare
ArrayBuffers."  There's no need for Blobs to intermediate here, but
hopefully I've represented that "types" might be useful.  Google wants
Content-Dispositions on Blobs as well, but I disagree with this (perhaps
because the Chrome download manager may be distinct from other browsers,
and "triggering" Blob download with a Content-Disposition might be a
valid use case here).
>   But, a Blob can be useful when allowing access using different data types.
>
> In my thinking, a blob would look like this:
>
> interface Blob {
> readonly long offset; // offset from the start of the stream , -1 if not available
> readonly unsigned long size;
>    

Fine so far, but:
> // if the implementation has Typed Arrays:
> readonly ArrayBuffer buffer;
>    

Stipulation: Blobs *must* be accessed asynchronously!  Do you disagree
that Blobs should only behave asynchronously?  If you want *synchronous*
Blobs, along with *asynchronous* Files, then we should separate the two
and not have File inherit from Blob.
> // for all implementations
> readonly DOMString binaryString;
> readonly DOMString text;
> readonly DOMString dataURL;
> DOMString getAsTextWithEncoding(in DOMString encoding);
>    

Asynchronous!
> // the above could also be "ArrayBuffer asArrayBuffer();" "DOMString asBinaryString();" etc. so
> // that we don't need the encoding bit as a separate function
>
> Blob slice(unsigned long startIndex, unsigned long endIndex);
> };
>
> Now -- /if/ we were to make Typed Arrays a requirement for File API (which I don't think we can), then we could consider adding ways to convert from an ArrayBuffer to a binary string, data URL, etc. and not need any of the above, though even then having the offset would be handy when you have a lot of reads in flight.
>
> A File would look like:
>
> interface File {
> readonly DOMString type;
> readonly DOMString uri;
> readonly DOMString name;
> readonly unsigned long long length;
>    
Yes, this is what File *does* look like, but it follows an inheritance
model.

> };
>
> with no attachment to a Blob; just an object that represents a File, obtained from an<input>  or other element.
>
> FileReader (which, I'll be honest, doesn't really make much sense to me -- why do we need a separate object to read from Files, as opposed to reading using the File directly? But, ok, that's not relevant here, and I guess it does isolate reading.)
>    

FileReader exists in order to separate Files from reading from a file
directly.  This was an API choice, born out of LOTS of discussion (see
for example [4] and follow the thread -- the model changed
substantially).  In an early draft, File objects fired asynchronous
*callback* based read methods which existed *on* the File object.  This
changed to FileReader + Events after enormous discussion on the
public-webapps WG listserv.
> interface FileReader {
> void read(in File file, [optional] in unsigned long long startOffset, [optional] in unsigned long long length);
>
> readonly attribute Blob result;
> };
>    

How does the read behave?  Is there an event model associated with
FileReader on the main thread?
> interface FileReaderSync {
> Blob read(in File file, [optional] in unsigned long long startOffset, [optional] in unsigned long long length);
> };
>
> Note that the above explicitly takes File elements as input -- it's a FileReader after all -- and has Blobs as the result from the read operation.
>
> That seems like a much cleaner separation to me -- you have File objects that have associated name/uri/type/etc. You use a FileReader to read a Blob from a file; then you can ask that Blob to give you the data that was read in one of a number of representations. No need to decide up front how you want to read a file -- with the current API it would be hard to do something like charset detection... you wouldn't be able to read something as a binary string first, try to guess a charset, and then ask for it again with an encoding without doing another FileReader, even though you already have the data.
>    

What you say above is true.  But actually, what's the use case for
charset detection if you *don't* want the File read as text?  In the
existing model,  you can do charset detection if you read as text.
> Given interfaces like the above, putting Blob to work for XHR and even structured storage seems very straightforward -- for XHR, you'd have a responseBlob property in the result. There'd be some overlap, for example responseBlob.text is likely to be the same as responseText (I don't know the details, so don't know if they're specified differently), but that's not an issue.
>    

(responseBlob is still a proposal being hashed out on XHR).  Under the
current design, it *MUST* be accessed asynchronously.  We can revisit
that if need be, and I'm amenable to changing the inheritance model.

-- A*
[1] http://lists.w3.org/Archives/Public/public-webapps/2010AprJun/0659.html
[2] http://www.mail-archive.com/public-webapps@w3.org/msg06137.html
[3] http://dev.w3.org/2009/dap/file-system/file-writer.html
[4] http://lists.w3.org/Archives/Public/public-webapps/2009JulSep/0576.html