Khronos Public Bugzilla
Bug 760 - Incorrect and incomplete GLSL character set specification
Incorrect and incomplete GLSL character set specification
Status: RESOLVED FIXED
Product: Khronos (general)
Classification: Unclassified
Component: Documentation
unspecified
All All
: P3 normal
: ---
Assigned To: Jon Leech
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-01-04 07:51 PST by Eivind Midtgård
Modified: 2013-06-26 00:39 PDT (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Eivind Midtgård 2013-01-04 07:51:14 PST
The GLSL specification from 4.20 onwards, specifies that the character set used is a subset of UTF-8. This is a bug. UTF-8 is not a character set. It is an encoding of the Unicode character set. 

In addition, the description of the character set omits several details necessary to interpret UTF-8 encoded data. For example, is there a BOM-mark? Is it allowed or disallowed? Are the UTF-8 bytes in big-endian or little endian order? Are both allowed?

Since the GLSL language seems to require nothing more than ASCII to represent the code, I think it would be good enough to specify that every character must be ASCII. Since the specification actually lists all characters allowed, and since they are all in the ASCII character set, the best solution would be to simply say that the character set is ASCII, encoded in the standard way (i.e. as ASCII).

The only problem I see with this is comments: Users may wish to write comments in their own language, and this may be impossible within the limits of ASCII. It is better to make the language as simple as possible and simply specify that also comments must be in ASCII.

If GLSL is planned to incorporate string or character data types, then going for Unicode encoded as UTF-8 is the way to go. The details should be fully specified.
Comment 1 Alfonse 2013-01-05 13:43:35 PST
UTF-8 cannot be in big-endian or little-endian order. UTF-8 is defined as a sequence of bytes (code units), and the order is defined by that specification. A single byte does not have an endian, so UTF-8 is completely endian-independent. One of the nice things about it compared to UTF-16.

As for a BOM, the UTF-8 encoding doesn't define a BOM. Unicode has one, but UTF-8 doesn't make a claim to such a thing. Section 3.1 states:

> A compile-time error will be given if any other character is used outside a comment.

The Unicode Byte Order Mark codepoint is not listed in the accepted "characters". It therefore is "any other character", so it is not allowed outside of comments.
Comment 2 Eivind Midtgård 2013-01-07 16:13:18 PST
You are right, there is no BOM or endianness in UTF-8, so that part of the bug report should be ignored.
Comment 3 Jon Leech 2013-06-18 15:47:13 PDT
(In reply to comment #0)
> The GLSL specification from 4.20 onwards, specifies that the character set
> used is a subset of UTF-8. This is a bug. UTF-8 is not a character set. It
> is an encoding of the Unicode character set.

Perhaps we could say "... is encoded in UTF-8 and includes the following
characters"?
 
> Since the GLSL language seems to require nothing more than ASCII to
> represent the code, I think it would be good enough to specify that every
> character must be ASCII. Since the specification actually lists all
> characters allowed, and since they are all in the ASCII character set, the
> best solution would be to simply say that the character set is ASCII,
> encoded in the standard way (i.e. as ASCII).

I'm unclear why it's better to say that the allowed character set is a subset
of ASCII than that it's a subset of UTF-8 encoded Unicode. Both are true.

We are trying to move towards UTF-8 being used consistently throughout
OpenGL, and while it's true that you can't use all of Unicode in shader
source, you can't use all of ASCII (or US-ASCII / ISO 646, if we're really
getting pedantic), either.

> If GLSL is planned to incorporate string or character data types, then going
> for Unicode encoded as UTF-8 is the way to go. The details should be fully
> specified.

That might happen in the future but there's no string support in GLSL planned
now, AFAIK.
Comment 4 Eivind Midtgård 2013-06-19 13:45:47 PDT
Since you are moving to UTF-8 I think your suggestion "... is encoded in UTF-8 and includes the following characters" is the best one. It is also better to say what you suggest about being a subset of UTF-8 encoded unicode. I suggested ASCII because that was used in earlier specifications and I mistakenly believed that you would continue using it.
Comment 5 Jon Leech 2013-06-19 15:13:20 PDT
I've asked the ARB to accept a clarification to this effect in the
next GLSL spec release, and will update here when we decided about
that. Thanks.
Comment 6 johnk 2013-06-26 00:39:26 PDT
The ARB has agreed that this change will be made in the next released version of the GLSL specification.