POV-Ray: Newsgroups: povray.programming: Re: Unicode for POVRay: Re: Unicode for POVRay

POV-Ray : Newsgroups : povray.programming : Re: Unicode for POVRay : Re: Unicode for POVRay		Server Time 23 Dec 2025 06:25:34 EST (-0500)
From: Jon A Cruz
Date: 5 Jun 1999 03:30:36
Message: <3758C3EF.87A09B39@geocities.com>
Ron Parker wrote:

> On Fri, 04 Jun 1999 08:15:43 -0700, Jon A. Cruz wrote:
> >If you start to allow arbitrary encodings, which do you use?
> [...]
> >Probably the only way to keep the .pov files portable and generating identical
> >results on any platform (which I think is one of the design goals of POV-Ray)
> >would be to include the encoding support in POV-Ray. But, you can't include
> >everything, so where do you draw the line?
>
> Presumably, you add things if they're needed.  Unicode might be sufficient
> for most people, so it usually won't be a problem, but what if the font you
> want to use doesn't have a Unicode encoding table?  Personally, I'd probably
> try to write a Perl script to add a Unicode table to the font, but the best
> solution would probably be to support that other encoding in POV (preferably
> without resorting to the locale stuff that was in the CJK patch, and without
> causing serious problems with plain ASCII text using the default NT5 Truetype
> fonts as the CJK patch did.)

Problem is there is no single 'other' encoding. That's the biggest problem. I think
that we can count on most modern fonts to have a true type index in them. For
others... can be a problem but the alternative is including the entire sets of
Unicode to Japanese, Unicode to Big5, Unicode to GB...

Ouch.

> >But then... what happens to all the text manipulation routines? Are end-user
> >scripts dependent on one character=one byte? What compatibility issues could
> >this cause? Hmmm....
>
> This is what I thought we were talking about.  Things like substr would have to
> be changed to count UTF-8 characters instead of bytes.  This is relatively easy,
> because you only have to look at the first byte to determine the size of the
> character, but it would take some work nonetheless.  Either that, or convert
> all strings to UCS-2 or UCS-4 at parse time and modify the rest of the code
> accordingly.  Or, you leave them the way they are modulo a few bugfixes and
> expect the users to deal with the encoding-related issues themselves.  The
> real question would be whether we expect substr and its ilk to work on
> characters or on bytes.  The documentation says characters, in case that
> matters.  It never mentions bytes except implicitly in a few discussions of
> the range of arguments or return values from some functions.
>
> I'd have to say that at the bare minimum, it'd be nice if the chr function
> could take the UCS-4 codepoint as an argument and return the corresponding
> UTF-8 string.  Its behaviour with arguments between 128 and 255 would still
> have to be encoding-dependent, though.  Whether the asc function should do
> the inverse is a topic for debate, too, and the slope starts getting quite
> slippery at that point.

I was thinking that to the users, UTF-8 would not be really accessable. Make them
do everything in UCS-2. Maybe have a few UCS-4 functions. Internally we might keep
things UTF-8, or UCS-2. I don't think users really want to have to do the tricks to
figure out if they need to deal with several bytes.

The route that Sun took with Java is that it is dealing with things internally as
UTF-8, but that is hidden. To a Java program and programmer everything is in UCS-2.
Of course that maybe has problems when you want to get at some things that are not
in the standard range for Unicode characters.

Guess I'd better go over all the string functions and take stock of the state of
things.


Bottom line is that I think /pov files themselves might be in a few different
encodings (very limited), but that POV-Ray would covert them to Unicode at parse
time. Then all the dealings would be in Unicode. The only question would be UCS-2
or UCS-4?

Then again, this might require a lot of changes to the parsing code, as one byte in
the file would no longer equal one character.
Post a reply to this message