POV-Ray: Newsgroups: povray.programming: Re: Unicode for POVRay: Re: Unicode for POVRay

POV-Ray : Newsgroups : povray.programming : Re: Unicode for POVRay : Re: Unicode for POVRay		Server Time 12 Jul 2025 06:45:46 EDT (-0400)

From: Ron Parker
Date: 4 Jun 1999 12:40:44
Message: <3757f37c.0@news.povray.org>

On Fri, 04 Jun 1999 08:15:43 -0700, Jon A. Cruz wrote:
>If you start to allow arbitrary encodings, which do you use? 
[...]
>Probably the only way to keep the .pov files portable and generating identical
>results on any platform (which I think is one of the design goals of POV-Ray)
>would be to include the encoding support in POV-Ray. But, you can't include
>everything, so where do you draw the line?

Presumably, you add things if they're needed.  Unicode might be sufficient
for most people, so it usually won't be a problem, but what if the font you
want to use doesn't have a Unicode encoding table?  Personally, I'd probably
try to write a Perl script to add a Unicode table to the font, but the best 
solution would probably be to support that other encoding in POV (preferably 
without resorting to the locale stuff that was in the CJK patch, and without 
causing serious problems with plain ASCII text using the default NT5 Truetype 
fonts as the CJK patch did.)

>But then... what happens to all the text manipulation routines? Are end-user
>scripts dependent on one character=one byte? What compatibility issues could
>this cause? Hmmm....

This is what I thought we were talking about.  Things like substr would have to
be changed to count UTF-8 characters instead of bytes.  This is relatively easy,
because you only have to look at the first byte to determine the size of the
character, but it would take some work nonetheless.  Either that, or convert 
all strings to UCS-2 or UCS-4 at parse time and modify the rest of the code 
accordingly.  Or, you leave them the way they are modulo a few bugfixes and
expect the users to deal with the encoding-related issues themselves.  The 
real question would be whether we expect substr and its ilk to work on 
characters or on bytes.  The documentation says characters, in case that 
matters.  It never mentions bytes except implicitly in a few discussions of 
the range of arguments or return values from some functions.

I'd have to say that at the bare minimum, it'd be nice if the chr function 
could take the UCS-4 codepoint as an argument and return the corresponding
UTF-8 string.  Its behaviour with arguments between 128 and 255 would still
have to be encoding-dependent, though.  Whether the asc function should do 
the inverse is a topic for debate, too, and the slope starts getting quite
slippery at that point.

Post a reply to this message