POV-Ray: Newsgroups: povray.programming: Re: Unicode for POVRay

POV-Ray : Newsgroups : povray.programming : Re: Unicode for POVRay		Server Time 28 Jul 2024 18:27:12 EDT (-0400)

From: Ron Parker
Subject: Re: Unicode for POVRay
Date: 4 Jun 1999 12:40:44
Message: <3757f37c.0@news.povray.org>

On Fri, 04 Jun 1999 08:15:43 -0700, Jon A. Cruz wrote:
>If you start to allow arbitrary encodings, which do you use? 
[...]
>Probably the only way to keep the .pov files portable and generating identical
>results on any platform (which I think is one of the design goals of POV-Ray)
>would be to include the encoding support in POV-Ray. But, you can't include
>everything, so where do you draw the line?

Presumably, you add things if they're needed.  Unicode might be sufficient
for most people, so it usually won't be a problem, but what if the font you
want to use doesn't have a Unicode encoding table?  Personally, I'd probably
try to write a Perl script to add a Unicode table to the font, but the best 
solution would probably be to support that other encoding in POV (preferably 
without resorting to the locale stuff that was in the CJK patch, and without 
causing serious problems with plain ASCII text using the default NT5 Truetype 
fonts as the CJK patch did.)

>But then... what happens to all the text manipulation routines? Are end-user
>scripts dependent on one character=one byte? What compatibility issues could
>this cause? Hmmm....

This is what I thought we were talking about.  Things like substr would have to
be changed to count UTF-8 characters instead of bytes.  This is relatively easy,
because you only have to look at the first byte to determine the size of the
character, but it would take some work nonetheless.  Either that, or convert 
all strings to UCS-2 or UCS-4 at parse time and modify the rest of the code 
accordingly.  Or, you leave them the way they are modulo a few bugfixes and
expect the users to deal with the encoding-related issues themselves.  The 
real question would be whether we expect substr and its ilk to work on 
characters or on bytes.  The documentation says characters, in case that 
matters.  It never mentions bytes except implicitly in a few discussions of 
the range of arguments or return values from some functions.

I'd have to say that at the bare minimum, it'd be nice if the chr function 
could take the UCS-4 codepoint as an argument and return the corresponding
UTF-8 string.  Its behaviour with arguments between 128 and 255 would still
have to be encoding-dependent, though.  Whether the asc function should do 
the inverse is a topic for debate, too, and the slope starts getting quite
slippery at that point.

Post a reply to this message

From: Jon A Cruz
Subject: Re: Unicode for POVRay
Date: 5 Jun 1999 03:30:36
Message: <3758C3EF.87A09B39@geocities.com>

Ron Parker wrote:

> On Fri, 04 Jun 1999 08:15:43 -0700, Jon A. Cruz wrote:
> >If you start to allow arbitrary encodings, which do you use?
> [...]
> >Probably the only way to keep the .pov files portable and generating identical
> >results on any platform (which I think is one of the design goals of POV-Ray)
> >would be to include the encoding support in POV-Ray. But, you can't include
> >everything, so where do you draw the line?
>
> Presumably, you add things if they're needed.  Unicode might be sufficient
> for most people, so it usually won't be a problem, but what if the font you
> want to use doesn't have a Unicode encoding table?  Personally, I'd probably
> try to write a Perl script to add a Unicode table to the font, but the best
> solution would probably be to support that other encoding in POV (preferably
> without resorting to the locale stuff that was in the CJK patch, and without
> causing serious problems with plain ASCII text using the default NT5 Truetype
> fonts as the CJK patch did.)

Problem is there is no single 'other' encoding. That's the biggest problem. I think
that we can count on most modern fonts to have a true type index in them. For
others... can be a problem but the alternative is including the entire sets of
Unicode to Japanese, Unicode to Big5, Unicode to GB...

Ouch.

> >But then... what happens to all the text manipulation routines? Are end-user
> >scripts dependent on one character=one byte? What compatibility issues could
> >this cause? Hmmm....
>
> This is what I thought we were talking about.  Things like substr would have to
> be changed to count UTF-8 characters instead of bytes.  This is relatively easy,
> because you only have to look at the first byte to determine the size of the
> character, but it would take some work nonetheless.  Either that, or convert
> all strings to UCS-2 or UCS-4 at parse time and modify the rest of the code
> accordingly.  Or, you leave them the way they are modulo a few bugfixes and
> expect the users to deal with the encoding-related issues themselves.  The
> real question would be whether we expect substr and its ilk to work on
> characters or on bytes.  The documentation says characters, in case that
> matters.  It never mentions bytes except implicitly in a few discussions of
> the range of arguments or return values from some functions.
>
> I'd have to say that at the bare minimum, it'd be nice if the chr function
> could take the UCS-4 codepoint as an argument and return the corresponding
> UTF-8 string.  Its behaviour with arguments between 128 and 255 would still
> have to be encoding-dependent, though.  Whether the asc function should do
> the inverse is a topic for debate, too, and the slope starts getting quite
> slippery at that point.

I was thinking that to the users, UTF-8 would not be really accessable. Make them
do everything in UCS-2. Maybe have a few UCS-4 functions. Internally we might keep
things UTF-8, or UCS-2. I don't think users really want to have to do the tricks to
figure out if they need to deal with several bytes.

The route that Sun took with Java is that it is dealing with things internally as
UTF-8, but that is hidden. To a Java program and programmer everything is in UCS-2.
Of course that maybe has problems when you want to get at some things that are not
in the standard range for Unicode characters.

Guess I'd better go over all the string functions and take stock of the state of
things.

Bottom line is that I think /pov files themselves might be in a few different
encodings (very limited), but that POV-Ray would covert them to Unicode at parse
time. Then all the dealings would be in Unicode. The only question would be UCS-2
or UCS-4?

Then again, this might require a lot of changes to the parsing code, as one byte in
the file would no longer equal one character.

Post a reply to this message

From: Ronald L Parker
Subject: Re: Unicode for POVRay
Date: 6 Jun 1999 13:55:43
Message: <375bb556.31956559@news.povray.org>

On Fri, 04 Jun 1999 23:30:07 -0700, "Jon A. Cruz"
<jon### [at] geocitiescom> wrote:

>Bottom line is that I think /pov files themselves might be in a few different
>encodings (very limited), but that POV-Ray would covert them to Unicode at parse
>time. Then all the dealings would be in Unicode. The only question would be UCS-2
>or UCS-4?

Isn't UCS-2 a subset of UCS-4?  If that is the case, I'd just go with
UCS-4 for everything.  It's not like POV is terribly memory-conscious
in any other part of its life. :)

>Then again, this might require a lot of changes to the parsing code, as one byte in
>the file would no longer equal one character.

I don't think the parse code cares about such things, but I might be
wrong.  It's already true that one byte in the file isn't one
character: a CRLF pair in the file looks like an LF to the parser.

BTW, I have a minor bone to pick with you.  Please don't use // to
comment things in your code.  GCC doesn't like it, and many other C
compilers don't either.

Post a reply to this message

From: Alain CULOS
Subject: GCC
Date: 13 Jun 1999 17:12:07
Message: <3762F502.13A3C4C9@bigfoot.com>

"Ronald L. Parker" wrote:

> BTW, I have a minor bone to pick with you.  Please don't use // to
> comment things in your code.  GCC doesn't like it, and many other C
> compilers don't either.

Ron,
I do not understand this tip you're giving, could you expand a bit more, please ?
I use GCC myself, although not proficiently yet and found no problem with //, maybe
that
is exactly where I am getting problems I am not able to analyse properly.
By the way, // is meant to be C++ only and not C. Is this what you meant ?

Cheers,
Al.

--
ANTI SPAM / ANTI ARROSAGE COMMERCIAL :

To answer me, please take out the Z from my address.

Post a reply to this message

From: Ron Parker
Subject: Re: GCC
Date: 13 Jun 1999 17:17:45
Message: <37651f2f.5520693@news.povray.org>

On Sun, 13 Jun 1999 01:02:10 +0100, Alain CULOS
<ZAl### [at] bigfootcom> wrote:

>"Ronald L. Parker" wrote:
>
>> BTW, I have a minor bone to pick with you.  Please don't use // to
>> comment things in your code.  GCC doesn't like it, and many other C
>> compilers don't either.
>
>Ron,
>I do not understand this tip you're giving, could you expand a bit more, please ?
>I use GCC myself, although not proficiently yet and found no problem with //, maybe
that
>is exactly where I am getting problems I am not able to analyse properly.
>By the way, // is meant to be C++ only and not C. Is this what you meant ?

Yes.  I was giving the tip specifically to Jon, because his unipatch
has a few lines commented with // in what is supposed to be a .C file.
egcs 1.1.1 doesn't like that (It reports an error) so I had to change
all his comments to compile on Linux.

Post a reply to this message

From: Nieminen Mika
Subject: Re: Unicode for POVRay
Date: 14 Jun 1999 01:34:36
Message: <3764946c@news.povray.org>

Ronald L. Parker <par### [at] mailfwicom> wrote:
: BTW, I have a minor bone to pick with you.  Please don't use // to
: comment things in your code.  GCC doesn't like it, and many other C
: compilers don't either.

  Tip: If you want to automatically enclose those comments with /* ... */,
you can try this sed command:
  sed "s/\/\/.*$/\/*&*\//" file.c

  (Cryptic, uh? :) )

-- 
main(i,_){for(_?--i,main(i+2,"FhhQHFIJD|FQTITFN]zRFHhhTBFHhhTBFysdB"[i]
):5;i&&_>1;printf("%s",_-70?_&1?"[]":" ":(_=0,"\n")),_/=2);} /*- Warp -*/

Post a reply to this message

From: Ron Parker
Subject: Re: Unicode for POVRay
Date: 14 Jun 1999 10:33:10
Message: <376512a6@news.povray.org>

On 14 Jun 1999 01:34:36 -0400, Nieminen Mika wrote:
>Ronald L. Parker <par### [at] mailfwicom> wrote:
>: BTW, I have a minor bone to pick with you.  Please don't use // to
>: comment things in your code.  GCC doesn't like it, and many other C
>: compilers don't either.
>
>  Tip: If you want to automatically enclose those comments with /* ... */,
>you can try this sed command:
>  sed "s/\/\/.*$/\/*&*\//" file.c
>
>  (Cryptic, uh? :) )

less cryptic, and with a backup:
  mv file.c file.c.bak; perl -pe "s#//(.*)$#/* $1 */#" <file.c.bak >file.c

Both methods would fail if the // comment happened to contain the '*/'
sequence of characters, but that's rare.

Post a reply to this message

From: Jon A Cruz
Subject: Re: Unicode for POVRay
Date: 15 Jun 1999 01:14:35
Message: <3765E13E.E9612B04@geocities.com>

Ron Parker wrote:

> On 14 Jun 1999 01:34:36 -0400, Nieminen Mika wrote:
> >Ronald L. Parker <par### [at] mailfwicom> wrote:
> >: BTW, I have a minor bone to pick with you.  Please don't use // to
> >: comment things in your code.  GCC doesn't like it, and many other C
> >: compilers don't either.
> >
> >  Tip: If you want to automatically enclose those comments with /* ... */,
> >you can try this sed command:
> >  sed "s/\/\/.*$/\/*&*\//" file.c
> >
> >  (Cryptic, uh? :) )
>
> less cryptic, and with a backup:
>   mv file.c file.c.bak; perl -pe "s#//(.*)$#/* $1 */#" <file.c.bak >file.c
>
> Both methods would fail if the // comment happened to contain the '*/'
> sequence of characters, but that's rare.

Yes, but no so rare with me. :-)
If I mix things, it's often because of the nestled stuff. I know, I'm a pain.
But I'm working on it.

Post a reply to this message