|
|
|
|
|
|
| |
| |
|
|
|
|
| |
| |
|
|
The internal reworking of the parser will make it necessary to overhaul
the handling of character sets, in a way that will not be entirely
compatible with earlier versions. Here are my current plans - comments
welcome:
(1) The current `global_settings { charset FOO }` mechanism (apparently
introduced with v3.5) will be disabled.
For backward compatibility it will still be accepted, but produce a
warning. (Not sure yet if we want the warning for pre-v3.5 scenes,
and/or an outright error for v3.8 and later scenes.)
(2) Internal character handling will use UCS (essentially Unicode minus
a truckload of semantics rules) throughout.
For example, `chr(X)` will return the character corresponding to UCS
codepoint X.
Specifically, for now the internal character set will be UCS-2 (i.e. the
16-bit subset of UCS, aka the Base Multilingual Plane; not to be
confused with UTF-16, which is a 16-bit-based encoding of the full UCS-4
character set).
Note that the choice of UCS-2 will make the internal handling compatible
with both Latin-1 and ASCII (but not Windows-1252, which is frequently
misidentified as Latin-1).
In later versions, the internal character handling may be extended to
fully support UCS-4.
(3) Character encoding of input files will be auto-detected.
Full support will be provided for UTF-8 (and, by extension, ASCII),
which will be auto-detected reliably.
Reasonable support will be provided for Windows-1252 (and, by extension,
Latin-1) except for a few fringe cases that will be misidentified as UTF-8.
For now, it will be recommended that any other encodings (or
misidentified fringe cases) be converted to UTF-8.
In later versions, a mechanism may be added to specify encoding on a
per-file basis, to add support for additional encodings or work around
misidentified fringe cases.
(4) Character encoding of output (created via `#write` or `#debug`) will
be left undefined for now except for being a superset of ASCII.
In later versions, output may be more rigorously defined as either
UTF-8, Latin-1 or Windows-1252, or mechanisms may be added to choose
between these (and maybe other) encodings.
(5) Text primitives will use UCS encoding unless specified otherwise.
If the current syntax is used, the specified text will be interpreted as
a sequence of UCS codepoints (conformant with (2)), and looked up in one
of the Unicode CMAP tables in the font file. If the font file has no
Unicode CMAP tables, the character will first be converted from UCS to
the character set of an available CMAP table, and then looked up via
that table. If this also fails, the character will be rendered using the
repacement glyph (often a square).
To simplify conversion of old scenes, the text primitive syntax will be
extended with a syntax allowing for more control over the lookup process:
#declare MyText = "a\u20ACb";
text { ttf "sym.ttf" cmap { 3,0 charset windows1252 } MyText }
This example will do the following:
- The sequence '\u20AC' in a string literal instructs POV-Ray to place
the UCS character U+20AC there (this syntax already exists), in this
case resulting in a string that contains `a` (U+0061) followed by a Euro
sign (U+20AC) followed by `b` (U+0062).
- Specifying `cmap { ... charset windows1252 }` instructs POV-Ray to
first convert each character to the Windows-1252 character set, in this
case giving character codes of 97 (`a`), 128 (the Euro sign) and 98
(`b`), respectively.
- Specifying `cmap { 3,0 ... }` instructs POV-Ray to use the CMAP table
with the specified PlatformID (in this case 3, identifying the platform
as "Microsoft") and PlatformSpecificID (in this case 0, identifiying the
encoding as "Symbol").
- POV-Ray will take the character codes as per the charset conversion,
and use those directly as indices into the specified CMAP table to look
up the glyph (character shape), without any further conversion.
The `charset` setting will be optional; omitting it will instruct
POV-Ray to not do a character conversion, but rather use the UCS
character codes directly as indices into the CMAP table.
For now, the set of supported charsets will be limited to the following:
- Windows-1252
- Mac OS Roman
- UCS (implicit via omission)
Additional charsets will be supported by virtue of being subsets of the
above, e.g.:
- Latin-1 (being a subset of both Windows-1252 and UCS)
- ASCII (being a subset of all of the above)
(Note that all of the above refer to the respective character sets, not
encodings thereof.)
The syntax for the `charset` setting is not finalized yet. Instead of a
keyword as a parameter (e.g. `charset windows1252`) I'm also considering
using a string parameter (e.g. `charset "Windows-1252"`), or even
Windows codepage numbers (e.g. `charset 1252` for Windows-1252 and
`charset 10000` for Mac OS Roman). Comments highly welcome.
For scenes using `global_setting { charset utf8 }`, as well as scenes
using only ASCII characters, the above rules should result in unchanged
behaviour (except for the effects of a few bugs in the TrueType handling
code that I intend to eliminate along the way).
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
in news:5c2e6746@news.povray.org clipka wrote:
Sorry, can't comment on the whole character sets thing, it's beyond me,
> backward compatibility
but this got my attention.
I know that backwards compatibility always has been a big deal with POV-
Ray, but is breaking it and keeping all previous versions around as binary
and/or source not an easier way to do that? Maybe even as a single
download.
I would not mind having several versions on my system and having the IDE
automatically switch to the right engine based on the #version directive
in the scene.
Ingo
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
hi,
ingo <ing### [at] tagpovrayorg> wrote:
> in news:5c2e6746@news.povray.org clipka wrote:
> > backward compatibility
> I would not mind having several versions on my system and having the IDE
> automatically switch to the right engine based on the #version directive
> in the scene.
neat. and it could be done via a shell script/batch file, not needing any
modification to the command-line parser; in the script grep the version number
from the scene or ini file + run the corresponding program.
regards ,jr.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Le 19-01-03 à 14:49, clipka a écrit :
> The internal reworking of the parser will make it necessary to overhaul
> the handling of character sets, in a way that will not be entirely
> compatible with earlier versions. Here are my current plans - comments
> welcome:
>
>
> (1) The current `global_settings { charset FOO }` mechanism (apparently
> introduced with v3.5) will be disabled.
>
> For backward compatibility it will still be accepted, but produce a
> warning. (Not sure yet if we want the warning for pre-v3.5 scenes,
> and/or an outright error for v3.8 and later scenes.)
maybe only a warning and ignore.
>
Will it be possible to directly use UTF-8 characters ?
After all, if you can directly enter characters like à é è ô ç (direct
access) or easily like €(altchar+e) ñ(altchar+ç,n) from your keyboard as
I just did, you should be able to use them instead of the cumbersome codes.
Alain
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Am 04.01.2019 um 14:27 schrieb ingo:
> I know that backwards compatibility always has been a big deal with POV-
> Ray, but is breaking it and keeping all previous versions around as binary
> and/or source not an easier way to do that? Maybe even as a single
> download.
>
> I would not mind having several versions on my system and having the IDE
> automatically switch to the right engine based on the #version directive
> in the scene.
That would require a strict separation and stable interface between IDE
and render engine - which we currently don't really have.
Architecturally, the IDE and render engine are currently a rather
monolithic thing.
Even if we had such an interface, what you suggest would still require
at least some baseline maintenance on all the old versions we want to
support, so that they keep working as runtime and build environments
change over time. Cases in point: v3.6 had to be modified to play nice
with Windows Vista; and v3.7.0 had to be modified to play nice with
modern versions of the boost library, compilers conforming to modern C++
language standards, and modern versions of the Unix Automake toolset.
And then there is the issue of how to deal with legacy include files
used in new scenes. Switching parsers mid-scene is not a viable option.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Am 04.01.2019 um 19:18 schrieb Alain:
> Will it be possible to directly use UTF-8 characters ?
> After all, if you can directly enter characters like à é è ô ç (direct
> access) or easily like €(altchar+e) ñ(altchar+ç,n) from your keyboard as
> I just did, you should be able to use them instead of the cumbersome codes.
Short answer: The `\uXXXX` notation won't be necessary. I just used it
to avoid non-ASCII characters in my post.
Looooong answer:
It depends on what you're taling about.
First, let's get an elephant - or should I say mammoth - out of the
room: The editor component of the Windows GUI. It's old and crappy, and
doesn't support UTF-8 at all. It does support Windows-1252 though (at
least on my system; I guess it may depend on what locale you have
configured in Windows), which has all the characters you mentioned.
Now if you are using a different editor, using verbatim "UTF-8
characters" should be no problem: Enter the characters, save the file as
UTF-8, done.
The characters will be encoded directly as UTF-8, and the parser will
work with them just fine (provided you're only using them in string
literals or comments); no need for `\uXXXX` notation.
Alternatively, you could enter the same characters in the same editor,
and save the file as "Windows-1252" (or maybe called "ANSI" or
"Latin-1"), or enter them in POV-Ray for Windows and just save the file
without specifying a particular encoding (because you can't).
In that case the characters will be encoded as Windows-1252, and in most
cases the parser will also work with them just fine (again, string
literals or comments only); again no need for `\uXXXX` notation.
What the parser will do in such a case is first convert the
Windows-1252-enoded characters to Unicode, and then proceed in just the
same way.
For example:
#declare MyText = "a€b"; // a Euro sign between `a` and `b`
will create a string containing `a` (U+0061) followed by a Euro sign
(U+20AC) followed by `b` (U+0062), no matter whether the file uses UTF-8
encoding or Windows-1252 encoding. In both cases, the parser will
interpret the thing between `a` and `b` as U+20AC, even though in a
UTF-8 encided file that thing is represented by the byte sequence hex
E2,82,AC while in a Windows-1252 encoded file it is represented by the
single byte hex 80.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
in news:5c3003db$1@news.povray.org clipka wrote:
> what you suggest would still require
> at least some baseline maintenance on all the old versions we want to
> support
Where lies the "break even point", how much does being backwards
compatibel cost versus this other maintenance with regards to the ability
/ possibility to take bigger / different development steps? Now, I know
you can't put a percentage on that ;) Just me wondering, looking at what
happened in the Python world with 2 & 3. Yesterday I 'broke' my mesh
macro's that are also in 3.7 by adding a dictionary and by changing the
way resolution is set...
Ingo
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Le 19-01-04 à 21:04, clipka a écrit :
> Am 04.01.2019 um 19:18 schrieb Alain:
>
>> Will it be possible to directly use UTF-8 characters ?
>> After all, if you can directly enter characters like à é è ô ç (direct
>> access) or easily like €(altchar+e) ñ(altchar+ç,n) from your keyboard
>> as I just did, you should be able to use them instead of the
>> cumbersome codes.
>
> Short answer: The `\uXXXX` notation won't be necessary. I just used it
> to avoid non-ASCII characters in my post.
>
>
> Looooong answer:
>
>
> It depends on what you're taling about.
>
> First, let's get an elephant - or should I say mammoth - out of the
> room: The editor component of the Windows GUI. It's old and crappy, and
> doesn't support UTF-8 at all. It does support Windows-1252 though (at
> least on my system; I guess it may depend on what locale you have
> configured in Windows), which has all the characters you mentioned.
>
>
> Now if you are using a different editor, using verbatim "UTF-8
> characters" should be no problem: Enter the characters, save the file as
> UTF-8, done.
>
> The characters will be encoded directly as UTF-8, and the parser will
> work with them just fine (provided you're only using them in string
> literals or comments); no need for `\uXXXX` notation.
>
>
> Alternatively, you could enter the same characters in the same editor,
> and save the file as "Windows-1252" (or maybe called "ANSI" or
> "Latin-1"), or enter them in POV-Ray for Windows and just save the file
> without specifying a particular encoding (because you can't).
>
> In that case the characters will be encoded as Windows-1252, and in most
> cases the parser will also work with them just fine (again, string
> literals or comments only); again no need for `\uXXXX` notation.
>
> What the parser will do in such a case is first convert the
> Windows-1252-enoded characters to Unicode, and then proceed in just the
> same way.
>
>
> For example:
>
> #declare MyText = "a€b"; // a Euro sign between `a` and `b`
>
> will create a string containing `a` (U+0061) followed by a Euro sign
> (U+20AC) followed by `b` (U+0062), no matter whether the file uses UTF-8
> encoding or Windows-1252 encoding. In both cases, the parser will
> interpret the thing between `a` and `b` as U+20AC, even though in a
> UTF-8 encided file that thing is represented by the byte sequence hex
> E2,82,AC while in a Windows-1252 encoded file it is represented by the
> single byte hex 80.
Nice.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Am 03.01.2019 um 20:49 schrieb clipka:
> (5) Text primitives will use UCS encoding unless specified otherwise.
...
> To simplify conversion of old scenes, the text primitive syntax will be
> extended with a syntax allowing for more control over the lookup process:
>
> #declare MyText = "a\u20ACb";
> text { ttf "sym.ttf" cmap { 3,0 charset windows1252 } MyText }
I think I will change that as following:
text { ttf "sym.ttf" cmap { 3,0 charset 1252 } MyText }
with a few select charset numbers defined; most notably:
1252 Windows code page 1252
(for obvious reasons)
10000 Mac OS Roman
(because Windows supports this as code page 10000)
61440 MS legacy symbol font (Wingdings etc.) remapping to
Unicode Private Use Area U+F000..U+F0FF
(because 61440 = hex F000)
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
hi,
clipka <ano### [at] anonymousorg> wrote:
> To simplify conversion of old scenes, the text primitive syntax will be
> extended with a syntax allowing for more control over the lookup process:
>
> #declare MyText = "a\u20ACb";
> text { ttf "sym.ttf" cmap { 3,0 charset windows1252 } MyText }
with alpha.10008988, and same code as in other thread modified to read:
#version 3.8;
.....
text { ttf "arialbd.ttf" cmap { 1,0 charset utf8 S }
.....
I get the following error:
File 'pav-patt.pov' line 61: Parse Warning: Text primitive 'cmap' extension is
experimental and may be subject to future changes.
File 'pav-patt.pov' line 61: Parse Error: Expected 'numeric expression', utf8
found instead
Fatal error in parser: Cannot parse input.
Render failed
same for 'ascii'.
regards, jr.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
|
|