POV-Ray : Newsgroups : povray.unofficial.patches : jr's large csv and dictionary segfault in the povr fork. Server Time
6 Dec 2023 20:09:02 EST (-0500)
  jr's large csv and dictionary segfault in the povr fork. (Message 1 to 9 of 9)  
From: William F Pokorny
Subject: jr's large csv and dictionary segfault in the povr fork.
Date: 26 Mar 2023 02:46:56
Message: <641fea60$1@news.povray.org>
A status update.

---
First, let me put on the table the segfault which happens here should 
not be a segfault during a non-debug run! Someone had already added code 
comments in the cylinder and code code to this effect.

It's good fortune - of sorts - we get the segfaults in povr, but it's 
only happening because no POV-Ray source coder got around to changing 
that throw into a warning or possible error message. ;-)

---
I've had no luck finding the core problem.

The segfault comes and goes depending seemingly on a lot of things - 
which makes me wonder if it isn't sitting there in official POV-Ray 
releases too and we've just not tripped it.

The really odd fails seen, I've been unable to reproduce even once. At 
the moment I'm chalking those up to my fatigue and potential 
configuration and set up issues as I tried to run through just parts of 
the animation.

---
I have now a changed foreach.inc file where I've added checks for 
duplicate macro to be executed strings and duplicate i_ indexes which 
has been working for me without fail for a while!

I suspect it's because I've slowed an whole animation down to almost 
exactly the time it take p380 beta 2 to do it - and I've never seen a 
fail there either.

The code changed is now:

#macro fore_voidRun(a_)
   #for (i_, 0, dict_.ttl_ - 1)

     // After this in, not seen a fail myself.
     #ifndef (Last_i_)
         #declare Last_i_ = i_;
     #else
         #if (Last_i_=i_)
             #local DummyID = f_boom(9,8,7,6,5,4);
         #end
         #declare Last_i_ = i_;
     #end

     #if (i_)
       fore_cmdNext(dict_)
     #end
     #local cmd_ = fore_cmdStr();

     // Below code found a duplicate once - and only once.
     #ifndef (LastCmdStr)
         #declare LastCmdStr=cmd_
     #else
         #if (strcmp(LastCmdStr,cmd_)=0)
             #local DummyID = f_boom(1,2,3,4,5,6);
         #end
         #declare LastCmdStr=cmd_
     #end

     fore_exec(cmd_,"parse_fore_void.tmp")
     #if (fore_debug)
       #debug concat("called '",cmd_,"'.\n")
     #end
   #end
#end

So it feels like a race issue of some kind, but I cannot see how it can 
even come about... Each frame is indeed a new parser thread, but it 
should be nothing from the prior frame's thread carries forward. I 
thought for a while it was file I/O issue, but one fail on the macro 
string checking suggests it does have something to do with the 
dictionaries or the dictionary set up as you, jr, suspected.

---
On the animation itself. We are basically re-building the scene for each 
frame of the animation with early frames having very few objects and the 
last ones almost 9000. A curiosity I don't understand is the 
usaf__mkItems(4130,a_[4130],tmp_) package indexing goes well to values 
over 10000 on occasion. These often peak at those levels and then are 
not seen again - maybe these all come from the partial runs where the 
animation re-starts? Done so many of these things my head is mush on 
what's what.

Anyway, the who code set up, gets slower and slower as we get into the 
later frames. Some is because the parsing just gets big/long, but it's 
also because the dictionary gets slower. The dictionary is built on top 
of the parser symbol table mechanism and, while I've not dug to be 
certain, my bet is some of the chains hanging off the hash table entries 
are getting long and slow.

I don't know enough about what is changing in the animation to suggest 
it, but, if there are larger fixed portions on which we are building and 
building, it would almost certainly be faster to keep those fixed parts 
as larger include files we bring in all at once - as raw SDL.

Suppose the suggestion neither here or there with respect to the problem 
- which is certainly real.

I'll keep playing with it as I have time - and I feel like it - but 
given that even starting at say frame 3000 isn't saving me all that much 
time for each animation turn - this is a very painful bug to chase.

My plan at the moment is to kick off the full hour plus long animation 
(tiny image sizes) with the added delay in place - as I think to do it. 
At days end perhaps.

It's not failed in a long while for me, but it's slow even at tiny image 
sizes so I haven't really run that many full passes.

If it doesn't fail again for me over time - the delay / race condition 
idea perhaps holds enough water for me to start putting in random locks 
as I can maybe see places to insert them. The trouble has been that I 
perturb things and the the problem blinks in and out for I do not know 
what cause - if any really my doing. I cannot come up with a solid way 
to come at this problem! It's too flaky - and it takes a painful amount 
of time to try anything.

Yes, I'm whining now and should stop. :-)

Bill P.


Post a reply to this message

From: jr
Subject: Re: jr's large csv and dictionary segfault in the povr fork.
Date: 26 Mar 2023 06:00:00
Message: <web.6420164299c558654301edef6cde94f1@news.povray.org>
hi,

will (have to) re-read the whole post later.

William F Pokorny <ano### [at] anonymousorg> wrote:
> ...
> The segfault comes and goes ...
> The really odd fails seen, ...
> thought for a while it was file I/O issue, but one fail on the macro
> string checking suggests it does have something to do with the
> dictionaries or the dictionary set up as you, jr, suspected.

wondering whether it's possible we're "out-running" the (C++) garbage collector,
and when the macro is called, it then sees "inappropriate" memory on occasion?


regards, jr.


Post a reply to this message

From: William F Pokorny
Subject: Re: jr's large csv and dictionary segfault in the povr fork.
Date: 26 Mar 2023 09:47:08
Message: <64204cdc$1@news.povray.org>
On 3/26/23 05:55, jr wrote:
> wondering whether it's possible we're "out-running" the (C++) garbage collector,
> and when the macro is called, it then sees "inappropriate" memory on occasion?

There isn't a garbage collector in C++/C. Memory is allocated when 
needed and released when no longer needed.

Your question thought still a good one. It takes considerable time to 
walk through all the allocations and free them more or less unwinding 
all the allocations(a). It should be nothing new parser related proceeds 
until the memory is freed, but...

Today the memory free up and re-allocation happens in a big way when we 
move frame to frame in an animation because one parsing thread goes 
away(a) and another gets created for the next frame. Parsing itself is 
always single threaded, unlike most other parts of POV-Ray, so we should 
not see multi-threading issues per-se.

What I too suspect is that we are perhaps sometimes seeing not quite (or 
perhaps in-correctly) initialized new parser memory that still contains 
data from the previous parser thread. This could explain why once we see 
fail points, they sometimes repeat that fail signature for a while.

Aside: I've gotten another two complete povr animation passes through 
with those changes to foreach.inc. Magic, but still real magic! FWIW. :-)

Bill P.

(a) - Back in my working years we were using a large, internally 
developed, interactive tool. On it's conversion to C++ we got frustrated 
because it took forever to exit the application as the memory was 
painstakingly released bit by bit. The developers solved the problem by 
intentionally crashing out of the application and letting the OS clean 
up the process related memory! ;-)

Anyhow. There is a performance cost to maintaining a, sort of, minimum 
memory foot print over time (as there is too for garbage collection 
memory management when it kicks in). I've wondered how much time we are 
burning doing memory management alone. Plus C++, because it tends to 
allocate as needed, ends up with bits and pieces of things all over the 
place in physical memory where it would be much better for performance 
if related memory were allocated (or re-allocated) in big contiguous 
blocks. Newer C++ versions have features intended to help with this 
memory fragmentation issue. Ah, whatever I guess. All still well down on 
my todo / toplaywith list.


Post a reply to this message

From: jr
Subject: Re: jr's large csv and dictionary segfault in the povr fork.
Date: 26 Mar 2023 13:40:00
Message: <web.6420829499c558654301edef6cde94f1@news.povray.org>
hi,

William F Pokorny <ano### [at] anonymousorg> wrote:
> ...
> Aside: I've gotten another two complete povr animation passes through
> with those changes to foreach.inc. Magic, but still real magic! FWIW. :-)

quick <blush> in public..  :-)


regards, jr.


Post a reply to this message

From: William F Pokorny
Subject: Re: jr's large csv and dictionary segfault in the povr fork.
Date: 4 Oct 2023 06:50:02
Message: <651d435a$1@news.povray.org>
On 3/26/23 02:46, William F Pokorny wrote:
> A status update.

FWIW. Another status update ahead of some time away.

---
As I was running test cases ahead of another povr tarball release, I 
started getting a parse error from inside the HF_Torus() macro on a test 
case running cleanly for years.

The failing behavior, and tendency to run cleanly again on most any SDL 
change slowing down the parsing is similar to the issue in this thread.

The new bit I see is an existing, macro local, identifier changing type 
from a 3D vector to a float - while the internal macro loops run. This, 
of course, shouldn't happen and I still don't know why it is.

There's a good deal of expression parsing going on in the loops - 
especially '.' vector element accesses. Most of that parsing being in 
the parser code, but some of it is happening in the VM too due a 
function call. Thinking aloud...

Bill P.


Post a reply to this message

From: jr
Subject: Re: jr's large csv and dictionary segfault in the povr fork.
Date: 4 Oct 2023 11:40:00
Message: <web.651d868399c55865b180e2cc6cde94f1@news.povray.org>
hi,

William F Pokorny <ano### [at] anonymousorg> wrote:
> On 3/26/23 02:46, William F Pokorny wrote:
> > A status update.
> ...
> The new bit I see is an existing, macro local, identifier changing type
> from a 3D vector to a float - while the internal macro loops run. This,
> of course, shouldn't happen and I still don't know why it is.

still, (much) more information than before.  nice.


> There's a good deal of expression parsing going on in the loops -
> especially '.' vector element accesses. Most of that parsing being in
> the parser code, but some of it is happening in the VM too due a
> function call. Thinking aloud...

out of interest, do macros and their respective local storage form units
("objects"), or are they married up "on demand" ?


regards, jr.


Post a reply to this message

From: William F Pokorny
Subject: Re: jr's large csv and dictionary segfault in the povr fork.
Date: 4 Oct 2023 14:12:02
Message: <651daaf2$1@news.povray.org>
On 10/4/23 11:36, jr wrote:
> out of interest, do macros and their respective local storage form units
> ("objects"), or are they married up "on demand" ?

Suppose more the latter. There isn't really local macro storage / or a 
local macro stack (excepting where VM functions are used with macros).

My current understanding; Hopefully not too badly described.

There is a, local to each macro when running, symbol table for #local 
declared things (*,**) and parameters (always true..?). The table 
entries point to created / stored things which might or might not 
persist beyond the macro call depending on whether they are assigned to 
an identifier in a calling level of hierarchy(b).

---
Function calls, whether inside or outside macros, are different.

For inbuilt functions like f_sphere() there is a virtual machine (VM) 
stack for passed and returned variables for each function call - and 
another stack used by the compiler for C++ variables within the inbuilt 
code.

For user (parse time compiled functions) run on the VM there is just the 
(VM) stack (lies and more lies... I know) from the SDL user's perspective.

Bill P.

(*) - Something I noticed on starting this recent debugging and that 
I've fixed in my povr copies of the HF* macros! These old HF* macros 
switch from using #local to using #declare for some variables near the 
bottom of each macro for reasons unknown...

...
-            #declare PArr[J][K] = P + H*Dir*Depth;
+            #local PArr[J][K] = P + H*Dir*Depth;

-            #declare K = K+1;
+            #local K = K+1;
          #end
-        #declare J = J+1;
+        #local J = J+1;
      #end

      HFCreate_()
...

This makes for confusing code, but it doesn't break things because the 
identifier is seen as already defined locally... In other words, those 
#declares don't create new 'global' identifiers, but rather, they 
redefine the locally defined identifier.

What this means more generally is we cannot arbitrarily create an 
identifier in the global name space with #declare where it has first 
been defined (added to the symbol table) with #local in the local macro 
space.

(**) - Something foggy still for me. I 'believe' it is still true that 
#local definitions sitting unwrapped by a macro in the top level scene 
file act as global #declares, but, I've not tested this with the new 
local and global dictionary access qualifiers in v3.8. Maybe at the top 
scene file the local and global dictionaries become the same thing?


Post a reply to this message

From: jr
Subject: Re: jr's large csv and dictionary segfault in the povr fork.
Date: 6 Oct 2023 13:10:00
Message: <web.65203e6899c55865b180e2cc6cde94f1@news.povray.org>
William F Pokorny <ano### [at] anonymousorg> wrote:
> On 10/4/23 11:36, jr wrote:
> > out of interest, do macros and their respective local storage form units
> Suppose more the latter. There isn't really local macro storage / or a
> local macro stack (excepting where VM functions are used with macros).

hey, thanks.

makes one speculate that sometimes then perpaps it's something "silly", like
some pointer not updated (or incorrectly).


> ...
> identifier is seen as already defined locally... In other words, those
> #declares don't create new 'global' identifiers, but rather, they
> redefine the locally defined identifier.
> What this means more generally is we cannot arbitrarily create an
> identifier in the global name space with #declare where it has first
> been defined (added to the symbol table) with #local in the local macro
> space.

interesting, thanks.  (I get much good info from your "musings" :-))


regards, jr.


Post a reply to this message

From: William F Pokorny
Subject: Re: jr's large csv and dictionary segfault in the povr fork.
Date: 7 Oct 2023 08:22:55
Message: <65214d9f$1@news.povray.org>
On 10/4/23 06:50, William F Pokorny wrote:
> The new bit I see is an existing, macro local, identifier changing type 
> from a 3D vector to a float - while the internal macro loops run. This, 
> of course, shouldn't happen and I still don't know why it is.

OK. I finally ran this bug down!

For any POV-Ray fork, or official release, using clipka's newer 
parser(*), change the line

	int Temp_Count=3000000;

to 	
	
	POV_LONG Temp_Count=9223372036854775807-1;

in the file parser.cpp and within Parser::Parse_RValue(...).


--- More detail for those interested.

There is code which counts the delta in the number of tokens found after 
seeing callable Identifiers / macro parameters. It involves the variable 
Tmp_Count. A variable I suspect was long ago initialized to a value 
thought to be many times larger than probable configuration values for 
TOKEN_OVERFLOW_RESET_COUNT.

The original coders made use of the token counter used for periodic 
parser status messages, rather than the global token counter, when 
calculating the delta in tokens parsed. This approach a little ugly in 
that it requires extra code for handling the reset/wrap cases which will 
frequently happen when TOKEN_OVERFLOW_RESET_COUNT is relatively small.

Christoph, while otherwise nicely re-factoring / cleaning up the older 
parser code, switched to using the global token counter straight up. 
This global counter has values which often run over/past the default 
Tmp_Count initialization of three million.

So... Once in a thousand blue moons, we call the 
Parser::Parse_RValue(...) code when the global token count is exactly 
three million and the delta in tokens found happens to be one.

On stepping on that landmine, code which should not run, does. This 
almost always results in an update to the wrong identifier type and a 
corruption of identifier associated data too.

A hideous bug. It's likely we didn't always know we'd tripped it. Only 
when the parser core dumped or stopped on some parser error was it clear 
something had gone wrong. I expect we all too often got weird parsing 
behavior or an odd image result instead. On thinking it something we 
did, we'd twiddle with the SDL. With the updated SDL and the bug would 
go away - or worse perhaps moves to another identifier with different 
end effects.

Further, where the parser didn't stop, the issue would often self heal 
the next time the identifier was redefined / updated (as in a loop) 
because the parser would realize the assigned value was indeed, say a 
vector, and not a float or whatever...

Animations - especially ones growing / changing frame to frame - are 
more likely to trip this bug simply for having more chances at it.

I was wrong that slowing down the parsing helped with this bug. I was 
simply changing the SDL parser's token counts enough to avoid it.

Bill P.


(*) - The v3.8 beta code backed off to an older v3.7 / v3.8 version of 
the parser which still used the token counter used for parser message 
updates regarding how many tokens have been parsed thus far. It should 
be OK with default builds excepting a couple VERY narrow exposures.

That said, I'd recommend all official code make the update above too! 
The v3.8 beta 1&2 code (and I'd bet most official POV-Ray versions...) 
are narrowly exposed:

- Should builders twiddle with the configuration variable 
TOKEN_OVERFLOW_RESET_COUNT in unlucky ways.

- Should the token delta count align in on an unfortunate harmonic with 
the relatively low default value of 2500 for TOKEN_OVERFLOW_RESET_COUNT. 
I think this not likely to happen in typical SDL.

- Should the type cast from the master POV_LONG (long long int) token 
count to 'int' itself cause a another problem - or still trigger this 
one in some round about way.


Post a reply to this message

Copyright 2003-2023 Persistence of Vision Raytracer Pty. Ltd.