POV-Ray : Newsgroups : povray.programming : SIMD implementation of dot-product in POV-Ray??? Server Time
28 Jul 2024 22:16:29 EDT (-0400)
  SIMD implementation of dot-product in POV-Ray??? (Message 6 to 15 of 15)  
<<< Previous 5 Messages Goto Initial 10 Messages
From: Chris Huff
Subject: Re: SIMD implementation of dot-product in POV-Ray???
Date: 27 Nov 1999 07:14:11
Message: <271119990714159449%chrishuff_99@yahoo.com>
In article <383FC8FF.5C5AC270@tidax.se>, Goran Begicevic
<gor### [at] tidaxse> wrote:

> > Wouldn't that make the program processor dependent? I thought that
> > the teams wanted their work to be portable.

> 
> Who cares as long as it's faster.

A lot of people, actually. And the precision loss would create other
problems.


> Most of new POV-patches are processor dependent anyway.

Which ones? I can only think of the #system patch, which is actually OS
dependant, not processor dependant. I don't think there has been one
processor dependant patch(except maybe PVMPOV).

-- 
Chris Huff
e-mail: chr### [at] yahoocom
Web page: http://chrishuff.dhs.org/


Post a reply to this message

From: Goran Begicevic
Subject: Re: SIMD implementation of dot-product in POV-Ray???
Date: 27 Nov 1999 07:16:28
Message: <383FCB6A.B6ABEC10@tidax.se>
> 
> This issue comes up every month or so, serach a bit back through the
> newsgroups and you will find your question answered.


the conclusion on this issue in older threads?n 

> I doubt that just improving the dot product will speed things up in any
> noticeable range at all.

Well, run POV in profiler and take a look where it's spending most of
it's time.
 
> 
> By default double uses 64 bits on x86. And there are good reason to have
> this precision.

Yes, i'm sorry , i mixed it with 'long double'. It was a long time since
i programmed.
 
> This is taken from the AMD 3DNow SDK matrix (thus it is AMDs SIMD FPU
> extension, not Intels), but for this purpose it will be enough:
> 
> ALIGN   32
> PUBLIC  _a_dot_vect
> _a_dot_vect PROC
>         movq        mm0,[eax]
>         movq        mm3,[edx]
>         movd        mm1,[eax+8]
>         movd        mm2,[edx+8]
>         pfmul       mm0,mm3
>         pfmul       mm1,mm2
>         pfacc       mm0,mm0
>         pfadd       mm0,mm1
>         ret
> _a_dot_vect ENDP

Neat. Thanx. Unfortunately, i don't own AMD processor. I'll try to get
one of those Athlons tough. 

Now, i'm not so assembler-skilled. How wide is mm0,1,2,3 register? Is
this done on 32-bit 'float' variables? 

As far as i heard, Intels implementation of dot-product is even more
'automated' so you don't need to multiply registers 'by hand'. It's all
being done in one command. 

> As you can see, making this change is rather trivial.  The problems you will
> need two versions of POV-Ray, one for AMDs extension and for Intels. 
Ahh...smallest problem.

> You do.  Define DBL as float and watch POV-Ray "hang" in several functions
> because of the missing precision.

Note that this is not my idea of how this should be done. I would keep
all calculations as they are, and just rewrite dot-product funtion. 

'double' would be converted into float prior to calculations and then
converted back.

Well, we'll never know if we never try, right?


Post a reply to this message

From: Thorsten Froehlich
Subject: Re: SIMD implementation of dot-product in POV-Ray???
Date: 27 Nov 1999 11:24:50
Message: <384005d2@news.povray.org>
In article <383FCB6A.B6ABEC10@tidax.se> , Goran Begicevic <gor### [at] tidaxse>
wrote:

>>
>> This issue comes up every month or so, serach a bit back through the
>> newsgroups and you will find your question answered.
>

> the conclusion on this issue in older threads?n

In short, that the precision is not good enough. In addition, improving high
level algorithms usually gives a more significant speedup without having to
use assembler.

>> I doubt that just improving the dot product will speed things up in any
>> noticeable range at all.
>
> Well, run POV in profiler and take a look where it's spending most of
> it's time.

Hmm, did you ever do that?  A profiler will show you in which functions the
time is spend, but all vector operations in POV-Ray are macros.
Whenever I profiled, I found that POV-Ray spends a lot of time doing memory
allocations...

> Now, i'm not so assembler-skilled. How wide is mm0,1,2,3 register? Is
> this done on 32-bit 'float' variables?

Yes, all the SIMD FPU instructions are on 32 bit floats, there are no 64 bit
float SIMD instructions.

> As far as i heard, Intels implementation of dot-product is even more
> 'automated' so you don't need to multiply registers 'by hand'. It's all
> being done in one command.

I am not very familiar with x86 assembler.

>> You do.  Define DBL as float and watch POV-Ray "hang" in several functions
>> because of the missing precision.
>
> Note that this is not my idea of how this should be done. I would keep
> all calculations as they are, and just rewrite dot-product funtion.
>
> 'double' would be converted into float prior to calculations and then
> converted back.

I am not sure if you can easily move data from the SISD FPU to the SIMD FPU
registers, that might take up more time than the actual SISD calculation.

> Well, we'll never know if we never try, right?

Well, of course there is nothing from keeping you to try it.  Just don't be
to disappointed if you don't see any speedup.


       Thorsten


Post a reply to this message

From: Mark Gordon
Subject: Re: SIMD implementation of dot-product in POV-Ray???
Date: 27 Nov 1999 11:40:40
Message: <38400964.320004BA@mailbag.com>
Chris Huff wrote:
>  
> Which ones? I can only think of the #system patch, which is actually OS
> dependant, not processor dependant. I don't think there has been one
> processor dependant patch(except maybe PVMPOV).

I'm pretty sure PVMPOV is more OS-dependent (needs some flavor of Unix)
than CPU-dependent.  I've only used it on x96-Linux myself, but I'd be
surprised if it didn't work on SPARC-Solaris, for instance, assuming you
can get PVM for SPARC-Solaris.

-Mark Gordon


Post a reply to this message

From: Thomas Willhalm
Subject: Re: SIMD implementation of dot-product in POV-Ray???
Date: 29 Nov 1999 05:07:53
Message: <qqmn1rx626f.fsf@goldach.fmi.uni-konstanz.de>
Mark Gordon <mtg### [at] mailbagcom> writes:

> Chris Huff wrote:
> >  
> > Which ones? I can only think of the #system patch, which is actually OS
> > dependant, not processor dependant. I don't think there has been one
> > processor dependant patch(except maybe PVMPOV).
> 
> I'm pretty sure PVMPOV is more OS-dependent (needs some flavor of Unix)
> than CPU-dependent.  I've only used it on x96-Linux myself, but I'd be
> surprised if it didn't work on SPARC-Solaris, for instance, assuming you
> can get PVM for SPARC-Solaris.

You can get it from ftp://netlib2.cs.utk.edu/pvm3/ and it works with PVMPOV.
(I tried it myself some time ago.)

Thomas 

-- 
http://thomas.willhalm.de/ (includes pgp key)


Post a reply to this message

From: Thomas Willhalm
Subject: Memory allocation (was: Re: SIMD implementation of dot-product in POV-Ray???)
Date: 29 Nov 1999 05:59:32
Message: <qqmiu2l5zsb.fsf_-_@goldach.fmi.uni-konstanz.de>
DISCLAMER:
I'm not too common with the way POV-Ray handles its memory.

"Thorsten Froehlich" <tho### [at] trfde> writes:
>
> Whenever I profiled, I found that POV-Ray spends a lot of time doing memory
> allocations...

Couldn't it be useful to have two methods for allocation memory?
1) the POV_MALLOC used so far
2) a new method of memory that can only be freed on exit of the program.

With method 2), POV-Ray can allocate large parts of memory instead of little
chunks. The size of the "little chunks" doesn't matter anymore and therefore
needn't be stored. This decreases memory consumption and its handling
by the OS. The "large parts" could be handled by a linked list. Only for the
last part -- the current one -- a pointer to the last byte (word) that is 
used is necessary.

However, I don't know whether:
a) it's really faster
b) the scenario for 2) occurs often enough in POV-Ray to justify the effort
of implementing this.

Thomas

-- 
http://thomas.willhalm.de/ (includes pgp key)


Post a reply to this message

From: Mark Wagner
Subject: Re: Memory allocation (was: Re: SIMD implementation of dot-product in POV-Ray???)
Date: 30 Nov 1999 00:29:36
Message: <384360c0@news.povray.org>
Thomas Willhalm wrote in message ...
>Couldn't it be useful to have two methods for allocation memory?
>1) the POV_MALLOC used so far
>2) a new method of memory that can only be freed on exit of the program.
>
>With method 2), POV-Ray can allocate large parts of memory instead of
little
>chunks. The size of the "little chunks" doesn't matter anymore and
therefore
>needn't be stored. This decreases memory consumption and its handling
>by the OS. The "large parts" could be handled by a linked list. Only for
the
>last part -- the current one -- a pointer to the last byte (word) that is
>used is necessary.


With a linked list, you still have the overhead of remembering which parts
of the large chunk of memory are in use.

Mark


Post a reply to this message

From: Thorsten Froehlich
Subject: Re: Memory allocation (was: Re: SIMD implementation of dot-product in POV-Ray???)
Date: 30 Nov 1999 01:06:58
Message: <38436982@news.povray.org>
In article <qqmiu2l5zsb.fsf_-_@goldach.fmi.uni-konstanz.de> , Thomas 
Willhalm <tho### [at] willhalmde>  wrote:

> I'm not too common with the way POV-Ray handles its memory.

OK :-)

> "Thorsten Froehlich" <tho### [at] trfde> writes:
>>
>> Whenever I profiled, I found that POV-Ray spends a lot of time doing memory
>> allocations...
>
> Couldn't it be useful to have two methods for allocation memory?
> 1) the POV_MALLOC used so far
> 2) a new method of memory that can only be freed on exit of the program.

Well, that would work for the command line versions, but it would create
various problems for the Windows and Macintosh GUI versions.

> With method 2), POV-Ray can allocate large parts of memory instead of little
> chunks. The size of the "little chunks" doesn't matter anymore and therefore
> needn't be stored. This decreases memory consumption and its handling
> by the OS.

The Mac version is currently doing this. You can get a memory allocation
down to about 100 cycles (plus a new allocation by calling the system every
- in case of the current Mac method - 16th call, on average) this way, but
there are easily a ten million or more allocations even for simple renders.
That makes billion cycles, or even on fast processors about 2 seconds just
for the cached allocations, plus about a 100000 cycles for the system memory
allocation functions (Mac OS). A simple example can be pyramid2.pov (a
sample scene that comes with POV-Ray 3.1). Rendering its default recursion
level (six) results in 23437 objects. Rendering those with 640 * 480 and
anti-aliasing (Method 1, Threshold 0.300, Depth 5, Jitter 0.00) ends up in
about five million memory allocations. Changing the whole thing to a glass
spheres


> The "large parts" could be handled by a linked list. Only for the
> last part -- the current one -- a pointer to the last byte (word) that is
> used is necessary.

You don't want to walk through lists, it is easier and more efficient to use
a bitmap (not the image term, the computer science term), a simple array of
bits which mark if memory in a particular location is used or not. You can
than use a few "bit tricks" and find an empty cell.  in order to not have to
mess around with different cell sizes you just divide cells into groups of
sizes, i.e. if a block of memory with 47 bytes is allocated, you allocate 47
* 32 bytes and manage those yourself. The next time an allocation for 47
bytes will be much faster (and POV-Ray uses a lot of blocks of the same
sizes). If you limit yourself to "caching" only the lower range of
allocations, i.e. 1 to 4096 bytes you can manage the whole "cache" of
allocated memory easily.

> However, I don't know whether:
> a) it's really faster

It is, but eliminated memory allocations in s"strategic" places all together
would speed things up even more. However, to do so quite a few modifications
in the source code of POV-Ray would be needed while changing the allocation
functions is simpler because they are external.

> b) the scenario for 2) occurs often enough in POV-Ray to justify the effort
> of implementing this.

It does, and doing it is not very difficult or a lot of work - the current
implementation on the PowerMac version of POV-Ray has just a few hundred
lines of code (however, due to a single inline assembler instruction used to
find a zero bit in a word (32 bits) it is not fully portable with the same
speed right now - but that is a very, very long story).


  Thorsten


____________________________________________________
Thorsten Froehlich, Duisburg, Germany
e-mail: tho### [at] trfde

Visit POV-Ray on the web: http://mac.povray.org


Post a reply to this message

From: Thomas Willhalm
Subject: Re: Memory allocation (was: Re: SIMD implementation of dot-product in POV-Ray???)
Date: 30 Nov 1999 06:11:25
Message: <qqmemd85j4y.fsf@goldach.fmi.uni-konstanz.de>
Sorry for replying to my own post, but since Thorsten and Mark didn't get
my point (which is probably due to my limited knowledge about the English 
language) I will give a more detailed description of my idea.

Thomas Willhalm <tho### [at] willhalmde> writes:
> 
> "Thorsten Froehlich" <tho### [at] trfde> writes:
> >
> > Whenever I profiled, I found that POV-Ray spends a lot of time doing memory
> > allocations...
> 
> Couldn't it be useful to have two methods for allocation memory?
> 1) the POV_MALLOC used so far
> 2) a new method of memory that can only be freed on exit of the program.
> 
> With method 2), POV-Ray can allocate large parts of memory instead of little
> chunks. The size of the "little chunks" doesn't matter anymore and therefore
> needn't be stored. This decreases memory consumption and its handling
> by the OS. The "large parts" could be handled by a linked list. Only for the
> last part -- the current one -- a pointer to the last byte (word) that is 
> used is necessary.

I imagine that there are a lot of objects in POV-Ray for which memory is 
allocated once and used until the end of the rendering. To give a concrete
example, imaging a box being parsed. The corresponding memory is allocated
at this time. The memory will be used until rendering finishes.

Now, I make the following assumption: For a lot of objects (in the sense of
programming) we can guaranty that we will use them until the end of the
rendering. This might be the case for objects (in the sense of POV-Ray),
textures, density maps and so on.

After accepting this assumption my idea comes into play: Why should I store
the necessary information to free the memory of every single object when
I will free them all at once? So, let us reserve a large part of memory 
(e.g. 30 MB). When an object is created (i.e. memory is allocated) we put it 
on top of the memory used to far. All we need is to store a pointer to the 
last address that has been used, because we will free our large part of 
memory all at once (when the rendering is finished).

The problem is of course that we don't know at the beginning of parsing,
how much memory we will need. Thus, we have to split the memory into 
smaller parts (e.g. 64 KB). This is where I want to use a linked list:
to connect the parts of memory. We will use this linked list only once
when freeing the memory at the end of the rendering. That's why this method
is still in O(n). (Allocating memory for a new object takes constant time
and should be faster than the standard way.)

What Thorsten writes about the memory management in POV-Ray for Mac is very
similar to what I suggest -- except that I want to "forget" where the
memory belongs to and the size of the corresponding object. As mentioned in
the introduction, I expect this method to work only for some (perhaps even
most) objects in POV-Ray.

I hope that my description is much clearer now.
Thomas

-- 
http://thomas.willhalm.de/ (includes pgp key)


Post a reply to this message

From: Ron Parker
Subject: Re: Memory allocation (was: Re: SIMD implementation of dot-product in POV-Ray???)
Date: 30 Nov 1999 08:23:58
Message: <3843cfee@news.povray.org>
On 30 Nov 1999 12:11:25 +0100, Thomas Willhalm wrote:
>
>Sorry for replying to my own post, but since Thorsten and Mark didn't get
>my point (which is probably due to my limited knowledge about the English 
>language) I will give a more detailed description of my idea.
[...]

I think Thorsten got your point, in that he mentioned it would cause problems
for the GUI versions because they don't terminate.  Of course, they do stop 
rendering at some point, and you could easily clean up the large chunks o' 
memory at that point.  Just make sure that GUI stuff never allocates memory
from there. 

I think this idea has some merit, but I am concerned somewhat about the 
possibility of wasting some large blocks of memory.  Consider what happens
when the current chunk o' memory has 31K left available and we ask to allocate
a 32k block of memory.  Not that 31K of wasted memory is horribly significant
these days, but those blocks could add up.  I suppose it's a tradeoff for the
expected time savings.

Also... it should be made a little more flexible, somehow.  If your scene has
lots of #declared objects, textures, etc. there are also a fairly large number 
of allocations that get freed at the end of parsing.  It might be nice to free
them all at once too.

-- 
These are my opinions.  I do NOT speak for the POV-Team.
The superpatch: http://www2.fwi.com/~parkerr/superpatch/
My other stuff: http://www2.fwi.com/~parkerr/traces.html


Post a reply to this message

<<< Previous 5 Messages Goto Initial 10 Messages

Copyright 2003-2023 Persistence of Vision Raytracer Pty. Ltd.