| 
|  |  |  
|  |  |  |  |  |  |  |  |  |  |  
|  |  |  |  |  |  |  |  |  |  |  
|  |  | Hi Guys, I am Francois Piednoel, working in the Performance enabling group
at Intel.
As you probably noticed, PovRay is a using a lot of processing Power.
The version 3.5 of PovRay for Windows is a big step for PovRay, the only
step missing to PovRay is to use fully the multi-processing power of modern
Processor, as the new Xeon.
HyperThreading is a real plus for the modern computers, and you can see
already the efficency of it on few 3D ray tracer already.
So, I am placing a call: anybody interest by working with us on moving
PovRay to a real multithreading kernel is welcome to email me at
fra### [at] intel com
we are already working closely with Chris Cason, and we hope to improve
again the Pentium(R) 4 user experience, via some more improvement.
Francois
--------------------------------------------------------------
Francois Piednoel
Senior Performance Analyst
Pentium 4, Prescott, Tejas.
Intel Corporation Post a reply to this message
 |  |  |  |  |  |  |  |  
|  |  |  |  |  |  |  |  |  |  |  
|  |  | "Francois Piednoel" <fra### [at] intel com> wrote in message
news:3d55aa5d$1@news.povray.org...
> we are already working closely with Chris Cason, and we hope to improve
> again the Pentium(R) 4 user experience, via some more improvement.
And I would like to take this opportunity to publicly thank Francois for his
hard work on the P4 optimizations already in 3.5. I am quite looking forward
to seeing his hyperthreading implementation.
-- Chris Post a reply to this message
 |  |  |  |  |  |  |  |  
|  |  | 
| From: Theo Gottwald * Subject: Re: MultiThreaded version of PovRay 3.5
 Date: 11 Aug 2002 10:42:05
 Message: <3d5677bd@news.povray.org>
 
 |  |  |  |  |  |  |  |  |  
|  |  | Sounds really good.
However the speed increase in Hyperthreading is diffrent to true
"smp-multiprocessing-systems".
while in multiprocessing systems you have a REAL SECOND FPU, you have
in a "Hyperthreading system" no physical second CPU" but just two
register-sets and caches.
So the speed win comes if one process is halted or used, I doubt we can have
real
45% increase like with a true SMP-System. After what I have read till now,
Floating-point calculations may be one of the things that get least speed
improvements from "jackson technology hyperthreading". 25% ?
However the Intel approach to help enable multiprocessing in standard
applications is a good idea and
I welcome it at any time cause its the way to the PC's of the future..
Also Intel has always had very good SMP-System.
BTW ...
Some more thinkings ...
- what do we think about a commandline-switch "-use_SSE" (Intel Streaming
Extensions) ?
 :-) is it usable and could it speed up this sort of calculations ?
NVIDIA has just running a study to make Raytracing in Hardware.
- could we use direct X 9 under windows to get speed improvements in future
versions ?
--Theo Gottwald
PS: So long you all work on "the real thing" I've been doing a "so long
solution".
     I've put a BETA-Test Version up which supports as many processors as
anyone may have.
     In one or in ten PC's.
     It currently does not support animations and its in Gemran. But maybe
in a few days ...
     The link is http://www.it-berater.org/smpov.htm
"Chris Cason" <new### [at] delete this  povray  org> schrieb im Newsbeitrag
news:3d562831@news.povray.org...
>
> "Francois Piednoel" <fra### [at] intel  com> wrote in message
news:3d55aa5d$1@news.povray.org...
> > we are already working closely with Chris Cason, and we hope to improve
> > again the Pentium(R) 4 user experience, via some more improvement.
>
> And I would like to take this opportunity to publicly thank Francois for
his
> hard work on the P4 optimizations already in 3.5. I am quite looking
forward
> to seeing his hyperthreading implementation.
>
> -- Chris
>
>Post a reply to this message
 |  |  |  |  |  |  |  |  
|  |  |  |  |  |  |  |  |  |  |  
|  |  | I'd be of no help about this stuff. Just surprised there was already some
code worked in for Pentium 4 already since I thought it was only up to PIII.
This concept had me remembering when I discussed data compression with my
Dad who worked at Boeing Aerospace at the time before I knew there might
already be such things being done. He said there was, this was mid-80's and
I hadn't read about such things yet and not being in the software business I
couldn't have known much. But I had envisioned on my own a scheme whereby
ASCII character changes would be related to the last and every new layer of
data overlaid onto the last. I was using a floppy-based computer at the
time, no HDD so it was a natural thing to think about.
Anyway, I was thinking maybe this Hyperthreading might be a similar thing.
From what I read at that web site about it though, looks more like a buffer
exchange system.  Hey, I'm not technically knowledgeable so please forgive
my ignorance. Thanks. :-)
Anyway, if it is not a kind of compressed data thing then is there a reason
that wouldn't be good? Maybe too much overhead to be tracking it?
 Post a reply to this message
 |  |  |  |  |  |  |  |  
|  |  | 
| From: Thorsten Froehlich Subject: Re: MultiThreaded version of PovRay 3.5
 Date: 11 Aug 2002 11:41:55
 Message: <3d5685c3@news.povray.org>
 
 |  |  |  |  |  |  |  |  |  
|  |  | In article <3d5677bd@news.povray.org> , "Theo Gottwald *" 
<The### [at] t-online de> wrote:
> So the speed win comes if one process is halted or used, I doubt we can have
> real
> 45% increase like with a true SMP-System. After what I have read till now,
> Floating-point calculations may be one of the things that get least speed
> improvements from "jackson technology hyperthreading". 25% ?
It runs down to _testing_ it.  Making guesses will get you nowhere with
modern microprocessor architectures.
    Thorsten
____________________________________________________
Thorsten Froehlich
e-mail: mac### [at] povray  org
I am a member of the POV-Ray Team.
Visit POV-Ray on the web: http://mac.povray.org Post a reply to this message
 |  |  |  |  |  |  |  |  
|  |  |  |  |  |  |  |  |  |  |  
|  |  | Theo Gottwald * <The### [at] t-online de> wrote:
> NVIDIA has just running a study to make Raytracing in Hardware.
> - could we use direct X 9 under windows to get speed improvements in future
> versions ?
  I don't know anything about that, but I suppose that it will only raytrace
triangles, and thus be of little use for povray (except perhaps for mesh
raytracing optimization).
-- 
#macro N(D)#if(D>99)cylinder{M()#local D=div(D,104);M().5,2pigment{rgb M()}}
N(D)#end#end#macro M()<mod(D,13)-6mod(div(D,13)8)-3,10>#end blob{
N(11117333955)N(4254934330)N(3900569407)N(7382340)N(3358)N(970)}//  - Warp - Post a reply to this message
 |  |  |  |  |  |  |  |  
|  |  |  |  |  |  |  |  |  |  |  
|  |  | Of course it will depend on the scene etc. however if you look at the
architecture you can see if it is technically like a smp-system or if its
not.
However true numbers will always depend on many factors, but what the good
thing about is, that if
many people have thos "2-in 1" chips (even if they are only "2+1/2 in one",
so the software will
be updated for "2 in one" and this will enable to sell more
multiprocessor-boards.
And with an increased number of SMP-Applications also the consumer-market of
the future may see
a multi-processing race (you know it from the CD-ROM's-speed  2 times, 4
times 75 times ...).
We want to see that with SMP-Systems.
cu
--Theo
"Thorsten Froehlich" <tho### [at] trf de> schrieb im Newsbeitrag
news:3d5685c3@news.povray.org...
> In article <3d5677bd@news.povray.org> , "Theo Gottwald *"
> <The### [at] t-online  de> wrote:
>
> > So the speed win comes if one process is halted or used, I doubt we can
have
> > real
> > 45% increase like with a true SMP-System. After what I have read till
now,
> > Floating-point calculations may be one of the things that get least
speed
> > improvements from "jackson technology hyperthreading". 25% ?
>
> It runs down to _testing_ it.  Making guesses will get you nowhere with
> modern microprocessor architectures.
>
>     Thorsten
>
> ____________________________________________________
> Thorsten Froehlich
> e-mail: mac### [at] povray  org
>
> I am a member of the POV-Ray Team.
> Visit POV-Ray on the web: http://mac.povray.org Post a reply to this message
 |  |  |  |  |  |  |  |  
|  |  |  |  |  |  |  |  |  |  |  
|  |  | nope ...
DX9 is completely programmable ...
you could theoretically send some "mini-programs" to the DX9-chip
and run them there...
the only question is, if this is really faster, because the
comunication overhead could make this worse.
OTOH ATis Radeon9700 has several parallel (IIRC 6-8) such pipelines,
so if all the data needed fits in the memory of the card and the "processors"
support all the needed operations it could be a nice idea ...
But to be honest: I don't believe that this is worth for POV, because the code will
become a mess
 Post a reply to this message
 |  |  |  |  |  |  |  |  
|  |  | 
| From: Francois Piednoel Subject: Re: MultiThreaded version of PovRay 3.5
 Date: 12 Aug 2002 17:16:42
 Message: <3d5825ba@news.povray.org>
 
 |  |  |  |  |  |  |  |  |  
|  |  | It is a very interesting point of view.
From a practical point of view, the Dual REAL physical processor will be
faster than an HyperThreading single physical Processor system in many case,
I agree. And I will be very happy if you buy two processors, instead of one
;-)
On the other side, we saw a lot of funky thinks when you have multithreaded
applications, like video decoding, or 3D pipelines using multiple threads,
because some time, the CPU2 ask a write combine from the CPU1 and that is
colliding with the regular memory traffic. (the L2 caches from CPU1 and CPU2
dispute the ownership of the data) In the case of Hyperthreading, we do not
see this. The return on investment, from a processing power point of view is
much better with Hyperthreading.
Hyperthreading was design to increase the decoding capability of your
processors, right now, the decoder is the bottle neck in your processor.
Your execution units are not fully used, due to the fact that your rendering
place a lot of memory request and wait for the answers (300 to 400 Cycles)
in a regular CPU, you will be waiting for the cache line to arrive via the
front side bus and do NOTHING, with hyperthreading, the second logical CPU
will have 300 cycles to do what ever it wants, from reading from L1 or L2
cache, to crunching data with int or float. It is a huge room for
improvement, specially for programs like PovRay that use a lot of memory.
By the way, I speak a little german, so I tried to understand how you did
implement the multithreading, and I was not able the figure out, can you
explain?
Can you share the source code?
Remember, Hyperthreading just improve the return on investment of each
transistor in the Processor, and that is a big deal if you are looking for
Processing power.
Francois
"Theo Gottwald *" <The### [at] t-online de> wrote in message
news:3d5677bd@news.povray.org...
> Sounds really good.
>
> However the speed increase in Hyperthreading is diffrent to true
> "smp-multiprocessing-systems".
>
> while in multiprocessing systems you have a REAL SECOND FPU, you have
> in a "Hyperthreading system" no physical second CPU" but just two
> register-sets and caches.
>
> So the speed win comes if one process is halted or used, I doubt we can
have
> real
> 45% increase like with a true SMP-System. After what I have read till now,
> Floating-point calculations may be one of the things that get least speed
> improvements from "jackson technology hyperthreading". 25% ?
>
>
> However the Intel approach to help enable multiprocessing in standard
> applications is a good idea and
> I welcome it at any time cause its the way to the PC's of the future..
>
> Also Intel has always had very good SMP-System.
>
> BTW ...
> Some more thinkings ...
> - what do we think about a commandline-switch "-use_SSE" (Intel Streaming
> Extensions) ?
>  :-) is it usable and could it speed up this sort of calculations ?
>
> NVIDIA has just running a study to make Raytracing in Hardware.
> - could we use direct X 9 under windows to get speed improvements in
future
> versions ?
>
> --Theo Gottwald
>
> PS: So long you all work on "the real thing" I've been doing a "so long
> solution".
>      I've put a BETA-Test Version up which supports as many processors as
> anyone may have.
>      In one or in ten PC's.
>      It currently does not support animations and its in Gemran. But maybe
> in a few days ...
>      The link is http://www.it-berater.org/smpov.htm
>
>
>
>
>
>
> "Chris Cason" <new### [at] delete  this  povray  org> schrieb im Newsbeitrag
> news:3d562831@news.povray.org...
> >
> > "Francois Piednoel" <fra### [at] intel  com> wrote in message
> news:3d55aa5d$1@news.povray.org...
> > > we are already working closely with Chris Cason, and we hope to
improve
> > > again the Pentium(R) 4 user experience, via some more improvement.
> >
> > And I would like to take this opportunity to publicly thank Francois for
> his
> > hard work on the P4 optimizations already in 3.5. I am quite looking
> forward
> > to seeing his hyperthreading implementation.
> >
> > -- Chris
> >
> >
>
> Post a reply to this message
 |  |  |  |  |  |  |  |  
|  |  | 
| From: Thorsten Froehlich Subject: Re: MultiThreaded version of PovRay 3.5
 Date: 13 Aug 2002 02:58:37
 Message: <3d58ae1d@news.povray.org>
 
 |  |  |  |  |  |  |  |  |  
|  |  | In article <3d5825ba@news.povray.org> , "Francois Piednoel" 
<fra### [at] intel com> wrote:
> On the other side, we saw a lot of funky thinks when you have multithreaded
> applications, like video decoding, or 3D pipelines using multiple threads,
> because some time, the CPU2 ask a write combine from the CPU1 and that is
> colliding with the regular memory traffic.
A bit off-topic:  Recently someone suggested to me that something like this
would be efficient* on x86 processors in a multiprocessor environment. This
is assuming two (or more threads) access the same data structure and one
thread is writing to it:
volatile int flag;        // toggle between matrix 1 and 2 to
struct {                  // allow one thread to modify matrix1
    double matrix1[4][4]; // and one or more others to read
    double matrix2[4][4]; // the other matrix2 or vice versa
}
Is there any information for programmers by Intel available (to the public)
regarding code like the above compared to more abstract ways of doing
parallel access?  Or, more general about parallel read/write access pitfalls
and performance suggestions on x86 systems?
    Thorsten
* At least more efficient than a simple lock and using a single data
structure assuming there are many (at least tens of thousands or more per
second) locks/unlocks.
____________________________________________________
Thorsten Froehlich
e-mail: mac### [at] povray  org
I am a member of the POV-Ray Team.
Visit POV-Ray on the web: http://mac.povray.org Post a reply to this message
 |  |  |  |  |  |  |  |  
|  |  |  |  |  |  |  |  |  |