|
|
|
|
|
|
| |
| |
|
|
|
|
| |
| |
|
|
> Look in the gcc or icc standard library implementations how they do it. Or
> compile and disassemble the compiled code.
Looks sufficiently ugly for my taste :)
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Thorsten Froehlich <tho### [at] trfde> wrote:
> For two decades now the CPU and FPU have been the same thing on x86. It is
> not like they are two different processors. They are *one* processor. The
> terminology is just a leftover from times when the logic we nowadays call
> FPU did not fit on the same die as the integer unit called CPU back then.
.... and yet, all these times, the FPU has been doing its business in parallel to
the CPU, like in the very first days.
On the other hand, given how much stuff is happening in parallel in a CPU
nowadays, this special status may not be really special anymore.
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Thorsten Froehlich <tho### [at] trfde> wrote:
> For two decades now the CPU and FPU have been the same thing on x86. It is
> not like they are two different processors. They are *one* processor. The
> terminology is just a leftover from times when the logic we nowadays call
> FPU did not fit on the same die as the integer unit called CPU back then.
That doesn't matter. The CPU part does not stop if the FPU is doing
something (the only situation where the CPU will wait for the FPU is
when it tries to retrieve some value from it).
This means that if the program executes an FPU opcode which takes dozens
of clock cycles for the FPU to perform, the CPU part will continue executing
CPU opcodes until a new FPU opcode (eg. fst) is encountered.
The original Quake engine was rather famous for using this to its
adantage in the 486 and Pentium processors: While FPU was calculating
a heavy division (heavier in those days than today), the CPU was
interpolating and drawing textures linearly the next 15 pixels. This
made the division operation almost free (at the cost of the perspective
correctness of the texture not being completely perfect).
--
- Warp
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Thorsten Froehlich <tho### [at] trfde> wrote:
> Yes, but what use are instructions you won't be able to use in the future
> and your are already recommended not to use now?
As long as the hardware supports x87, I see absolutely no rational reason
why an OS would drop support for 99% of programs just because it doesn't
want the FPU to be used.
The OS would, in fact, have to go to great lengths in order to detect
that a program is using the FPU and deliberately stop it (rather than
allow it to simply malfunction, which would be stupid).
"Sorry, your program uses the FPU, and while this computer does have
an FPU, and it could run your program just perfectly, I'm not going to
allow it. Tough luck."
--
- Warp
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
clipka <nomail@nomail> wrote:
> Warp <war### [at] tagpovrayorg> wrote:
> > > Yes, by not saving and restoring the x87 "register" stack when switching
> > > threads or making operating system calls. You need OS support for that.
> >
> > That would be a rather broken OS.
> Not if this was part of the OS specification.
So the hardware would be perfectly able to run the software, but the
OS deliberately stops the software from being run if it uses the FPU.
And this makes sense?
--
- Warp
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Thorsten Froehlich <tho### [at] trfde> wrote:
> Look in the gcc or icc standard library implementations how they do it. Or
> compile and disassemble the compiled code.
Maybe it's different in a x86-64 architecture, but at least in my P4,
according to my tests, if the compiler uses the FPU opcodes it will be
faster than anything else.
I made a little function to try to test the speed of trigonometric
functions with different compiler options:
double foo(double d)
{
double retval = 0;
for(double angle = 0; angle <= d; angle += 0.0001)
retval += std::sin(angle) + std::cos(angle);
return retval;
}
Then I call it with "foo(10000);"
It performs quite many other operations as well, so the trigonometric
functions get a bit buried among the others. However, it still produces
measurably differences in execution speed with different options.
I use the optimization options "-O3 -march=native" for all tests. For
some reason if I use the option "-ffast-math", gcc will produce a direct
fsincos opcode, but if I don't specify it, it will produce a library call
instead. I don't really understand why, but that suits me just fine for
this test. Here are some results (average of 4 runs, rounded to 1 decimal):
-O3 -march=native : 8.2 seconds
-O3 -march=native -ffast-math : 7.1 seconds
-O3 -march=native -mfpmath=sse : 8.1 seconds
It could be possible to run the test with pure software FP calculations
by using the -msoft-float option, but apparently my gcc (or, more precisely,
libgcc) has not been compiled with the support for it, so I can't test it.
It's a pitty. It would have been interesting to see how much slower it would
have been.
I haven't really looked at what the gcc sincos library call is doing,
but it might well be that it just executes an fsincos opcode, and that
the time difference is coming from the overhead of the function call.
Anyways, whatever the reason, at least on 32-bit x86 it just seems to
be faster to execute an fsincos directly.
--
- Warp
Post a reply to this message
|
|
| |
| |
|
|
From: Thorsten Froehlich
Subject: Re: Radiosity Status: Giving Up...
Date: 1 Jan 2009 07:22:51
Message: <495cb59b@news.povray.org>
|
|
|
| |
| |
|
|
Warp wrote:
> Thorsten Froehlich <tho### [at] trfde> wrote:
>> For two decades now the CPU and FPU have been the same thing on x86. It is
>> not like they are two different processors. They are *one* processor. The
>> terminology is just a leftover from times when the logic we nowadays call
>> FPU did not fit on the same die as the integer unit called CPU back then.
>
> That doesn't matter.
When did I say it does? You asserted it would:
> The CPU part does not stop if the FPU is doing
> something (the only situation where the CPU will wait for the FPU is
> when it tries to retrieve some value from it).
"at least in theory, have the FPU calculating your operation while the CPU
does other (non-FPU) operations at the same time. I don't know if any
compiler is able to opimize like this, though."
I am asserting that (both of) your statements are incorrect because you
continue to view the FPU as a separate entity from the CPU in your
statements. What you refer to as CPU is the combination of ALU and LSU. That
is not the whole CPU. The FPU is just one other component of the CPU, it is
in no way distinct, especially not in x86 processors.
In fact, Intel's "Core" architecture fuses the ALU and FPU like no other CPU
design currently around. Go look for a Core i7 (Nehalem) block diagram in
the IDF videos on the Intel site, or look at i.e. this redrawn one:
<http://pc.watch.impress.co.jp/docs/2008/0403/kaigai_nehalem.pdf>
Notice something? - Where is your FPU, where is your "CPU"? There are six
separate execution units, each with some unique and some common features...
Thorsten
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Warp wrote:
> Thorsten Froehlich <tho### [at] trfde> wrote:
>> Yes, but what use are instructions you won't be able to use in the future
>> and your are already recommended not to use now?
>
> As long as the hardware supports x87, I see absolutely no rational reason
> why an OS would drop support for 99% of programs just because it doesn't
> want the FPU to be used.
Tell that Microsoft, Apple and the Linux community.
Thorsten
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Warp wrote:
> Thorsten Froehlich <tho### [at] trfde> wrote:
>> Look in the gcc or icc standard library implementations how they do it. Or
>> compile and disassemble the compiled code.
<snip>
> I haven't really looked at what the gcc sincos library call is doing,
> but it might well be that it just executes an fsincos opcode, and that
> the time difference is coming from the overhead of the function call.
You asked what a fast SSE trigonometry implementation would look like, not
what code your compiler generates when targeting a P4. So clearly you should
not be looking at the x87 implementation using the fsincos opcode when you
want to know how the SSE code would look like!?!
Thorsten
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |
| |
|
|
Thorsten Froehlich <tho### [at] trfde> wrote:
> Warp wrote:
> > Thorsten Froehlich <tho### [at] trfde> wrote:
> >> For two decades now the CPU and FPU have been the same thing on x86. It is
> >> not like they are two different processors. They are *one* processor. The
> >> terminology is just a leftover from times when the logic we nowadays call
> >> FPU did not fit on the same die as the integer unit called CPU back then.
> >
> > That doesn't matter.
> When did I say it does? You asserted it would:
> > The CPU part does not stop if the FPU is doing
> > something (the only situation where the CPU will wait for the FPU is
> > when it tries to retrieve some value from it).
> "at least in theory, have the FPU calculating your operation while the CPU
> does other (non-FPU) operations at the same time. I don't know if any
> compiler is able to opimize like this, though."
> I am asserting that (both of) your statements are incorrect because you
> continue to view the FPU as a separate entity from the CPU in your
> statements. What you refer to as CPU is the combination of ALU and LSU. That
> is not the whole CPU. The FPU is just one other component of the CPU, it is
> in no way distinct, especially not in x86 processors.
> In fact, Intel's "Core" architecture fuses the ALU and FPU like no other CPU
> design currently around. Go look for a Core i7 (Nehalem) block diagram in
> the IDF videos on the Intel site, or look at i.e. this redrawn one:
> <http://pc.watch.impress.co.jp/docs/2008/0403/kaigai_nehalem.pdf>
> Notice something? - Where is your FPU, where is your "CPU"? There are six
> separate execution units, each with some unique and some common features...
OMG. You blame me for nitpicking about semantics, and now you are doing
that exact same thing yourself.
I never said the "FPU" would be a separate piece of circuitry from the CPU.
When I say "FPU" I mean, rather obviously, "the part of the processor which
performs the floating point calculations". It doesn't matter how it's
physically distributed inside the processor, I was talking about its
behavior.
--
- Warp
Post a reply to this message
|
|
| |
| |
|
|
|
|
| |