POV-Ray : Newsgroups : povray.programming : Batch intersection processing : Batch intersection processing Server Time
28 Jul 2024 10:17:53 EDT (-0400)
  Batch intersection processing  
From: Vic
Date: 30 Apr 2002 16:48:05
Message: <3ccf0305@news.povray.org>
Dear developers,

I've recently read about the C++ rewrite of POV. This will be a big work,
but can lead to the best raytracer ever written. Preview rendering is
important when making a big and/or complex scene. Separation of object can
be a solution, but not in all cases. I have years of experience in x86
assembly and SIMD processing, so I can take part in the development of a
"preview renderer" mode for the x86 architecture.

Intel and AMD processors supporting SSE and SSE2 already became popular, so
these instruction sets are widely supported. SSE and SSE2 gives outstanding
performance in single precision floating point arithmetic. I know, it's not
enough for raytracing. But can be definitely enough for color processing and
"preview" quality rendering. Parsing and transforming can be done at full
precision. Then the whole scene can be rounded to single precision as
aligned SSE2 (128 bit) words. There are instructions supporting this.

Multiple rays can be started on the image plane at the same time. For
example, start 5000 rays from a 50x100 rectagle of the image plane.

Each elementary object class (not instance) have it's own processing queue.
Rays are processed in a loop and "intersection requests" are posted to the
object queues for each object respectively (bounding trees requires multiple
passes). After processing a batch of rays all intersection queues are
executed (feeding rays with intersection points and normals). Objects in
each elementary object class can be processed in batch using SSE or SSE2
instructions. SIMD instructions can be utilized with full bandwidth, because
multiple object (such as 4 spheres) can be intersected with different rays
at the same time. For example, 4000 ray-sphere instersections can be
calculated in one batch without a single effective cache miss (due to
prefetch instructions). Queue length and the number of parallel rays can be
optimized according to the measured cache characteristics of the specific
CPU.

Groups of object queues can be adaptively delegated to multiple processors
(in multiprocessor machines) or networked machines (in network rendering
mode). Time consuming intersection operations (such as sturm) can be
delegated to other machines with a running "povray service". Only
intersection calculations can be delegated without copying large amounts of
texture data. Cache usage can be optimized by "prefetch" instructions,
because queue locations are known. The number of effective cache misses
causing wait states can be minimized this way.

Quality rendering with full percision (,anti-aliasing, large number of
reflected rays,...) should be done with double precision, but can be queued
with the same mechanism.

I've some experience in raytracing acceleration using oct-tree auto-bounding
lookups with SIMD (MMX) technology. This can reduce the number of
unnecessary intersection calculations of expensive objects. Infinite objects
and simple ones (spheres, etc.) should not be bounded by this method.

I'll be able to do some research/work in this field after POV-Ray 3.5 source
code is released.

Best regards, Vic


Post a reply to this message

Copyright 2003-2023 Persistence of Vision Raytracer Pty. Ltd.