Objects in the 3D scene and the scene itself are sequentially converted, or
transformed, through five spaces when proceeding through the 3D pipeline. A
brief overview of these spaces follows:
where each model is in its own coordinate system, whose origin is some point on the model, such as the right foot of a soccer player model. Also, the model will typically have a control point or "handle". To move the model, the 3D renderer only has to move the control point, because model space coordinates of the object remain constant relative to its control point. Additionally, by using that same "handle", the object can be rotated.
where models are placed in the actual 3D world, in a unified world coordinate system. It turns out that many 3D programs skip past world space and instead go directly to clip or view space. The OpenGL API doesn't really have a world space.
in this space, the view camera is positioned by the application
(through the graphics API) at some point in the 3D world coordinate system,
if it is being used. The world space coordinate system is then transformed,
such that the camera (your eye point) is now at the origin of the coordinate
system, looking straight down the z-axis into the scene. If world space is bypassed,
then the scene is transformed directly into view space, with the camera similarly
placed at the origin and looking straight down the z-axis. Whether z values
are increasing or decreasing as you move forward away from the camera into the
scene is up to the programmer, but for now assume that z values are increasing
as you look into the scene down the z-axis. Note that culling, back-face culling,
and lighting operations can be done in view space.
The view volume is actually created by a projection, which as the name suggests,
"projects the scene" in front of the camera. In this sense, it's a
kind of role reversal in that the camera now becomes a projector, and the scene's
view volume is defined in relation to the camera. Think of the camera as a kind
of holographic projector, but instead of projecting a 3D image into air, it
instead projects the 3D scene "into" your monitor. The shape of this
view volume is either rectangular (called a parallel projection), or
pyramidal (called a perspective projection), and this latter volume is
called a view frustum (also commonly called frustrum, though frustum
is the more current designation).
The view volume defines what the camera will see, but just as importantly, it
defines what the camera won't see, and in so doing, many objects models and
parts of the world can be discarded, sparing both 3D chip cycles and memory
bandwidth.
The frustum actually looks like an pyramid with its top cut off. The top of
the inverted pyramid projection is closest to the camera's viewpoint and radiates
outward. The top of the frustum is called the near (or front) clipping
plane and the back is called the far (or back) clipping plane. The entire
rendered 3D scene must fit between the near and far clipping planes, and also
be bounded by the sides and top of the frustum. If triangles of the model (or
parts of the world space) falls outside the frustum, they won't be processed.
Similarly, if a triangle is partly inside and partly outside the frustrum the
external portion will be clipped off at the frustum boundary, and thus the term
clipping. Though the view space frustum has clipping planes, clipping
is actually performed when the frustum is transformed to clip space.
Similar to View Space, but the frustum is now "squished" into a unit cube, with the x and y coordinates normalized to a range between 1 and 1, and z is between 0 and 1, which simplifies clipping calculations. The "perspective divide" performs the normalization feat, by dividing all x, y, and z vertex coordinates by a special "w" value, which is a scaling factor that we'll soon discuss in more detail. The perspective divide makes nearer objects larger, and farther objects smaller as you would expect when viewing a scene in reality.
where the 3D image is converted into x and y 2D screen coordinates for 2D display. Note that z and w coordinates are still retained by the graphics systems for depth/Z-buffering (see Z-buffering section below) and back-face culling before the final render. Note that the conversion of the scene to pixels, called rasterization, has not yet occurred.
Because so many of the conversions involved in transforming through these different
spaces essentially are changing the frame of reference, it's easy to get confused.
Part of what makes the 3D pipeline confusing is that there isn't one "definitive"
way to perform all of these operations, since researchers and programmers have
discovered different tricks and optimizations that work for them, and because
there are often multiple viable ways to solve a given 3D/mathematical problem.
But, in general, the space conversion process follows the order we just described.
To get an idea about how these different spaces interact, consider this example:
Take several pieces of Lego, and snap them together to make some object. Think
of the individual pieces of Lego as the object's edges, with vertices existing
where the Legos interconnect (while Lego construction does not form triangles,
the most popular primitive in 3D modeling, but rather quadrilaterals, our example
will still work). Placing the object in front of you, the origin of the model
space coordinates could be the lower left near corner of the object, and all
other model coordinates would be measured from there. The origin can actually
be any part of the model, but the lower left near corner is often used. As you
move this object around a room (the 3D world space or view space, depending
on the 3D system), the Lego pieces' positions relative to one another remain
constant (model space), although their coordinates change in relation to the
room (world or view spaces). In some sense, 3D chips have become physical incarnations
of the pipeline, where data flows "downstream" from stage to stage.
It is useful to note that most operations in the application/scene stage and
the early geometry stage of the pipeline are done per vertex, whereas culling
and clipping is done per triangle, and rendering operations are done per pixel.
Computations in various stages of the pipeline can be overlapped, for improved
performance. For example, because vertices and pixels are mutually independent
of one another in both Direct3D and OpenGL, one triangle can be in the geometry
stage while another is in the Rasterization stage. Furthermore, computations
on two or more vertices in the Geometry stage and two or more pixels (from the
same triangle) in the Rasterzation phase can be performed at the same time.
Another advantage of pipelining is that because no data is passed from one vertex
to another in the geometry stage or from one pixel to another in the rendering
stage, chipmakers have been able to implement multiple pixel pipes and gain
considerable performance boosts using parallel processing of these independent
entities. It's also useful to note that the use of pipelining for real-time
rendering, though it has many advantages, is not without downsides. For instance,
once a triangle is sent down the pipeline, the programmer has pretty much waved
goodbye to it. To get status or color/alpha information about that vertex once
it's in the pipe is very expensive in terms of performance, and can cause pipeline
stalls, a definite no-no.
ExtremeTech 3D Pipeline Tutorial
June, 2001
By: Dave Salvator
extract from http://www.extremetech.com/