The Raster Graphics Pipeline

Two renderings of triangles with colored checkerboard patterns intersecting, with the color rendering shown at left and depth rendering at right. — Figure 0
: Composite output from the provided sample rasterizer code^[1].

This tutorial is intended to be a crash course in raster graphics theory. It should give you enough context to know what the operations you're doing mean when using a graphics API like OpenGL or Direct3D (we won't focus on either API's details, but everything is described OpenGL-centric). This tutorial is intended to be a complete, yet concise, explanation of the graphics pipeline. Jargon to know is highlighted like so.

Since code is worth a thousand words, I have written a sample 3D CPU-based rasterizer that demonstrates the basics (see Figure 0). It doesn't do all operations in a full graphics pipeline, but it does most of the key steps in a conceptually correct way, and importantly it does them simply, so you can follow and learn from it. This tutorial does not refer to that code specifically, but it may be helpful for reference as you read.

Contextual Survey

Photo of GTX 275 GPU. — Figure 1
: NVIDIA GPU, apparently a GTX 275 (very dated now, but pretty).

The GPU is a highly optimized chip or chip component designed to draw stuff, often combined with powerful support hardware as a circuitboard+chip assembly called a graphics card. When given data about a 2D or 3D scene, the GPU's job is to render it into an image. In a basic workflow, the GPU's job is to draw 3D triangles as 2D pixels in a framebuffer (the contents of the screen in the GPU's memory).

Figure 2

: Logos for the OpenGL and Vulkan graphics APIs (images' source).

There are also graphics APIs, like OpenGL, Vulkan, and Direct3D, that let you, the programmer, talk to the GPU. These APIs let you control how the GPU turns the triangles into pixels. The GPU is, broadly speaking, a state machine. This means that you configure it a certain way, and then do some draw operations. The internal state that the GPU stores affect what the GPU does, and how it does it.

The rest of this tutorial presents an idealized version of the graphics pipeline (which describes the process of drawing with a GPU). In its classic, simplest form, it consists, in order, of the Application Stage, Geometric Stage(s)—including primarily and minimally the Vertex Stage, the Rasterization Stage, Fragment Stage, Framebuffer Stage, and the Screen Stage:

Application Stage: the programmer (you!) talks to the graphics API, controlling the following steps and providing data to draw.
Geometric Stage(s): transformation of the geometric data to pixel coordinates.
Rasterization Stage: conversion of geometry from vectors into pixels.
Fragment Stage: assigning color/depth to each pixel.
Framebuffer Stage: test the incoming pixel to see if it should be visible.
Screen Stage: the memory on the GPU is displayed onto the screen.

Application Stage

This is where your program runs, handling everything your GPU doesn't (like your application's logic, AI, maybe some physics, etc.). Graphics-wise, you talk to the GPU through a series of API commands (graphics calls). These calls affect the GPU's state, load or retrieve data, or tell the GPU to draw something using its current state. Everything you do in your program technically falls under the application stage, albeit it's also where you configure the rest of the pipeline.

The first thing you have to do is set up a graphics context (e.g. an OpenGL context). The context is how the GPU knows about the application trying to talk to it, and encapsulates all resources used (models, textures, framebuffers, etc.). Setting up the OpenGL context is usually the job of an off-the-shelf windowing system, which means most programmers will not need to worry about it^[2]. Two examples of such windowing systems are GLFW and SDL^[3].

You must also specify your geometric data to the GPU so that it has something to actually draw. Essentially, you upload buffer(s) of data called a vertex buffer objects (VBOs). You will also likely want to upload buffer(s) called index buffer objects (IBOs). The idea is that chunks of data selected from the VBOs (usually, indicated by vertex indices from an IBO) get interpreted as vertices and then geometric objects. You explain the details of how this is to be done with a vertex array object (VAO). There were other, simpler but less-efficient, ways in the past^[4].

You must also set up the GPU's internal state to configure how it renders, for example with calls like (OpenGL) glEnable(⋯). Nowadays^[5], you must also provide details for how (minimally) the vertex stage works, in the form of a shader. (Almost always, you provide a shader for the fragment stage too, though this is technically optional.) A shader is a program the GPU executes to implement a given stage. The collection of shaders used for all stages forms a shader program.

Finally, you tell the graphics API to draw using the GPU's current state (e.g. with OpenGL) glDrawArrays(⋯) (VBOs) or glDrawElements(⋯) (VBOs and IBO). These calls essentially tell the GPU to start rendering a bunch of primitives. A primitive is usually something like a triangle strip (a list of triangles, specified efficiently to share vertices)^[6]. The individual vertices of the primitives get transformed by the vertex stage (and possible additional geometric stage(s)), reassembled into base primitives (such as just individual triangles), and then drawn to the screen by subsequent stages.

Geometric Stage(s)

The GPU has been told to draw something, and it has a bunch of geometric vertex data that's supposed to be connected up somehow. The geometry you provide is specified in some object-space, a localized space that defines objects with respect to a common origin. For example, a 'person' model might have its polygons defined with the origin being directly under its feet. This is ideal for modeling, but it has no relationship to the screen-space that the pixel coordinates are defined in.

The job of the geometric stage(s) is fundamentally to convert from object-space to screen-space. The quintessential, required, geometric stage is the vertex stage. However, one may also optionally have the tessellation-control stage and tessellation-evaluation stage, and/or optionally the geometry stage—both much less used. We'll elaborate a bit more below, after explaining what it is we're trying to actually accomplish, but these different stages basically offer varying granularity to run the operations at.

There are actually a bunch of different spaces we should consider. Most transformations between spaces are done by a transformation matrix: you multiply a vertex (or a bunch of vertices) by the matrix to get the result in the new space. These are generally \(4\times 4\) matrices, which might seem a little weird since we're in (at most) 3D. However, using 4D matrices is convenient for technical reasons (e.g. it lets us express translation as a matrix multiplication).

Object-, World-, and Eye-Space

First, there's the aforementioned object-space, the space the 3D modeler defined the object in.

Next, there's the world-space, which is a common reference frame that all the objects share. Continuing our example, imagine two 'person' models standing next to each other. It is clear that that the polygons of one, although analogous to the polygons of the other, are in physically different locations. By multiplying one copy of the person model's vertices by one matrix and the other copy by a different matrix, the two objects' polygons will be transformed to different places. That per-object transformation matrix is called the model matrix^[7].

Next comes eye-space, which is the scene defined in coordinates relative to the camera. Eye space can be imagined as the eyeball's coordinate system, with the positive \(x\)-axis to the right, the positive \(y\)-axis upwards, and looking down the negative \(z\)-axis^[8]. To transform to eye-space, we multiply by the view matrix^[7].

Clip-Space, NDC, and Perspective

Next, we distort eye-space according to e.g. 3D perspective. There are three key stages to this: (1) the (perspective-)projection matrix, which transforms from eye-space into clip-space, so named because it is the precursor to (2) clipping, which crops the scene to only the visible geometry, and (3) the perspective divide, which finishes the perspective transformation. Taken together, these accomplish perspective distortion. Importantly, don't confuse the projection matrix with the perspective transformation as a whole!—it's just the first step.

A pyramid cut horizontally by two parallel planes. — Figure 3
: The viewing frustum implied by a 3D perspective projection matrix (image source).

Imagine a 3D pyramid with the top cut off (see Figure 3), a shape called a frustum. The apex of the former pyramid is your eye, the sides of the frustum are the edges of your screen. The base is the far clipping plane, and the top is the near clipping plane. The volume enclosed by the frustum is the volume you can see, and the details are given by the projection matrix.

The overall perspective transform distorts this frustum into a rectangular solid, which accomplishes the effect of perspective (note: for 2D rendering, the frustum is actually a rectangular solid to start with, so the transform just rescales it).

After we multiply an eye-space vertex (which is of the form \(\langle x,y,z,1 \rangle\)) by the \(4\times 4\) projection matrix, we'll get a new \(4\times 1\) vertex in clip-space. However, the final component (the \(w\)-component) will no longer be \(1\). This is because the clip-space vertex is in homogeneous coordinates. You can think of it as a 4D coordinate system where the last component scales the remaining 3D coordinate through the origin. (If this seems a little weird, that's because it is.)

The final step will be to recover the 3D world by dividing by the \(w\)-component, something like \(\langle ~x/w,~y/w,~z/w,~w/w~ \rangle\), the perspective divide.

However, we cannot go ahead and divide just yet! One reason why is that a vertex behind the camera will turn out to have a negative \(w\)-component. When we divide by \(w\), the sign of the \(\langle x,y,z\rangle\) coordinates will flip, effectively putting that vertex in front of the camera! Another reason is that vertices that are outside the frustum shouldn't be drawn, because they wouldn't show up on the screen, and we shouldn't waste time drawing them!

Thus, the GPU first performs clipping, wherein triangles that fall outside of the clip-space volume are subdivided and culled until all that remains are ones that do. Most sources are vague or wrong about this, so to be clear, this is clipping in the 4D homogeneous space, and the key idea is to clip such that after the perspective divide, we are in normalized device coordinates (NDC)—i.e. \([-1,+1]\) on all axes (for OpenGL). Any vertex behind the camera (negative \(w\)) will be clipped, as will any whose \(x\)- and/or \(y\)-components exceed their \(w\)-components. So e.g. for \(x\), if we want e.g. \(-1 \leq x/w \leq 1\), we need to clip against the two planes \(-w \leq x \leq w\), and similarly for \(y\) and \(z\). In total there are six planes^[9]. At the end of clipping, only triangles that lie inside this cube will remain, and furthermore, all these triangles' vertices will have nonnegative \(w\)-components.

At this point, the GPU can then safely and logically divide all vertices by their \(w\)-components. Again, this is called the perspective divide, and it results in coordinates \([-1,+1]\) on all axes (for OpenGL), which is called normalized device coordinates (NDC) or the canonical view volume.

Note that the \(w\)-component of all vertices will now naturally be \(1\) (because \(w/w=1\)). We can thus throw away the fourth component, returning back to our 3D universe. However, in the rasterization stage, it will turn out we'll need \(1/w\) from each vertex, so we do still need to save that somewhere^[10].

Viewport

From here, we apply the viewport transform, which is just a simple scale and bias of NDC to a different rectangular solid, window-space. For \(x\) and \(y\), window-space is basically just pixel coordinates (note that pixels don't exist yet though!). The \(z\)-component is the screen-space depth, and defines how 'far away' the pixel is from the eye. However, depth is not physical distance, even along the \(z\)-axis alone; it is nonlinear in a way we'll discuss below. The depth value will be important in the framebuffer stage.

Programming and Stages

Everything up through multiplication by the projection matrix, you need to specify in a shader^[11]. The clipping, perspective divide, and viewport transform are (thankfully) done for you, though it's still useful to know about them.

At minimum, you must define a vertex shader, which runs on each vertex in the vertex stage. In a simple workflow, this does everything, multiplying by the model matrix, then the view matrix, then the projection matrix (in practice, you may want to concatenate some or all of these matrices together for efficiency). The output of the vertex shader is then assembled into primitives and sent off to the next stage, rasterization.

There are, however, additional, optional stages you can tack on after the vertex stage. You can enable tessellation, which adds two new stages, the tessellation-control stage and the tessellation-evaluation stage, with corresponding shaders. This basically subdivides the assembled primitives, outputting (generally many) more primitives. There can also (independently) be the geometry stage, which operates on the granularity of a single primitive. At the end, the resulting geometric objects are called base primitives, and are sent off to the next stage, rasterization.

Rasterization Stage

At this point, the base primitives are now defined in terms of coordinates on the screen. The \(x\)- and \(y\)-components define the \(x\)- and \(y\)-position of the vertices on the screen, while the \(z\)-component is related to the depth of the vertex (how 'far in' the vertex is, in \(z\), though again as we'll see it no longer has quite that meaning).

The general (and in some implementations, only) case is that the base primitive is a triangle, so we'll assume that that's what we're working with in the following.

Diagram of a triangle being converted to a triangular shape made out of pixels. — Figure 4
: (image source).

Triangles to Fragments

A triangle is still defined by its vertices, three infinitely small mathematical points, but we want to have pixels! In this stage, the rasterizer does the first step of that, converting the three points into a set of fragments. A fragment is a "pixel in flight", a pixel-that-could-be. It is the information required to draw one pixel, but not the pixel itself, just yet. The information usually includes the depth, and often also information like interpolated vertex attribute(s), such as the texture coordinate (texcoord), which maps an image onto the 3D triangle, or normal vector, which is a bivector notionally perpendicular to the surface.

The rasterizer finds the set of pixels the triangle covers, and creates a fragment at each such pixel's location^[12]. This is accomplished by sophisticated 'edge-walking' algorithms that minimize wasted computational effort. Many rasterization algorithms also try to output fragments that are physically close together at the same time to maximize cache locality. The simplest way this can be done is by rasterizing in screen-space tiles (you divide the framebuffer into small rectangles, and then rasterize all the pixels in that rectangle at the same time). The purest expression of this is called tile-based rendering, and is widely used on mobile devices.

After rasterization, the fragments go on to the fragment stage, where their final attribute(s) are computed.

Interpolating Attributes

The vertex attribute(s) stored for each vertex must be interpolated to each fragment. For example, the vertices could have texture coordinate and normal attributes, which need to be interpolated onto the intermediate locations of the pixels subtended by the triangle. Also, so that non-orthogonal triangles display correctly, we must do this with perspective-correct interpolation^[13].

Doing this is a little tricky, and it's easiest to express with actual math. Let's say we have a screen-space triangle with vertex positions \(\vec{p}_a\), \(\vec{p}_b\), and \(\vec{p}_c\). Every point in the triangle can be expressed as a linear combination of these positions:

\[ \vec{p}(\alpha,\beta,\gamma) = \alpha \,\vec{p}_a + \beta \,\vec{p}_b + \gamma \,\vec{p}_c \]

The weights \(\langle\alpha,\beta,\gamma\rangle\) are called barycentric coordinates. Intuitively, e.g. \(\alpha\) says how much to weight the quantity at vertex \(a\), such as \(a\)'s position \(\vec{p}_a\) or one of its vertex attributes \(\vec{h}_a\). And so on with \(\beta\) for \(b\)'s quantities and \(\gamma\) for \(c\)'s quantities.

Remember how we divided by the clip-space \(w\)-component, but I said we needed to keep it around? This is why. Ordinary attributes \(\vec{h}\), such as the texcoord or normal, get interpolated using the following formula:

\[ \vec{h} = \frac{ \alpha \,\vec{h}_a / w_{a,\text{clip}} + \beta \,\vec{h}_b / w_{b,\text{clip}} + \gamma \,\vec{h}_c / w_{c,\text{clip}} }{ \alpha / w_{a,\text{clip}} + \beta / w_{b,\text{clip}} + \gamma / w_{c,\text{clip}} } \]

That is, we interpolate \(\vec{h}/w_\text{clip}\) and \(1/w_\text{clip}\) from each vertex to the fragment location, then divide the former by the latter to get the value at the fragment. This results in perspective-correct interpolation^[14]. We can factor out the common factors to make it more efficient:

\[ \vec{h} = \left( \frac{1}{ \alpha / w_{a,\text{clip}} + \beta / w_{b,\text{clip}} + \gamma / w_{c,\text{clip}} } \begin{bmatrix} \alpha / w_{a,\text{clip}} \\ \beta / w_{b,\text{clip}} \\ \gamma / w_{c,\text{clip}} \end{bmatrix} \right) \dotprod \begin{bmatrix} \vec{h}_a \\ \vec{h}_b \\ \vec{h}_c \end{bmatrix} \]

This is better because the parenthesized vector can be reused for any number of vertex attributes. You can think of this as calculating perspective-corrected barycentric coordinates, and then doing the interpolation in eye-space.

Understanding Depth

Depth needs to be interpolated too, but the method is different^[14]. Before we do that, we need to first take a step back and understand what's going on. The viewport transform from NDC to screen-space for depth is just:

\[ z_\text{screen} = \frac{1}{2} z_\text{ndc} + \frac{1}{2} \]

This just remaps NDC \([-1,+1]\) to screen-space \([0,1]\). In turn, NDC is calculated from clip-space by the perspective divide, as we saw in the previous section:

\[ z_\text{ndc} = \frac{z_\text{clip}}{w_\text{clip}} \]

In turn, clip-space \(z_\text{clip}\) and \(w_\text{clip}\) are calculated from the eye-space \(z_\text{eye}\) and \(w_\text{eye}\) (\(=1\)) in the following way (this follows immediately from multiplying an eye-space vertex by the definition of the projection matrix). Here, \(n\) and \(f\) are the near and far clipping planes, and we're assuming a perspective projection:

\begin{align*} z_\text{clip} &= \frac{f+n}{f-n} z_\text{eye} - \frac{2 f n}{f - n} \\ w_\text{clip} &= -z_\text{eye} \end{align*}

Graph of depth versus z; the relationship is a nonlinear swoop in quadrant II, intersecting the x-axis to the left of the origin. — Figure 5
: Screen-space depth (\(z_\text{screen}\)) vs. eye-space \(z\)-component (\(z_\text{eye}\)). Distance *ahead* of the camera increases from *right to left*. Notice the nonlinear distortion; depth is not a spatial coordinate!

Thus, putting it all together, the screen-space depth can be related to eye-space depth as (again, for a perspective projection):

\begin{align*} z_\text{screen} &= \frac{1}{2}\left(\frac{2 n f}{f-n} \cdot \frac{1}{z_\text{eye}} + \frac{f+n}{f-n}\right) + \frac{1}{2} \\ &= \frac{f}{f-n} \left( \frac{n}{z_\text{eye}} + 1 \right) \end{align*}

You can check this—if you put in \(z_\text{eye}=-n\) (remember that in eye-space we're looking down the negative \(z\)-axis, so negative numbers are in front of the camera), then you can calculate that you get out \(z_\text{screen}=0\) for screen-space, the nearest value. Similarly, if you put in \(z_\text{eye}=-f\), you get out \(z_\text{screen}=1\), the farthest. See Figure 5.

Pay attention to the form of the equation, though. The screen-space depth \(z_\text{screen}\) is inversely related to the eye-space depth \(z_\text{eye}\)! This is very counterintuitive. However, there is a good reason for it: the precision of the depth buffer gets 'concentrated' nearer to the eye. This is useful because closer objects get more detail^[15].

Interpolating Depth

Another reason is that interpolating depth is very simple. Although eye-space depth is not linear in screen-space, the reciprocal of depth is, and so interpolating the depth (again, vertices \(a\), \(b\), and \(c\), with corresponding barycentric coordinates \(\langle\alpha,\beta,\gamma\rangle\)) is as simple as:

\[ z_\text{screen} = \alpha \,z_{a,\text{screen}} + \beta \,z_{b,\text{screen}} + \gamma \,z_{c,\text{screen}}\hspace{1cm}\text{(correct interpolation)} \]

This usually causes some confusion, which should hopefully already be somewhat alleviated by the above, but let's address it explicitly. It is quite common^[16] to say that to get depth \(z^*\), you should interpolate \(1/z\) and then take the reciprocal (where it's not clear what space \(z\) and \(z^*\) are in):

\[ z^* = \frac{1}{\text{interpolate}( 1 / z )}\hspace{1cm}\text{(common)} \]

This is somewhere between a simplification and a misunderstanding. With a simpler projection matrix, we might have that \(z_\text{screen}=1/z_\text{eye}\) exactly. This is actually part of the legitimate derivation of the interpolation for depth^[14]. Then, if we take the reciprocal of this, we get depth in screen-space. That is, depth is the reciprocal of the interpolated reciprocal \(z_\text{eye}\). However, screen-space depth \(z_\text{screen}\) is not defined to be physical depth; it is actually inversely related (on purpose, for better precision), and it's also not an exact reciprocal relationship. So we could say:

\[ \text{depth} \approx \frac{1}{z_\text{screen}} \approx \frac{1}{\text{interpolate}( 1 / z_\text{eye} )}\hspace{1cm}\text{(sortof)} \]

Four renderings showing lighting components on a blob shape: ambient, diffuse, specular, and their combination. — Figure 6
: Lighting components (top to bottom: ambient, diffuse, specular, combined) used in a basic lighting model. Better lighting models exist, but this one is simple and time-honored. (Adapted from image source.)

Fragment Stage

Here, the fragments are prepared for entry into the framebuffer. This basically involves computing the final values for the fragment's attributes.

The most obvious fragment stage operation is to determine a color for the fragment. Normally, the fragment shader you provided in the application stage runs here, once for each fragment^[17]. Computing the color usually involves running a local lighting model (i.e., a virtual material, defined through textures, lit by one or more virtual lights).

The output of this stage is a fragment with more limited information—intuitively, just a color, depth, and location. You can have multiple colors, alter the depth, and mess with the stencil buffer or GPU memory, but most of the time your fragment shader only cares about computing one color (given the complexity of some material models, that can be quite difficult enough!).

A basic and common model, popular since the early days of computer graphics^[18], is to break the lighting model into three terms: ambient, diffuse, and specular (see Figure 6).

To understand ambient, imagine standing in a dark room, shining a flashlight at the floor. If you look up, you will still see the ceiling, even though no light from the flashlight is directly hitting it. This indirect light turns out to be very difficult to compute. The ambient term is basically just a constant that gets added to try to fake it instead. Usually, it's computed as an ambient color (per light) times an ambient color (per object).

To understand diffuse, imagine shining your flashlight at a sphere. The place on the sphere that's facing the flashlight will be brightest. The places farther away are tilted away from the light, meaning the light is more spread out in those areas, and so darker^[19]. This is called the diffuse term, and the simplest model is Lambertian which you compute as the dot-product of the normal and light direction.

To understand specular, imagine shining your flashlight at a plastic sphere. You'll notice a bright spot of light on the surface of the sphere. If you keep the flashlight and sphere in one place, but move your head around, you'll see that the location of the spot changes. This spot is at the location where light from the flashlight bounces directly, or nearly directly, off the surface and into your eye. Intuitively, you're looking at an imperfect reflection of the light in the sphere. The spot is called a specular highlight, and its presence is accounted for by the specular term.

Framebuffer Stage

The new fragment must now determine whether (and if so, how) its information should be combined with the data at the corresponding pixel in the framebuffer.

This is accomplished by a series of tests the fragment may have to pass. Most tests are based on comparing the fragment's information to the corresponding values stored in the framebuffer. I'll describe how these tests usually work, though it's worth noting they can be configured to act differently. If a fragment passes all tests, its data is written into the framebuffer. Fragment data can also be combined with the existing data, for example when a color alpha-blends, partially transparent, over the pre-existing pixel.

There are a number of different buffers within the framebuffer. Their names simply refer to the data they store. For example, the color buffer stores the colors of each pixel. The depth buffer (or z-buffer) stores depths. It is the collection of these individual buffers that comprise a framebuffer.

First comes the pixel ownership test. Intuitively, this just checks whether the GPU should draw here. For example, it shouldn't draw a pixel if it's in a window that's behind another window.

Next comes the less common scissor test, which basically allows you to crop your rendering to a certain region.

Next comes the less common stencil buffer test. The stencil buffer is a buffer that stores a per-pixel integer (usually 8-bit) that can be both manipulated by rendering (e.g. if a fragment renders here, increment) and tested (e.g. if the value is less than 2 here, reject the fragment).

In OpenGL 3.0 and below, next came a test called the alpha test, which discarded the fragment if the fragment's color's alpha was below a given threshold. This has been removed in more recent APIs (it's easy to implement yourself in the fragment shader), but is mentioned here since it was not uncommonly used.

Next comes the extremely important and common depth test, which compares the screen-space depth of the fragment to the current depth already stored at the pixel, thus solving the painter's problem of determining which surface is visible. The depth is initially set to the maximum value (i.e. \(1\)). Fragments from new polygons come in, reducing the value. When such a fragment comes in with a depth that is greater than the depth already stored in the depth buffer, the new fragment is farther away (and so behind) the surface that's already drawn on the pixel, and so is rejected.

Although these tests all semantically run after the fragment shader executes, some of them can be run before, because the fragment shader does not alter them. These early fragment tests are an important optimization because fragment shading is often expensive, and if any test fails, we don't have to run the shader.

The pixel ownership test always runs before. The stencil test can run before if the shader is known not to change the stencil value. Same with the depth test for the depth value. If the alpha test exists and is enabled, it could force subsequent early fragment tests to be disabled, because that really does depend on the fragment shader having been executed.

Screen Stage

This is where we finally get to see what happened. The framebuffer contains color values, and these are drawn to the screen. At its most basic, this is the most straightforward stage, because it just involves scanning out the data from the framebuffer to the display hardware.

The display hardware can refresh at a certain rate, usually around 60 Hz for most monitors. The GPU may render faster or slower than this, and this introduces some issues we need to resolve.

The first is that we can't try to display a framebuffer that the GPU is rendering to. Suppose you're rendering a teapot in a room. First the GPU clears the framebuffer, then draws the polygons of the room, then the polygons of the teapot. However, if we tried to display that while the GPU was working on it, some pixels might be black (because the framebuffer has just been cleared), some pixels might show the room without a teapot, some might show the room or teapot partially drawn, and some might actually be right.

To address this, most renderers set up double-buffered rendering^[20]. Instead of one framebuffer, you have one framebuffer (the front buffer) that the display can read, and another framebuffer (the back buffer) that the GPU is rendering to. When the GPU finishes, the framebuffers can be swapped (a buffer swap). The display will start reading from the former back buffer, which now has a fully completed image, and the GPU can start rendering the next frame onto the former front buffer, starting the cycle anew.

Although this is significantly better in that the display only ever reads from a completed frame, the buffer swap can still happen partway through the display scanning out the image, meaning that the display will show part of one frame and part of the next. If the frame has a significant amount of motion, you can get an objectionable effect called screen tearing.

We can solve this by making the buffer swap wait for the display to finish drawing the image; this is called vertical synchronization (VSync). However, because the buffer swap is delayed, the GPU can't start rendering the next frame right away.

There are more sophisticated approaches such as triple buffering (the GPU can keep working on new frames) and there are nuances to VSync.

Conclusion

At this point, you should know basically how the basic graphics pipeline produces an image from a bunch of triangles. However, we've glossed over some practical details and talked almost nothing about actually commanding the GPU's hardware graphics pipeline, using a graphics API like OpenGL.

For the 'practice' counterpart to 'theory', I again offer a sample CPU rasterizer. It produces a rasterized image (Figure 0) using just simple C++ code. You can follow the whole pipeline through in the code.

For the graphics API side, which is how you would write a practical renderer today, there are lots of resources and approaches—the best of which is hard work and experimentation. Please go explore the world of OpenGL tutorials available on the internet! I provide a few tutorials myself on the parent Resources Page. And, although it's not a tutorial, my PyGame+PyOpenGL Starter Code will get an OpenGL-powered triangle or two on any python programmer's screen in a jiffy.

I hope you found this useful, and best of luck in your graphics journey!

Notes

Demonstration that the surface through four points can be different, even though the vertices are the same. — Figure 7
: A quadrilateral can be triangulated one of two ways: although the vertices are the same, the rendered shape changes (image source).

[1]	This sample was inspired by this nice but incomplete example.
[2]	You can absolutely write your own windowing system. I've done this before, but that way lies madness, so I'd recommend against it.
[3]	Another common one is GLUT, but GLUT is crufty at best, old and bad at worst. GLFW is better in literally every way, and SDL is more capable than both.
[4]	There was an immediate mode way to specify geometry. You could set e.g. a color per-vertex with `glColor3f(⋯)` or vertex position with `glVertex3f(⋯)` and when you were done defining your object that way, the GPU would draw it. You could cache lists of these commands with display lists. However, this overall approach is very inefficient, and it soon became required to at least pass the data with array pointers (called vertex arrays, though the meaning of this term has now changed to mean VBO, basically, as suggested by VAO). Then uploading before drawing as VBOs seems to have become the only option. Nowadays, we must also encapsulate a VBO enable-state with VAOs. All driven toward more efficiency, but unfortunately, also more confusion.
[5]	In older GPUs and APIs (such as OpenGL 2), the available functionality was baked into the hardware, and you could only configure the so-called fixed-function pipeline.
[6]	Although a triangle strip is the commonest primitive, there are many other primitive types. Of special note is that in older GPUs, quadrilaterals were supported directly. Ultimately, the graphics driver broke these down into individual triangles, a process called triangulation. They were eventually removed because, among other reasons, the way graphics drivers choose to break down quads can vary, so sometimes you can get completely different triangles, and therefore a different overall shape you weren't expecting (see Figure 7). This issue is still relevant to 3D modelers today.
[7]	OpenGL 2 concatenated the view and the model matrices into the modelview matrix ('viewmodel' would have been more accurate, but). Multiplying vertices by the modelview matrix transforms them directly from object space to eye space. Part of the reason this is done is because it saves time, because matrix multiplication is associative. Multiplying every vertex by the model matrix and then by the view matrix obviously requires more time than simply multiplying every vertex by the single modelview matrix.
[8]	Old-style OpenGL programs were set up so it would be the negative \(z\)-axis while old-style Direct3D was the positive \(z\)-axis. Nowadays, there aren't any default view matrices, so you must make the choice explicitly. However, the negative \(z\)-axis (i.e. implying a right-handed coordinate system) is objectively superior (e.g. you don't have to flip all your cross-products, etc.).
[9]	Specifically, they are \(-w \leq \{x,y,z\} \leq w\). Heuristically, I like to test them in the order \(z \geq -w\), \(x \geq -w\), \(x \leq w\), \(y \geq -w\), \(y \leq w\), then \(z \leq w\), as I think this usually clips away as much geometry as fast as possible, which is most efficient. Note that the six planes imply that \(w \geq 0\), so we don't need to test that separately. You'll also want to handle \(w=0\) somehow—it's not clear what the right answer even is. You can define your own custom clipping planes, too, even (contrary to popular belief) in modern OpenGL.
[10]	In practice, it's saved back into the \(w\)-component, but that's pretty confusing, so be careful.
[11]	In old-style OpenGL, the fixed-function pipeline did this for you, and that remained an option for a while. In modern OpenGL, you need to minimally have a vertex shader, though. You will also usually have at least a fragment shader.
[12]	To compute screen-space derivatives, pseudo-fragments may be generated outside of triangles for helper invocations. Memory operations (such as writing a color) will not complete in these fragments, so for most purposes they don't exist.
[13]	The definitions below can be confirmed in the OpenGL 4.6 specification (§14.6.1 Basic Polygon Rasterization, pg. 477ff.), although we will also go to more effort to properly understand their implications.
[14]	Seeing why is essentially just some geometry and algebra, which you can read about here or here.
[15]	This writeup shows the derivation in more detail and also discusses depth-buffer precision issues more.
[16]	For example, it is stated more or less explicitly here, here, and here.
[17]	Including fragments generated for helper invocations. Also note, within a single fragment, supersampling shaders can run once per sample (though multisampling shaders, the much more common kind, do not).
[18]	And baked into old-style OpenGL. The lighting model, using the Blinn–Phong BRDF for specular, would be evaluated at each vertex and the light interpolated to each fragment. Since there are usually far fewer vertices than pixels, this was much faster. Nowadays, although fragment shading is still expensive, it is usually considered vital to quality to evaluate the lighting per-pixel. More realistic microfacet BRDFs are generally used too.
[19]	On a large scale, this is the primary cause of the seasons on Earth (in winter, the Earth's tilt causes sunlight to become spread out over a larger area; in summer, the Earth's tilt causes sunlight to become more concentrated in a smaller area).
[20]	In fact, on modern hardware, it's almost impossible to set up anything else, especially in OpenGL (as opposed to e.g. Vulkan).