This blog post complements a presentation I gave at IGDATC. If you'd prefer to consume this in video form, you can watch it on YouTube or view my slides here.
🏗️ Under Construction! 👷
If you followed the link from my presentation, I haven't quite finished this blog post yet. Sorry! The conclusion isn't quite finished, and some of the diagrams are just a picture of my floatplane. Subscribe to my RSS feed to get notifications when it's fully done. :)
3D graphics requires a lot of math. Much of it is quite difficult, and is often considered to be black magic. Transform matrices are one of the most fundamental concepts, and I'm going to try and convince you that they are not black magic. In fact, I'm going to convince you that if you've seen a transform gizmo in a 3D engine before, then you already intuitively know what a transform matrix is, and even what each and every one of the numbers in it mean! Neat!
Now compare that to the raw, 4x4 grid of numbers that make up a 3D transform matrix. It's... a little intimidating right? I mean it's pretty reasonable to think that this grid of numbers really is some sort of black magic.
On the other hand, the transform gizmo is an intuitive visual representation of a matrix, but it really is the same thing. When I "do matrix math" I think about the colorful gizmos or draw them on paper. I don't think about grids of numbers. That's the trick!
Why Bother?
It's certainly possible to do graphics without understanding transform matrices, so why bother? You might even think "I don't need to know transforms, my engine does." Perhaps you are just making a simple 2D game so surely don't need to know that, but humor me that you are considering adding sprite shadows, reflections, or even some basic single point perspective to your "simple 2D game". Your fancy engine probably doesn't do any of that for you out of the box!
I think with a few visualizations you'll realize you understood transforms all along, and I think that will give you a gamedev superpower.
For now, let's consider the simple scene in the following image. To render the cube you need to perform a series of transformations to figure out what the cube looks from the camera's point of view. GPUs work by drawing triangles so that means we need to figure out where each vertex on the cube appears on the screen. Sure, they do fancy pants ray-tracing now too, but you still need geometry to shoot rays against.
The first step is to figure out where the cube is relative to some "absolute" coordinate system shared by both the cube and camera. (represented by the gizmo in the lower left) The cube is moved to a specific place (translation is just addition), and maybe it's a certain size (scaling is just multiplication), and finally it's oriented a certain way (rotation is... more complicated). Now you know where the cube is relative to the absolute coordinate system. To see it from the camera's POV, you need to undo the camera's translation and rotation, but you also need a projection to map it to the screen. For example is it an orthographic or perspective projection? (or isometric, axonometric, single point perspective, etc...)
So maybe each one of those steps isn't so bad in isolation, but each one you add makes it more complicated. Then you have to repeat this entire chain of transforms for each and every vertex. This can get really expensive for 3D models with lots of vertexes. Since it's common to make transforms hierarchical, it only gets worse. Imagine placing a decoration on the cube's surface. You want it to move with the cube, but making it relative means you need to add even more transforms to the chain. This can get out of hand really quickly!
Ooops! You forgot about input...
If this already sounds complicated. Consider that we've only talked about half of the problem so far. I've rarely worked on a game that didn't need to translate mouse input in screen coordinates back into the scene. To do that, you need to not only run the sequence of transforms in reverse, but also do the opposite of each individual transform.
What if there was a magic abstraction that we could use for all of these different transforms? Even better, what if it allowed us to combine a sequence of them into a single step, and even easily reverse the whole thing?
What is a Transform Matrix?
A transform matrix is a magic abstraction that you can use for many different types of transforms such as translation, rotation, scaling, skewing, projection, and more. Even better, they allow you combine a sequence of them into a single step, and even easily reverse the whole thing. It doesn't even use black magic. Remember the colored arrows on that transform gizmo? It just stores those in (nearly) the most obvious way you can imagine. Let's just jump right to a 2D example so you can just see it in action.
(Don't worry if you don't fully understand this demo yet. Keep reading.)
(double click to enter fullscreen)
If you drag the base of the arrows around, you can translate the sprite. If you drag the tips, then you can rotate, scale, stretch, skew, etc. Did you figure out what the numbers in the matrix mean? The first column is just the x/y measurements of the red arrow, the second column is the green arrow, and the final column is position of the gizmo. There's also a final row that has some zeros and a 1 in it. I'll explain that later. (Most 2D graphics systems such as the Javascript canvas or css APIs ignore the last row.) That's pretty straightforward right? 3D transforms have an extra row and colum for the z-values, but otherwise are just as boring.
Visualizing Vector Arithmetic
Let's start with a crash course in vectors. For the purpose of gamedev, they are a list of numbers that you use to store 2D [x, y] or 3D [x, y, z] coordinates.
In order to talk about transforms, we need to know how to add and multiply vectors. Fortunately, this is extremely visual, so I can just make pictures instead of showing you a bunch of arithmetic.
Vector Addition
To add vectors visually, you just lay them down tip to tail. It doesn't matter which order you do it in. You can think of them forming sides of a box, and then the diagonal is the result.
(double click to enter fullscreen)
If you want to do the raw arithmetic? Just add the x values together and the y values together. So that's pretty straightforward too.
// Add vectors A and B to get C.
C.x = A.x + B.x
C.y = A.y + B.y
C.z = A.z + B.z
Vector Multiplication (by a scalar)
To multiply a vector by a scalar (a regular number), just duplicate the arrow that many times and line them all up. For fractional values, just make the last arrow shorter. Alternatively, you can think of it as stretching out the vector without changing it's direction.
(double click to enter fullscreen)
Similar to addition, when you want to do the raw arithmetic just multiply each of the x/y/z values. There are ways to interpret "multiplying" vectors together too, but that's a topic for another tutorial. ;)
// Multiply A by b to get C.
C.x = A.x*b
C.y = A.y*b
C.z = A.z*b
Coordinate Spaces
Now that you know how to do basic arithmetic on vectors, think about how you measure them. The easiest way to explain it is to "think about the grid". When you have graph paper, you count the x-value to the right, and then count the y-value up. This isn't the only way though. Pixel coordinates often measure the y-value in a downward direction instead, and this makes a lot of sense if you are working with text. You can also rotate a grid to line it up with something and still use it to measure something. Think of that grid as a visual representation of the space the coordinates exist in.
Try playing with the widget below. Can you flip it so it measures y-values downwards from the top left corner? Does it make sense to measure coordinates when the axes aren't at right angles? Can you set it up to look like isometric coordinates?
(double click to enter fullscreen)
The point is that no matter how much you rotate, skew, or scale the grid, it's still perfectly valid to measure coordinates against. Follow the arrows and count the lines.
Homogeneous Coordinates
There is a missing ingredient though. You can rotate, scale and skew the grid. This is fine if you just want to handle vectors that have directions in them. If you want to work with vectors that store positions, you need to be able to pan the grid to move it around. In the first demo, there was handle that let you drag the whole gizmo around. Now we have a problem though, how do we know if a vector is storing a position or not? If we don't add the offset, our positions are stuck near the origin. On the other hand, adding the offset to our directions makes no sense.
The math trick to deal with this is to add an extra "w" coordinate on your vectors. When you want it to represent a direction, you set w = 0. When it represents a position, you set w = 1. You can think of it like a boolean flag for now, but it's actually a bit more than that. This is referred to as homogeneous coordinates. (I have idea why...)
Transforms
We now have all the ingredients we need to convert between coordinates spaces. That's a transform!
(double click to enter fullscreen)
To be clear, there are two coordinate space's in this diagram, but drawn in different ways. The grid is visualizing one coordinate space, while the gizmo is a second. We could draw gizmos and/or grids for both, but that be super confusing! Another thing to consider is that we are sort of implicitly measuring the gizmo's vectors by the grid's coordinate space. Transforms only operate in one direction, and the transform we are portraying here converts from the gizmo to the grid. Think of this like how nesting hierarchical gizmos in your 3D engine or modeling program works. The gizmo is a child transform, and the grid is the parent.
Transform Matrices
So we can visualize a transform by drawing some arrows, and the raw arithmetic is just a bunch of additions and multiplications. Why do we need matrices if it's so basic? Abstraction! Let's look at a specific example and show the vector arithmetic.
There isn't any complex math happening here, but notice how the coordinate values are all mixed up with the gizmo vectors. It would be a little easier to think about if the transform vectors were all wrapped up in one place, and the vector it's transforming was in another. That's what the matrix form is for, to wrap the gizmo vectors into a single idea.
The matrix just separates the vectors for the transform from the vector you want to apply it to. Now instead of a transform being a series of operations, it's a thing by itself. You can reason about it, or even give it a name now! It's more about organization than math.
Linear Transforms
There are many kinds of "transforms" you can do, but the kind you do with a matrix like this is called a linear transform. That has a pretty specific meaning, but the short version is simple: Lines will stay lines, and parallel lines will stay parallel.
Examples of linear transforms include translation, scaling, rotation and skewing. These aren't mutually exclusive even. You can skew an object by rotating and then scaling it. In fact any sequence of linear transforms will also be a linear transform. The straight line requirement does rule out anything that causes an object to bend, twist, warp, curve, curl, etc.
Simplified Transforms
The visual representation of a transform can only carry you so far once you need to start writing code for it. I personally find it extremely valuable to break a transform down into a series of simpler transforms. Let the computer do some of your arithmetic for you. After all, the point of using matrices is that computing them is really cheap compared to applying them to a model with lots of vertexes.
For example, usually when you scale or rotate an object, you have a particular pivot point in mind. Maybe it's the anchor point of a sprite, or the point where an object sits on the ground. The first transform you want to do should move that point to the origin. Then you can apply your rotation and/scaling, and translate the object back into place. It makes it much easier to visualize how the object will react, and the matrix forms of those kinds of transforms are trivial to write out too.
A really good example of this is map zooming with a mouse. Ideally you want to let the user zoom into the mouse's position, or at least the center of the screen. It seems like this shouldn't be hard, but I've struggled on multiple occasions to implement this with only raw arithmetic. Do you multiply in the scale before or after adding the offset... or both?! I've broken down every time and just implemented transforms on top of whatever interface was given to me. Instead, translate the position of the mouse to the origin, zoom, and then translate back. Much easier!
Order Matters!
The other benefit of breaking something into simpler transforms is because the order you apply them matters!
In the example above, you can see that rotating and then translating gives you a different result when you swap the order. Remember that you are pivoting around the origin, not whatever you think might be the center of the ship. You can reformulate a different transform that rotates around the center of the ship in a single step, but it makes it more difficult to write down as a matrix. Do yourself a favor!
Confusingly, sometimes the order doesn't matter. For example, if you step forward, then back you end up where you started. If you reverse the order, the path you take is different, but the result is the same. This can happen when you are working with transforms too, especially when you start with a simple example, and find out a more complex one doesn't work. My best advice to deal with this is mostly the same. Break it down into simpler transforms and make sure each one does what you think.
Matrix Multiplication
The main benefit of encoding transforms as a matrix is that you can combine a sequence of them into a single matrix. How does that work? When you put a point or a direction in a vector, you can transform it with a matrix. A transform gizmo is just an origin point and direction for each axis. When you put a gizmo into a matrix, all you are doing is writing down it's origin and axes as column vectors. So one way to combine multiple transforms into one can just be done by multiplying their matrix forms. This doesn't really involve anything you haven't already seen.
Like with a sequence of transforms, the order you write the matrices down is very important. With gizmos you can imagine transforming from the gizmo's coordinate space to it's parent's space. In matrix form, you write the matrix on the left, and the vector on the left. I like to visualize this as the vector as moving to the left through the matrix to get to the result. Then when you have a sequence of them to apply, you can just follow the same "motion" through the equation.
For example: result = matrix * vector
Rendering to the Screen
As a concrete example, lets go back to rendering a model to the screen. First you need to transform the model into absolute coordinates. (Let's call this the model transform.) Then we need to transform those coordinates relative to the camera. (Let's call this the view transform.) Finally we need to figure out how objects near the camera map to the screen. (Let's call this the projection transform.) This is the pattern used to be practically baked into rendering APIs like OpenGL and DirectX, and modern engines basically all still use it unchanged.
screen_pos = Projection*(View*(Model*vertex))
I added parenthesis because I wanted you to remember that the vertex position moves to the left as goes through the transforms and finally makes it to the screen position. This is inefficient though. The main selling point of matrices was supposed to be to avoid a long sequence of math per vertex remember? While we can't change the order of the transforms, we are allowed to move the parenthesis around and group all ot the matrices together.
screen_pos = (Projection*View*Model)*vertex
Now we've precalculated the matrix from the model's coordinates all the way to screen coordinates, and can apply it all in a single step! Even better, we only have to calculate this matrix once per object and can apply it to thousands of vertexes. Just to reiterate one more time... even though the transform sequence looks backwards, remember that the vector moves to the left through the sequence.
There is one problem though...
Matrix Inverse
Did you notice that the arrow for the view transform points from the absolute coordinates to the camera? You can interpret that as the absolute coordinate space being measured by the camera, or alternatively that the absolute gizmo is a child of the camera gizmo. How would that work?!
Instead, how you would really implement this would be to have a camera transform instead. This would be measure relative to the absolute coordinates, or set up as a child gizmo. The camera transform would be the inverse of the view transform. In other words, if the camera's transform is the sequence where you rotate and then translate, then the view transform would untranslate and then unrotate. For a relatively simple sequence, this isn't so bad if you remember all the variables. In general you need a more generic solution though.
The solution requires algebra, and I promised I wouldn't talk about any complex math here. ;) If you took an algebra class in school, you kind of know how to solve this actually. "Two equations, two unknowns." Inverting a 3x3 matrix for 2D math actually isn't too bad, but by the time you get to a 4x4 matrix for 3D math there are quite a few steps. I don't think most people would benefit from knowing how this works. Just use the functionality from your engine, or get a math library to do it.
view = matrix.inverse(camera)
Easy Mode: 2D Transforms
Almost every 2D drawing API I've ever seen tends to just expose raw 2D matrices to the programmer. Even the CSS and Canvas APIs in the web tech-stack do it. If my goal was to convince you that transforms and matrices aren't so bad, I'm especially here to convince you that 2D transforms are particularly easy.
In all the examples so far, I've been drawing out a full 3x3 matrix for 2D transforms, and the bottom row is always [0, 0, 1]. Hopefully this makes sense now, as the first two columns are the directions of the x and y axes. Directions have a w coordinate of 0. The last column is the position of the origin. Positions have a w coordinate of 1.
The matrix code I used on this webpage for example is nothing more than this:
With the exception of perspective projection matrices in 3D, the last row will always follow this pattern. Since perspective isn't even used in 2D rendering, it's entirely common to just drop it and bake it into code instead. Then you are left with the x axis as [a, b], the y axis as [c, d], and the origin as [x, y].
All of the basic transforms become very simple. As Javascript code, translate is just:
{
a:1, c:0, x:translate_x,
b:0, d:1, y:translate_y,
}
Scaling just just:
{
a:scale_x, c: 0, x:0,
b: 0, d:scale_y, y:0,
}
Leaving rotation as the only mildly complicated one.
{
a:cos(radians), c:-sin(radians), x:0,
b:sin(radians), d: cos(radians), y:0,
}
Usually I find that I don't even need to use any trig functions. Remember, that's all just to get the direction the axes point in. For instance, if you are drawing an arrow moving along an arc, it's rotation will just follow the direction of it's velocity. So you can just normalize the velocity and use that instead! No trig required.
I do keep one complex transform in mind for 2D though. The quintessential transform is when you scale a sprite, rotate it, then translate it into position. The matrix form of this is quite simple, and so I just use it directly:
a:cos(radians)*scale_x, c:-sin(radians)*scale_y, x:translate_x, b:sin(radians)*scale_x, d: cos(radians)*scale_y, y:translate_y, }
It makes sense to me as a rotation matrix, with the axes multiplied by the scale factors, and then translated. The rest of the code I use is just a couple dozen lines to apply, multiply, and invert the matrices:
function mat2_direction(m, v){ return { x: v.x*m.a + v.y*m.c, y: v.x*m.b + v.y*m.d, }; } function mat2_point(m, v){ return { x: v.x*m.a + v.y*m.c + m.x, y: v.x*m.b + v.y*m.d + m.y, }; } function mat2_mul(m1, m2){ return { a:m1.a*m2.a + m1.c*m2.b, c:m1.a*m2.c + m1.c*m2.d, x:m1.a*m2.x + m1.c*m2.y + m1.x, b:m1.b*m2.a + m1.d*m2.b, d:m1.b*m2.c + m1.d*m2.d, y:m1.b*m2.x + m1.d*m2.y + m1.y, }; } function mat2_det(m){return m.a*m.d - m.c*m.b;} function mat2_inv(m){ const inv_det = 1/mat2_det(m); return { a: m.d*inv_det, c:-m.c*inv_det, x:(m.c*m.y - m.d*m.x)*inv_det, b:-m.b*inv_det, d: m.a*inv_det, y:(m.b*m.x - m.a*m.y)*inv_det }; }
Other than the inverse function, there shouldn't be anything surprising here. Honestly, I didn't even bother to copy paste this, and just typed it out by visualizing the gizmos. That's it! That's all the code you need to do fancy 2D rendering from scratch!
Transform Tricks
Over the past 20 years, I've worked on several games where I found it invaluable to know how transforms worked. Hopefully this will inspire you to come up with some good tricks of your own.
Sprite Tricks
There's a bit of an (in?)famous interview with Iwata about the development of Zelda: A Link Between Worlds where he describes how they achieved the game's top down view while also including a perspective effect. He says that they tilted everything in the game to match the camera.
I see this image get referenced all the time in indie circles, but you I have a hard time believing Nintendo actually created the game this way! What an absolute pain! Like it makes for a fun "how the sausage is made" story, but lots of other games have used a similar perspective too. Maybe you can see where I'm going with this, but all you need is a tiny bit of transform knowledge.
This is exactly what we did for our game, Verdant Skies. We stood all the sprites up normally, and then just modified the camera's view transform so the upward axis of the world matched the camera's. That's it! We even posted an interactive demo you can play with here.
We took it even further to draw the sprite shadows and reflections in the world too. For the reflections we just rendered the scene again, but told the camera to flip the axis upside down. To render shadows we rendered again and told the camera lay the axis down on the ground and moved it to match the sun's angle. We got all of these effects just by changing a column in a matrix.
Portals
Around 2010, we worked on Phineas and Ferb: Transport-inators of Doooom! as a webgame that featured portals. To render them, it was just a matter of transforms. We used a second camera with a view matrix that mirror the original camera's. We applied one portal's inverse model matrix, flipped the transform across the x-axis, and then applied the other portal's regular model matrix. Tada!
2D Shadows
In 2009, we released a game for the original iPhone that featured realtime 2D shadows. This was before mobile phones had GPUs that could run shaders. How did we do that? Transforms of course! I actually have a whole blog series on that topic if you want to know more.
🏗️ Under Construction! 👷
More Transforms!
Color Transforms
Reprojection
Inverse Kinematics
Going Further
Perspective Divide
Shadow Mapping
Infinite Geometry
Stereo Rendering
Computer Vision