Thursday, March 14, 2013

Understanding texture compression - 01 - History & overview

I have achieved significant progress on the engine part and even some gameplay, progress that I'll be slowly showing using brand new higher production value videos, but I'm far too lazy to create such a video during the week. I'll try to do a first one during the weekend. Also, I am kind of overworking myself and I should take it easy and make sure not to burn out on development.

In the meantime, I'll write a short series of articles on texture compression. This domain of computer graphics can be incredibly complex and confusing and I'll be writing these articles as I am learning the ropes myself, so please excuse any mistakes I might make or poorly researched information.

So what is texture compression? Before we answer that, let's go more general: since textures are images in a format that is meant for very specific hardware to access (in our case PC GPUs), what is image compression? Traditionally you use image compression to reduce the size of an image when saved on disk. That's it. That's the primary motivation. Current storage oriented hardware is bigger and faster than ever, but you still can't ignore image compression. Back in the day, a 32 bit 640x480 (VGA) image occupied 1.17 MiB, which was quite a lot of memory for the hardware that was available then. A 1920x1080 (1080p) occupies 7.91 MiB. The width is 3 times as high and height is 2.25 times as high, so the area is 6.75 times larger, so if you do the math this makes sense. Today, in early 2013, a Samsung Galaxy 2 is a quite common and still great phone, but definitively last gen. It takes pictures with a resolution of 3264x2448, so a uncompressed picture takes up 30.48 MiB. This would eat up the relatively small on board storage quite quickly. But this doesn't happen because no one uses uncompressed images. The above mentioned images occupy hundreds of KiB or a few MiB based on compression and quality settings. Image compression is even more important in the case of video. Especially with upcoming technology  A 4K 3D video running at 48 FPS that would allow you to see The Hobbit part 3 on your future tech TV would have serious problems today, because it can't fit comfortably on a single optical media and has a bitrate so high that you can't stream it over the Internet (like on Netflix or something).

That was a fully unnecessary and yet too basic overlong introduction. Back to subject. In the case of texture compression  you don't care about the space the file occupies on disk (but you get reduced disk space as a bonus), but instead you care about occupied video memory. But there is an even more important benefit to compression: reduced bandwidth usage and better cache behavior. If you compress your 4 MiB image to 1 MiB, your GPU will access it faster, even if some form of decompression is needed for each individual access.

Taking this into consideration, several things that can be considered compression in general terms are not texture compression. Here are a few important conditions that must be satisfied and behavior the GPU will have:
  • Texture compression is GPU oriented, so the GPU must receive the raw compressed data. If you decompress your image on the CPU before sending it to the GPU, you may be using less disk space, but you get no benefit from texture compression.
  • The GPU will not decompress and cache the full data once it has received it. Doing that would void the reduced memory consumption and bandwidth advantages. So the GPU stores the texture in the compressed format and uses it like that when sampling.
  • Texture decompression must be fast. Since the GPU accesses the raw compressed data when it samples a texture, this process must be fast.
  • The GPU needs fast constant time random access to any point of the texture. So streaming compression  where the image must be decompressed in a temporary memory location on accessed until the desired pixel is reached is out of the question.

So it seems that creating such a compression format is no easy task. Something like a simpler JPEG compression can't be used. The scheme must be a lot simpler but still give good compression results. A 10% reduction is size is not good enough to outweigh the cost of decompression.

A long time ago, in a galaxy far away, a company called S3 Graphics that used to produce graphics chips laid the foundation of a compression scheme that is both in use today and was the foundation for other techniques developed in the meantime. They developed the S3 Texture Compression algorithm (S3TC for short) and presumably only their graphics chips could decompress from this format. It was a block based algorithm. The image was split into chunks of 4x4 pixels. There were multiple variants of the method, each suited a different purpose, but the main idea was the same: you would store two key pixels with a high bit depth and the rest would be approximated by storing the difference between the color that was stored and the  key colors, using a low bit depth. This was based on the observation that over small surfaces there is generally a smaller change in color in most images. But why does this give high compression ratio? A 4x4 block has 16 unique pixels, so it would consume 64 bytes. How do you encode 64 bytes in far less bytes, using a algorithm that is fast to decompress? Well you can't. Not unless you use lossy compression. And S3 chose a very lossy scheme. I'll detail all the schemes soon, but for now it is enough to mention that this compression would always result in a fixed compression ratio of 1:8 or 1:4. That's right, the 64 bytes block would be compressed as an 8 byte block. Needless to say, this worked on some images better than others and there are tons of cases where you shouldn't use this compression.

The block structure satisfies nicely the condition of GPU decompression. It has fast constant cost random access because for a coordinate you can easily compute the block location. Decompression is very fast because for a block the decompression algorithm is a fixed set of arithmetic operations without any branching. It also takes advantage of a very common  scenario: when rendering a textured polygon, a texture sample operation will almost always be followed by another texture sample operation for a near by texel (a texel is a texture "pixel"). This meant that decompressing the block and storing the result in cache would greatly improve performance and would have a very low rate of cache misses.

So texture compression seems very advantageous, even with the reduced visual quality. Especially since when it was introduces, video memory was very low. Today you can easily buy a GPU with 2 GiB of on board DDR5 RAM, so memory consumption is less of an issue but memory bandwidth is still as important as ever. Probably even more important as it was, because RAM is falling behind and when compared to the instruction execution speed on modern CPUs/GPUs, memory access is a performance bottleneck.

But what use was this method if it only worked on S3 chips? Especially since S3 is no longer producing such chips? Well, other chips/APIs stated adding reliable support for texture compression and paying royalties to S3. And this is the last thing I'll mention about S3 because I probably got the entire history part messed up and S3 will try an sue me.

OpenGL added support to S3TC starting with version 1.3. They kept the name and supported it with the "EXT_texture_compression_s3tc" API or extension or whatever OpenGL uses in these cases. I am not targeting OpenGL so I won't talk about it anymore. DirectX also adopted the technique starting with DirectX 6.0. Ahhh, I remember DirectX 5.0. It sucked :P! In a move completely atypical for Microsoft, they renamed it to DXT.

DXT came in 5 variants: DXT1, DXT2, DXT3, DXT4 and DXT5! I'll summarize the differences between them in the following table:

Method Components Encodes As Premultiplied Bytes
RGB, optional A
RGB(5:6:5), A(0)/A(1)
RGB(5:6:5), explicit A(4)
RGB(5:6:5), explicit A(4)
RGB(5:6:5), interpolated A(8)
RGB(5:6:5), interpolated A(8)

I love making HTML tables!

OK, now let's try to understand the table. In part two I will go into a lot of detail regarding the structure and implementation of each method, but really the information in the table is all you need.

DXT1 is the base of all methods and is the simplest, while DXT2-DXT4 are very similar in structure and build upon DXT1. The last column of the table gives the dimension of the block in bytes. Since an uncompressed block takes up 64 bytes, this means that DXT1 provide a 1:8 compression and uses 4bpp (bits per pixel). The rest of the methods provide a compression of 1:4 and use 8bpp. This is why compression gained traction: you are compressing a normally 32 bits per pixel image to 4/8 bits bits per pixel image. In the case of 24 bpp images that don't have an alpha, when compressed with DXT1 the ratio is 1:6.

Now that we understand the basic size difference let's see what we actually encode. Images can have several channels and we traditionally work with images encoded in the RGB format that has 3 channels one for red, one for green and one for blue. These channels use 8 bits normally, but in very specialized graphics processing they can use more. You can also have a fourth channel specifying the transparency of the pixels, the alpha channel.  This fourth 8 bit channel creates the very common RGBA 32bpp pixel format. All DXT format are for channel formats that encode RGBA, with the exception of DXT1, which is either opaque, having an alpha of 100% and encoding only 3 channels, or it can optionally encode RGBA, but the alpha channel of a given pixel can be either 0% (fully transparent) or 100% (fully opaque).

Now that we know what is encoded, the question how is it encoded remains: the fourth row in the table. All DXT methods encode the RGB components in the 5:6:5 format, meaning that green uses 6 bits, while red and blue only 5. DXT1 uses 1 bit for alpha to signal 0%/100%. DXT2 and DXT3 use 4 bits per alpha, while DXT4 and DXT5 use 8 bits. This is where the interesting part starts: since DXT5 uses two times as many bits for alpha then DXT3, it should consume more memory. But if you look at the final column, they both use the same memory. This is because they store alpha differently. DXT2/3 use explicit alpha, each pixel having one 4 bit component to store the value. DXT4/5 use interpolated alpha, using a scheme similar to DXT1 RGB compression: two key alpha values are stored at high bit depths and the rest is interpolated and the difference is stored with low bit depths. So even though DXT5 has more bits per alpha, these values are not explicit. Each pixel does not have its own alpha, but on interpolated value.

Let's skip what premultiplied means for now and give a few key guidelines and observations about these methods and which you should choose.

One key observation is that all 5 methods encode RGB data the same way and provide the same quality. So if you don't care about alpha values, you should always use DXT1 because it as the best compression ratio. This also has a downside: if your DXT1 compressed RGB only image looks like crap with DXT1, you can't switch over to DXT2-5 to get a better quality. The RGB encoding is deterministic across all methods. With one exception. Say you care about alpha, but one bit is enough and you use DXT1: the RGB encoding will look worse that DXT1 without alpha or DXT2-5. The extra alpha encoding reduces the RGB color space. So if your DXT1 looked bad, your DXT1 with alpha will look even worse.

Now let's start caring about alpha. If 1 bit is enough consider DXT1. You will probably need to apply and alpha threshold in the pixel shader to compensate for some unwanted black borders, but it will work. But if the RGB quality drops in a disturbing way by adding the alpha bit, you can consider DXT2-5. Or you must consider DXT2-5 if you need more than 1 bit.

And the rule here is very simple: DXT3 is good at images with sharp alpha changes while DXT5 is good at images with smooth alpha changes, like alpha gradients.

And finally let's address the elephant in the room: premultiplied alpha. DXT2 and DXT4 use premultiplied alpha. This means that the alpha channel is encoded as is (like in DXT3 and DXT5 respectively), but the RGB data is considered to have been premultiplied with the alpha before encoding. So choosing DXT2 over DXT3 changes only the values of the RGB components. In practice it turned out that there was not a lot of use for premultiplied alpha. So little in fact that when the DXT reform occurred  these two methods were left out. So don't sue DXT2/4 unless you have really good reasons for it.

The DXT reform renamed some methods and added a few more to solve some common problems.

DXT, while pretty good, is not 100% general. It gives poor visual results when used with a lot of photographic materials, very detailed textures, smooth gradients, large color variation, diagonal detail, a few very specific images where the blocked encoding aligns very badly with another blocky pattern resulting from the content of the image and... normal maps! Normal maps look absolutely horrible when compressed with DXT and give rise to a typical blocky bump mapping effect. Newer compression method address some of these issues. 

But people are clever! Long before the new methods were created and incorporated into newer consumer level hardware, people came up with ways to fix, at least partially, the shortcomings of DXT.

Let's take normal maps as an example. DXT is generally a 4 channel compression, but not enough bit depth is available for the 3 channels of a normal map that needs very smooth transition between normals that are meant to follow a surface. One clever trick is the so called DXT5n(m) (I'm not 100% sure if DXT5n and DXT5nm are the same format). What is DXT5n? It is DXT5! There are absolutely no differences between the two formats. Except for what you store in them. Instead of writing the 3 components of the normal into the RGB channels of the image, you move the red channel to the alpha  you keep the green in place and fill the now unused red and blue channels with the same color. The alpha and green channels have a higher bit depth thus saving becomes less lossy. Since DXT is based on saving differences from two key colors, filling red and blue with the same value minimizes unnecessary differences and creates better detail precision. The final component of the normal is computed in the pixel shader since normals have a unit length of one. The benefit of texture compression generally outweighs the extra cost of the third component calculation. This is a clever trick that can make more normal compressible with good results than DXT1, which generally fails to give good results. But we are saving only 2 channels in format created for saving 4 channels. This method would greatly benefit from a compression format optimized to store only two channels with greater bit depth than DXT. Foreshadowing!

But normals are not the only thing that can be improved. How about plain RGB images? What do you do when DXT1 (and thus DXT2-5) give poor results, full of artifacts and what not? You use another clever trick! Normal DXT1 is a 4bpp format and we want to get comparable results with greater visual quality. For this we first convert the image to YCbCr format: a luma component followed by blue difference and red difference chroma components. We save the luma in the green channel of DXT1 texture. We encode the Cb and Cr into the alpha and green channels of another texture saved as DXT5. The first texture will already use the same storage space as our entire DXT1 image, that is it will have 4bpp. And we still have a second texture that will be stored at 8bpp, for a total of 12bpp! Not to mention another sampling cost! Not a good idea. The trick here is to down-sample the CbCr texture so that under the new resolution it is effectively 2bpp, giving a total of 6bpp. We can even do another trick, sampling the second image at a lover mip-map level. While the memory consumption is still 6bpp, this will behave more like a 4.5bpp. This improves quality a lot over DXT1 but is still not as great when DXT1 really doesn't like you input image. How great it would be if we could use a format optimized for saving 1 channel images and one for 2 channel images! More foreshadowing!

As you can see, DXT is not that hard to understand and master. The real challenge is to compensate for its weaknesses with all sorts of tricks!

As a final point, let's go over that DXT reform I mentioned earlier. More precisely a DirectX change. Direct X is actually dead. For quite some time now! Out of inertia/misinformation it is still commonly refereed to as DirectX, but what it actually is, the part that is evolving is Direct3D. Initially a sub-API of DirectX, Direct3D is the only rendering part that gets attention. The DirectX SDK hasn't been updated in quite some while, causing some unnecessary panic. How do you get the new versions of Direct3D SDK? Well the Direct3D SDK has been more or less silently incorporated into the Windows SDK. So anyway, Direct3D is evolving, and Direct3D changed a few things in the domain of compression.

It renamed DXT1 as BC1, DXT3 as BC2 and DXT5 as BC3. DXT2 and DXT4 were left out because of their low use.

From DirectX 6 to DirectX/Direct3D 10 new formats were introduces by different manufacturers. 3Dc+/ATI1 was created and is a block format very similar to DXT but it only encodes 1 channel. 3Dc/ATI2 encodes using a similar method 2 channels. ATI1 became BC4 and ATI2 became BC5. Using BC5 for the normals encoding trick described a few paragraphs above gives the best quality compressed normal maps available and BC4 and BC5 can be used for the two image YCbCr trick again with great results. 

Direct3D 11 added BC6 and BC7, two formats what are very complicated  but when used correctly the give extremely good results. Better than BC4/5. I will ignore them, especially since XNA is Direct3D 9.

So let's summarize in a new table:

MethodComponentsEncodesAsOld nameBytes
BC13/4RGB, optional ARGB(5:6:5), A(0)/A(1)DXT18
BC24RGBARGB(5:6:5), explicit A(4)DXT316
BC34RGBARGB(5:6:5), interpolated A(8)DXT516
BC411 channel(8)ATI1/3Dc+8
BC522 channels(8:8)ATI2/3Dc16

This article really didn't turn out the way I planned, but I'll go with it anyway. Part two will go into more detail regarding BC1-5.

No comments:

Post a Comment