{"id":901,"date":"2024-11-11T14:27:46","date_gmt":"2024-11-11T22:27:46","guid":{"rendered":"https:\/\/tellusim.com\/?p=901"},"modified":"2024-11-11T15:28:21","modified_gmt":"2024-11-11T23:28:21","slug":"gpu-encoder","status":"publish","type":"post","link":"https:\/\/tellusim.com\/gpu-encoder\/","title":{"rendered":"GPU texture encoder"},"content":{"rendered":"<p>Creating fast, real-time 3D applications always involves balancing quality and performance, especially when targeting platforms without top-tier GPUs. One of the main bottlenecks in these scenarios is memory throughput, which can significantly impact performance. The amount of texture data used directly affects memory bandwidth, and hardware texture compression helps alleviate this issue by reducing the required memory bandwidth and footprint.<\/p>\r\n\r\n<p>All GPUs support block compression formats; however, there is no universal standard that works seamlessly across all platforms. Currently, different GPUs support three major compression formats:<\/p>\r\n\r\n<ul>\r\n<li><b>BC1-5<\/b> (also known as DXT or S3TC) &#8211; Supported by all desktop GPUs.<\/li>\r\n<li><b>BC6-7<\/b> &#8211; Supported by D3D11+ desktop GPUs.<\/li>\r\n<li><b>ASTC<\/b> &#8211; Supported by modern mobile GPUs.<\/li>\r\n<\/ul>\r\n\r\n<p>There are also older mobile compression formats that are still in use today, such as ETC, ETC2, EAC, ATC, and PVRTC. Unfortunately, using BC formats on mobile devices and ASTC on desktops is not feasible, necessitating different data packs for various platforms.<\/p>\r\n\r\n<p>Compressing textures to BC1-5 formats was relatively straightforward using either the CPU or GPU, as the encoding algorithm was simple. However, the introduction of BC6-7 increased the complexity due to additional compression modes, making the encoding process significantly slower. ASTC formats further increased complexity due to the vast number of modes and the use of integer coding with trits and quints.<\/p>\r\n\r\n<p>Compressing textures offline and shipping them with the project is common but often suboptimal, especially for dynamic or procedural textures. For instance, GLTF and USD resources typically include embedded JPEG images to reduce asset size, while some algorithms generate procedural textures at runtime. In such cases, fast, real-time compression is necessary.<\/p>\r\n\r\n<p>Tellusim SDK provides real-time compression for BC1-5, BC6-7, and ASTC formats on all platforms. BC1 texture compression remains a viable option for PCs because BC formats do not have variable block sizes and BC1 is twice as compact as BC7, with excellent compression speed. A practical use case for BC1 compression is in real-time applications like <a href=\"https:\/\/youtu.be\/BXLtj0Wdl3Y\" target=\"_blank\" rel=\"noopener\">Google Maps<\/a> or <a href=\"https:\/\/youtu.be\/Da_rtamWIw4\" target=\"_blank\" rel=\"noopener\">XYZ tile<\/a> compression, where it helps minimize memory overhead and reduce compression stalls.<\/p>\r\n\r\n<p>Our SDK provides GPU encoders via the <i>EncoderBC15<\/i>, <i>EncoderBC67<\/i>, and <i>EncoderASTC<\/i> interfaces. Each encoder has specific flags and can be initialized for required formats only since initial kernel compilation can take some time.<\/p>\r\n\r\n<p>The encoder input is a standard texture, and the output is an integer texture in <i>RGBAu16<\/i>\/<i>RGBAu32<\/i> format, with one pixel per block dimension. The application must copy this intermediate integer texture into the final block texture because direct copying to block-compressed formats is typically unsupported. (<i>RGBAu16<\/i> is required only for BC1 and BC4 formats).<\/p>\r\n\r\n<p>Integer textures cannot fully represent all required mipmap levels due to size truncation. This issue needs to be managed manually, either by reducing the number of mipmaps being compressed or by increasing the size of the integer texture. The truncation occurs because the final 1&#215;1 mipmap level in the integer texture represents a 4&#215;4 (or 5&#215;4, 5&#215;5) pixel block, leaving no space for the smallest mipmaps (2&#215;2 and 1&#215;1).<\/p>\r\n\r\n<p>Below is an example of GPU ASTC 5&#215;5 texture compression using the Tellusim SDK:<\/p>\r\n\r\n<div class=\"ts-scroll mb-2\"><pre style=\"background-color: #000000\"><font color=\"#ffffff\"><font color=\"#80a0ff\">\/\/ texture format<\/font>\r\nFormat format = FormatASTC44RGBAu8n;\r\n\r\n<font color=\"#80a0ff\">\/\/ create intermediate image<\/font>\r\n<font color=\"#60ff60\"><b>uint32_t<\/b><\/font> width = src_texture.getWidth();\r\n<font color=\"#60ff60\"><b>uint32_t<\/b><\/font> height = src_texture.getHeight();\r\n<font color=\"#60ff60\"><b>uint32_t<\/b><\/font> block_width = getFormatBlockWidth(format);\r\n<font color=\"#60ff60\"><b>uint32_t<\/b><\/font> block_height = getFormatBlockHeight(format);\r\nImage dest_image = Image(Image::Type2D, FormatRGBAu32, Size(udiv(width, block_width), udiv(height, block_height)));\r\n\r\n<font color=\"#80a0ff\">\/\/ create intermediate texture<\/font>\r\nTexture dest_texture = device.createTexture(dest_image, Texture::FlagSurface | Texture::FlagSource);\r\n<font color=\"#ffff60\"><b>if<\/b><\/font>(!dest_texture) <font color=\"#ffff60\"><b>return<\/b><\/font> <span style=\"background-color: #0d0d0d\"><font color=\"#ffa0a0\">1<\/font><\/span>;\r\n\r\n<font color=\"#80a0ff\">\/\/ dispatch encoder<\/font>\r\n{\r\n    Compute compute = device.createCompute();\r\n    encoder.dispatch(compute, EncoderASTC::ModeASTC44RGBAu8n, dest_texture, src_texture);\r\n}\r\n\r\n<font color=\"#80a0ff\">\/\/ flush context<\/font>\r\ncontext.flush();\r\n\r\n<font color=\"#80a0ff\">\/\/ get intermediate image data<\/font>\r\n<font color=\"#ffff60\"><b>if<\/b><\/font>(!device.getTexture(dest_texture, dest_image)) <font color=\"#ffff60\"><b>return<\/b><\/font> <span style=\"background-color: #0d0d0d\"><font color=\"#ffa0a0\">1<\/font><\/span>;\r\n\r\n<font color=\"#80a0ff\">\/\/ copy image data<\/font>\r\nImage image = Image(Image::Type2D, format, Size(width, height));\r\nmemcpy(image.getData(), dest_image.getData(), min(image.getDataSize(), dest_image.getDataSize()));\r\n\r\n<font color=\"#80a0ff\">\/\/ save encoded image<\/font>\r\nimage.save(<span style=\"background-color: #0d0d0d\"><font color=\"#ffa0a0\">&quot;texture.astc&quot;<\/font><\/span>);\r\n<\/font><\/pre><\/div>\r\n\r\n<p>Additionally, the SDK includes a fast GPU JPEG decompression interface, which significantly accelerates JPEG to BC or ASTC conversions.<\/p>\r\n\r\n<p>Of course, achieving real-time compression speeds involves quality trade-offs, which can reduce the resulting texture quality. Below are tables with PSNR and time values for compressing a test <a href=\"https:\/\/tellusim.com\/images\/blog\/gpu-encoder\/texture.png\" target=\"_blank\" rel=\"noopener\">1024&#215;512 RGB image<\/a> on Apple M1 Max:<\/p>\r\n\r\n<div class=\"ts-scroll mt-4\">\r\n<table class=\"ts-blog-table table table-dark table-striped\">\r\n<thead><tr>\r\n<th scope=\"col\">PSNR RGB (db)<\/th>\r\n<th scope=\"col\">CPU Fast<\/th>\r\n<th scope=\"col\">CPU Default<\/th>\r\n<th scope=\"col\">CPU Best<\/th>\r\n<th scope=\"col\">GPU<\/th>\r\n<\/tr><\/thead>\r\n<tbody>\r\n\r\n<tr><th scope=\"row\"><b>BC1<\/b><\/th>\r\n<td>39.83<\/td>\r\n<td>39.85<\/td>\r\n<td><\/td>\r\n<td>39.69<\/td>\r\n<\/tr>\r\n\r\n<tr><th scope=\"row\"><b>BC7<\/b><\/th>\r\n<td>48.27<\/td>\r\n<td>48.53<\/td>\r\n<td><\/td>\r\n<td>44.85<\/td>\r\n<\/tr>\r\n\r\n<tr><th scope=\"row\"><b>ASTC 4&#215;4<\/b><\/th>\r\n<td>48.13<\/td>\r\n<td>48.29<\/td>\r\n<td>48.50<\/td>\r\n<td>44.97<\/td>\r\n<\/tr>\r\n\r\n<tr><th scope=\"row\"><b>ASTC 5&#215;4<\/b><\/th>\r\n<td>46.18<\/td>\r\n<td>46.34<\/td>\r\n<td>46.50<\/td>\r\n<td>42.74<\/td>\r\n<\/tr>\r\n\r\n<tr><th scope=\"row\"><b>ASTC 5&#215;5<\/b><\/th>\r\n<td>44.48<\/td>\r\n<td>44.61<\/td>\r\n<td>44.73<\/td>\r\n<td>40.96<\/td>\r\n<\/tr>\r\n\r\n<\/tbody>\r\n<\/table>\r\n<\/div>\r\n\r\n<div class=\"ts-scroll mt-4\">\r\n<table class=\"ts-blog-table table table-dark table-striped\">\r\n<thead><tr>\r\n<th scope=\"col\">PSNR RG (db)<\/th>\r\n<th scope=\"col\">CPU Fast<\/th>\r\n<th scope=\"col\">CPU Default<\/th>\r\n<th scope=\"col\">CPU Best<\/th>\r\n<th scope=\"col\">GPU<\/th>\r\n<\/tr><\/thead>\r\n<tbody>\r\n\r\n<tr><th scope=\"row\"><b>BC1<\/b><\/th>\r\n<td>49.52<\/td>\r\n<td>49.62<\/td>\r\n<td><\/td>\r\n<td>40.11<\/td>\r\n<\/tr>\r\n\r\n<tr><th scope=\"row\"><b>BC7<\/b><\/th>\r\n<td>50.14<\/td>\r\n<td>50.38<\/td>\r\n<td><\/td>\r\n<td>46.28<\/td>\r\n<\/tr>\r\n\r\n<tr><th scope=\"row\"><b>ASTC 4&#215;4<\/b><\/th>\r\n<td>51.07<\/td>\r\n<td>51.18<\/td>\r\n<td>51.37<\/td>\r\n<td>47.43<\/td>\r\n<\/tr>\r\n\r\n<tr><th scope=\"row\"><b>ASTC 5&#215;4<\/b><\/th>\r\n<td>48.65<\/td>\r\n<td>48.77<\/td>\r\n<td>48.92<\/td>\r\n<td>44.02<\/td>\r\n<\/tr>\r\n\r\n<tr><th scope=\"row\"><b>ASTC 5&#215;5<\/b><\/th>\r\n<td>46.41<\/td>\r\n<td>46.54<\/td>\r\n<td>46.67<\/td>\r\n<td>41.85<\/td>\r\n<\/tr>\r\n\r\n<\/tbody>\r\n<\/table>\r\n<\/div>\r\n\r\n<div class=\"ts-scroll mt-4\">\r\n<table class=\"ts-blog-table table table-dark table-striped\">\r\n<thead><tr>\r\n<th scope=\"col\">Time RGB (ms)<\/th>\r\n<th scope=\"col\">CPU Fast<\/th>\r\n<th scope=\"col\">CPU Default<\/th>\r\n<th scope=\"col\">CPU Best<\/th>\r\n<th scope=\"col\">GPU<\/th>\r\n<\/tr><\/thead>\r\n<tbody>\r\n\r\n<tr><th scope=\"row\"><b>BC1<\/b><\/th>\r\n<td>28<\/td>\r\n<td>43<\/td>\r\n<td><\/td>\r\n<td>0.4<\/td>\r\n<\/tr>\r\n\r\n<tr><th scope=\"row\"><b>BC7<\/b><\/th>\r\n<td>105<\/td>\r\n<td>186<\/td>\r\n<td><\/td>\r\n<td>1.0<\/td>\r\n<\/tr>\r\n\r\n<tr><th scope=\"row\"><b>ASTC 4&#215;4<\/b><\/th>\r\n<td>100<\/td>\r\n<td>157<\/td>\r\n<td>386<\/td>\r\n<td>2.2<\/td>\r\n<\/tr>\r\n\r\n<tr><th scope=\"row\"><b>ASTC 5&#215;4<\/b><\/th>\r\n<td>117<\/td>\r\n<td>175<\/td>\r\n<td>464<\/td>\r\n<td>4.8<\/td>\r\n<\/tr>\r\n\r\n<tr><th scope=\"row\"><b>ASTC 5&#215;5<\/b><\/th>\r\n<td>138<\/td>\r\n<td>212<\/td>\r\n<td>542<\/td>\r\n<td>3.1<\/td>\r\n<\/tr>\r\n\r\n<\/tbody>\r\n<\/table>\r\n<\/div>\r\n\r\n<p>Reducing the number of input texture components improves PSNR values, which is beneficial for normal maps and luminance-only textures. ASTC encoding performance can be further optimized by limiting the number of compression modes if needed. However, the current performance is satisfactory for applications using JPEG input textures.<\/p>\r\n\r\n<p>The latest version of the reference <a href=\"https:\/\/github.com\/ARM-software\/astc-encoder\" target=\"_blank\" rel=\"noopener\">astcenc<\/a> compressor demonstrates excellent CPU encoding performance, while we stopped our CPU ASTC encoder optimizations at BC7 performance level:<\/p>\r\n\r\n<div class=\"ts-scroll mt-4\">\r\n<table class=\"ts-blog-table table table-dark table-striped\">\r\n<thead><tr>\r\n<th scope=\"col\">PSNR RGB (db)<\/th>\r\n<th scope=\"col\">Fast<\/th>\r\n<th scope=\"col\">Medium<\/th>\r\n<th scope=\"col\">Thorough<\/th>\r\n<\/tr><\/thead>\r\n<tbody>\r\n\r\n<tr><th scope=\"row\"><b>ASTC 4&#215;4<\/b><\/th>\r\n<td>47.31<\/td>\r\n<td>48.19<\/td>\r\n<td>48.50<\/td>\r\n<\/tr>\r\n\r\n<tr><th scope=\"row\"><b>ASTC 5&#215;4<\/b><\/th>\r\n<td>45.58<\/td>\r\n<td>46.34<\/td>\r\n<td>46.63<\/td>\r\n<\/tr>\r\n\r\n<tr><th scope=\"row\"><b>ASTC 5&#215;5<\/b><\/th>\r\n<td>43.56<\/td>\r\n<td>44.61<\/td>\r\n<td>44.97<\/td>\r\n<\/tr>\r\n\r\n<\/tbody>\r\n<\/table>\r\n<\/div>\r\n\r\n<div class=\"ts-scroll mt-4\">\r\n<table class=\"ts-blog-table table table-dark table-striped\">\r\n<thead><tr>\r\n<th scope=\"col\">Time RGB (ms)<\/th>\r\n<th scope=\"col\">Fast<\/th>\r\n<th scope=\"col\">Medium<\/th>\r\n<th scope=\"col\">Thorough<\/th>\r\n<\/tr><\/thead>\r\n<tbody>\r\n\r\n<tr><th scope=\"row\"><b>ASTC 4&#215;4<\/b><\/th>\r\n<td>22<\/td>\r\n<td>28<\/td>\r\n<td>65<\/td>\r\n<\/tr>\r\n\r\n<tr><th scope=\"row\"><b>ASTC 5&#215;4<\/b><\/th>\r\n<td>20<\/td>\r\n<td>26<\/td>\r\n<td>64<\/td>\r\n<\/tr>\r\n\r\n<tr><th scope=\"row\"><b>ASTC 5&#215;5<\/b><\/th>\r\n<td>21<\/td>\r\n<td>25<\/td>\r\n<td>67<\/td>\r\n<\/tr>\r\n\r\n<\/tbody>\r\n<\/table>\r\n<\/div>\r\n\r\n<p>All textures and metrics were taken by Tellusim Image Processing Tool from Core SDK using this script:<\/p>\r\n\r\n<ul>\r\n<li><a href=\"https:\/\/tellusim.com\/images\/blog\/gpu-encoder\/encoder.sh\" target=\"_blank\" rel=\"noopener\">encoder.sh<\/a><\/li>\r\n<\/ul>","protected":false},"excerpt":{"rendered":"Creating fast, real-time 3D applications always involves balancing quality and performance, especially when targeting platforms without top-tier GPUs. One of the main bottlenecks in these scenarios is memory throughput, which can significantly impact performance. The amount of texture data used directly affects memory bandwidth, and hardware texture compression helps alleviate this issue by reducing the [&hellip;]","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/tellusim.com\/wp-json\/wp\/v2\/posts\/901"}],"collection":[{"href":"https:\/\/tellusim.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/tellusim.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/tellusim.com\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/tellusim.com\/wp-json\/wp\/v2\/comments?post=901"}],"version-history":[{"count":24,"href":"https:\/\/tellusim.com\/wp-json\/wp\/v2\/posts\/901\/revisions"}],"predecessor-version":[{"id":925,"href":"https:\/\/tellusim.com\/wp-json\/wp\/v2\/posts\/901\/revisions\/925"}],"wp:attachment":[{"href":"https:\/\/tellusim.com\/wp-json\/wp\/v2\/media?parent=901"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/tellusim.com\/wp-json\/wp\/v2\/categories?post=901"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/tellusim.com\/wp-json\/wp\/v2\/tags?post=901"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}