深刻剖析MSAA

时间 2019-11-10

标签深刻剖析 msaa 繁體版

原文原文链接

本文打算对MSAA(Multisample anti aliasing)作一个深刻的讲解，包括基本的原理、以及不一样平台上的实现对比（主要是PC与Mobile）。为了对MSAA有个更好的理解，因此写下了这篇文章。固然文章中不免有错误之处，若有发现，还请指证，以避免误导其余人。好了，废话很少说，下面咱们开始正文。html

MSAA的原理

Aliasing(走样)

在介绍MSAA原理以前，咱们先对走样（Aliasing）作个简单介绍。在信号处理以及相关领域中，走样（混叠）在对不一样的信号进行采样时，致使得出的信号相同的现象。它也能够指信号从采样点从新信号致使的跟原始信号不匹配的瑕疵。它分为时间走样（好比数字音乐、以及在电影中看到车轮倒转等）和空间走样两种（摩尔纹）。这里咱们不详细展开。android

具体到实时渲染领域中，走样有如下三种：算法

几何体走样（几何物体的边缘有锯齿），几何走样因为对几何边缘采样不足致使。
着色走样，因为对着色器中着色公式（渲染方程）采样不足致使。比较明显的现象就是高光闪烁。

上面一张图显示了因为对使用了高频法线贴图的高频高光BRDF采样不足时产生的着色走样。下面这张图显示了使用4倍超采样产生的效果。windows
时间走样，主要是对高速运动的物体采样不足致使。好比游戏中播放的动画发生跳变等。

SSAA（超采样反走样）

从名字能够看出，超采样技术就是以一个更大的分辨率来渲染场景，而后再把相邻像素值作一个过滤（好比平均等）获得最终的图像（Resolve）。由于这个技术提升了采样率，因此它对于解决上面几何走样和着色走样都是有效果的。以下图所示，首先经对每一个像素取n个子采样点，而后针对每一个子像素点进行着色计算。最后根据每一个子像素的值来合成最终的图像。缓存

虽然SSAA能够有效的解决几何走样和着色走样问题，可是它须要更多的显存空间以及更多的着色计算（每一个子采样点都须要进行光照计算），因此通常不会使用这种技术。顺着上面的思路，若是咱们对最终的每一个像素着色，而不是每一个子采样点着色的话，那这样虽然显存仍是那么多，可是着色的数量少了，那它的效率也会有比较大的提升。这就是咱们今天想要主要说的MSAA技术。架构

MSAA(多重采样反走样)

在前面提到的SSAA中，每一个子采样点都要进行单独的着色，这样在片段（像素）着色器比较复杂的状况下仍是很费的。那么能不能只计算每一个像素的颜色，而对于那些子采样点只计算一个覆盖信息（coverage）和遮挡信息（occlusion）来把像素的颜色信息写到每一个子采样点里面呢？最终根据子采样点里面的颜色值来经过某个重建过滤器来降采样生成目标图像。这就是MSAA的原理。注意这里有一个很重要的点，就是每一个子像素都有本身的颜色、深度模板信息，而且每个子采样点都是须要通过深度和模板测试才能决定最终是否是把像素的颜色获得到这个子采样点所在的位置，而不是简单的做一个覆盖测试就写入颜色。关于这个的出处，我在接下来的文章里会写出多个出处来佐证这一点。如今让咱们先把MSAA的原理讲清楚。app

Coverage（覆盖）以及Occlusion(遮挡)

一个支持D3D11的显卡支持经过光栅化来渲染点、线以及三角形。显卡上的光栅化管线把图形的顶点看成输入，这些顶点的位置是在经由透视变换的齐次裁剪空间。它们用来决定这个三角形在当前渲染目标上的像素的位置。这个可见像素由两个因素决定:ide

覆盖覆盖是经过判断一个图形是否跟一个指定的像素重叠来决定的。在显卡中，覆盖是经过测试一个采样点是否在像素的中心来决定的。接下来的图片说明了这个过程。

一个三角形的覆盖信息。蓝色的点表明采样点，每个都在像素的中心位置。红色的点表明三角形覆盖的采样点。wordpress

遮挡告诉咱们被一个图形覆盖的像素是否被其它的像素覆盖了，这种状况你们应该很熟悉就是z buffer的深度测试。

覆盖和遮挡两个一块儿决定了一个图形的可见性。post

就光栅化而言，MSAA跟SSAA的方式差很少，覆盖和遮挡信息都是在一个更大分辨率上进行的。对于覆盖信息来讲，硬件会对每一个子像素根据采样规则生成n的子采样点。接下来的这张图展现了一个使用了旋转网格（rotated grid）采样方式的子采样点位置。

三角形会与像素的每一个子采样点进行覆盖测试，会生成一个二进制覆盖掩码，它表明了这个三角形覆盖当前像素的比例。对于遮挡测试来讲，三角形的深度在每个覆盖的子采样点的位置进行插值，而且跟z buffer中的深度信息进行比较。因为深度测试是在每一个子采样点的级别而不是像素级别进行的，深度buffer必须相应的增大以来存储额外的深度值。在实现中，这意味着深度缓冲区是非MSAA状况下的n倍。

MSAA跟SSAA不一样的地方在于，SSAA对于全部子采样点着色，而MSAA只对当前像素覆盖掩码不为0的进行着色，顶点属性在像素的中心进行插值用于在片段程序中着色。这是MSAA相对于SSAA来讲最大的好处。

虽然咱们只对每一个像素进行着色，可是并不意味着咱们只须要存储一个颜色值，而是须要为每个子采样点都存储颜色值，因此咱们须要额外的空间来存储每一个子采样点的颜色值。因此，颜色缓冲区的大小也为非MSAA下的n倍。当一个片段程序输出值时，只有地了覆盖测试和遮挡测试的子采样点才会被写入值。所以若是一个三角形覆盖了4倍采样方式的一半，那么一半的子采样点会接收到新的值。或者若是全部的子采样点都被覆盖，那么全部的都会接收到值。接下来的这张图展现了这个概念：

经过使用覆盖掩码来决定子采样点是否须要更新值，最终结果多是n个三角形部分覆盖子采样点的n个值。接下来的图像展现了4倍MSAA光栅化的过程。

MSAA Resolve(MSAA 解析)

像超采样同样，过采样的信号必须从新采样到指定的分辨率，这样咱们才能够显示它。

这个过程叫解析（resolving）。在它最先的版本里，解析过程是在显卡的固定硬件里完成的。通常使用的采样方法就是一像素宽的box过滤器。这种过滤器对于彻底覆盖的像素会产生跟没有使用MSAA同样的效果。好很差取决于怎么看它（好是由于你不会由于模糊而减小细节，坏是由于一个box过滤器会引入后走样（postaliasing））。对于三角形边上的像素，你会获得一个标志性的渐变颜色值，数量等于子采样点的个数。接下来的图展现了这一现象：

固然不一样的硬件厂商可能会使用不一样的算法。好比nVidia的"Quincunx" AA等。随着显卡的不断升级，咱们如今能够经过自定义的shader来作MSAA的解析了。

小结

经过上面的解释，咱们能够看到，整个MSAA并非在光栅化阶段就能够彻底的，它在这个阶段只是生成覆盖信息。而后计算像素颜色，根据覆盖信息和深度信息决定是否来写入子采样点。整个完成后再经过某个过滤器进行降采样获得最终的图像。大致流程以下所示：

PC与Mobile对比

上面咱们讲解了MSAA的基本原理，那么具体到不一样显卡厂商以及不一样平台上的实现有什么不一样吗？下面就让咱们作些简单的对比。其实，既然算法已经肯定了，那么差别基本上就是在一些细节上的处理，以及GPU架构不一样带来的差别。

版本	MSAA是否支持	自定义Shader解析	是否须要更大的颜色深度缓冲区
Direct3D 9	是	否	须要
Direct3D 11	是	是	须要
Direct3D 12	是	是	须要
OpenGL ES 2.0	(Multisample rasterization cannot be enabled or disabled after a GL context is created. It is enabled if the value of SAMPLE_BUFFERS is one, and disabled otherwise) Multisample Texture: 使用GL_EXT_multisampled_render_to_texture扩展苹果： APPLE_framebuffer_multisample 安卓：使用EGL	否	看GPU架构： TBR(Mali Qualcomm Adreno(300系列以前)） TBDR（PowerVR）不须要 IMR（nVidia Tera Qualcomm Adreno 300系列以及以后能够在IMR、TBR之间切换）须要。若是使用GL_EXT_multisampled_render_to_texture也须要（跟硬件实现有关（enabling MSAA the right way in OpenGL ES））。
OpenGL ES 3.0	是（The technique is to sample all primitives multiple times at each pixel. The color sample values are resolved to a single, displayable color. For window system-provided framebuffers, this occurs each time a pixel is updated, so the antialiasing appears to be automatic at the application level. For application-created framebuffers, this must be requested by calling the BlitFramebuffer command (see section 4.3.3).） When rendering textures, emphasis is placed on multisample anti-aliasing (MSAA), which earlier hardware generations could only run against the framebuffer. OpenGL ES 3.0 can presently support MSAA-type rendering for a texture.	否	若是是系统提供的framebuffer,那么同OpenGL ES 2.0的版本。若是是用户建立的framebuffer，那么是须要额外的显存的(跟硬件实现有关？？？)。
OpenGL ES 3.1	是	是（sampler2DMS）	若是是系统提供的framebuffer,那么同OpenGL ES 2.0的版本。若是是用户建立的framebuffer，那么是须要额外的显存的(跟硬件实现有关？？？)。

IMR vs TBR vs TBDR

IMR （当即渲染模式）

目前PC平台上基本上都是当即渲染模式，CPU提交渲染数据和渲染命令，GPU开始执行。它跟当前已经画了什么以及未来要画什么的关系很小（Early Z除外）。流程以下图所示：

TBR（分块渲染）

TBR把屏幕分红一系列的小块，每一个单独来处理，因此能够作到并行。因为在任什么时候候显卡只须要场景中的一部分数据就可完成工做，这些数据（如颜色深度等）足够小到能够放在显卡芯片上（on-chip），有效得减小了存取系统内存的次数。它带来的好处就是更少的电量消耗以及更少的带宽消耗，从而会得到更高的性能。

分块

TBDR （分块延迟渲染）

TBDR跟TBR有些类似，也是分块，并使用在芯片上的缓存来存储数据（颜色以及深度等），它还使用了延迟技术，叫隐藏面剔除（Hidden Surface Removal），它把纹理以及着色操做延迟到每一个像素已经在块中已经肯定可见性以后，只有那些最终被看到的像素才消耗处理资源。这意味着隐藏像素的没必要要处理被去掉了，这确保了每帧使用最低可能的带宽使用和处理周期数，这样就能够获取更高的性能以及更少的电量消耗。

一个简单的对比传统GPU与TBDR

移动平台上的MSAA

有了上面对移动GPU架构的简单了解，下面咱们看下在移动平台上是怎么处理MSAA的，以下图所示：

能够看到若是相对于IMR模式的显卡来讲，TBR或者TBDR的实现MSAA会省不少，由于好多工做直接在on-chip上就完成了。这里仍是有两个消耗：

4倍MSAA须要四倍的块缓冲内存。因为芯片上的块缓冲内存很最贵，因此显卡会经过减小块的大小来消除这个问题。减小块的大小对性能有所影响，可是减小一半的大小并不意味着性能会减半，瓶颈在片段程序的只会有一个很小的影响。

第二个影响就是在物体边缘会产生更多的片段，这个在IMR模式下也有。每一个多边形都会覆盖更多的像素以下图所示。并且，背景和前景的图形都贡献到一个交互的地方，两片段都须要着色，这样硬件隐藏背面剔除就会剔除更少的像素。这些额外片段的消耗跟场景是由多少边缘组成有关，可是10%是一个比较好的猜想。

主流移动GPU的实现细节

Mali:

JUST22 - Multisampled resolve on-tile is supported in hardware with no bandwidth hit Mali GPUs support resolving multisampled framebuffers on-tile. Combined with tile-buffer support for full throughput in 4x MSAA makes 4x MSAA a very compelling way of improving quality with minimal speed hit.

In GLES on Mali GPUs, the simplest case for 4xMSAA would be to render directly to the window surface (FB0), having set EGL_SAMPLES to 4. This will do all multisampling and resolving in the GPU registers, and will only flush the resolved buffer to memory. This is the most efficient way to implement MSAA on a Mali GPU, and comes at almost no performance cost compared to rendering to a normal window surface. Note that this does not expose the sample buffers themselves to you, and does not require an explicit resolve.

Qualcomm Adreno:

Anti-aliasing is an important technique for improving the quality of generated images. It reduces

the visual artifacts of rendering into discrete pixels.

Among the various techniques for reducing aliasing effects, multisampling is efficiently

supported by Adreno 4x. Multisampling divides every pixel into a set of samples, each of which

is treated like a "mini-pixel" during rasterization. Each sample has its own color, depth, and

stencil value. And those values are preserved until the image is ready for display. When it is time

to compose the final image, the samples are resolved into the final pixel color. Adreno 4xx

supports the use of two or four samples per pixel.

PowerVR:

Another benefit of the SGX and SGX-MP architecture is the ability to perform efficient 4x Multi-Sample Anti-Aliasing (MSAA). MSAA is performed entirely on-chip, which keeps performance high without introducing a system memory bandwidth overhead (as would be seen when performing anti-aliasing in some other architectures). To achieve this, the tile size is effectively quartered and 4 sample positions are taken for each fragment (e.g., if the tile size is 16x16, an 8x8 tile will be processed when MSAA is enabled). The reduction in tile size ensures the hardware has sufficient memory to process and store colour, depth and stencil data for all of the sample positions. When the ISP operates on each tile, HSR and depth tests are performed for all sample positions. Additionally, the ISP uses a 1 bit flag to indicate if a fragment contains an edge. This flag is used to optimize blending operations later in the render. When the subsamples are submitted to the TSP, texturing and shading operations are executed on a per-fragment basis, and the resultant colour is set for all visible subsamples. This means that the fragment workload will only slightly increase when MSAA is enabled, as the subsamples within a given fragment may be coloured by different primitives when the fragment contains an edge. When performing blending, the edge flag set by the ISP indicates if the standard blend path needs to be taken, or if the optimized path can be used. If the destination fragment contains an edge, then the blend needs to be performed individually for each visible subsample to give the correct resultant colour (standard blend). If the destination fragment does not contain an edge, then the blend operation is performed once and the colour is set for all visible subsamples (optimized blend). Once a tile has been rendered, the Pixel Back End (PBE) combines the subsample colours for each fragment into a single colour value that can be written to the frame buffer in system memory. As this combination is done on the hardware before the colour data is sent, the system memory bandwidth required for the tile flush is identical to the amount that would be required when MSAA is not enabled.

On PowerVR hardware Multi-Sampled Anti-Aliasing (MSAA) can be performed directly in on-chip memory before being written out to system memory, which saves valuable memory bandwidth. In general, MSAA is considered to cost relatively little performance. This is true for typical games and UIs, which have low geometry counts but very complex shaders. The complex shaders typically hide the cost of MSAA and have a reduced blend workload. 2x MSAA is virtually free on most PowerVR graphics cores (Rogue onwards), while 4x MSAA+ will noticeably impact performance. This is partly due to the increased on-chip memory footprint, which results in a reduction in tile dimensions (for instance 32 x 32 -> 32 x 16 -> 16 x 16 pixels) as the number of samples taken increases. This in turn results in an increased number of tiles that need to be processed by the tile accelerator hardware, which then increases the vertex stages overall processing cost. The concept of "good enough‟ should be followed in determining how much anti-aliasing is enough. An application may only require 2x MSAA to look "good enough‟, while performing comfortably at a consistent 60 FPS. In some cases there may be no need for anti-aliasing to be used at all e.g. when the target device‟s display has high PPI (pixels per-inch). Performing MSAA becomes more costly when there is an alpha blended edge, resulting in the graphics core marking the pixels on the edge to "on edge blend". On edge blend is a costly operation, as the blending is performed for each sample by a shader (i.e. in software). In contrast, on opaque edge is performed by dedicated hardware, and is a much cheaper operation as a result. On edge blend is also "sticky‟, which means that once an on-screen pixel is marked, all subsequent blended pixels are blended by a shader, rather than by dedicated hardware. In order to mitigate these costs, submit all opaque geometry first, which keeps the pixels "off edge" for as long as possible. Also, developers should be extremely reserved with the use of blending, as blending has lots of performance implications, not just for MSAA.

总结

经过上面的讲解，咱们了解了MSAA的实现原理，以及在PC平台和移动平台上由于架构的不一样致使具体实现细节的不一样。MSAA是影响了GPU管理的光栅化、片段程序、光栅操做阶段（每一个子采样点都要作深度测试）的。每一个子采样点都是有本身的颜色和深度存储的，而且每一个子采样点都会作深度测试。在移动平台上，是否须要额外的空间来存储颜色和深度须要根据OpenGL ES的版本以及具体硬件的实现有关。MSAA在通常的状况下（不须要额外空间来存储颜色和深度，直接在on-chip上完成子采样点计算，而后直接resolve到framebuffer）是要比PC平台上效率高的，由于没有了那么大的带宽消耗。可是鉴于硬件实现差别大，建议仍是以实测为准。因为本人水平有限，不免会有错误的地方。若是发现，还请指正，以避免误导了他人。

参考文献

https://en.wikipedia.org/wiki/Aliasing
https://en.wikipedia.org/wiki/Moir%C3%A9_pattern
https://mynameismjp.wordpress.com/2012/10/21/applying-sampling-theory-to-real-time-graphics/
https://en.wikipedia.org/wiki/Supersampling
https://mynameismjp.wordpress.com/2012/10/24/msaa-overview/
https://mynameismjp.wordpress.com/2012/10/28/msaa-resolve-filters/
http://graphics.stanford.edu/courses/cs248-07/lectures/2007.10.11%20CS248-06%20Multisample%20Antialiasing/2007.10.11%20CS248-06%20Multisample%20Antialiasing.ppt
https://msdn.microsoft.com/en-us/library/windows/desktop/cc627092(v=vs.85).aspx
https://www.khronos.org/registry/OpenGL/specs/es/2.0/es_full_spec_2.0.pdf
https://www.khronos.org/registry/OpenGL/extensions/EXT/EXT_multisampled_render_to_texture.txt
https://developer.apple.com/library/content/documentation/3DDrawing/Conceptual/OpenGLES_ProgrammingGuide/WorkingwithEAGLContexts/WorkingwithEAGLContexts.html#//apple_ref/doc/uid/TP40008793-CH103-SW4
https://stackoverflow.com/questions/27035893/antialiasing-in-opengl-es-2-0
https://www.imgtec.com/blog/a-look-at-the-powervr-graphics-architecture-tile-based-rendering/
https://www.imgtec.com/blog/understanding-powervr-series5xt-powervr-tbdr-and-architecture-efficiency-part-4/
https://en.wikipedia.org/wiki/Tiled_rendering
https://www.qualcomm.com/media/documents/files/the-rise-of-mobile-gaming-on-android-qualcomm-snapdragon-technology-leadership.pdf
https://static.docs.arm.com/100019/0100/arm_mali_application_developer_best_practices_developer_guide_100019_0100_00_en2.pdf
https://www.imgtec.com/blog/introducing-the-brand-new-opengl-es-3-0/
https://www.khronos.org/assets/uploads/developers/library/2014-gdc/Khronos-OpenGL-ES-GDC-Mar14.pdf
https://android.googlesource.com/platform/external/deqp/+/193f598/modules/gles31/functional/es31fMultisampleShaderRenderCase.cpp
https://www.anandtech.com/show/4686/samsung-galaxy-s-2-international-review-the-best-redefined/15
https://www.imgtec.com/blog/a-look-at-the-powervr-graphics-architecture-tile-based-rendering/
http://www.seas.upenn.edu/~pcozzi/OpenGLInsights/OpenGLInsights-TileBasedArchitectures.pdf
https://static.docs.arm.com/100019/0100/arm_mali_application_developer_best_practices_developer_guide_100019_0100_00_en2.pdf
https://community.arm.com/graphics/f/discussions/4426/multisample-antialiasing-using-multisample-fbo
http://cdn.imgtec.com/sdk-documentation/PowerVR+Series5.Architecture+Guide+for+Developers.pdf
http://cdn.imgtec.com/sdk-documentation/PowerVR.Performance+Recommendations.pdf