后端ui_使UI后端更快

后端ui

In Unity 4.6 / 5.0, the generation of batches for rendering of the UI system is very slow. This is due to a few factors, but ultimately our deadlines kept us from dedicating the time to polishing this part of the UI, instead focusing on the usability and API side of things.

在Unity 4.6 / 5.0中，用于渲染UI系统的批处理的生成非常缓慢。这是由于一些因素造成的，但最终我们的截止日期使我们无法花时间来完善UI的这一部分，而是着眼于可用性和API方面。

In the final sprints of finishing the UI, we were lucky enough to have some help with optimisation. After we shipped, we decided to take a step back and analyse exactly why things were slow and how we could fix them.

在完成UI的最后冲刺中，我们很幸运能够在优化方面有所帮助。发货后，我们决定退后一步，准确地分析为什么情况变慢以及如何解决。

If you want the quick and dirty: We managed to move everything (apart from job scheduling) away from the main thread as well as drastically fix up some of the algorithms we were using in the batch sorting.

如果您想要快速而又肮脏：我们设法将所有内容(除了作业计划之外)都移离了主线程，并彻底修复了我们在批处理排序中使用的某些算法。

表演项目 (Performance Project)

We developed a few UI performance test scenes to get a good baseline to work with when testing the performance changes. They stress the UI in a variety of ways. The test that was most applicable to the sorting / batch generation test had the canvas is completely filled with ‘buttons’. There are overlaps between the text on the button and the button background, so there will always will be some overhead in calculating what can batch with what. The test constantly modifies the UI elements so that rebatching is required every frame.

我们开发了一些UI性能测试场景，以便在测试性能变化时获得良好的基准。他们以多种方式强调用户界面。如果画布完全用“按钮”填充，则最适用于排序/批生产测试的测试。按钮上的文本与按钮背景之间存在重叠，因此在计算可以与哪些内容成批处理时总会产生一些开销。该测试会不断修改UI元素，因此每帧都需要重新标记。

The test can be configured to place UI elements in an ordered way (taking advantage of spacial closeness), or a random way (potentially stressing the sorting algorithms more). It was clear to us that batch sorting needed to be fast in both scenarios. In 4.6 / 5.0 it is fast in neither.

可以将测试配置为以有序方式(利用空间紧密度)或以随机方式(可能会更加强调排序算法)放置UI元素。我们很清楚，在两种情况下批次排序都需要快速进行。在4.6 / 5.0中，两者都不快。

It should be noted that the performance test tends to have ~10k UI elements. This is not something we would expect to see in a ‘real’ UI, most UI’s we’ve experienced have ~300 items per canvas.

应该注意的是，性能测试往往具有约1万个UI元素。这不是我们期望在“真实” UI中看到的东西，我们经历过的大多数UI在每个画布上都有约300个项目。

All performance and profiling done from my MacBook Air (13-inch, Mid 2013).

所有性能和配置文件都来自我的MacBook Air(13英寸，2013年中)。

Portion of the test scene:

测试场景部分 ：

原始(4.6之前的版本，无统计) (Original (pre 4.6, no stats))

During the 4.6 betas we were getting feedback that batch sorting was very slow when there were many elements on the canvas. This was due to there basically being NO smartness when we were trying to figure out batch draw order. We would simply iterate the elements on the canvas and see what we collided with and then assign a depth based on some rules. This meant that as we added more elements to the scene, things would get slower (O(N^2)), much slower. This is ‘bad vibes’ in terms of performance.

在4.6 beta中，我们收到反馈，当画布上有很多元素时，批处理排序非常慢。这是因为当我们试图找出批量绘制顺序时，基本上没有任何智能。我们只需要迭代画布上的元素，看看发生了什么冲突，然后根据一些规则分配深度即可。这意味着，当我们向场景中添加更多元素时，事情会变得更慢(O(N ^ 2))，变得更慢。就性能而言，这是“不良共鸣”。

4.6 / 5.0版本(基准) (4.6 / 5.0 release (baseline))

We did some work on the sorting that took advantage of the idea that ordered drawn elements would normally be in a similar location on the screen. From this, a bounding box was built (per group of n elements) and then new elements were collided with this group before being collided with individual elements. This lead to a decent performance increase in scenes that had locality between UI elements, but in randomly ordered scenes, or scenes were elements were spaced far apart, the improvements were only marginal.

我们利用排序的绘制元素通常在屏幕上的相似位置这一想法进行了一些整理工作。由此，构建了一个边界框(每组n个元素)，然后在与单个元素碰撞之前，新的元素与该组碰撞。这导致UI元素之间具有局部性的场景的性能得到了不错的提高，但是在随机排序的场景中，或者场景中的元素间隔很远时，这些改进只是微不足道的。

If we take a look at this version you can see that when placing random elements the batch performance massively breaks down, taking roughly 100ms to sort and populate a scene…. that’s for reals slow.

如果我们看一下这个版本，您会发现当放置随机元素时，批处理性能将大大下降，大约需要100ms来排序和填充场景。这是真正的缓慢。

Looking at this in the timeline profiler also reveals another worrying situation… we are completely blocking anything else from happening. The batch generation is run just before UI is rendered, this is after a late update and often after scene cameras are rendered. It looks like it would make sense to bring the batch generation to be right after late update so that it can happen while a scene would normally be rendered.

在时间轴分析器中查看此内容还显示出另一种令人担忧的情况……我们完全阻止了其他任何事情的发生。批处理生成仅在渲染UI之前运行，这是在后期更新之后并且通常在渲染场景摄像机之后运行。在后期更新之后立即使批处理生成似乎很有意义，这样它就可以在正常渲染场景时发生。

改进的排序(步骤1) (Improved sorting (Take 1))

We did a first pass on improving sorting. It was still based on the idea of element locality, but with a few more smarts. It would try and keep groups ‘batchable’, so we could include / exclude batchability on a whole group level. It was faster, but still fell down when given very spatially separate scenes and did not scale well with the number of renderable elements.

我们在改进排序上做了第一遍。它仍然基于元素局部性的思想，但是具有更多的智能。它将尝试使组保持“可批处理”，因此我们可以在整个组级别上包含/排除可批处理性。它的速度更快，但在空间上非常独立的场景下仍会掉下来，并且无法随可渲染元素的数量很好地缩放。

Non spatially grouped input

非空间分组输入

Spatially grouped input 空间分组输入

This is pretty poor. It was clear that we needed a new approach.

这真是可怜。显然，我们需要一种新的方法。

改进的排序(步骤2) (Improved sorting (Take 2))

As mentioned earlier, sorting tends to break down and be slow in larger UI scenes with spread elements. We took a step back and thought about what might be a better approach. In the end we decided to implement a canvas grid structure. Each grid square becomes a ‘bucket’ and any UI element that touches a square gets added to that bucket. This means that when adding a new UI element we only need to look into the buckets that the element touches to find what it can / can’t batch with. This led to significant performance improvements when the scene was ordered randomly.

如前所述，在具有扩展元素的较大UI场景中，排序往往会分解并且变慢。我们退后一步，思考什么是更好的方法。最后，我们决定实现一个画布网格结构。每个网格正方形都将成为一个“桶”，并且任何与正方形有关的UI元素都将添加到该桶中。这意味着，在添加新的UI元素时，我们只需要查看该元素接触的存储桶以查找它可以/不可以使用的批处理。当场景被随机订购时，这导致了显着的性能改进。

Non spatially grouped input

非空间分组输入

Spatially grouped input

空间分组输入

Comparable performance between setups!

设置之间可比的性能！

几何工作 (Geometry Job)

We reached the first step on the path to pulling the UI off the main thread by using the new Geometry Job system which was introduced in Unity 5. This is an internal feature that can be used to populate a vertex / index buffers in a threaded way. The changes that were made here allowed us to move a whole bunch of code off the main thread as the timeline below shows. There is some small overhead with regards to managing the geometry job, we have to create the job and job instructions, for example, which requires some memory, but this is negligible compared to the previous main thread cost.

通过使用Unity 5中引入的新的Geometry Job系统，我们已迈出了将UI从主线程中拉出的第一步。这是一个内部功能，可用于以线程方式填充顶点/索引缓冲区。此处所做的更改使我们能够将一大堆代码从主线程中移出，如下面的时间线所示。与管理几何作业有关的开销很小，例如，我们必须创建作业和作业指令，这需要一些内存，但是与之前的主线程成本相比，这可以忽略不计。

简化批次排序 (Simplifying the batch sort)

During the optimisation process, we did a bunch of smaller, profiler guided optimisations. The biggest gain was probably when we vectorised a bunch of our rectangular overlap checks in the sorting. Basically, getting our data into a super nice, DOD, layout ready for overlap checking, then checking with one call… it removed the overlap checks from a hot spot in the c++ profiler when before they accounted for ~60% of the sort time. As you can see doing this really helped our sort performance a bunch. But there was still a ways to go, and that was taking the 7ms off the main thread.

在优化过程中，我们做了很多较小的，探查器指导的优化。最大的收益可能是在排序中矢量化了一堆矩形重叠检查时。基本上，将我们的数据放入一个超级漂亮的DOD布局中，可以进行重叠检查，然后通过一个调用进行检查……在占大约60％的排序时间之前，它从c ++事件探查器的热点中删除了重叠检查。如您所见，这样做确实有助于我们的排序性能。但是还有一段路要走，那就是将主线程缩短了7毫秒。

vectorised overlap code 向量化重叠码

全部删除(主线程) (Taking it all off (the main thread))

The next logical step for us was to remove UI generation from the main thread. For this, we used the internal Job system to schedule a number of tasks. Some of them are serial, others are able to go wide and execute in parallel. Here is the breakdown:

对我们而言，下一个合乎逻辑的步骤是从主线程中删除UI生成。为此，我们使用内部Job系统安排了许多任务。其中一些是串行的，另一些则可以扩展并并行执行。这是细分：

1) Split incoming UI instructions into renderable instructions (1 UI instruction can contain many draw calls due to submeshs and multiple materials). This task goes wide. It allocates memory to accommodate the maximum possible number of renderable instructions. The incoming instructions are then processed in parallel and placed into the output array. This array is then ‘compressed’ down in a combine job into a contiguous section of memory just containing the valid instructions.

1)将传入的UI指令拆分为可渲染的指令(由于子网格和多种材质，1个UI指令可能包含许多绘制调用)。这项任务范围很广。它分配内存以容纳尽可能多的可渲染指令。然后，并行处理传入的指令并将其放入输出数组。然后，将该数组在组合作业中“压缩”到仅包含有效指令的内存连续部分中。

2) Sort the renderable instructions. Compare depths, overlaps ect. Basically sort for a command buffer the requires the LEAST amount of state change when rendering.

2)对可渲染指令进行排序。比较深度，重叠等。基本上，对命令缓冲区进行排序是在渲染时需要最少的状态更改。

3) Batch Generation

3)批量生成

Generate the render command buffer. Create draw calls (batches / sub batches).
生成渲染命令缓冲区。创建绘图调用(批次/子批次)。
Generate the transform instructions that the geometry job can use.
生成几何作业可以使用的变换指令。

The jobs are scheduled right after LateUpdate. This allows them to execute while a normal scene would be rendering and before the UI would be displayed. When these jobs are scheduled a fence is held by the main thread. It will be waited on by both the call for the canvas rendering and the Geometry job until all the required data has been generated.

作业在LateUpdate之后立即安排。这允许他们在渲染普通场景时以及在显示UI之前执行。计划好这些作业后，主线程将保留防护。调用画布渲染和“几何”作业都将等待它，直到生成了所有必需的数据。

In the example below, you can see the geometry job ‘stall’ as it waits for the batch generation to be completed, we need to do more testing around this but as these scenes do not have any renderable elements aside from UI this issue would decrease as the complexity of the scene increases.

在下面的示例中，您可以看到几何作业“停顿”，因为它等待批处理完成，我们需要对此进行更多测试，但是由于这些场景除了UI之外没有任何可渲染的元素，因此此问题将减少随着场景的复杂性增加。

Executing on a machine that has a few more cores than my MacBook Air 在比我的MacBook Air多核的机器上执行

So there we have it, 0.4ms on the main thread for a very expensive UI

因此，在主线程上只有0.4ms，这对于非常昂贵的UI

我们做过的其他表演 (Other performance things we did)

2D Rect clipping (most UI’s don’t really need stencil buffer it turns out, and this reduces draw calls and state change).
2D Rect裁剪(事实证明，大多数UI确实不需要模板缓冲区，这减少了绘制调用和状态更改)。
2D Rect culling (if your element is out or render bounds… cull it).
2D矩形剔除(如果您的元素超出范围或渲染范围……则将其剔除)。
Smarter canvas command buffer

更智能的画布命令缓冲区
- Allow text / normal elements to share the same shaders / materials
  允许文本/普通元素共享相同的着色器/材质
- Massively reduce set pass calls
  大量减少设定的通行证电话
- Push a lot of UI specific data into material property blocks
  将大量特定于UI的数据推送到材料属性块中
- Normally 1 set pass call for a UI, then multiple draw calls
  通常，对一个UI进行1组通过调用，然后进行多次绘制调用
* Combine UI into 1 mesh / index buffer

*将UI合并为1个网格/索引缓冲区
- Use DrawIndexRange for rendering
  使用DrawIndexRange进行渲染
- One VBO / index buffer that resizes as needed
  一个VBO /索引缓冲区，可根据需要调整大小
- Splits to a new draw call when > 2^16 indicies
  当索引> 2 ^ 16时拆分为新的绘图调用

下一步 (Next Steps)

Right now, the sorting / batch generation is behaving acceptably; there are, or course, things we can do to make if faster, but the biggest issue is the time it takes to process the geometry job. As it’s now off the main thread and an isolated job, it’s a good candidate for tidying and speeding up. I’m fairly certain we are doing some dumb things still (is that branching in a tight inner loop?), and it’s also using a bunch of slow maths that could handle being vectorised very nicely. At a higher level it is also worth looking at the situations that lead to a rebatch happening and attempting to minimise those. As always there is more work to do, but what is described here is in Unity 5.2 and already a significant improvement.

目前，分类/批次生成的行为令人满意。当然，我们可以做一些甚至更快的事情，但是最大的问题是处理几何作业所花费的时间。由于它现在脱离主线程并且是一项孤立的工作，因此它是整理和加快速度的理想人选。我可以肯定地说，我们仍然在做一些愚蠢的事情(是在一个紧密的内部循环中分支吗？)，它还使用了一堆慢速数学，可以很好地处理向量化。在更高的层次上，还值得研究导致重新批处理的情况，并尝试将这些情况最小化。与往常一样，还有更多工作要做，但是这里描述的是Unity 5.2，并且已经有了重大改进。

带走 (Take Away)

Many of the new features in Unity 5.2 are pretty great. They allowed us to completely minimise the cost of the UI system on the main thread, as well as optimise the batching in general. When we were working, we used a strongly profiler guided approach to find out where the issues were; in one or two places, we decided to completely step back and try again when we realised the old solution was inadequate. Internally at Unity we are doing a lot more of this kind of work, really trying to address pain points and issues that you are reporting to us in a way that makes Unity better for everyone. Thank you for reporting bugs and real projects that have issues for us to investigate. –UI team

Unity 5.2中的许多新功能都很棒。它们使我们可以完全降低主线程上UI系统的成本，并总体上优化批处理。在工作时，我们使用了严格的探查器引导方法来找出问题所在。在一两个地方，当我们意识到旧的解决方案不够用时，我们决定完全退后一步，然后再试一次。在Unity内部，我们正在做大量此类工作，实际上是在尝试解决使您对我们报告的痛点和问题，从而使Unity更适合所有人。感谢您报告存在问题供我们调查的错误和实际项目。 –用户界面团队

翻译自: https://blogs.unity3d.com/2015/09/07/making-the-ui-backend-faster/

后端ui