猎人狩猎
Hunting bugs is fun. And every now and then you get away alive with a story to bore your grandkids with (“In my days, we still hunted bugs with sticks and stones” and all).
找虫子很有趣。 然后,您时不时地活出一个故事,让您的孙子们感到厌烦(“在我的日子里,我们仍然用棍棒和石头猎杀虫子”等等)。
First bug ever found. Taped into log for evidence.
发现的第一个错误。 录入日志以获取证据。
GDC 2014 had another such trophy-worthy hunting safari in store for us. We were five days away from presenting Unity 5 to the world when we “spotted” (well, it was kinda hard to miss) an ugly little elephant of a bug: our shiny new 64-bit editor was randomly crashing on OSX to the point of being completely unusable. There’s just nothing like being up on stage to showcase how awesome your bug reporter is every couple minutes.
GDC 2014还为我们准备了另一个这样的奖杯式狩猎野生动物园。 当我们“发现”一个臭虫的丑陋小象时(距离我们很难错过),距离向世界展示Unity 5还差五天:我们崭新的64位编辑器随机地崩溃到了OSX的地步。完全无法使用。 就像登台展示每两分钟您的错误报告者有多棒一样。
So, Levi, Jonathan and I dropped all the awesome stuff we’re working on (more stories we want to bore our grandkids with) and went stalking. All we knew at that point was that it crashed somewhere in the native code that Mono generates at run-time.
因此,李维(Levi),乔纳森(Jonathan)和我放弃了我们正在研究的所有很棒的东西(我们想和孙子们打交道的更多故事),然后继续跟踪。 我们当时所知道的是,它在Mono运行时生成的本机代码中崩溃了。
As every programmer knows, when you’re faced with a bug that isn’t obvious, you simply start by gathering evidence. Once you’ve learned enough about the bug’s behavioral patterns, you’ll eventually get a shot at it. And with the clock ticking, we were ready to shoot at pretty much anything.
每个程序员都知道,当您遇到不明显的错误时,您只需从收集证据开始。 一旦您对错误的行为模式学到了足够的知识,您最终将获得一枪。 随着时间的流逝,我们已经准备好拍摄任何东西。
But we were stumped. For an elephant, the bug turned out to be surprisingly agile and sneaky.
但是我们很沮丧。 对于一头大象来说,这个虫子出奇地敏捷和偷偷摸摸。
It seemed to happen only on OSX 10.9 although Kim saw something that looked markedly similar on Windows with his heavy duty memory debugger branch. And if you enabled Guard Malloc on earlier versions of OSX, you got what looked fairly similar as well. However, as it was crashing in random script code at arbitrary depths in the call hierarchy, it was difficult to say with certainty what was the same crash and what wasn’t. And the crash could be consistent for ten consecutive runs only to be totally different for the next five.
它似乎只在OSX 10.9上发生,尽管Kim看到了带有重型内存调试器分支的Windows上看起来非常相似的东西。 而且,如果您在OSX的早期版本上启用了Guard Malloc,那么您得到的外观也相当相似。 但是,由于它在调用层次结构中任意深度的随机脚本代码中崩溃,因此很难确定是什么崩溃还是不崩溃。 崩溃可能连续十次保持一致,而接下来的五次完全不同。
So while Kim and I waded knee-high through memory and thigh-high through assembly code, Levi ran an extensive trace on all of Mono’s secret and not so secret activities to generate a gigabyte log and an editor that ran at the speed of my grandma. This yielded the first interesting insight: apparently we were always compiling the method we crashed in right before things got ugly.
因此,当金和我在记忆中走到膝盖高而在汇编代码中走到高腿时,李维斯对Mono的所有秘密活动(不是那么秘密的活动)进行了广泛的跟踪,以生成千兆字节的日志,并以我祖母的速度运行了一个编辑器。 这产生了第一个有趣的见解:显然,我们总是在事情变得丑陋之前编译我们崩溃的方法。
But what made it crash? The immediate cause was that we were trying to execute code from an invalid address. How did we get there? A bug in Mono’s signal handling where we don’t resume properly? A bug in Mono’s JIT compiler that won’t jump back properly to the compiled code? A different thread corrupting stack memory on the main thread? Fairies and grumkins? (for a bit, the latter seemed the most likely).
但是是什么让它崩溃了? 直接的原因是我们试图从无效的地址执行代码。 我们如何到达那里? Mono无法正确恢复信号处理中的错误? Mono的JIT编译器中的错误无法正确跳回已编译的代码? 另一个线程破坏了主线程上的堆栈内存吗? 小精灵和小鹿? (有一点,后者似乎是最有可能的)。
After two days of hunting, the elephant was still well alive and out and about.
经过两天的狩猎,大象仍然活着,并且四处走动。
So, Saturday night I equipped myself with a notebook, four different colored pens and an ample supply of beer from our trademark Unity fridge (carefully making sure I don’t touch the awful canned Christmas beer we still have stuck in its crevices ). Then I spun up Unity instances until I had four different crashes frozen in the debugger, labeled them “Red Crash”, “Blue Crash”, “Green Crash”, and “Black Crash” and went to work with my respectively colored pens to take notes and draw some not-so-pretty diagrams of everything I found.
因此,周六晚上,我为自己配备了笔记本电脑,四支不同颜色的笔和来自我们商标Unity冰箱的充足啤酒(请确保我不会碰到我们仍然卡在缝隙中的糟糕的圣诞节罐头啤酒)。 然后,我启动Unity实例,直到在调试器中冻结了四个不同的崩溃,分别将它们标记为“ Red Crash”,“ Blue Crash”,“ Green Crash”和“ Black Crash”,然后分别用彩色笔工作记录并为我发现的所有内容绘制一些不太漂亮的图表。
Here’s my notes for Blue Crash:
这是我对“蓝色崩溃”的注意事项:
And that’s when I made my first discovery: in every case, the stack was 16 bytes larger than it should be!
那是我第一次发现时:在每种情况下,堆栈都比应有的大16个字节!
That then led to the next discovery: for all crashes, looking at those extra 16 bytes turned up a return address back into the function we crashed in. From a trace it was clear that in all cases we already had executed some calls from the same method, and at first I thought the address was from the last call we had traced. However, closer inspection revealed that it was actually the return address for a call whose method had not been compiled yet!
然后导致下一个发现:对于所有崩溃,查看那些额外的16个字节将返回地址返回到我们崩溃的函数中。从跟踪中可以明显看出,在所有情况下,我们已经执行了相同的调用方法,起初我以为地址是从我们追踪到的上次呼叫开始的地址。 但是,仔细检查发现,它实际上是尚未编译方法的调用的返回地址!
This puzzled me for a moment as in some cases there were several calls in-between the last traced method and this call that hadn’t been compiled yet either. Looking closer, however, revealed that we always had jumped around them.
这让我感到困惑,因为在某些情况下,最后一个跟踪的方法与该调用之间还存在多个尚未编译的调用。 但是仔细看,发现我们总是在他们周围跳跃。
So, then I looked at that function we apparently were supposed to return from…
所以,然后我看了看似应该从...返回的函数
And there we have it (highlighted in blue): We were jumping in the wrong direction!
到了那里(以蓝色突出显示):我们跳错了方向!
What Mono does here is create little “trampoline” functions that contain only a call to the JIT compiler and some data encoded into the instruction stream after the call (used by the JIT compiler to know which method to compile). Once the JIT compiler has done its work, it will delete those trampolines and erase every trace of having hooked into the method call.
Mono在这里所做的是创建一些小的“蹦床”函数,这些函数仅包含对JIT编译器的调用,以及在调用之后编码到指令流中的一些数据(由JIT编译器用来了解要编译的方法)。 JIT编译器完成工作后,将删除这些蹦床并清除钩入方法调用中的所有痕迹。
However, the call instruction you see there is what is called a “near call” which incidentally uses a signed 32-bit offset to jump relative to the next instruction.
但是,您看到的调用指令就是所谓的“近调用”,它偶然使用带符号的32位偏移量相对于下一条指令进行跳转。
And since a signed 32-bit number can reach only 2GB up and down and we’re running 64-bit here, we suddenly knew why heap memory layout played such a crucial role in reproducing the bug: once Mono’s trampolines were further than 2GB away from the JIT compiler, offsets wouldn’t fit anymore into 32-bit and would get truncated when emitting the call instruction.
而且由于有符号的32位数字上下只能达到2GB,并且我们在此处运行64位,因此我们突然知道为什么堆内存布局在重现该错误中起着至关重要的作用:一旦Mono的蹦床距离超过2GB来自JIT编译器的偏移量不再适合32位,并且在发出调用指令时会被截断。
At that point, Jonathan quickly pinpointed the right fix and by the time his Sunday was over, we had a stable working build ready in time for GDC.
那时,乔纳森(Jonathan)Swift指出了正确的解决方法,到周日结束时,我们已经为GDC及时准备了稳定的工作版本。
You all know the history from there. We successfully demoed Unity 5 at GDC 2014 to rave reviews and after launch, it quickly became the most beloved piece of software ever. Oh wait, that bit is yet to come…
你们都从那里知道历史。 我们在2014年GDC上成功演示了Unity 5,以赢得好评,并在其发布后Swift成为有史以来最受欢迎的软件。 哦,等等,那一点还没到……
Before that launch, there’s a whole lot more black and blue crashes to fix :).
在发布之前,还有很多黑与蓝的崩溃要修复:)。
翻译自: https://blogs.unity3d.com/2014/04/22/on-hunting-the-uncommon-elephant/
猎人狩猎