Descriptive Problem Summary:
Players playing AI on our SS13 server have been reporting freezes for a while now. These freezes happen whenever the AI players eye teleports from one place to another (specifically, more than one screen away or to another Z-level). This got worse when we swapped to the new map in the summer.
We suspect the issue is map data that's being pushed with rapid jumping (thus invoking large updates to what the client sees) is somehow causing an infinite loop.
To confirm: this happens whenever the player uses an ingame chat control, or whenever they use verbs to teleport their viewport up and down Z-levels. Or from one area to another, basically.
There are two similar issues:
http://www.byond.com/forum/?post=2008870&hl=DMTextPrinter
http://www.byond.com/forum/?post=2082440&hl=DMTextPrinter
(I found them based on searching the name of the function that was at the top of the call stacks whenever I was debugging the freezes.)
Numbered Steps to Reproduce Problem:
There is a very reliable way to reproduce this:
* Set up the Aurora.3 master branch https://github.com/Aurorastation/Aurora.3
* Start the round
* Join as or make yourself into the AI
* Use jump-to-network to teleport yourself to the Research Outpost camera network
* Look around for the Research Outpost elevator (somewhere left and down, it's a 3x3 space)
* Cycle between using move-upwards and move-downwards commands until you crash!
Code Snippet (if applicable) to Reproduce Problem:
N/A
Expected Results:
Serverside runtime or no crash.
Actual Results:
DreamSeeker freezes for the AI player. They are still connected to the server, however, cannot interact with DreamSeeker beyond just closing it.
Does the problem occur:
Every time? Or how often? Multiple times every round for a majority of our AI playerbase.
In other games? Unknown. Related issues were linked up above.
In other user accounts? Yes.
On other computers? Yes.
When does the problem NOT occur?
When you're not playing as the AI. The AI is not the only role which can move rapidly between Z-levels and areas like this, another one is a ghost, but we've not received reports of ghosts crashing. One theory is that the AI additionally has to deal with static (one image per tile unseen) whenever it loads new map data.
Did the problem NOT occur in any earlier versions? If so, what was the last version that worked? Tested with the earliest and latest versions of 511, still crashes. Server is 511 dependent so there's no point testing lower than that.
Workarounds:
Not known.
Debug Data:
When testing, I did find my windows to be running BYOND in compat mode for Win7. But I disabled that and it sitll didn't fix anything.
Yes I have some. I have 3 dumps available upon request and I undertook the stack-trace->gu method described by Lummox in one of the linked threads. The results were as thus:
# Child-SP RetAddr Call Site
00 00000000`00ffd4c8 00ffd4e8`00008afd byondcore!DMTextPrinter::SetLinkStart+0x248af
01 00000000`00ffd4d0 00008afd`0f932661 0x00ffd4e8`00008afd
02 00000000`00ffd4d8 00000000`0dd45b70 0x00008afd`0f932661
03 00000000`00ffd4e0 00000040`00000000 0xdd45b70
04 00000000`00ffd4e8 0f932781`00ffd65c 0x00000040`00000000
05 00000000`00ffd4f0 5971f706`00000000 0x0f932781`00ffd65c
06 00000000`00ffd4f8 00000040`00000050 0x5971f706`00000000
07 00000000`00ffd500 0000ffff`00000002 0x00000040`00000050
08 00000000`00ffd508 0000ffff`0000ffff 0x0000ffff`00000002
09 00000000`00ffd510 0000ffff`0000ffff 0x0000ffff`0000ffff
0a 00000000`00ffd518 00000000`00000000 0x0000ffff`0000ffff
Gu here produces this:
0:000> gu
Unable to insert breakpoint 10000 at 00ffd4e8`00008afd, Win32 error 0n998
"Invalid access to memory location."
The breakpoint was set with BP. If you want breakpoints
to track module load/unload state you must use BU.
go bp10000 at 00ffd4e8`00008afd failed
WaitForEvent failed
ntdll!DbgBreakPoint+0x1:
00007ffe`ef138d61 c3 ret
However, the stack changes. We're now here:
# Child-SP RetAddr Call Site
00 00000000`03e8f688 00007ffe`ef16536b ntdll!DbgBreakPoint+0x1
01 00000000`03e8f690 00007ffe`ef100d5f ntdll!DbgUiRemoteBreakin+0x4b
02 00000000`03e8f6c0 00000000`00000000 ntdll!RtlUserThreadStart+0x2f
Another gu leaves the breakpoint complete and the code starts executing again. If I break again, the stack trace will be the same as the one at the start. byondcore!DMTextPrinter::SetLinkStart always ends up at the top.
ID:2294466
Sep 16 2017, 9:54 am
|
|||||||||||||
| |||||||||||||
Sep 16 2017, 3:24 pm
|
|
Thanks for the good info. I'll take a look at this as soon as I can, although I'm not sure it'll be prior to the 512 release. Since you're able to reproduce this reliably, that's actually a pretty big help. If I can get it to happen at my end then I feel pretty confident in finding a fix.
|
I was about to report a similar bug, but because I know that AI movement in Space Station 13 involves a lot of image manipulation and the symptoms I get are exactly the same, this is very likely the same bug I am experiencing.
Say you have two images, imageA and imageB, both on the same atom. imageA is not pixel shifted at all and imageB is very heavily pixel shifted, at with at least pixel_x = 320 and pixel_y = 320 (I do not know the exact values that cause this to happen, but it does not seem to happen on low values such as pixel_x = 32 and pixel_y = 32). If you have at least one of the images in your client.images and your eye is in a position such that imageB would be on your screen but imageA (and the atom that both are attached to) would not be, your dreamseeker will crash as described above if you remove at least one of the images from your client.images and your perspective moves very far away (at least 75 turfs) in the same tick. Note that you do not have to have both images in your client.images for this to happen. You can have imageA in your client.images and remove it, have imageB in your client.images and remove it, or have both images in your client.images and remove either one. The crash does not occur if you put a sleep(1) between the perspective change and the removal of the image. I have managed to reproduce this bug by changing the clients perspective using its eye var and by just teleporting the mob. I have not managed to reproduce it without the large amount of pixel shifting on one of the images. Doing a binary search with BYOND versions I found that the bug appeared between releases 510.1344 and 510.1345 This is the code I wrote to reproduce this bug in a non-SS13 environment: (You will need to have an icon file called 'icons.dmi', or change the filename to something that exists. It does not matter if the file contains the icon_states "blue" and "red" or if it does not) var/image/imageA To reproduce this bug with these verbs: * Load a map that is at least 100x by 100y * Use the "Cause crash" verb Or * Load a map that is at least 100x by 100y * Use the following series of verbs: * Teleport to 1,1 * Create Image A * Create Image B * Show Image A * Teleport to 15,15 * Teleport to 100,100 and hide A |
In response to Cruix
|
|
Can confirm, images off screen are a nuisance in general. The addition of a snek race, which had a long enough tail to stretch off screen, would cause regular crashing. Though I've not bothered to diagnose it, as said snek people were a meme.
Also, thank you for the research, Cruix! |
In response to Cruix
|
|
I tried your code out, but I wasn't able to get a crash except just once. When it happened it was only because I had added showImageB() right after showImageA(), and the problem only happened when closing the world. So I can see there's something wrong, but I can't pinpoint it just yet because the problem is highly intermittent.
I tried also adding a sleep(10) after the setEye() call, and that didn't make a difference. Nor did setting the perspective to EYE_PERSPECTIVE. Do you have a test project that shows this in action every time? [edit] Also, please note I tested in 512. I would recommend you retest in 512 as well because the code for handling turfs and images has changed a great deal. |
As an update, I was able to find that images didn't set their visual bounds correctly in regards to pixel offsets, which is a bug--but not a crashing bug. I am still unable to replicate a crash.
|
This does not occur for me in 512, only in the latest stable release and earlier.
My test project is just the above verbs in a project with a 100x100 map and some (bad) sprites for /mob, /turf and the red and blue images. |
In response to Lummox JR
|
|
As said, my crash still works pretty reliably. Even in 512. Though I do understand that setting up an SS13 codebase for local testing can be a bigger butt that using the small file Cruix provided.
|
Thanks for the info, Cruix. I was afraid of that; it sounds like other fixes in the interim have obviated the specific issue you saw. Chances are you could probably narrow down which 512 fixed it.
Skull, is there any way you can think of to pare down the codebase you're working with to something minimalist? Or if not, can you at least set it up so that I can get to the crashy bit with a minimum of effort? (If it's something big like /tg the size will tend to be a big problem for debugging, but I've run some other builds successfully.) |
In response to Lummox JR
|
|
I can set up Aurora's test server and give you enough flags to spawn yourself as AI. It would basically involve you logging onto the IP I send, and pasting a few instructions.
Cutting it down would take a metric tonne of time, as I'm not really sure how much can be removed without "fixing" the bug. I'm hoping that it'll be less effort to investigate via attaching a debugger onto the client, and figuring out what's wrong. Considering it's a loop, you might be able to figure out what function loops exactly and how, without directly messing with the server. Alternatively, running a decent SS13 codebase isn't that difficult: download zip (our master for example, dev is a little ahead), compile, copy-paste config/example into config, and run with DD. That's it. |
Remote debugging is simply not an option. That's no way for this to work. I'd need to run a server.
If it's not /tg then there's a good chance I can run it in the debugger, if you can get me a current master and the exact instructions I'll need to get to where the crash occurs. However of course in a codebase as complex as SS13, the problem might not be easy to diagnose even if I can make it happen reliably. |
Actually, it appears that bumping both server and client to 512 may have fixed it. When I reported, I only bumped client. But after a bit of testing on a 512 compiled and ran version of the server, I am unable to reproduce the crash.
So, I suppose this is solved until I get players moaning again, after 512 goes live? |