Descriptive Problem Summary:
Server hits some odd state where it refuses connections (at the tcp level), drops all current connections, and just idles using up a minor amount of cpu (5 to 15% of a core). Memory usage suggests the world is running, but I can't really know for sure.
I went and got some stack traces from process explorer after this triggered in case they help. It changed on refresh so i just took a few snapshots:
https://gist.github.com/MrStonedOne/ 25d86fb82643c0f4378b21d01a2aae56
And here is the cpu and memory usage log from one minute before The Event uptill I restarted the server. The last log line from in game was written at 2:04:06 so thats as close of a guestimate I can give for an exact moment the issue triggered:
https://gist.github.com/MrStonedOne/ 7d8b88f85a93df2207c13a0ca3e11334
I've seen similar hangs since 1397, but had only now gotten around to poking it enough to make sure it wasn't user error.
This happens often enough to be annoying, see: https://status.tgstation13.org/779067706
ID:2329701
Dec 20 2017, 8:07 pm
|
|||||||||||||
Resolved
| |||||||||||||
Its enabled in DD (the script passes -webclient to dreamdaemon.exe) but disabled in the config meaning that it disconnects people connecting over webclient that aren't admins during client/New()
|
In response to MrStonedOne
|
|
MrStonedOne wrote:
Its enabled in DD (the script passes -webclient to dreamdaemon.exe) but disabled in the config meaning that it disconnects people connecting over webclient that aren't admins during client/New() I can't help but wonder if that's significant. I've experienced something like that on rare occasions when testing the webclient and I was never able to reproduce it reliably. Although some server code changed under the hood in anticipation of adding stuff to the webclient, none of that should have an impact on connections, so I doubt the server version is specifically related. But since I haven't been able to get reliable info on that, who knows. |
So I did some cross referencing with logs. The 511 server hasn't had this issue, its only been the 512 server. (both have the web client enabled)
The first time was a few hours after we updated to 512.1397(from 512.1395). When we reverted back to 512.1395 it didn't happen, and it happened shortly after updating to 512.1399. If you'll notice, I mentioned in my previous that there was hangs during restart. I had been attributing these hangs to that bug because one of the hangs had happened shortly after a server reboot, but looking back, it seems that was a coincidence, and I was likely lumping this bug with that one. edit: anywho, I disabled the webclient on the 512 server. so we shall see if that helps. |
Shoot. Then chances are it's related to id:2329742.
I have to assume the proc queue is in some way messing up, so I'll take a closer look at that and see what I can find. |
In response to MrStonedOne
|
|
MrStonedOne wrote:
Happened on 512.1401 It wasn't the proc queue. |
I did some digging and found an issue in the ProcData struct optimization code that could well account for this. The only trouble is, I think you would have seen evidence in your logs.
Specifically the issue I found happens with one of two messages: "bad arg name '...'" and "positional parameters must precede all named args". The error string buffer doesn't get cleared when it's used for a crash, so all subsequent proc calls would also crash. I'll definitely fix that in 1402. The reason I think this is iffy, however, is that the errors arise from bad calls in soft code, and you would have had log entries for them not only in the past, but also lots and lots of them in the newer builds that have the soft-hanging issue. You didn't mention seeing anything like this in your logs, which would say either you didn't check them (unlikely) or you didn't see them there. I'll keep looking. |
My look through the 1395-1397 diff is complete, and I can find nothing else besides the issue I mentioned above. Again I would have expected that to show up in your server logs, since it's a runtime error message.
The proc queue was the most obvious place to look for a problem, but since turning that off in 1401 didn't fix the issue I'm 99.9% sure that's not it either. To be safe I looked over the code once more to be sure nothing changed in the code when the #define flag governing the queue is turned off, and it all looks good to me. I did see one questionable thing, but it would have been an issue in older versions too if it was relevant. |
logging is done in world/Error so that we can retain the usefulness of the message window in DD on windows.
This likely kept that from getting called. I'll re-add the config option to set world.log to a file so we can catch those kinds of errors in our runtime logs on the server without destroying the usefulness of the dd window during dev. |
Oh, cool. If you can confirm that there was indeed a runtime error, that would totally explain the issue.
Of course if that's the case, it also means there's almost certainly a problem in your soft code that's gone undetected or ignored, since it takes a certain kind of runtime error to trigger this issue. |
it also means there's almost certainly a problem in your soft code https://tgstation13.org/parsed-logs/basil/data/logs/2017/12/ 2017-12.runtime.condensed.txt I think that might be a forgone conclusion =P Anywho I did some testing with this information and the symptoms match up. And nothing gets to the log because world/Error() runtimes for the exact same reason. edit: And I found 7 unique cases of that runtime on the 511 server's december runtime logs. |
I think we're also experiencing this, since we freeze once or twice a day and it doesn't spit out a crash log. We tried updating to the beta rather than the stable but then we're freezing even more often.
https://pastebin.com/rvfuu7ES I'm not sure if any of this is useful/relevant since runtime logs lack time stamps. Tracking down the causes of crashed and freezes in general seems like it's trying to find a needle in a haystack, due to the lack of debugging tools. |
Lummox JR resolved issue with message:
Two specific runtime errors, "bad arg name" and "positional parameters must precede all named args", caused all further proc calls to crash. This is the bug that was being sought out when the proc queue was reverted to its old form, so the proc queue has also been returned to the newer binary insertion form. |
Out of curiosity is your server webclient-enabled?