When you say revert, which version are you returning to? Is one 499 more stable than the others, or do you have to go back to 498, 496, what?
|
I know that it's relatively recent and ATHK will be able to post with the version we've been using, since he manages the server (BYONDpanel). I am fairly certain it's 499 but can't be sure without his confirmation.
|
Sorry for the late reply! It's on 499.1197, if you view the log file supplied before 1201 we were using 1197 which is what the server was reverted to since it was actually working, but even then so far 1158 (i think) has been the most stable...
The log file is in the last post of page 4. |
Server code has barely changed in 499; I have trouble believing any freezes/crashes are brand new to that version. There was a fix to several functions like winget() that were doing bad things to the proc stack when the client was invalid, but I strongly doubt that's the cause of any new bad behavior--I have complete confidence in that fix.
This bug (specifically, the problem with the string tree that causes a freeze) was originally reported on 498 anyway. Some routine in these projects is causing heap corruption that's messing up the string tree. I believe there's no inherent difference in stability among the 499 builds, at least where server code is concerned, with the exception that the winget() family fix stopped some proc stack corruption. Heap corruption however, which has obviously been ongoing for the projects in question, is the kind of thing that only shows up intermittently. It's the nature of intermittent bugs to sometimes appear frequently, sometimes lie dormant for a while, which can give an illusion that one version is more stable than others. Whatever's causing it here is obviously an infrequent case to begin with; it's likely related to some kind of player-oriented action, be it login/logout, savefile usage, or a specific command. Bottom line: This is known to be present in at least 498, and I can't imagine anything in 499's server code that would actually make it worse. Based on the nature of when and how this bug crops up, I think the trigger for it (whatever's causing the heap corruption) is the same it always was, and reverting to an earlier 499 will do absolutely nothing to help. Also, we know this issue already crops up on 499.1197, and on 499.1193 and earlier builds. As I've mentioned before, the best bet for catching this bug is going to be to get a server running valgrind so it can detect buffer overruns. I understand there are some projects that just run too slow that way, but surely there must be one that can handle it. Last I knew, Stephen001 was still working on getting results from Hazordhu II in valgrind. [edit] I should add that the refcount issue with clients (that's the 5:xxx you're seeing) could well be related to this, or it could be separate. I'm apt to believe it's connected, but I've searched repeatedly for the source of that and never found anything that would make sense. I believe there must be some commonality between the projects that see this, perhaps some function like winset(), winget(), winshow(), anything like that. Again the fix I did would help with "bad client" cases but there might be something lingering in one of those that I haven't managed to suss out. I wouldn't rule out other procs either, especially anything with client interaction like icon uploads, ftp(), client.Import(), what have you. If you can do anything to break down the code and separate out stuff by function, maybe it could help highlight the commonalities. |
http://awesomeware.org/valgrind.log
A first valgrind log up to the point of freeze is here. Sadly due to the lag, and priorwarnings, we hit the error limit on valgrind too early. I'm re-running with the limit disabled. |
The rgbaToPalette() errors in that log are complete crap. They're complaining about a var called idx, which is absolutely initialized.
Is there a way you can tell valgrind to ignore those and only look at buffer overruns? |
I can say that I was experiencing similar issues with Angel Falls when I was running 499.1197+ on Ubuntu as well. I attempted on both Cent OS 6.4 and Ubuntu 13.04 with 499.1197 and then 499.1201, the game would start and then crash within 30 seconds with an error that was reported in another thread in the world log.
I have since reverted to BYOND 498 on Ubuntu 13.04 and I am no longer experiencing the issue, Angel Falls has been up and running again for a week now on 498 with no crashes. I don't know what it is with 499, but something doesn't agree with Linux. I also know that Xirre and his host server utility, which the utility itself works in 499, some of the games he is hosting have experienced the same issue. |
Did you have to do anything to trigger the crash in Angel Falls (eg, login), or did it just crash sometime after starting up?
|
The game would just crash, with an error just like this one:
http://www.byond.com/ forum/?post=1323538&hl=Crashing%20due%20to%20an%20illegal%20 operation I thought I had a copy of the exact log but it seems I no longer do. If you would like, when I get some time, I can setup a separate install of 499 on my Ubuntu server to reproduce the error. The public host files though, are the same files that I experienced the 499 error in. |
This is only a preliminary post. Ill leave it running on valgrind for the next 5 hours or so to try to catch it.
But heres the log (ignoring undef variable jump) http://sebsauvage.net/paste/?746f859fc8e50d4b#OiJ79/ 08mnAwW7REqOwq+r9AhGGAD+u+6Dlh7jG5bfI= (easier to read) ==17341== Warning: invalid file descriptor -1 in syscall write() ==17341== Warning: invalid file descriptor -1 in syscall write() ==17341== Warning: invalid file descriptor -1 in syscall write() ==17341== Warning: invalid file descriptor -1 in syscall write() ==17341== Warning: invalid file descriptor -1 in syscall write() ==17341== Warning: invalid file descriptor -1 in syscall write() ==17341== Warning: invalid file descriptor -1 in syscall write() ==17341== Warning: invalid file descriptor -1 in syscall write() ==17341== Warning: invalid file descriptor -1 in syscall write() ==17341== Warning: invalid file descriptor -1 in syscall write() ==17341== Warning: invalid file descriptor -1 in syscall write() ==17341== Warning: invalid file descriptor -1 in syscall write() ==17341== Warning: invalid file descriptor -1 in syscall write() ==17341== Warning: invalid file descriptor -1 in syscall write() ==17341== Warning: invalid file descriptor -1 in syscall write() ==17341== Warning: invalid file descriptor -1 in syscall write() ==17341== Warning: invalid file descriptor -1 in syscall write() ==17341== Warning: invalid file descriptor -1 in syscall write() ==17341== Warning: invalid file descriptor -1 in syscall write() ==17341== Warning: invalid file descriptor -1 in syscall write() ==17341== Warning: invalid file descriptor -1 in syscall write() ==17341== Warning: invalid file descriptor -1 in syscall write() ==17341== Warning: invalid file descriptor -1 in syscall write() ==17341== Warning: invalid file descriptor -1 in syscall write() ==17341== Invalid write of size 4 ==17341== at 0x434455F: DantomClient::SendWaitingMsg() (in /usr/local/byond/bin/libbyond.so) ==17341== by 0x434BEDE: DantomClient::HandleHelloMsg(NetMsg*) (in /usr/local/byond/bin/libbyond.so) ==17341== by 0x43502F3: DantomClient::HandleMsg(NetMsg*) (in /usr/local/byond/bin/libbyond.so) ==17341== by 0x43581D5: ClientSocket::ReadMsg() (in /usr/local/byond/bin/libbyond.so) ==17341== by 0x4357DB7: ??? (in /usr/local/byond/bin/libbyond.so) ==17341== by 0x4359D69: SocketLib::Event_io() (in /usr/local/byond/bin/libbyond.so) ==17341== by 0x435ACC4: SocketLib::WaitForSocketIO(long, unsigned char) (in /usr/local/byond/bin/libbyond.so) ==17341== by 0x804A41D: ??? (in /usr/local/byond/bin/DreamDaemon) ==17341== by 0x48F7BD5: (below main) (libc-start.c:226) ==17341== Address 0x1d5d7f44 is 1,100 bytes inside a block of size 1,116 free'd ==17341== at 0x4024851: operator delete(void*) (vg_replace_malloc.c:387) ==17341== by 0x4344926: DantomClient::SendCallback(signed char, char const*) (in /usr/local/byond/bin/libbyond.so) ==17341== by 0x43451AE: DantomClient::Event_NetDown() (in /usr/local/byond/bin/libbyond.so) ==17341== by 0x43569FB: ClientSocket::SendBuf(char const*, long) (in /usr/local/byond/bin/libbyond.so) ==17341== by 0x43574A3: ClientSocket::WriteMsg(NetMsg*) (in /usr/local/byond/bin/libbyond.so) ==17341== by 0x4344552: DantomClient::SendWaitingMsg() (in /usr/local/byond/bin/libbyond.so) ==17341== by 0x434BEDE: DantomClient::HandleHelloMsg(NetMsg*) (in /usr/local/byond/bin/libbyond.so) ==17341== by 0x43502F3: DantomClient::HandleMsg(NetMsg*) (in /usr/local/byond/bin/libbyond.so) ==17341== by 0x43581D5: ClientSocket::ReadMsg() (in /usr/local/byond/bin/libbyond.so) ==17341== by 0x4357DB7: ??? (in /usr/local/byond/bin/libbyond.so) ==17341== by 0x4359D69: SocketLib::Event_io() (in /usr/local/byond/bin/libbyond.so) ==17341== by 0x435ACC4: SocketLib::WaitForSocketIO(long, unsigned char) (in /usr/local/byond/bin/libbyond.so) |
new version including some other corruption. Seem mostly network related
http://sebsauvage.net/paste/?64d7934adabd4855#f52Bjet060cNw/ JvwB0azNRJOfMX1+7PcZqsIQI6B78= Oh yea, thats on 499.1201 |
another more updated version
http://sebsauvage.net/ paste/?00d4a1e6232cdfc3#bFQjshPG8UqRkg3NkxfRPLxukkNmgz4urlOj vy6VGdE= |
Server finally crashed (altho might be from oom since valgrind tend to use a LOT of memory), but still here it is, the whole valgrind log + crash data + the code around the crash.
http://sebsauvage.net/ paste/?3916275c00da36bd#4l09b9SzmFXCTYn12cncD8HCiak5ZVgRttoc yZM2+Mc= |
Any fixes planned for the next version or are things still a bit of a mystery at the moment?
|
In response to Writing A New One
|
|
I'm on the track of the problem. (I just got back from vacation, which is why there were no updates last week.) After much head-banging and searching, I finally found a couple of problem areas with the assistance of the valgrind log.
The first problem is related to hub communication and may have changed in 499; however I suspect that specific part isn't actually new. The second problem appears to be that the map-sending routines are relying on data structures that might be cleared after a message is sent, due to the link in question being deleted. When this happens, it is important for the map-sending routine to bail out, which is not being done. I think this is the bigger issue and is more likely to be the major cause of heap corruption. |
At what time can we expect a new version for release that fixes this? I hate to ask, but switching from linux crashes to home-hosting and then back to another computer isn't working out very well for me.
|
I can grant Lummox SVN access if he wants to schedule a debugging session / tinker with the source to hopefully pinpoint what's causing the crashes if the information supplied isn't enough. I'm sure a bunch of people from Eternia's community wouldn't mind signing in and doing as instructed, if it's something that'll require a good deal of fiddling.