Dream Daemon Hang Up

Jan 6 2017, 11:14 pm

In response to Lummox JR

Yeah, that looks pretty good.

Any chance you would be willing to hook Hikato up with debug builds of libbyond.so and DreamDaemon (-g flag for debug symbols, no -O flags in case you're not familiar -- sorry, not sure where your knowledge baseline is for gcc flags)? That would give us something to work with in the meantime. Here are some things we can accomplish with a debug build:

1.) Attach with gdb after it starts hanging
2.) Generate a core dump, which includes memory/stack/register information at the time of generation
3.) Useful backtraces, to include function names, line numbers, and values of local variables in case something's getting hunkerdoodled.

#3 would prove to be the most useful -- generate a good number of them, see what's really happening while it's clearly not accepting DS connections.

Jan 6 2017, 11:18 pm
Lummox JR	I think the build process may generate debug builds automatically. I'll have to take a look and see. If I find that there are debug builds available, I can shoot one over. I'm not really expecting to throw any weekend time at this though.

Jan 6 2017, 11:20 pm
Audeuro	Yeah, that's fair. Also, I did have him attach to a non-hanging DreamDaemon in the same fashion as he did this last one, just to rule out any funkiness like in the link you posted. That one was definitely attaching properly and outputting normal DD syscalls.

Jan 6 2017, 11:28 pm
Hikato	It looks like (per that post) passing -f on the strace will also trace forked processes, so on the next hang up, I'll try that as well and see if it comes up with anything useful. No worries. If you have a debug build, I'll run it on one of the servers. Although, so long as it is a debug build of the latest stable (510.1346), I'll run it on all three.

Jan 7 2017, 11:01 am

Hikato

Hung up again, I tried strace with -f, but nothing extra. I do have a bunch of hung zombie processes that only get killed when I kill the DD process.

tinycos+ 12236 0.0 0.0 0 0 pts/10 Z 04:55 0:00 [sh] [defunct]
tinycos+ 12244 0.0 0.0 0 0 pts/10 Z 04:55 0:00 [sh] [defunct]
tinycos+ 12245 0.0 0.0 0 0 pts/10 Z 04:55 0:00 [sh] [defunct]
tinycos+ 12251 0.0 0.0 0 0 pts/10 Z 04:55 0:00 [sh] [defunct]
tinycos+ 12252 0.0 0.0 0 0 pts/10 Z 04:55 0:00 [sh] [defunct]
tinycos+ 12253 0.0 0.0 0 0 pts/10 Z 04:55 0:00 [sh] [defunct]
tinycos+ 12254 0.0 0.0 0 0 pts/10 Z 04:55 0:00 [sh] [defunct]
tinycos+ 12259 0.0 0.0 0 0 pts/10 Z 04:55 0:00 [sh] [defunct]
tinycos+ 12260 0.0 0.0 0 0 pts/10 Z 04:55 0:00 [sh] [defunct]

I wasn't able to run anything on the zombie processes before confirming they were killed by killing the parent DD process, but I'll check on those next time it hangs up.

Jan 7 2017, 11:18 am (Edited on Jan 7 2017, 1:57 pm)

Audeuro

So, something else I just thought of:

In your SIGCHLD handler, you're calling signal_srv->CommandFinished -- what all does signal_srv->CommandFinished do?

Signal handlers are like interrupt handlers: You really want them to end quickly, and you don't want to do much of any work in them or you'll run into problems. Ideally, CommandFinished would simply indicate that it finished, and then that gets processed by normal process flow. Also, you definitely don't want to do anything blocking inside a signal handler. See signal(7)

Edited for wording, now that I'm awake.

Jan 11 2017, 8:53 am
Audeuro	To add to this, Hikato has since noticed that there is a very noticeable delay between his child processes turning into zombies and being reaped. This would further point to something odd happening in SIGCHLD that is blocking or taking a relatively long time. It will be interesting to see/hear about what's happening in signal_srv->CommandFinished that may be causing this.

Jan 11 2017, 9:58 am

Lummox JR

When shell() is called, it puts the proc to sleep as a background proc, requiring a specific wait signal. When CommandFinished() is called, it immediately calls the waiting background proc so that it can complete.

There might be some value in putting the background proc back on the normal sleep queue instead so it runs at a more opportune time, but I can't see why any processing during this time would cause a problem.

I do have to wonder if maybe ChildHandler() should call break after the CommandFinished() call. I mean once CommandFinished() has been called, there's no need to continue the loop, is there?

Jan 11 2017, 11:13 am

In response to Lummox JR

Audeuro

Lummox JR wrote:

When shell() is called, it puts the proc to sleep as a background proc, requiring a specific wait signal. When CommandFinished() is called, it immediately calls the waiting background proc so that it can complete.

You don't really want to be doing this because of the way signal handlers are. When the signal comes in, like an interrupt, it basically saves the current instruction pointer and register state then jumps to your signal handler, which can also be interrupted at any time. Once the signal handler is done, previous instruction pointer and register state is restored and your normal control is resumed.

As you change state within this signal handler, you're also potentially changing state that your normal thread of execution may rely upon and delaying any other children that may need to be reaped. This opens up potential for a lot of unique kinds of problems that may be difficult to envision, but changing state when you're not necessarily expecting it is generally not a good thing.

There might be some value in putting the background proc back on the normal sleep queue instead so it runs at a more opportune time, but I can't see why any processing during this time would cause a problem.

I think this is a good idea, mostly for the above reasons.

I do have to wonder if maybe ChildHandler() should call break after the CommandFinished() call. I mean once CommandFinished() has been called, there's no need to continue the loop, is there?

There actually is- in cases like Hikato's and likely SS13's. SIGCHLD may only be queued up once, so if multiple child processes terminate around the same time, you'll have exactly one SIGCHLD that's expected to reap all of them.

The way you're doing it now is the way to go so that you can reap anything that's pending at that time. Anything terminated after your signal handler is done will then generate another SIGCHLD.

Jan 11 2017, 12:12 pm

Lummox JR

Also relevant to this discussion: There's a routine called TickBgProcs() that looks specifically for shell() cases and if there's a background proc for it, it calls IsCommandFinished() and checks on that. IsCommandFinished() is implemented on the server as follows:

int status;
if(pid != waitpid(pid,&status,WNOHANG)) return false;
return GetExitCode(status,ret_code);

Is there, in your opinion, any chance this could interact badly with the loop in ChildHandler()?

I know this code is rather important for Windows BTW, but for Linux I don't know that we really need two methods checking on the same thing.

Jan 11 2017, 12:19 pm

In response to Lummox JR

Audeuro

Yeah, this should be safe but not likely to do much for the Linux case. You're more likely to end up reaping the child from the SIGCHLD handler than from here, since SIGCHLD will come in as soon as it ends.

One thing to consider, though, if you're going to change the SIGCHLD handler to move the proc back onto the normal sleep queue: this (SIGCHLD handler) *could* happen while you're in the middle of TickBgProcs() or this isCommandFinished(), so you'll need to be prepared for that possibility as you do that.

It might be ideal to have the SIGCHLD handler not actually move the proc into the normal queue but rather flag it as finished in some other method so that TickBgProcs() or isCommandFinished() can naturally resume execution on it or move it around as needed. Invalidating the queue you're iterating over could be bad, depending on how the queue is designed.

Jan 23 2017, 7:46 pm (Edited on Jan 23 2017, 7:59 pm)
Hikato	So from what I gather, a fix is in the open, just need to write and test? At least a Verified tag. :(

Jan 26 2017, 12:42 pm In response to Hikato
Audeuro	Hikato wrote: So from what I gather, a fix is in the open, just need to write and test? That's what I'm assuming, given the radio silence. =O

Jan 26 2017, 4:30 pm (Edited on Jan 26 2017, 4:36 pm)
Hikato	Or I'm not hosting SS13 so it don't matter. :)

Jan 27 2017, 3:46 pm
Lummox JR	I haven't made the proposed change yet, but it's on my list. I've kept this open as a tab so I can remind myself of it.

Jan 27 2017, 5:21 pm
Hikato	Do you have an ETA to attempt to get a fix out? This is a year(s) old bug, and pretty serious. I'd hoped it'd be a little higher in priority than some graphical things.

Jan 27 2017, 9:15 pm
Lummox JR	I'll get to it very soon. There was actually more happening under the hood this week than went into this release, and I didn't have a chance to sit down and put this change through properly.

Feb 2 2017, 2:56 pm
Lummox JR	Well, shoot. It turns out my great idea of spawning the proc call instead of handling it immediately is so great that it's what we're already doing. I was wrong about the return proc being processed right away. The return is in fact spawned, always.

Feb 2 2017, 5:41 pm (Edited on Feb 2 2017, 5:53 pm)

In response to Lummox JR

Audeuro

Lummox JR wrote:

Well, shoot. It turns out my great idea of spawning the proc call instead of handling it immediately is so great that it's what we're already doing. I was wrong about the return proc being processed right away. The return is in fact spawned, always.

This means that you're shuffling queues around in a signal handler, or no? Because that's still not a good idea-- you could be using the queue in normal execution flow. The only thing that you can consider safe, for this purpose, is to write to a variable that gets periodically read solely for signalling that the shell() ended.

Feb 2 2017, 8:14 pm
Lummox JR	Hrm. I wonder if the problem is more that the child handler is interrupting the regular timer event or vice-versa.