RE: Excessive threads in opendkim-2.2.2 on Solaris 10

From: Murray S. Kucherawy <msk_at_cloudmark.com>
Date: Wed, 26 Jan 2011 10:17:35 -0800

> -----Original Message-----
> From: opendkim-dev-bounce_at_lists.opendkim.org [mailto:opendkim-dev-bounce_at_lists.opendkim.org] On Behalf Of Gary Mills
> Sent: Wednesday, January 26, 2011 5:48 AM
> To: Murray S. Kucherawy
> Cc: opendkim-dev_at_lists.opendkim.org
> Subject: Re: Excessive threads in opendkim-2.2.2 on Solaris 10
>
> This is interesting. This morning, the thread count of the same
> process is back to normal:
>
> PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
> 1731 daemon 2699M 2180M sleep 59 0 17:07:58 0.0% dccd/1
> 464 daemon 78M 75M sleep 59 0 0:21:47 0.0% opendkim/62
> 1451 daemon 43M 39M sleep 58 0 94:36:26 0.0% dccm/61
> 1732 daemon 782M 106M sleep 59 0 3:42:52 0.0% dccd/1

This might be a result of the fact that all the pending DKIM instances waiting for DNS replies finally gave up. Your logs from opendkim should tell you whether that's the case. You're back at a tolerable load here from the looks of things.

> Thanks for all the analysis in your previous message. Yes, select()
> calls pollsys(), but the damage has been done by then. poll() also
> calls pollsys() but poll() is able to handle any number of file
> descriptors. That's the difference.

That's true, the setup for select() could be trying to create fd_sets using descriptors select() can't handle, thus messing with adjacent memory. That would only happen though if the descriptor use got that high, which would definitely not be the case in typical operation as opendkim doesn't need that may descriptors on an ongoing basis.

I think my first plan of attack will be to add compile-time support for poll() instead of select(), and you should be able to see if that helps or not. Second will be more robust handling of a partial writev() and doing more frequent checking to see if the descriptor can't accept any more data.

Another possible test would be to start your asynchronous resolver in TCP mode so that it never needs to go through the upgrade-and-resend-everything process. I'll add a switch for that in the next Beta as well.

> ----------------- lwp# 2 / thread# 2 --------------------
> feafa795 pollsys (fe94ec90, 2, fe94ed18, 0)
> feaa3e36 pselect (7, fe94ef3c, fe94eebc, feb70300, fe94ed18, 0) + 18e
> feaa412c select (7, fe94ef3c, fe94eebc, 0, fe94eeb0, 0) + 82
> fef53543 ar_dispatcher (80ad880) + 453
> feaf73b7 _thr_setup (fe840200) + 4e
> feaf76a0 _lwp_start (fe840200, 0, 0, fe94eff8, feaf76a0, fe840200)
> ----------------- lwp# 3 / thread# 3 --------------------
> feafa795 pollsys (fe83ec70, 2, fe83ed18, 0)
> feaa3e36 pselect (a, fe83ef3c, fe83eebc, feb70300, fe83ed18, 0) + 18e
> feaa412c select (a, fe83ef3c, fe83eebc, 0, fe83eeb0, 0) + 82
> fef53543 ar_dispatcher (80be9c8) + 453
> feaf73b7 _thr_setup (fe840a00) + 4e
> feaf76a0 _lwp_start (fe840a00, 0, 0, fe83eff8, feaf76a0, fe840a00)

Again, this is odd. I don't know why you're seeing two different threads in select() inside ar_dispatcher() when opendkim only creates one instance.

Versions prior to to 2.1.3 could instantiate libar twice, once in TCP mode and once in UDP mode. But if this is from 2.2.2, that doesn't seem to be possible.

> ----------------- lwp# 301785 / thread# 301785 --------------------
> feafa795 pollsys (fdb4bef0, 1, fdb4bec0, 0)
> fea9f8be poll (fdb4bef0, 1, 6e0410, 0) + 52
> 08079563 mi_rd_cmd (2b, fdb4bf68, fdb4bf63, fdb4bf64, 80abeb0, 2b) +
> 47
> 0807788a mi_engine (9b36240, 9b36240, fdb4bfb8, 807613f) + 82
> 0807615a mi_handle_session (9b36240, 0, e0840a00, feb6e000, feaf5d23,
> feb6e000) + 2e
> 08075752 mi_thread_handle_wrapper (9b36240) + e
> feaf73b7 _thr_setup (e0840a00) + 4e
> feaf76a0 _lwp_start (e0840a00, 0, 0, fdb4bff8, feaf76a0, e0840a00)

All but one of the remaining threads looked like this. That call stack means the thread is sitting inside poll() waiting for an MTA to say something. So you have a lot of connections open but none of them are stuck waiting on the nameserver. (There was only one of those in this report.)

Note the call to mi_rd_cmd(); the first parameter is the descriptor. Here it's 2b, which I presume is just a hex value so that's 43, far below 1024. In your report yesterday the highest descriptor I saw was 9, but it wasn't a complete listing. If the next time you see that you get a pstack output and grep for mi_rd_cmd, perhaps you'll see much higher descriptor numbers, maybe even approaching 1024. That would lend evidence to the theory that fd_set handling for select() is at least part of the problem.

-MSK
Received on Wed Jan 26 2011 - 18:17:44 PST

This archive was generated by hypermail 2.3.0 : Mon Oct 29 2012 - 23:33:08 PST