Excessive threads in opendkim-2.2.2 on Solaris 10

From: Gary Mills <mills_at_cc.umanitoba.ca>
Date: Tue, 25 Jan 2011 23:07:38 -0600

About a month ago, I upgraded our main e-mail server from
opendkim-2.0.2 to opendkim-2.2.2. At that time, it seemed fine. The
thread count of opendkim was tracking the thread count of the other
milter, dccm. Now with a higher load, I've noticed a problem. Here
are two snippets of `prstat -u daemon' taken about 1/2 hour apart:

   PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
   464 daemon 99M 96M sleep 59 0 0:04:13 0.2% opendkim/2089
  1731 daemon 2699M 2368M sleep 59 0 17:04:58 0.1% dccd/1
  1451 daemon 47M 43M sleep 58 0 94:25:03 0.0% dccm/577

   PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
   464 daemon 170M 167M sleep 59 0 0:05:22 0.3% opendkim/3816
  1731 daemon 2699M 2370M sleep 59 0 17:05:04 0.1% dccd/1
  1451 daemon 43M 40M sleep 58 0 94:25:31 0.0% dccm/90

The thread count of opendkim is growing but never decreasing. I see
from the logs for the SMF service that the processes is exiting every
day or two, and being restarted by SMF...

  [ Jan 25 15:55:24 Stopping because all processes in service exited. ]
  [ Jan 25 15:55:24 Executing stop method (:kill) ]
  [ Jan 25 15:55:24 Executing start method ("/lib/svc/method/site/opendkim") ]
  starting opendkim
  started opendkim
  [ Jan 25 15:55:24 Method "start" exited with status 0 ]

Here are the syslog messages from that event:

  Jan 25 15:54:53 electra opendkim[4975]: [ID 909925 mail.error] OpenDKIM Filter: thread_create() failed: 11, try again
  Jan 25 15:55:08 electra opendkim[4975]: [ID 909925 mail.error] OpenDKIM Filter: thread_create() failed: 11, abort
  Jan 25 15:55:16 electra opendkim[4975]: [ID 109917 mail.error] OpenDKIM Filter, mi_rd_cmd: read returned -1: Connection reset by peer
  Jan 25 15:55:16 electra opendkim[4975]: [ID 699540 mail.debug] p0PLOMoG008746: no signature data
  Jan 25 15:55:21 electra opendkim[4975]: [ID 109917 mail.error] OpenDKIM Filter, mi_rd_cmd: read returned -1: Connection reset by peer
  Jan 25 15:55:24 electra opendkim[4975]: [ID 555540 mail.info] OpenDKIM Filter v2.2.2 terminating with status -1, errno = 12
  Jan 25 15:55:24 electra opendkim[464]: [ID 482974 mail.info] OpenDKIM Filter v2.2.2 starting (args: -x /etc/mail/opendkim.conf)

I also have a partial stack trace of the existing process:

  # pstack 464 | head -44
  464: opendkim -x /etc/mail/opendkim.conf
  ----------------- lwp# 1 / thread# 1 --------------------
   feafa795 pollsys (8036ffc, 1, 8036fb0, 0)
   fea9f8be poll (8036ffc, 1, 1388, 8075f11) + 52
   0807598b mi_listener (80ade00, 0, 80acef8, 1c2a, 80, 1) + cf
   08074d4b smfi_main (8047d74, feffb800, 0, 0, 0, 0) + 67
   0806a885 main (3, 8047db8, 8047dc8, 8047dac) + 2b95
   0805a00d _start (3, 8047e60, 8047e69, 8047e6c, 0, 8047e84) + 7d
  ----------------- lwp# 2 / thread# 2 --------------------
   feafb1d5 writev (5, fe94ed04, 2)
   fef52be9 ar_sendquery (80ad880, 8c93b00) + 279
   fef54252 ar_dispatcher (80ad880) + 1162
   feaf73b7 _thr_setup (fe840200) + 4e
   feaf76a0 _lwp_start (fe840200, 0, 0, fe94eff8, feaf76a0, fe840200)
  ----------------- lwp# 3 / thread# 3 --------------------
   feafa795 pollsys (fe83ec70, 2, fe83ed18, 0)
   feaa3e36 pselect (a, fe83ef3c, fe83eebc, feb70300, fe83ed18, 0) + 18e
   feaa412c select (a, fe83ef3c, fe83eebc, 0, fe83eeb0, 0) + 82
   fef53543 ar_dispatcher (80be9c8) + 453
   feaf73b7 _thr_setup (fe840a00) + 4e
   feaf76a0 _lwp_start (fe840a00, 0, 0, fe83eff8, feaf76a0, fe840a00)
  ----------------- lwp# 4 / thread# 4 --------------------
   feaf9b75 sigtimedwait (fe73fef0, fe73ff00, 0)
   feaec71d sigwait (fe73ffc0) + 1c
   feae4b82 __posix_sigwait (fe73ffc0, fe73ffd0) + 2e
   0805c43b dkimf_reloader (0) + 4b
   feaf73b7 _thr_setup (fe841200) + 4e
   feaf76a0 _lwp_start (fe841200, 0, 0, fe73fff8, feaf76a0, fe841200)
  ----------------- lwp# 5 / thread# 5 --------------------
   feaf9b75 sigtimedwait (fe640f00, fe640f10, 0)
   feaec71d sigwait (fe640fc0, 2, fe640fdc, feaf6ba7) + 1c
   0807705b mi_signal_thread (80abeb0) + 4b
   feaf73b7 _thr_setup (fe841a00) + 4e
   feaf76a0 _lwp_start (fe841a00, 0, 0, fe640ff8, feaf76a0, fe841a00)
  ----------------- lwp# 96452 / thread# 96452 --------------------
   feaf7719 lwp_park (0, e3abec80, 0)
   feaf1efe cond_wait_queue (8ac7868, 8ac7878, e3abec80) + 5e
   feaf227b cond_wait_common (8ac7868, 8ac7878, e3abec80) + 1db
   feaf24ad _cond_timedwait (8ac7868, 8ac7878, e3abecf8) + 51
   feaf2518 cond_timedwait (8ac7868, 8ac7878, e3abecf8) + 24
   feaf2554 pthread_cond_timedwait (8ac7868, 8ac7878, e3abecf8) + 1e
   fef5590b ar_waitreply (80ad880, 8ac7840, e3abed3c, e3abee90, 0, 0) + 2db
   0806fcf8 dkimf_ar_waitreply (80ad880, 8ac7840, e3abee90, e3ac13b8, 0, 8b4a110) + a8
   fef7b21b dkim_get_policy_dns (8b4a0e0, e3ac1c7f, 0, e3ac1833, 401, e3ac1c40) + 2bb
  
I notice from this that thread #3 is doing select(3C). The man page
states:

     The poll(2) function is preferred over this function. It
     must be used when the number of file descriptors exceeds
     FD_SETSIZE.

`dccm' does use poll(), as does sendmail. I believe that FD_SETSIZE
is defined to be 1024 for a 32-bit process. This could be the cause
of the problem, although perhaps select() only fails when something
else causes the thread count to grow.

-- 
-Gary Mills-        -Unix Group-        -Computer and Network Services-
Received on Wed Jan 26 2011 - 05:07:49 PST

This archive was generated by hypermail 2.3.0 : Mon Oct 29 2012 - 23:33:08 PST