About a month ago, I upgraded our main e-mail server from
opendkim-2.0.2 to opendkim-2.2.2. At that time, it seemed fine. The
thread count of opendkim was tracking the thread count of the other
milter, dccm. Now with a higher load, I've noticed a problem. Here
are two snippets of `prstat -u daemon' taken about 1/2 hour apart:
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
464 daemon 99M 96M sleep 59 0 0:04:13 0.2% opendkim/2089
1731 daemon 2699M 2368M sleep 59 0 17:04:58 0.1% dccd/1
1451 daemon 47M 43M sleep 58 0 94:25:03 0.0% dccm/577
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
464 daemon 170M 167M sleep 59 0 0:05:22 0.3% opendkim/3816
1731 daemon 2699M 2370M sleep 59 0 17:05:04 0.1% dccd/1
1451 daemon 43M 40M sleep 58 0 94:25:31 0.0% dccm/90
The thread count of opendkim is growing but never decreasing. I see
from the logs for the SMF service that the processes is exiting every
day or two, and being restarted by SMF...
[ Jan 25 15:55:24 Stopping because all processes in service exited. ]
[ Jan 25 15:55:24 Executing stop method (:kill) ]
[ Jan 25 15:55:24 Executing start method ("/lib/svc/method/site/opendkim") ]
starting opendkim
started opendkim
[ Jan 25 15:55:24 Method "start" exited with status 0 ]
Here are the syslog messages from that event:
Jan 25 15:54:53 electra opendkim[4975]: [ID 909925 mail.error] OpenDKIM Filter: thread_create() failed: 11, try again
Jan 25 15:55:08 electra opendkim[4975]: [ID 909925 mail.error] OpenDKIM Filter: thread_create() failed: 11, abort
Jan 25 15:55:16 electra opendkim[4975]: [ID 109917 mail.error] OpenDKIM Filter, mi_rd_cmd: read returned -1: Connection reset by peer
Jan 25 15:55:16 electra opendkim[4975]: [ID 699540 mail.debug] p0PLOMoG008746: no signature data
Jan 25 15:55:21 electra opendkim[4975]: [ID 109917 mail.error] OpenDKIM Filter, mi_rd_cmd: read returned -1: Connection reset by peer
Jan 25 15:55:24 electra opendkim[4975]: [ID 555540 mail.info] OpenDKIM Filter v2.2.2 terminating with status -1, errno = 12
Jan 25 15:55:24 electra opendkim[464]: [ID 482974 mail.info] OpenDKIM Filter v2.2.2 starting (args: -x /etc/mail/opendkim.conf)
I also have a partial stack trace of the existing process:
# pstack 464 | head -44
464: opendkim -x /etc/mail/opendkim.conf
----------------- lwp# 1 / thread# 1 --------------------
feafa795 pollsys (8036ffc, 1, 8036fb0, 0)
fea9f8be poll (8036ffc, 1, 1388, 8075f11) + 52
0807598b mi_listener (80ade00, 0, 80acef8, 1c2a, 80, 1) + cf
08074d4b smfi_main (8047d74, feffb800, 0, 0, 0, 0) + 67
0806a885 main (3, 8047db8, 8047dc8, 8047dac) + 2b95
0805a00d _start (3, 8047e60, 8047e69, 8047e6c, 0, 8047e84) + 7d
----------------- lwp# 2 / thread# 2 --------------------
feafb1d5 writev (5, fe94ed04, 2)
fef52be9 ar_sendquery (80ad880, 8c93b00) + 279
fef54252 ar_dispatcher (80ad880) + 1162
feaf73b7 _thr_setup (fe840200) + 4e
feaf76a0 _lwp_start (fe840200, 0, 0, fe94eff8, feaf76a0, fe840200)
----------------- lwp# 3 / thread# 3 --------------------
feafa795 pollsys (fe83ec70, 2, fe83ed18, 0)
feaa3e36 pselect (a, fe83ef3c, fe83eebc, feb70300, fe83ed18, 0) + 18e
feaa412c select (a, fe83ef3c, fe83eebc, 0, fe83eeb0, 0) + 82
fef53543 ar_dispatcher (80be9c8) + 453
feaf73b7 _thr_setup (fe840a00) + 4e
feaf76a0 _lwp_start (fe840a00, 0, 0, fe83eff8, feaf76a0, fe840a00)
----------------- lwp# 4 / thread# 4 --------------------
feaf9b75 sigtimedwait (fe73fef0, fe73ff00, 0)
feaec71d sigwait (fe73ffc0) + 1c
feae4b82 __posix_sigwait (fe73ffc0, fe73ffd0) + 2e
0805c43b dkimf_reloader (0) + 4b
feaf73b7 _thr_setup (fe841200) + 4e
feaf76a0 _lwp_start (fe841200, 0, 0, fe73fff8, feaf76a0, fe841200)
----------------- lwp# 5 / thread# 5 --------------------
feaf9b75 sigtimedwait (fe640f00, fe640f10, 0)
feaec71d sigwait (fe640fc0, 2, fe640fdc, feaf6ba7) + 1c
0807705b mi_signal_thread (80abeb0) + 4b
feaf73b7 _thr_setup (fe841a00) + 4e
feaf76a0 _lwp_start (fe841a00, 0, 0, fe640ff8, feaf76a0, fe841a00)
----------------- lwp# 96452 / thread# 96452 --------------------
feaf7719 lwp_park (0, e3abec80, 0)
feaf1efe cond_wait_queue (8ac7868, 8ac7878, e3abec80) + 5e
feaf227b cond_wait_common (8ac7868, 8ac7878, e3abec80) + 1db
feaf24ad _cond_timedwait (8ac7868, 8ac7878, e3abecf8) + 51
feaf2518 cond_timedwait (8ac7868, 8ac7878, e3abecf8) + 24
feaf2554 pthread_cond_timedwait (8ac7868, 8ac7878, e3abecf8) + 1e
fef5590b ar_waitreply (80ad880, 8ac7840, e3abed3c, e3abee90, 0, 0) + 2db
0806fcf8 dkimf_ar_waitreply (80ad880, 8ac7840, e3abee90, e3ac13b8, 0, 8b4a110) + a8
fef7b21b dkim_get_policy_dns (8b4a0e0, e3ac1c7f, 0, e3ac1833, 401, e3ac1c40) + 2bb
I notice from this that thread #3 is doing select(3C). The man page
states:
The poll(2) function is preferred over this function. It
must be used when the number of file descriptors exceeds
FD_SETSIZE.
`dccm' does use poll(), as does sendmail. I believe that FD_SETSIZE
is defined to be 1024 for a 32-bit process. This could be the cause
of the problem, although perhaps select() only fails when something
else causes the thread count to grow.
--
-Gary Mills- -Unix Group- -Computer and Network Services-
Received on Wed Jan 26 2011 - 05:07:49 PST