Runaway threads with opendkim-2.3.2

From: Gary Mills <mills_at_cc.umanitoba.ca>
Date: Mon, 9 May 2011 13:09:25 -0500

As I noted in my earlier messages to this mailing list, I ran
opendkim-2.3.2 on our production e-mail server for two weekends with
no ill effects. This time I decided to leave it running on Monday,
our busiest day for e-mail. Today, I noticed a few incidents of
runaway threads. At 09:49:53, I noticed this from `prstat -u daemon':

  PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
 16720 daemon 258M 236M cpu10 59 0 51:01:57 0.1% j-chkmail/258
  1451 daemon 45M 41M sleep 59 0 160:08:46 0.0% dccm/253
 28715 daemon 39M 36M sleep 59 0 0:08:48 0.0% opendkim/712

It shows our three sendmail filters, with opendkim having more than
twice as many threads as the others. Later, at 10:48:57, it seemed to
have recovered:

  PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
 16720 daemon 263M 241M sleep 59 0 51:04:08 0.1% j-chkmail/468
 28715 daemon 26M 23M sleep 59 0 0:09:38 0.0% opendkim/206
  1451 daemon 47M 43M sleep 58 0 160:12:25 0.0% dccm/464

Then, at 11:38:08, the thread count was over 2000:

   PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
 28715 daemon 114M 111M cpu15 0 0 0:23:40 6.3% opendkim/2373
 16720 daemon 259M 237M sleep 59 0 51:05:59 0.1% j-chkmail/298
  1451 daemon 45M 41M sleep 59 0 160:14:46 0.0% dccm/293

The first incident seems to have stemmed from this e-mail message:

    May 9 09:46:21 electra sm-mta[13129]: [ID 801593 mail.info] p49EkEdT013129: from=<XXXX_at_serc.mb.ca>, size=26112, class=0, nrcpts=1, msgid=<a3ca5816.1cc0e58.75150c27.4823_at_serc.mb.ca>, proto=ESMTP, daemon=MTA, relay=mail.serc.mb.ca [205.200.66.55]
    May 9 09:47:59 electra sm-mta[13129]: [ID 801593 mail.error] p49EkEdT013129: Milter (dkim): timeout before data read, where=eom
    May 9 09:47:59 electra sm-mta[13129]: [ID 801593 mail.info] p49EkEdT013129: Milter (dkim): to error state
    May 9 09:48:00 electra sm-mta[13129]: [ID 801593 mail.info] p49EkEdT013129: Milter add: header: X-DCC-UofM-Metrics: electra 1033; Body=1 Fuz1=1 Fuz2=1
    May 9 09:49:03 electra sm-mta[14344]: [ID 801593 mail.info] p49EkEdT013129: to=<YYYY_at_cc.umanitoba.ca>, delay=00:02:48, xdelay=00:01:03, mailer=local, pri=56366, relay=localhost, dsn=2.0.0, stat=Sent
    May 9 09:49:41 electra opendkim[28715]: [ID 184845 mail.error] p49EkEdT013129: Authentication-Results header add failed

At the time of the last incident, this error showed up for the first time:

    May 9 11:46:17 electra opendkim[28715]: [ID 909925 mail.error] OpenDKIM Filter: thread_create() failed: 11, try again

I got a core with `gcore' and restarted the service. It's behaviour
returned to normal. This is a stack trace from the core file:

core 'core.28715' of 28715: opendkim -x /etc/mail/opendkim.conf
----------------- lwp# 1 / thread# 1 --------------------
 feada795 pollsys (8036ef0, 0, 8036f40, 0)
 fea83e36 pselect (0, feb50300, feb50300, feb50300, 8036f40, 0) + 18e
 fea8412c select (0, 0, 0, 0, 8037018, 0) + 82
 fef366ce mi_listener (809dd60, 0, 80a2240, 1c2a, 80, 1) + 4be
 fef33671 smfi_main (8047d74, feffb800, 0, 0, 0, 0) + 79
 08068581 main (3, 8047db8, 8047dc8, 8047dac) + 2c51
 0805766d _start (3, 8047e60, 8047e69, 8047e6c, 0, 8047e84) + 7d
----------------- lwp# 2 / thread# 2 --------------------
 fead9b75 sigtimedwait (fe90eef0, fe90ef00, 0)
 feacc71d sigwait (fe90efc0) + 1c
 feac4b82 __posix_sigwait (fe90efc0, fe90efd0) + 2e
 08059b3b dkimf_reloader (0) + 4b
 fead73b7 _thr_setup (fe950200) + 4e
 fead76a0 _lwp_start (fe950200, 0, 0, fe90eff8, fead76a0, fe950200)
----------------- lwp# 3 / thread# 3 --------------------
 fead9b75 sigtimedwait (fe80ff00, fe80ff10, 0)
 feacc71d sigwait (fe80ffc0, 2, fe80ffdc, fef382a9) + 1c
 fef382f7 mi_signal_thread (80a0738) + 5b
 fead73b7 _thr_setup (fe950a00) + 4e
 fead76a0 _lwp_start (fe950a00, 0, 0, fe80fff8, fead76a0, fe950a00)
----------------- lwp# 10 / thread# 10 --------------------
 fead7719 lwp_park (0, 0, 0)
 fead0e9f mutex_lock_impl (87a0af0, 0) + 102
 fead0f78 mutex_lock (87a0af0, 0) + 1a
 fef54840 ar_dispatcher (80a1220) + f70
 fead73b7 _thr_setup (fe953200) + 4e
 fead76a0 _lwp_start (fe953200, 0, 0, fe314ff8, fead76a0, fe953200)
----------------- lwp# 288851 / thread# 288851 --------------------
 fead7719 lwp_park (0, 0, 0)
 fead0e9f mutex_lock_impl (80a126c, 0) + 102
 fead0f78 mutex_lock (80a126c, 0) + 1a
 fef559a6 ar_addquery (80a1220, a1f7630, 1, 1, 3, f5e221c4) + 1d6
 0806df9c dkimf_ar_query (80a1220, 1, a1f7630, f5e221c4, 2000, f5e241d4) + 12c
 fef7b415 dkim_get_policy_dns_excheck (a1fa650, a1f7630, f5e26be8) + 185
 fef7bb7b dkim_get_policy_dns (a1fa650, a1f7630, 1, f5e267db, 401, f5e26be8) + 1ab
 fef83b36 dkim_get_policy (a1fa650, a1f7630, 1, f5e26d34, f5e26d28, f5e26d30) + 286
 fef8877d dkim_policy (a1fa650, 81a9e20, 81a9e2c, 0, 0, 0) + 11d
 08064268 mlfi_eom (91dadd8, 10003, 1000000, fef3523e) + dc8
 fef352b8 st_bodyend (f5e28f70, a, f5e28f98, fef33787) + 88
 fef33a24 mi_engine (91dadd8, fef49ea4, f5e28fb8, fef36c2d) + 35c
 fef36c48 mi_handle_session (91dadd8, 0, fc829200, fef36090, fead5d23) + 38
 fef360a2 mi_thread_handle_wrapper (91dadd8) + 1e
 fead73b7 _thr_setup (fc829200) + 4e
 fead76a0 _lwp_start (fc829200, 0, 0, f5e28ff8, fead76a0, fc829200)
----------------- lwp# 289401 / thread# 289401 --------------------
 feada795 pollsys (fd027ed0, 1, fd027ea0, 0)
 fea7f8be poll (fd027ed0, 1, 6e0410, fef36ca6) + 52
 fef36cee mi_rd_cmd (38, fd027f68, fd027f63, fd027f64, 80a0738, fef49ea4) + 56
 fef337ac mi_engine (83d49e0, fef49ea4, fd027fb8, fef36c2d) + e4
 fef36c48 mi_handle_session (83d49e0, 0, f0801200, fef36090, fead5d23) + 38
 fef360a2 mi_thread_handle_wrapper (83d49e0) + 1e
 fead73b7 _thr_setup (f0801200) + 4e
 fead76a0 _lwp_start (f0801200, 0, 0, fd027ff8, fead76a0, f0801200)
...
----------------- lwp# 299233 / thread# 299233 --------------------
 fead7719 lwp_park (0, 0, 0)
 fead0e9f mutex_lock_impl (80a126c, 0) + 102
 fead0f78 mutex_lock (80a126c, 0) + 1a
 fef559a6 ar_addquery (80a1220, be8fa00, 1, 1, 3, 2f1101c4) + 1d6
 0806df9c dkimf_ar_query (80a1220, 1, be8fa00, 2f1101c4, 2000, 2f1121d4) + 12c
 fef7b415 dkim_get_policy_dns_excheck (be8ffa0, be8fa00, 2f114be8) + 185
 fef7bb7b dkim_get_policy_dns (be8ffa0, be8fa00, 1, 2f1147db, 401, 2f114be8) + 1ab
 fef83b36 dkim_get_policy (be8ffa0, be8fa00, 1, 2f114d34, 2f114d28, 2f114d30) + 286
 fef8877d dkim_policy (be8ffa0, be8de08, be8de14, 0, 0, 0) + 11d
 08064268 mlfi_eom (922bb00, 10003, 1000000, fef3523e) + dc8
 fef352b8 st_bodyend (2f116f70, a, 2f116f98, fef33787) + 88
 fef33a24 mi_engine (922bb00, fef49ea4, 2f116fb8, fef36c2d) + 35c
 fef36c48 mi_handle_session (922bb00, 0, 8068ca00, fef36090, fead5d23) + 38
 fef360a2 mi_thread_handle_wrapper (922bb00) + 1e
 fead73b7 _thr_setup (8068ca00) + 4e
 fead76a0 _lwp_start (8068ca00, 0, 0, 2f116ff8, fead76a0, 8068ca00)
----------------- lwp# 299238 / thread# 299238 --------------------
 feada795 pollsys (c46afed0, 1, c46afea0, 0)
 fea7f8be poll (c46afed0, 1, 6e0410, fef36ca6) + 52
 fef36cee mi_rd_cmd (390, c46aff68, c46aff63, c46aff64, 80a0738, fef49ea4) + 56
 fef337ac mi_engine (b5ba2c0, fef49ea4, c46affb8, fef36c2d) + e4
 fef36c48 mi_handle_session (b5ba2c0, 0, e065ca00, fef36090, fead5d23) + 38
 fef360a2 mi_thread_handle_wrapper (b5ba2c0) + 1e
 fead73b7 _thr_setup (e065ca00) + 4e
 fead76a0 _lwp_start (e065ca00, 0, 0, c46afff8, fead76a0, e065ca00)

The two types of traceback seem to alternate. What could be happening?

This was run on a 16-core Intel server. At busy times it was handling
over 10,000 SMTP sessions per hour. I suppose a cunningly-crafted e-mail
message could have caused this mayhem. An unprotected variable shared
between threads could also do it. Or, maybe it's a bug?

-- 
-Gary Mills-        -Unix Group-        -Computer and Network Services-
Received on Mon May 09 2011 - 18:09:36 PST

This archive was generated by hypermail 2.3.0 : Mon Oct 29 2012 - 23:33:10 PST