Re: Excessive threads in opendkim-2.2.2 on Solaris 10

From: Gary Mills <mills_at_cc.umanitoba.ca>
Date: Wed, 26 Jan 2011 18:49:36 -0600

On Wed, Jan 26, 2011 at 10:17:35AM -0800, Murray S. Kucherawy wrote:
> > -----Original Message-----
> > From: opendkim-dev-bounce_at_lists.opendkim.org [mailto:opendkim-dev-bounce_at_lists.opendkim.org] On Behalf Of Gary Mills
> > Sent: Wednesday, January 26, 2011 5:48 AM
> > To: Murray S. Kucherawy
> > Cc: opendkim-dev_at_lists.opendkim.org
> > Subject: Re: Excessive threads in opendkim-2.2.2 on Solaris 10
> >
> > This is interesting. This morning, the thread count of the same
> > process is back to normal:
> >
> > PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
> > 1731 daemon 2699M 2180M sleep 59 0 17:07:58 0.0% dccd/1
> > 464 daemon 78M 75M sleep 59 0 0:21:47 0.0% opendkim/62
> > 1451 daemon 43M 39M sleep 58 0 94:36:26 0.0% dccm/61
> > 1732 daemon 782M 106M sleep 59 0 3:42:52 0.0% dccd/1
>
> I think my first plan of attack will be to add compile-time support
> for poll() instead of select(), and you should be able to see if
> that helps or not. Second will be more robust handling of a partial
> writev() and doing more frequent checking to see if the descriptor
> can't accept any more data.

That sounds good to me. I don't know why the nameserver would stop
accepting queries, but I suppose it could be overloaded. This happens
occasionally when our Internet connection is down, so that the
nameserver can't resolve recursive queries. They eventually time out,
as do the clients. Any query will need a timeout of some sort.

> Another possible test would be to start your asynchronous resolver
> in TCP mode so that it never needs to go through the
> upgrade-and-resend-everything process. I'll add a switch for that
> in the next Beta as well.

I'm not sure that's a good idea. Here's a look at the nameserver
running on our e-mail machine:

    # rndc status
    version: 9.6.1-P3
    CPUs found: 16
    worker threads: 16
    number of zones: 86
    debug level: 0
    xfers running: 1
    xfers deferred: 0
    soa queries in progress: 1
    query logging is OFF
    recursive clients: 3/3900/4000
    tcp clients: 2/100
    server is up and running

It seems to have a limit of only 100 TCP queries. Almost all queries
will be with UDP. That should still be used on the first attempt.
 
> Note the call to mi_rd_cmd(); the first parameter is the descriptor.
> Here it's 2b, which I presume is just a hex value so that's 43, far
> below 1024. In your report yesterday the highest descriptor I saw
> was 9, but it wasn't a complete listing. If the next time you see
> that you get a pstack output and grep for mi_rd_cmd, perhaps you'll
> see much higher descriptor numbers, maybe even approaching 1024.
> That would lend evidence to the theory that fd_set handling for
> select() is at least part of the problem.

In the one from this morning, there were 99 instances of mi_rd_cmd().
The highest file descriptor was 0x23b or 571 decimal. That's close
enough to the limit to be worrisome.

I also checked on open file descriptors on the same process now.
The current rlimit is 65536 file descriptors. The highest is 521.
There are a couple of TCP connections to the nameserver:

   5: S_IFSOCK mode:0666 dev:287,0 ino:2144 uid:0 gid:0 size:0
      O_RDWR
      SOCK_STREAM
      SO_SNDBUF(49152),SO_RCVBUF(49152),IP_NEXTHOP(0.192.0.0)
      sockname: AF_INET 127.0.0.1 port: 60980
      peername: AF_INET 127.0.0.1 port: 53
   8: S_IFSOCK mode:0666 dev:287,0 ino:49858 uid:0 gid:0 size:0
      O_RDWR
      SOCK_STREAM
      SO_SNDBUF(49152),SO_RCVBUF(49152),IP_NEXTHOP(0.192.0.0)
      sockname: AF_INET 127.0.0.1 port: 37772
      peername: AF_INET 127.0.0.1 port: 53

I didn't see any UDP connections, but they would be transient.
Are the TCP connections persistent, or did they just happen to get
captured in the output?

-- 
-Gary Mills-        -Unix Group-        -Computer and Network Services-
Received on Thu Jan 27 2011 - 00:49:46 PST

This archive was generated by hypermail 2.3.0 : Mon Oct 29 2012 - 23:33:08 PST