RE: Revised stats gathering approach from Murray S. Kucherawy on 2010-08-16 (OpenDKIM Developers mailing list)

From: Murray S. Kucherawy <msk_at_cloudmark.com>
Date: Mon, 16 Aug 2010 09:51:02 -0700

> -----Original Message-----
> From: opendkim-dev-bounce_at_lists.opendkim.org [mailto:opendkim-dev-
> bounce_at_lists.opendkim.org] On Behalf Of Alessandro Vesely
> Sent: Sunday, August 15, 2010 4:53 AM
> To: Murray S. Kucherawy
> Cc: opendkim-dev_at_lists.opendkim.org
> Subject: Re: Revised stats gathering approach
>
> On 14/Aug/10 16:52, Murray S. Kucherawy wrote:
> > [Besides envelope IDs], unfortunately, the 5322.Message-ID cannot be
> trusted to be unique.
>
> How are we going to stuff multiple messages, possibly being logged
> concurrently, onto the same log file, then? I think a "best effort"
> ID would suffice, for statistical purposes, since a conservative sort
> that groups log lines by equal ID would almost always yield the
> correct result. In that case, fopen(log_file, "a") and line-by-line
> logging should suffice.

Yes, that's my plan. Also, part of the record includes a timestamp, and there's apparently a one-in-2^15 chance that a postfix job ID will be recycled within the same second, so a composite key involving the reporting host, job ID and timestamp should prevent any ambiguity once it gets inserted into an SQL table.

With any luck I'll have the revised schema and corresponding documentation visible via CVS later today.

> The latter part, breakage diagnosis, may or may not be available.
> When it is available, its format probably varies widely. In case the
> log consists of (part of) a message's header, diagnoses may reside in
> comments given in the A-R fields.

A-R information is not complete in terms of failure diagnosis. It only reports results, not causes. There's a bunch of meta-data available through the libopendkim interface that's not recorded in original headers or in the A-R fields, and those data are part of what's currently recorded in the tabular form.

> It would still be useful to know whether message rewriting took place
> before verification, if there were any ">From" in the body, etcetera.

That would probably require a level of participation from the MTA we're unlikely to get without expensive amounts of patching. I'm hoping to include this in a release in just over a month, so there's probably no bandwidth for such an undertaking.

> Ordered columns save space at the cost of making the logs less
> flexible. The log-spec may say what headers fields and what field
> tags are not needed, but a tiny site may not care and just dump all,
> in order to save the hassle of parsing.

Ultimately the log files get translated into SQL inserts, so there seems to be little point in collecting data the SQL schema doesn't accommodate.

Of course a local admin interested in collecting stuff for a locally-extended schema would find such logging useful, but that would also require extension of the data set parsed from such a log, and a sysadmin with such demands is already doing the work of hacking the logging and the schema anyway.

> A simple format has the advantage of allowing log contributions from
> sites that neither use opendkim nor libopendkim. Although we can
> provide scripts or C function that write logs, someone may want to
> roll their own loggers.

I agree, and I think it's sufficient to provide the log format in a README or such.

> I can't help fantasize about specific fields... E.g., Received,
> Subject, List-*, and possibly more, can be probably deprived of any
> value, leaving just the keyword as a placeholder. For From, To, Cc,
> it might be useful to know how many mailboxes were present, or what
> kind of format they had; however, for statistical purposes, it may be
> enough to know the number of bytes or wrapped lines that they
> consisted of. Date fields are useful in full, for graphing and
> grouping. It would also be interesting to know the value of some
> not-always-used fields, such as Content-Language. Boundary values of
> Content-Type fields can always be dropped...

We need to distill any such data set down to a schema. The advantage to writing entire headers prior to processing would be that such a data set can be re-run through an updated parser when the schema changes, but there's so much data coming in that it's not clear what the value of doing so would be; the first data set doesn't have some extra value that the second does not.

> What would be the relationship between such log format and the various
> MARF/OMA/DKIM-reporting formats currently on-air?

None, as far as I can tell. Essentially this is all private usage data within common users of a single project, and there's no immediate need for standardization.
Received on Mon Aug 16 2010 - 16:51:12 PST