Re: Revised stats gathering approach

From: Alessandro Vesely <vesely_at_tana.it>
Date: Sun, 15 Aug 2010 13:52:34 +0200

On 14/Aug/10 16:52, Murray S. Kucherawy wrote:
> [Besides envelope IDs], unfortunately, the 5322.Message-ID cannot be trusted to be unique.

How are we going to stuff multiple messages, possibly being logged
concurrently, onto the same log file, then? I think a "best effort"
ID would suffice, for statistical purposes, since a conservative sort
that groups log lines by equal ID would almost always yield the
correct result. In that case, fopen(log_file, "a") and line-by-line
logging should suffice.

> [...]
> Since a single message can have more than one signature, and since
> each signature could sign a completely different set of header fields,
> I believe this is necessary in order to report the kinds of statistics
> we'd like to have. I don't plan to record the contents of signed
> header fields, but only their names, to answer questions about which
> fields are typically signed and which field typically gets changed to
> break signatures.

The latter part, breakage diagnosis, may or may not be available.
When it is available, its format probably varies widely. In case the
log consists of (part of) a message's header, diagnoses may reside in
comments given in the A-R fields.

It would still be useful to know whether message rewriting took place
before verification, if there were any ">From" in the body, etcetera.

>> Why not dumping the complete signature, unwrapped into a single line
>> and with truncated b= and bh= tags to save space?
>
> Further compression of things like canonicalization and algorithm to
> one-byte integers is possible as well, plus tabular ordered columns
> saves the space of repeated tag names and equal signs. The file
> doesn't need to be human-readable; we can have opendkim-stats do that
> translation.

Ordered columns save space at the cost of making the logs less
flexible. The log-spec may say what headers fields and what field
tags are not needed, but a tiny site may not care and just dump all,
in order to save the hassle of parsing.

A simple format has the advantage of allowing log contributions from
sites that neither use opendkim nor libopendkim. Although we can
provide scripts or C function that write logs, someone may want to
roll their own loggers.

I can't help fantasize about specific fields... E.g., Received,
Subject, List-*, and possibly more, can be probably deprived of any
value, leaving just the keyword as a placeholder. For From, To, Cc,
it might be useful to know how many mailboxes were present, or what
kind of format they had; however, for statistical purposes, it may be
enough to know the number of bytes or wrapped lines that they
consisted of. Date fields are useful in full, for graphing and
grouping. It would also be interesting to know the value of some
not-always-used fields, such as Content-Language. Boundary values of
Content-Type fields can always be dropped...

What would be the relationship between such log format and the various
MARF/OMA/DKIM-reporting formats currently on-air?
Received on Sun Aug 15 2010 - 11:52:46 PST

This archive was generated by hypermail 2.3.0 : Mon Oct 29 2012 - 23:32:53 PST