|
|
install Bundle::CPAN |
CPAN proceeds to automatically install the various modules, including ones for handling various routines having to do with dates, HTML parsing, file compression, and much more. When it asks, I instruct CPAN to follow all prerequisites and download everything it desires.
After that's complete, I can install the latest version of SpamAssassin with the command:
install Mail::SpamAssassin |
CPAN automatically downloads the package into my chosen CPAN build directory and begins to install it. I am prompted to authorize the installation of a couple of prerequisites, HTML::Parser and HTML::Tagset, which I agree to. I provide a postmaster contact address when asked. The build process takes a while. When it's done, I quit CPAN.
Next, I want to install the collaborative filtering tool Vipul's Razor; but to do so, I need to download it separately, because it's not available via CPAN. Additionally, as of Razor version 2.36, I need to apply a small patch to the Razor code in order to make it work correctly with SpamAssassin. I download razor-agents and razor-agents-sdk from razor.sourceforge.net and unzip them both.
The patch is part of the SpamAssassin package. To apply it, I run the following command (as the superuser):
patch -p0 -d /Users/paul/source/razor-agents-2.36/lib/Razor2 < /Users/paul/.cpan/build/Mail-SpamAssassin-2.61/Razor2.patch |
where the first directory is the one where I've unpacked the razor-agents tarball, and the second is my CPAN build directory (where the SpamAssassin source was automatically downloaded).
After applying the patch, I first cd to the razor-agents-sdk directory, and run:
perl Makefile.PL make make test make install |
This builds and installs all the prerequisites for Razor. I then cd to the razor-agents directory and issue the same series of commands, to build the actual tool.
Finally, I need to run a few scripts (as the superuser) to initialize Razor, create a user account, and so forth. I type:
razor-client razor-admin -create razor-admin -register |
This phase of the installation is complete. Next, we set up SpamAssassin.
Configuration for SpamAssassin is done on the site-wide level in a file at /etc/mail/spamassassin/local.cf. Individual users can optionally have ~/.spamassassin/local.cffiles as well. There are dozens of configuration options, which can be seen by typing at the command line:
perldoc Mail::SpamAssassin::Conf |
The basic settings I like are as follows:
rewrite_subject 0 # don't change subject lines of spam always_add_headers 1 # add SpamAssassin headers to both spam and nonspam use_bayes 1 # enable the statistical filter auto_learn 1 # allow the statistical filter to learn as it goes rbl_timeout 4 razor_timeout 4 # set short timeouts for remote database checks |
The auto-learn setting means that the statistical filter will train itself. After 200 messages each of spam and non-spam are received by any given user, statistical filtering will be sufficiently trained, and kick in automatically from that point on.
As an alternative, the filter can be trained using old messages. For maximum accuracy, each user's database should be trained with that user's own mail. Failing that, though, I can do training just based on my own mail, copy the database files thus created to each user's home directory, and let auto-learning use that as a starting point.
I have a big file full of spam messages on hand: there's always plenty of spam. Or a ready-made zipped corpus can be found at Spam Archive. Exiting the superuser shell, I feed my spam corpus file to SpamAssassin's sa-learn tool:
sa-learn --showdots --spam --mbox ./spam-corpus |
It processes that over for a long while, and then tells me something like, "Learned from 1273 message(s) (1293 message(s) examined)."
I then take a sizeable archive of my non-spam incoming mail of various representative types (SpamAssassin calls non-spam "ham") and feed that in as well:
sa-learn --showdots --ham --mbox ./saved-mail-2003-10 |
I now have a database of words for the statistical filter to use. If I choose the quick-and-dirty approach, I can simply copy the ~/.spamassassin/bayes* files just created to each user's .spamassassin directory as a starting point, and let the auto-learn feature take over from here.
I can test SpamAssassin with a sample spam message as follows:
spamassassin -tD > spam-sample |
It outputs the processed message, as well as a report:
Content analysis details: (4.4 points, 5.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- 4.4 DATE_SPAMWARE_Y2K Date header uses unusual Y2K formatting |
The final step in setting up SpamAssassin is to get it to run automatically. I will use the daemonized version, called spamd, which runs in the background, and processes messages sent to it by the client spamc.
To run spamd automatically, I place the following line at the end of my /etc/watchdog.conffile:
spamd:respawn:/usr/bin/spamd -m 30 -a -d # SpamAssassin daemon |
Now I need to send all incoming mail through SpamAssassin. There are a number of ways to hook SpamAssassin into Postfix for processing of incoming mail. Intermediaries that provide the connection include Amavisd-new and Spampd. Procmail is another option, and one that I've found simple to set up and use. Procmail is already installed on Mac OS X Server, so all that's necessary is to route messages to it through Postfix's filter system.
NOTE: the following manual changes to two of Postfix's text-based configuration files will be ignored, and possibly overwritten, by the Server Admin tool. The best way to handle this is to configure mail setup via Server Admin completely before making the modifications below. If it's necessary to make additional changes to the mail setup, they can either be done manually, or, if the Server Admin tool is used, the Procmail changes must be added to the Postfix files again afterward.
First, I turn off mail service, so no messages are misrouted during setup. Then I start making changes. The first file to edit is /etc/postfix/main.cf. Working as the superuser, I will make two changes. First, there's a line (it's line 438 in my version) that says:
#mailbox_command = /some/where/procmail -a "$EXTENSION" |
I uncomment this line by removing the initial #, and replace "/some/where" with the actual path to Procmail: /usr/bin. Also, I change -a to -ta, which provides a level of fallback security.
A few lines down is:
#mailbox_transport = cyrus |
I uncomment this as well, and change "cyrus" to "procmail". Then, I save the file.
This has told Postfix to use Procmail as its mailbox_transport. Next, I have to create that transport.
In order to do that, I edit a second file, /etc/postfix/master.cf. At the end, in the final "Interfaces to non-Postfix software" section, I add two lines:
procmail unix - n n - - pipe
user=cyrus argv=/usr/bin/procmail -t -m /etc/procmailrc USER=${user} EXTENSION=${extension}
|
This creates a Procmail relay that Postfix will pass mail to. Now I need to configure Procmail to handle the messages.
I create a general configuration file for Procmail, called /etc/procmailrc. This file can contain any sort of Procmail recipe to process all incoming mail.
For more flexibility, each user can have his or her own Procmail recipes as well, simply by creating a .procmailfile in his or her home directory. Procmail will automatically include these files in its processing path. A starter .procmail file can be created in /System/Library/User Template/English.lproj/, the directory that serves as a template for the home directories of newly created users. The file will be created in the home directory of any new user.
Depending on the particular maintenance requirements of the server setup, the burden of making mail-filtering decisions can be given to the users to a greater or lesser extent. I find it convenient to tag mail with SpamAssassin from a central Procmail recipe, and then leave it up to the users to add filters to their mail clients to filter mail that contains the "X-Spam-Status: Yes" header into a spam folder. The recipe below automatically deletes any message that gets a SpamAssassin score higher than 12, on the assumption that such messages are assuredly spam.
Some administrators may think this is a draconian method; I find that it weeds out about half of the incoming mail to my site, and, in the six months that I was dumping all such high-scoring mail to a "worst-spam" folder, found zero false positives.
In the /etc/procmailrc file I place the following:
LOGFILE=/var/log/procmail # this can be deleted after testing VERBOSE=no HOME=/Users/$USER DROPPRIVS=yes :0fiw | spamc # call spamassassin INCLUDERC=$HOME/.procmail # allow users to create their own recipes :0: * ^X-Spam-Level: \*\*\*\*\*\*\*\*\*\*\*\*\*\* /dev/null #trash all messages with a very high spam score :0w | /usr/bin/cyrus/bin/deliver -a $USER -m user/$USER # if not told otherwise, deliver all messages to the user's inbox |
Deliver is the Cyrus POP/IMAP server's tool for delivering a message to a mailbox. In the last line of the procmailrc file, I send the filtered messages to deliver, specifying the authorizing username and the destination mailbox.
SpamAssassin requires very little administrative intervention once it's set up. The SpamAssassin site includes a useful page of advice for running a site-wide installation.
If statistical filtering is used, and auto-learn is turned on in the configuration, the filter will gradually train itself, based on the result of the non-statistical filter. It's good to have a system set up for users to report mis-categorized messages: either spam that finds its way into their inbox, or—worse—legitimate mail that SpamAssassin tags as spam.
SpamAssassin has two command-line switches for dealing with mis-filed mail. spamassassin -r reports that a message is actually spam, despite being marked as non-spam. spamassassin -k does the reverse, categorizing the input message as non-spam. A shell script can use these to recategorize messages that users report. There are a few possible ways of setting up user reporting, depending on how you want to organize your mail service:
-- In an IMAP environment, the mail administrator can create a pair of write-only shared folders: Is-Spam and Not-Spam. (Creation of shared folders is explained in the Mail Service Administration manual.) IMAP users can place mis-filed messages in these folders. Since the folders are write-only, users can't peek at each others' mail. A shell script, running hourly or daily, can then go through the folders and pass the messages to SpamAssassin for correction. This method doesn't work well if a separate statistical database is set up for each user; since all the messages are lumped together, it is best for an installation that uses just one central database for the statistical filter. The critical line in such a script would look something like:
cat $message | formail -n 4 -ds | sudo -u spam /usr/bin/spamassassin -r -d -a -x > /dev/null 2>&1 |
-- Another option for an IMAP environment is to create copies of the Is-Spam and Not-Spam folders in each mail user's account. Then each user can move mis-filed messages to the appropriate folder, and the shell script can update each user's database individually. This requires changing the default Cyrus setup, to provide the additional folders for each user. For scalability, it's best to stagger the shell scripts for each user so they don't all run at once.
-- A third option, and one which can be used with POP mail, is to set up a pair of mail accounts: isspam@mailserver.com and notspam@mailserver.com, and instruct users to redirect—not forward—messages to the accounts for re-categorization. The shell script would then operate on incoming mail to those accounts. Unfortunately for this technique, only some mail clients support message redirection, better known as "bouncing." Forwarding removes the headers from the original message, which are an important aspect of the message's statistical profile. It's possible to set up a pair of accounts for each user: isspam-username@mailserver.com and notspam-username@mailserver.com, to allow for maintenance of individual databases.
Posted: 2004-02-09