Apple Developer Connection
Advanced Search
Member Login Log In | Not a Member? Contact ADC

Fighting Spam on Mac OS X Server

Spam—that ever-increasing flood of unwanted email—presents more and more of a problem to email users and administrators alike. It consumes bandwidth and CPU resources, eats up your time, and costs businesses and governments billions of dollars each year in the effort to reduce it.

At present (early 2004), the best solution to the problem of overwhelming spam traffic is simply to filter your incoming mail. Many mail clients, including Mac OS X Mail, offer built-in filtering options, which help, but these do little to reduce bandwidth consumption, as the filtering happens only after the spam has reached its final destination. Additionally, client-side solutions require attention and maintenance from the users.

A more centralized solution—blocking spam at the server level—is fairly easy to implement, and has clear advantages.

In this article, I explain the installation and use of a popular third-party server-side spam-filtering tool, SpamAssassin (there are others you could use, as well); I cover how it works, and how to use Procmail to hook it into the Mac OS X Server mail server software so that it can be used to filter incoming mail for users on the server side. Then I explain configuring a site-wide installation, training the statistical filter, and finally, adjusting configuration settings

Mac OS X 10.3 uses Postfix as its mail server. Postfix can be elaborately configured with rules and policies to detect and block spam. AFP548.com's Anti-spam Measures is a good place to learn how to do that. In keeping with Unix's philosophy of modularity, though, some prefer to use a separate tool just for filtering spam. Such a tool can be upgraded, modified, swapped out, and turned on and off without affecting the function of the mail server.

Spam Fighting Techniques

There are numerous server-side spam-fighting products on the market now, which work in a variety of ways. One of the popular emerging technologies is a rather elegant one: statistical filtering, a novel way to look at email messages and determine which ones are unwanted.

Statistical filtering works by first analyzing a corpus of real email; actually two corpora, one of spam and the other of non-spam. The filtering tool compiles a database of all the words that occur in each corpus, and assigns each word a variable score denoting how frequently the word appears in spam messages versus non-spam messages. A word like "viagra" will tend to appear primarily in spam, whereas other words— colleagues' names and the like—will appear much more often in non-spam mail than in spam.

The filter is then able to analyze incoming messages according to this database, and draw conclusions as to whether a new message is spam or not based on the words it contains. It can also be trained on the fly and corrected when it makes mistakes, so its accuracy keeps improving. Especially when "words" can include senders' IP addresses and other header information, statistical filtering can be an extremely effective method. For maximal effectiveness, such a filter should maintain a separate database of words for each user, so that filtering is tuned for the specific content of each inbox.

Another way of weeding out spam from inboxes is rule-based filtering. This technique uses a set of rules to detect spam-like messages that match telltale patterns, like excessive capitalization, false message ID strings, list removal information, and such.

SpamAssassin is a free, open-source tool that uses both of these methods, in combination with checks against server blacklists and the Razor hash database, to filter spam on the server side, with very good results. It can be integrated nicely with Postfix, the Mac OS X 10.3 Server mail software. The combination of techniques makes it a very efficaceous, as the different methods not only complement, but actively reinforce each other.

Installing SpamAssassin

Here is how I installed and set up SpamAssassin on a Mac OS X Server v10.3 machine.

Before you install SpamAssassin on Mac OS X Server, you need to have the Xcode Tools installed, as SpamAssassin must be built from the source code. Having installed Xcode, the easiest way to install SpamAssassin is through CPAN, the Perl module repository interface.

Working as the superuser, I type cpan at the command line, which opens a CPAN shell.

Since this is the first time I'm running CPAN on this machine, I need to go through various initialization steps: namely, choosing servers and protocols that CPAN will use to download modules. (If you have already configured CPAN, you can skip this phase.) CPAN walks me through the configuration step by step— if you haven't done it before, rest assured that the default answers are almost always sufficient. When CPAN wants to know the location of a certain executable (say, gzip), the answer can be discovered with the which command, by typing (in a second terminal window) e.g., which gzip.

After CPAN is configured, I want to install some basic Perl modules that are prerequisites for SpamAssassin to run. Most of these are conveniently contained in a bundle called Bundle::CPAN. At the CPAN prompt, I type:

install Bundle::CPAN

CPAN proceeds to automatically install the various modules, including ones for handling various routines having to do with dates, HTML parsing, file compression, and much more. When it asks, I instruct CPAN to follow all prerequisites and download everything it desires.

After that's complete, I can install the latest version of SpamAssassin with the command:

install Mail::SpamAssassin

CPAN automatically downloads the package into my chosen CPAN build directory and begins to install it. I am prompted to authorize the installation of a couple of prerequisites, HTML::Parser and HTML::Tagset, which I agree to. I provide a postmaster contact address when asked. The build process takes a while. When it's done, I quit CPAN.

Next, I want to install the collaborative filtering tool Vipul's Razor; but to do so, I need to download it separately, because it's not available via CPAN. Additionally, as of Razor version 2.36, I need to apply a small patch to the Razor code in order to make it work correctly with SpamAssassin. I download razor-agents and razor-agents-sdk from razor.sourceforge.net and unzip them both.

The patch is part of the SpamAssassin package. To apply it, I run the following command (as the superuser):

patch -p0 -d /Users/paul/source/razor-agents-2.36/lib/Razor2 < 
 	/Users/paul/.cpan/build/Mail-SpamAssassin-2.61/Razor2.patch
 

where the first directory is the one where I've unpacked the razor-agents tarball, and the second is my CPAN build directory (where the SpamAssassin source was automatically downloaded).

After applying the patch, I first cd to the razor-agents-sdk directory, and run:

 perl Makefile.PL
 make
 make test
 make install

This builds and installs all the prerequisites for Razor. I then cd to the razor-agents directory and issue the same series of commands, to build the actual tool.

Finally, I need to run a few scripts (as the superuser) to initialize Razor, create a user account, and so forth. I type:

 razor-client
 razor-admin -create
 razor-admin -register

This phase of the installation is complete. Next, we set up SpamAssassin.

Setting Up SpamAssassin

Configuration for SpamAssassin is done on the site-wide level in a file at /etc/mail/spamassassin/local.cf. Individual users can optionally have ~/.spamassassin/local.cffiles as well. There are dozens of configuration options, which can be seen by typing at the command line:

perldoc Mail::SpamAssassin::Conf

The basic settings I like are as follows:

 rewrite_subject 0
 # don't change subject lines of spam
 always_add_headers 1
 # add SpamAssassin headers to both spam and nonspam
 use_bayes               1
 # enable the statistical filter
 auto_learn              1
 # allow the statistical filter to learn as it goes
 rbl_timeout 4
 razor_timeout 4
 # set short timeouts for remote database checks

The auto-learn setting means that the statistical filter will train itself. After 200 messages each of spam and non-spam are received by any given user, statistical filtering will be sufficiently trained, and kick in automatically from that point on.

As an alternative, the filter can be trained using old messages. For maximum accuracy, each user's database should be trained with that user's own mail. Failing that, though, I can do training just based on my own mail, copy the database files thus created to each user's home directory, and let auto-learning use that as a starting point.

I have a big file full of spam messages on hand: there's always plenty of spam. Or a ready-made zipped corpus can be found at Spam Archive. Exiting the superuser shell, I feed my spam corpus file to SpamAssassin's sa-learn tool:

 sa-learn --showdots --spam --mbox ./spam-corpus

It processes that over for a long while, and then tells me something like, "Learned from 1273 message(s) (1293 message(s) examined)."

I then take a sizeable archive of my non-spam incoming mail of various representative types (SpamAssassin calls non-spam "ham") and feed that in as well:

 sa-learn --showdots --ham --mbox ./saved-mail-2003-10

I now have a database of words for the statistical filter to use. If I choose the quick-and-dirty approach, I can simply copy the ~/.spamassassin/bayes* files just created to each user's .spamassassin directory as a starting point, and let the auto-learn feature take over from here.

I can test SpamAssassin with a sample spam message as follows:

 spamassassin -tD > spam-sample

It outputs the processed message, as well as a report:

 Content analysis details:   (4.4 points, 5.0 required)
 
  pts rule name              description
 ---- ---------------------- --------------------------------------------------
  4.4 DATE_SPAMWARE_Y2K      Date header uses unusual Y2K formatting
 

The final step in setting up SpamAssassin is to get it to run automatically. I will use the daemonized version, called spamd, which runs in the background, and processes messages sent to it by the client spamc.

To run spamd automatically, I place the following line at the end of my /etc/watchdog.conffile:

 spamd:respawn:/usr/bin/spamd -m 30 -a -d        # SpamAssassin daemon
 

Using Procmail

Now I need to send all incoming mail through SpamAssassin. There are a number of ways to hook SpamAssassin into Postfix for processing of incoming mail. Intermediaries that provide the connection include Amavisd-new and Spampd. Procmail is another option, and one that I've found simple to set up and use. Procmail is already installed on Mac OS X Server, so all that's necessary is to route messages to it through Postfix's filter system.

NOTE: the following manual changes to two of Postfix's text-based configuration files will be ignored, and possibly overwritten, by the Server Admin tool. The best way to handle this is to configure mail setup via Server Admin completely before making the modifications below. If it's necessary to make additional changes to the mail setup, they can either be done manually, or, if the Server Admin tool is used, the Procmail changes must be added to the Postfix files again afterward.

First, I turn off mail service, so no messages are misrouted during setup. Then I start making changes. The first file to edit is /etc/postfix/main.cf. Working as the superuser, I will make two changes. First, there's a line (it's line 438 in my version) that says:

 #mailbox_command = /some/where/procmail -a "$EXTENSION"

I uncomment this line by removing the initial #, and replace "/some/where" with the actual path to Procmail: /usr/bin. Also, I change -a to -ta, which provides a level of fallback security.

A few lines down is:

#mailbox_transport = cyrus

I uncomment this as well, and change "cyrus" to "procmail". Then, I save the file.

This has told Postfix to use Procmail as its mailbox_transport. Next, I have to create that transport.

In order to do that, I edit a second file, /etc/postfix/master.cf. At the end, in the final "Interfaces to non-Postfix software" section, I add two lines:

 procmail  unix  -       n       n       -       -       pipe
   user=cyrus argv=/usr/bin/procmail -t -m /etc/procmailrc USER=${user} EXTENSION=${extension}
 

This creates a Procmail relay that Postfix will pass mail to. Now I need to configure Procmail to handle the messages.

Bringing It All Together

I create a general configuration file for Procmail, called /etc/procmailrc. This file can contain any sort of Procmail recipe to process all incoming mail.

For more flexibility, each user can have his or her own Procmail recipes as well, simply by creating a .procmailfile in his or her home directory. Procmail will automatically include these files in its processing path. A starter .procmail file can be created in /System/Library/User Template/English.lproj/, the directory that serves as a template for the home directories of newly created users. The file will be created in the home directory of any new user.

Depending on the particular maintenance requirements of the server setup, the burden of making mail-filtering decisions can be given to the users to a greater or lesser extent. I find it convenient to tag mail with SpamAssassin from a central Procmail recipe, and then leave it up to the users to add filters to their mail clients to filter mail that contains the "X-Spam-Status: Yes" header into a spam folder. The recipe below automatically deletes any message that gets a SpamAssassin score higher than 12, on the assumption that such messages are assuredly spam.

Some administrators may think this is a draconian method; I find that it weeds out about half of the incoming mail to my site, and, in the six months that I was dumping all such high-scoring mail to a "worst-spam" folder, found zero false positives.

In the /etc/procmailrc file I place the following:

 LOGFILE=/var/log/procmail  # this can be deleted after testing
 VERBOSE=no
 HOME=/Users/$USER
 DROPPRIVS=yes
 
 :0fiw
 | spamc   # call spamassassin
 
 INCLUDERC=$HOME/.procmail  # allow users to create their own recipes
 
 :0:
 * ^X-Spam-Level: \*\*\*\*\*\*\*\*\*\*\*\*\*\*
 /dev/null
 #trash all messages with a very high spam score
 
 :0w
 | /usr/bin/cyrus/bin/deliver -a $USER -m user/$USER
 # if not told otherwise, deliver all messages to the user's inbox
 

Deliver is the Cyrus POP/IMAP server's tool for delivering a message to a mailbox. In the last line of the procmailrc file, I send the filtered messages to deliver, specifying the authorizing username and the destination mailbox.

Maintenance and Training

SpamAssassin requires very little administrative intervention once it's set up. The SpamAssassin site includes a useful page of advice for running a site-wide installation.

If statistical filtering is used, and auto-learn is turned on in the configuration, the filter will gradually train itself, based on the result of the non-statistical filter. It's good to have a system set up for users to report mis-categorized messages: either spam that finds its way into their inbox, or—worse—legitimate mail that SpamAssassin tags as spam.

SpamAssassin has two command-line switches for dealing with mis-filed mail. spamassassin -r reports that a message is actually spam, despite being marked as non-spam. spamassassin -k does the reverse, categorizing the input message as non-spam. A shell script can use these to recategorize messages that users report. There are a few possible ways of setting up user reporting, depending on how you want to organize your mail service:

-- In an IMAP environment, the mail administrator can create a pair of write-only shared folders: Is-Spam and Not-Spam. (Creation of shared folders is explained in the Mail Service Administration manual.) IMAP users can place mis-filed messages in these folders. Since the folders are write-only, users can't peek at each others' mail. A shell script, running hourly or daily, can then go through the folders and pass the messages to SpamAssassin for correction. This method doesn't work well if a separate statistical database is set up for each user; since all the messages are lumped together, it is best for an installation that uses just one central database for the statistical filter. The critical line in such a script would look something like:

 cat $message | formail -n 4 -ds | sudo -u spam 
 	/usr/bin/spamassassin -r -d -a -x > /dev/null 2>&1
 

-- Another option for an IMAP environment is to create copies of the Is-Spam and Not-Spam folders in each mail user's account. Then each user can move mis-filed messages to the appropriate folder, and the shell script can update each user's database individually. This requires changing the default Cyrus setup, to provide the additional folders for each user. For scalability, it's best to stagger the shell scripts for each user so they don't all run at once.

-- A third option, and one which can be used with POP mail, is to set up a pair of mail accounts: isspam@mailserver.com and notspam@mailserver.com, and instruct users to redirect—not forward—messages to the accounts for re-categorization. The shell script would then operate on incoming mail to those accounts. Unfortunately for this technique, only some mail clients support message redirection, better known as "bouncing." Forwarding removes the headers from the original message, which are an important aspect of the message's statistical profile. It's possible to set up a pair of accounts for each user: isspam-username@mailserver.com and notspam-username@mailserver.com, to allow for maintenance of individual databases.


Posted: 2004-02-09