Here Come the Spambots...

Paul R. Brown @ 2008-01-27T00:59:56Z

It didn't take long for someone (or something) to send the first comment spam:

some comment spam in the review display

It's interesting in that the spambot appears to choose the same pages as on other blogs and that either someone wrote a quick plugin for their spambot platform or has a bot that figures out what links and form fields mean.

(comment bubbles) 0 comments

Delete Me

Paul Brown @ 2007-06-01T01:55:00Z

Earlier in the week, I received an email from Ancestry.com informing me of how to change my password. This was of interest, since I'd never signed up on their site or even heard of them, and the company at least looks reputable enough that I doubt that they're artificially boosting their membership stats by shanghaiing people with guessable gmail addresses.

I poked around enough to find out how to turn off their marketing spam, but what I really want is a way to say "delete me" — remove any and all records of my personal information (including otherwise innocuous information like name and email address). Every app should have that functionality, but very few do.

(comment bubbles) 0 comments

Trackback Spam and iptables Recipes

Paul Brown @ 2007-03-23T02:46:06Z

It was interesting to read the conclusions in the Chen-Ma-Niu-Wang paper "Spam Double-Funnel: Connecting Web Spammers with Advertisers", specifically about how most of the page spam involves permissive content posting sites and a relatively small group of IP addresses. (There is also a NYT article about the paper.) It's similar to my experiences with trackback spam.

The comment system in the version of typo that I'm using (for the time being) relies on Javascript, and it's apparently not widespread enough that someone's decided to crack it. (In fact, based on the fact that most commenters post several of the same comment, I expect that it either doesn't work well or is confusing for human users...) I get almost no comment spam. On the other hand, the trackback system is an unadorned HTTP POST, so it's trivial to automate trackback spamming. Akismet's great service correctly marks most of the spam, which is good, given that the ratio between spam and ham trackbacks is something like 5000:1, but it's still a pain for me to wade through gobs of spam looking for possible ham. Even worse, it's a safe assumption that the spambots are gobbling up bandwidth and CPU when they spider through my content looking for trackback links. Nonetheless, I'm not willing to give up and turn the feature off, even for old posts.

With the server logs from the last several months of traffic, I used grep, awk, sort, uniq, and column to get a listing, sorted by count, of the originating IP numbers for trackbacks with rolled-up counts for B and C subnets. (It wouldn't be useful to eliminate the few ham trackbacks, since they'll be insignificant in the overall numbers.) The results were useful: most of the spam trackback POST traffic was coming from a few individual IP addresses and a couple of subnets. From there, a little iptables magic from the httpd tools that accompany the (very good) book Apache Security by Ivan Rustic, and the gush of 2000-3000 trackback spam/day was down to a trickle of <100/day. (Ivan's blacklist tool doesn't support IP ranges yet, so I added a rule manually to block the 72.232 class B subnet and a couple of class C subnets.)

The heuristic is satisfactory for the time being, but it does seem like trackbacks are doomed without some kind of authority or authentication system in place. As alternatives, the Technorati linkcount and the Yahoo! Site Explorer API and badge (Y! ID required) look promising.

(comment bubbles) 0 comments

The Eye of the Spam Storm

Paul Brown @ 2007-01-01T14:40:30Z

On Thursday last week (20061228), for no apparent reason, the otherwise incessant flow of comment and trackback spam stopped completely for about six hours and then started up again at the usual pace (~100 SPAM/hour). Maybe it was an echo of the Internet outage in Asia? Not that I'm complaining.

(comment bubbles) 0 comments

Unclear on the Concept

Paul Brown @ 2006-09-05T03:56:00Z

I got a SPAM this weekend for "Blogger and Podcaster Magazine". In all likelihood, this is just a case of someone identifying a demographic where the advertising revenues from a magazine could exceed the cost of printing and mailing a bunch of free copies, but this strikes me as a case of unclear on the concept. A hard copy, snail-mailed magazine targeted at people who are deeply involved in blogging, podcasting, and other purely on-line activities? That's just the kind of content that you can't find on-line... [sic.]

I don't get it, but maybe that's just me.

(comment bubbles) 0 comments

Ph.D. Holding Me Down

Paul Brown @ 2006-02-13T22:07:00Z

From some spam today:

Have you ever thought that the only thing stopping you from a great job and better pay was a few letters behind your name?
(comment bubbles) 0 comments

How to Integrate an External Spam Filter with Mail.app

Paul Brown @ 2004-01-17T08:00:00Z

This entry outlines how to integrate procmail and bogofilter with Mail.app.

For motivation, the recent rash of word-spew spam rendered the junk filter in Mail.app ineffective (for me). With no way to upgrade the filter functionality built into Mail.app, I decided to get an external, upgradeable, extensible filter working. Once all was said and done, I had procmail and bogofilter working and word list management AppleScripts integrated into Mail.app:

Mail Processing Pipeline Construction and Configuration

Mail.app has a lot going for it, like great search capabilities, a multi-threaded front-end, and integration with AddressBook, among other things. However, there are a couple of gotchas for using Mail.app with standard MDAs like procmail that expect to operate on streams and spool files:

  • Mail.app, as far as I can tell, can only pull mail from a POP3, IMAP, or Exchange server. (It may be possible to simply use one of the mbox files inside Mail.app's folder hierarchy as a spool, but without documentation about file locking and caching behavior, I'm not going to touch it.)
  • Mail.app's rule processing is side-effect-free, so any in-client message processing would have to be done in one or more independent AppleScript scripts that did not alter the message. (This would require, for example, that each AppleScript move the message, which is undesirable.)

To get around the issues, I settled on a mail processing pipeline that still presents a POP3 interface to Mail.app:

Mac OS X 10.3 comes with procmail and fetchmail, so that much is easy. For the local POP3 server, qpopper), and the setup instructions in Adriaan Tijsseling's blog are straightforward.

With BerkeleyDB built and installed from scratch (the version from darwinports puts files in strange locations that are incompatible with the bogofilter build), a fresh download of bogofilter 0.16.1 compiles cleanly with:

./configure --with-libdb-prefix=/usr/local/BerkeleyDB.4.2

(I do have the BerkeleyDB lib and include directories declared in the CPATH and C_INCLUDE_PATH environment variables.)

To get bogofilter to insert an X-Bogosity header into each mail, the .procmailrc that I'm using is:

MAILDIR=/Users/prb

:0fw | bogofilter -u -e -p

:0: Mailbox

The .fetchmailrc (with the host, login, and password changed...) is:

set daemon 120

defaults no rewrite

poll mailhost.com with protocol POP3:
  user login password secret
  mda "procmail -d %T"
  ssl, no keep, no forcecr

Integration with Mail.app

With the mail processing pipeline working, I shut off Mail.app's junk mail processing and added a rule to trigger on messages that bogofilter marks as spam:

The more difficult piece of integration is a way to manage the bogofilter word list from within Mail.app. Because I'm using the -u option that automatically adds to the counts in the word list, I need to rerun bogofilter with either -Ns (decrement the ham counts, incremenet the spam counts) or -Sn (decrement the spam counts, increment the ham counts), as appropriate, to reclassify a message.

Here is script that writes the content of a selected message or messages out to a file and then runs the appropriate shell commands:

using terms from application "Mail"
  on perform mail action with messages msgs
    if (count of msgs) is not equal to 0 then
      repeat with msg in msgs
        set t to path to temporary items
        set posixT to POSIX path of t
        set nam to (t as string) & "bogotemp"
        set Pnam to (posixT as string) & "bogotemp"
        do shell script "rm -f " & Pnam
        set f to open for access file nam with write permission
        write ((source of msg) as string) to f
        close access f
        do shell script "/usr/local/bin/bogofilter -Ns < " & Pnam
        do shell script "rm -f " & Pnam
      end repeat
    end if
  end perform mail action with messages
end using terms from

This is the equivalent of the much more compact

bogofilter -Ns < msg

on the commandline. It is also possible to get message statistics from bogofilter in Mail.app. For example, this script populates a new message with the words and statistics from a selected message using the same temporary file trick:

using terms from application "Mail"
  on perform mail action with messages msgs
    set bogodirectory to "/usr/local/bin/"
    set bogodb to "/Users/prb/.bogofilter/wordlist.db"
    if (count of msgs) is not equal to 0 then
      repeat with msg in msgs
        set t to path to temporary items
        set posixT to POSIX path of t
        set nam to (t as string) & "bogotemp"
        set Pnam to (posixT as string) & "bogotemp"
        do shell script "rm -f " & Pnam
        set f to open for access file nam with write permission
        write ((source of msg) as string) to f
        close access f
        set spamicity to bogodirectory & "bogofilter -e -t  < " & Pnam
        set score to "spamicity: " & (do shell script spamicity)
        set summary to bogodirectory & "bogolexer -p -I " & Pnam
        set summary to summary & " | " & bogodirectory
        set summary to summary & "bogoutil -v -p " & bogodb
        set summary to summary & " | sort | uniq | sort +3 -r"
        set lst to paragraphs of (do shell script summary)
        set text item delimiters to return
        set wordList to score & return & return
        set wordList to wordList & (end of lst & return & rest of lst)
        tell application "Mail" ¬
               to make new outgoing message ¬
               with properties {subject:score, ¬
               content:wordList, visible:true}
        do shell script "rm -f " & Pnam
      end repeat
    end if
  end perform mail action with messages
end using terms from

If you're lazy or want something to use as a starting point, here are the scripts I'm using. To install, unpack in ~/Library/Scripts/Mail\ Scripts.

I'm a relative newbie to AppleScript and the internals of inter-application communication on Mac OS X, so exploring the different ways to integrate procmail and bogofilter with Mail.app was educational. At the very high end, it is apparently possible to build native plug-in bundles for Mail.app. (I couldn't even get started because XCode kept crashing on me...) Example source code (here and here) is available, but I wasn't able to locate any official documentation. I also explored the possibilities with the specific capabilities of MacPython and Mac::Glue.

Notes

Thanks to Lucas Bergman, FiveSight's guru of all things mail, for helping with the rc files and various mucking around.

Actually writing AppleScript is awful, and the unpalatable syntax is compounded by the lack of any useful debugging other than popping-up dialogs through the course of a script. I tried to avoid it, but it appears to have been the shortest path. (If you must, it looks like AppleScript: The Definitive Guide is a good source of information and examples.)

An incomplete list of alternatives to the do-it-yourself approach with procmail:

  • Ben Han has a nice (free) package JunkMatcher that can perform regular expression-based filtering along with a bunch of other tests. His approach uses a callback to the scripting interface of Mail.app from a Python application to do the work.
  • SpamSieve looks like a great deal at $25 (It took more than $25 worth of my time to get this all working...). Nonetheless, it doesn't provide for arbitrary extensibility, and in its current incarnation, it can only mark but not move marked messages from a POP3 account. (This is, as the SpamSieve FAQ points out, Mail.app's fault, not SpamSieve's.)
(comment bubbles) 0 comments