Reduce, Reuse, and as a last resort, Recycle

Paul Brown @ 2007-04-08T15:52:43Z

The bagel in a CD spindle case image floated by twice last week, once in TreeHugger and once in Assaf's new tumblog.

The mantra "reduce, reuse, recycle" is in my head from someplace, and maybe it comes from growing up in a house where we put out a couple of cans of garbage per year. Recycling has a green halo, but it's not as important as reduce or reuse. The best possible thing is simply not to consume in the first place other than to meet your realistic needs, and reuse comes second. Recycling is a distant third because it consumes resources — items must be collected and processed before being used as raw materials for something new. Not everyone can be no-impact-man (NYT, registration required), but I'm not even going to list "just throwing things away" as a choice. (Here in the Emerald City, it's illegal.)

Any health considerations of the CD spindle case aside (see thread on reddit, or read elsewhere on why plastic is the new lead), packaging is wasteful, and there should be a way to reuse it.

For example, I've wondered about the economics of an old school glass milk bottle-style exchange system applied to restaurant take-out containers. (For younger readers, the milk bottle system worked like this: pay $1 for a heavy glass milk bottle once, bring it back to the store when empty, pay for new milk and trade for a new bottle; the dairy cleans and refills the bottles, etc.) The consumer would pay a non-refundable deposit (e.g., $15) on the take-out containers, and old containers could be traded for new. The little details include what the containers would be made of, how they could be cleaned, what inventories would have to be like for a restaurant in terms of different sizes and form factors, and of course, whether consumers would go for it. There is also the question of whether or not reuse of glass containers (with the initial cost of manufacturing and ongoing impact of water use, detergents, transportation, etc.) would be lower-impact than single-use paper containers. The lightweight DIY version would be to bring empty containers to the restaurant to be filled, but that doesn't work with phone or delivery orders.

(comment bubbles) 0 comments

Trackback Spam and iptables Recipes

Paul Brown @ 2007-03-23T02:46:06Z

It was interesting to read the conclusions in the Chen-Ma-Niu-Wang paper "Spam Double-Funnel: Connecting Web Spammers with Advertisers", specifically about how most of the page spam involves permissive content posting sites and a relatively small group of IP addresses. (There is also a NYT article about the paper.) It's similar to my experiences with trackback spam.

The comment system in the version of typo that I'm using (for the time being) relies on Javascript, and it's apparently not widespread enough that someone's decided to crack it. (In fact, based on the fact that most commenters post several of the same comment, I expect that it either doesn't work well or is confusing for human users...) I get almost no comment spam. On the other hand, the trackback system is an unadorned HTTP POST, so it's trivial to automate trackback spamming. Akismet's great service correctly marks most of the spam, which is good, given that the ratio between spam and ham trackbacks is something like 5000:1, but it's still a pain for me to wade through gobs of spam looking for possible ham. Even worse, it's a safe assumption that the spambots are gobbling up bandwidth and CPU when they spider through my content looking for trackback links. Nonetheless, I'm not willing to give up and turn the feature off, even for old posts.

With the server logs from the last several months of traffic, I used grep, awk, sort, uniq, and column to get a listing, sorted by count, of the originating IP numbers for trackbacks with rolled-up counts for B and C subnets. (It wouldn't be useful to eliminate the few ham trackbacks, since they'll be insignificant in the overall numbers.) The results were useful: most of the spam trackback POST traffic was coming from a few individual IP addresses and a couple of subnets. From there, a little iptables magic from the httpd tools that accompany the (very good) book Apache Security by Ivan Rustic, and the gush of 2000-3000 trackback spam/day was down to a trickle of <100/day. (Ivan's blacklist tool doesn't support IP ranges yet, so I added a rule manually to block the 72.232 class B subnet and a couple of class C subnets.)

The heuristic is satisfactory for the time being, but it does seem like trackbacks are doomed without some kind of authority or authentication system in place. As alternatives, the Technorati linkcount and the Yahoo! Site Explorer API and badge (Y! ID required) look promising.

(comment bubbles) 0 comments

Not Quite No Transactions

Paul Brown @ 2007-03-20T02:56:00Z

I enjoyed Dan Pritchett's presentation on The EBay Architecture, but now it seems like people are getting carried away, where the intersection of people with my blogroll contains Martin Fowler. Martin is well-known and rightly regarded as a smart guy, but this fragment from his post on "no transactions" isn't quite right:

The rationale for not using transactions was that they harm performance at the sort of scale that eBay deals with. This effect is exacerbated by the fact that eBay heavily partitions its data into many, many physical databases. As a result using transactions would mean using distributed transactions, which is a common thing to be wary of.

(Unfortunately, Martin's bliki doesn't support comments, or I would be posting this over there.) The relevant bullet from Dan's presentation at SDForum (not the one at QCon) is:

Absolutely no client-side transactions.
∗ Single database transactions managed through anonymous PL/SQL blocks.
∗ No distributed transactions.

This is very different than "no transactions", but the game of telephone is already underway.

The reality is that attempts at being "for sure, for sure" (multi-master replication, two-phase commit, locking, aggressive retry, etc.) can and do inhibit scale. There are also plenty of stupid things you can do to hobble an application without leaving a single database, usually by inadvertently abusing locking. (One classic is a SELECT ... FOR UPDATE that includes a join across a code table.)

In terms of things that can be done to scale, it's a good idea to take the shortest path to a working system. It's a good idea to be intimately familiar with the performance characteristics of a system in production, and that means really, truly understanding how all of the pieces (load balancer, web server, application tier, database, physical network, hardware, etc.) work individually and collectively. It's important to be able to disentangle design choices from business constraints, and this is more or less the point that Assaf made in the original round of discussion about Dan Pritchett's presentation. As Dare Obasanjo and Charlie Wood point out, life without transactions isn't necessarily a picnic, either. Thinking about "what can I live without" in the abstract will lead down the path toward something like what Pat Helland describes in his Life Beyond Distributed Transactions paper, but that journey should be taken in pragmatic steps driven by experiment and experience. (And now that you're out of the jungle, Pat, can you hurry up and go back to blogging?)

(comment bubbles) 3 comments

Java Brainteaser on Regular Expressions

Paul Brown @ 2007-03-20T00:32:11Z

Suppose that you have a Java web application where regular expressions are used deep down in the implementation to do some work, but you observe that the an array index exception is occurring sporadically where the regular expressions are being used.

What's causing the exceptions? What's the solution?

(comment bubbles) 1 comment

STM and IO Redux

Paul Brown @ 2007-03-13T13:05:00Z

Pepe Iborra left a comment on my entry on STM and IO about the use unsafeIOToSTM that spurred me to do some more reading and ask a few questions by email. (Better yet, people who knew the answers were kind enough to respond.)

Better without unsafeIOToSTM

The consensus was to avoid the use of unsafeIOToSTM and just combine the IO actions in the IO monad. This changes things around a bit but in a good way; refactoring (if the word applies in this scenario) only took about 15 minutes.

Disregarding the suggestion to use TMVar for the moment, here are some revisions. (If you look at TMVar, the source is more informative than the documentation.)

First, check_out and check_in need to change, and storing an entry to disk can get simpler:

check_out :: TVar Holder -> IO Holder
check_out h = atomically ( do { h' <- readTVar h
                              ; if locked h'
                                then retry
                                else writeTVar h (lock h')
                              ; return h' } )

check_in :: TVar Holder -> Holder -> IO ()
check_in h' h = atomically ( do { h'' <- readTVar h'
                                ; if (locked h'')
                                  then writeTVar h' (unlock h)
                                  else error "Internal error." } )

store :: Holder -> IO Holder
store h = do { let e = entry h
              ; do writeFile (fname e) ((show e) ++ "\n")
              ; return h }

And two additional utility functions:

onHolders :: (Entry -> Entry) -> (Holder -> Holder)
onHolders f = \ (Holder e l) -> Holder (f e) l

s_apply :: (Entry -> Entry) -> TVar Holder -> IO ()
s_apply f h' = check_out h'
               >>= (store' . onHolders f)
               >>= (check_in h')

(The ">>=" in IO is sequencing where the output of each step is passed to the next as input.) The flow of s_apply is as follows:

  1. "Check out" the entry by setting the locked field to True and then pass the Holder to the next step.
  2. Apply the function f to the Entry wrapped in the Holder, write the result to disk, and pass it along.
  3. "Check in" the entry by setting the locked field to True and writing the new value into the TVar.

Publishing and unpublishing a persistent Entry now has the appealingly simple form:

s_publish :: TVar Holder -> IO ()
s_publish = s_apply publish

s_unpublish :: TVar Holder -> IO ()
s_unpublish = s_apply unpublish

Even Better with bracket

An anonymous commenter pointed out bracket, a function that has the same semantics as try { ... } finally { ... } in Java. The bracket function has the signature:

bracket :: IO a -> (a -> IO b) -> (a -> IO c) -> IO c

In the analogy with try/finally from Java, the IO a would occur before the try, like lock'ing a Lock in the usual idiom. The result of the initial computation is passed both to the inner computation and to the final computation, so the application of the function (e.g., publish) would need to be grouped with the check_out operation if the published Entry was to be the one checked back in. For my purposes, bracketOnError is preferable, since it only executes the fallback action if the inner action (i.e., the last argument) fails. With bracketOnError added and a little more clean-up from another pass over the code, everything gets a little simpler yet:

store :: Entry -> IO Entry
store e = do { writeFile (fname e) ((show e) ++ "\n")
             ; return e }

check_out :: TVar Holder -> IO Entry
check_out h = atomically ( do { h' <- readTVar h
                              ; if locked h'
                                then retry
                                else writeTVar h (lock h')
                              ; return $ entry h' } )

check_in :: TVar Holder -> Entry -> IO ()
check_in h' e = atomically ( do { h'' <- readTVar h'
                                ; if (locked h'')
                                  then writeTVar h' (Holder e False)
                                  else error "Programmer error." } )

s_apply :: (Entry -> Entry) -> TVar Holder -> IO ()
s_apply f h' = bracketOnError (check_out h')
               (\e -> (store.f) e >>= (check_in h'))
               (check_in h')

With s_publish as before, this does the right thing in the event of an error while writing:

*Main> th <- atomically ( newTVar (Holder (Entry "foo" False) False ))
*Main> s_show th
"Holder {entry = Entry {entry_id = \"foo\", published = False}, locked = False}"

*Main> :! chmod -w entry-foo.hb
*Main> do s_publish th
*** Exception: entry-foo.hb: openFile: permission denied (Permission denied)
*Main> s_show th
"Holder {entry = Entry {entry_id = \"foo\", published = False}, locked = False}"

*Main> :! chmod +w entry-foo.hb
*Main> do s_publish th
*Main> s_show th
"Holder {entry = Entry {entry_id = \"foo\", published = True}, locked = False}"

Whither TMVar?

From the TMVar source code comments:

A 'TMVar' is a synchronising variable, used for communication between concurrent threads. It can be thought of as a a box, which may be empty or full.

The idea would be that when an Entry was locked, its TMVar "box" would be empty, to be filled with a TVar wrapping the new value after the operation was completed. Other threads (e.g., threads rendering page views or feeds) need to be able to read values while output is being performed, so I don't think that a TMVar is what I'm after in this case.

(comment bubbles) 0 comments

OK... Who put the JUnit JAR in my WAR?

Paul Brown @ 2007-03-07T02:26:15Z

My odyssey with Maven continues. This entry is spurred by having a WAR built with Maven come out to be three times the size of the one built with the original Ant build. JUnit, JMock, a couple of different Log4J's, and other assorted goodies. With multiple modules and liberal use of open source components, the question is not whether someone did but who peed in which POM?

Open source reminds me of college. I had the opportunity to enjoy some eclectic people during my education at Reed and Berkeley. Rent a room and then sublet the closet? That's cool. Eat what others would otherwise throw away in the dining commons? That's cool. (Off topic, at least one former "scrounger" has done just fine...) These sort of situations came with their own etiquette, e.g., tell a "scrounger" if you have a cold when you drop off your tray and leave items intact and relatively unmolested. The bohemian analogy cuts both ways with open source — you can probably find whatever you are looking for, but it may not be in quite the state that you'd like.

Some shell scripting (find, grep, xargs, and friends) identified commons-httpclient as the likely culprit, and sure enough, it's there plain as day:

<dependency>
  <groupId>junit</groupId>
  <artifactId>junit</artifactId>
  <version>3.8.1</version>
</dependency>

There should be a <scope>test</scope>, but there isn't. Since he helps steward the commons, I pinged Henri about it, and it looks like the issue was already fixed for versions 3.1 and on. This was only part of the battle, however, because commons-httpclient wasn't an explicit dependency; it was only getting included as a transitive dependency of some other dependency of one of the modules that the web application used, and the module hierarchy was already four levels deep. Surely someone else has already experienced issues with dependencies of unknown provenance and come up with a way to navigate the graph, and it turns out that there are (at least) two solutions.

First up, for playing Heracles to my Odysseus or Anchises to my Aeneas or Virgil to my Dante or Laurel to my Hardy or whatever in JAR hell, Henri gets a hat-tip for pointing me at the pomtools plugin, which provides an interactive interface for navigating the graph of dependencies and can alter and serialize the underlying model of the project to fix version conflicts. I didn't end up trying it, but I will, since I have a soft spot for anything with a terminal interface.

Instead, since I also have a soft spot for GraphViz, I used the depgraph plugin from the EL4J project, which I found via Philipp Oser's blog. In my case, the plugin produced the following graph:

The graph showed commons-httpclient referenced by a variety of XFire components, and some exclusions got me out of JAR purgatory for the moment. (I ate a couple of whole pomegranates down there, so I'm sure I'll be headed back sometime soon...) This isn't just a Jakarta Commons issue. XFire has a little of the same kind of POM-rot as of 1.2.3, but that will disappear in the forthcoming 1.2.5. For those keeping score at home, AXIS2 has some (xmlunit should be <scope>'d to test), too. This makes me wish for a Maven "lint" that would flag common errors like test libraries listed as runtime dependencies or dependencies not referenced from runtime source code.

Getting the depgraph plugin wired-up was straightforward. I just added a plugin repository to the master POM:

<pluginRepositories>
  <pluginRepository>
    <id>elca-services</id>
    <url>http://el4.elca-services.ch/el4j/maven2repository</url>
    <releases>
      <enabled>true</enabled>
    </releases>
  </pluginRepository>
</pluginRepositories>

Then the plugin to the build:

<build>
[...]
  <plugins>
    [...]
    <plugin>
      <groupId>ch.elca.el4j.maven.plugins</groupId>
      <artifactId>maven-depgraph-plugin</artifactId>
      <configuration>
        <outDir>target/site/images</outDir>
        <outFile>${pom.artifactId}.png</outFile>
      </configuration>
    </plugin>
  </plugins>
</build>

And then it's just a mvn depgraph:depgraph to get a view of the dependency graph. The real lesson here is to aggressively scope your dependencies as a service to the community.

(comment bubbles) 5 comments

STM, IO, and a Simple Persistence Model

Paul Brown @ 2007-03-04T17:57:00Z

Herein post 5 of n on my hobby project to rewrite my personal publishing software in Haskell. (In strict terms, the project is to write it, since I didn't write the current system.) This post covers persistence and concurrency using the filesystem and Haskell's software transactional memory implementation.

Exploiting Commutativity and Choosing Locking Granularity

As I imagine things working, the basic operations that I want to be able to perform against the persistent form of the blog are something like:

  • Create an entry (and by extension, a comment).
  • Change the metadata on an entry, e.g., publish/unpublish or add/remove tags.
  • Add a comment to an existing entry.

From an end-user perspective, these all commute with each other — it doesn't matter whether a comment is added before or after a tag is changed — so it's reasonable to let the system take care of ordering the operations to be performed. Moreover, because creation commutes with linking, locking granularity can be limited to an individual entry. (There is no reason to lock both the newly created comment and its parent entry simultaneously.)

Without further ado, here's a locking scheme implemented at the granularity of an entry. This would be used only for writes. First, a wrapper type to hold the lock status for an entry:

data Holder = Holder { entry :: Entry,
                       locked :: Bool }
            deriving (Show)

And then the lock/unlock code:

lock :: Holder -> Holder
lock (Holder e False) = Holder e True
lock (Holder e True) = error "Already locked."

unlock :: Holder -> Holder
unlock (Holder e True) = Holder e False
unlock (Holder e False) = error "Already unlocked."

It's worth stopping to observe a common construct in functional programming. A lock function that locks a Holder can't exist because all values are immutable. Instead, lock creates a new Holder that has locked set to True but is otherwise identical to the original, and we can use the STM mechanics to create actions to be applied to a TVar:

check_out :: TVar Holder -> STM ()
check_out h = do { h' < readTVar h 
                 ; if locked h'
                   then retry
                   else writeTVar h (lock h') }

check_in :: TVar Holder -> STM ()
check_in h = do { h' < readTVar h
                ; if locked h'
                  then writeTVar h (unlock h')
                  else error "Already unlocked." }

The retry above will cause check_out to block until the entry is checked back in, while check_in signals an error if it is asked to release an already free entry.

By the way, the following one-liner to print the showable value wrapped in a TVar is useful for experimenting with STM in ghci:
s_show :: Show a => TVar a -> IO String
s_show = atomically.(liftM show).readTVar

Operating on Entries

To integrate operations on entries, I'm going to take the minimal use case of publishing and unpublishing, so my Entry data structure is almost trivial:

data Entry = Entry { entry_id :: String,
                     published :: Bool }
             deriving (Show)

publish :: Entry -> Entry
publish (Entry i _) = Entry i True

unpublish :: Entry -> Entry
unpublish (Entry i _) = Entry i False

Add in a function to convert an Entry -> Entry function to a Holder -> Holder function:

liftH :: (Entry -> Entry) -> (TVar Holder -> STM ())
liftH f = \ h -> do { h' < readTVar h
                    ; writeTVar h ((holderize f) h')
                    ; return () }
          where holderize f = \ (Holder e l) -> Holder (f e) l

Combining a publish with check_in/check_out is straightforward in the STM monoid. Here's some scratch work in ghci that shows this in action:

$ ghci -package stm
   ___         ___ _
  / _ \ /\  /\/ __(_)
 / /_\// /_/ / /  | |      GHC Interactive, version 6.6, for Haskell 98.
/ /_\\/ __  / /___| |      http://www.haskell.org/ghc/
\____/\/ /_/\____/|_|      Type :? for help.

Loading package base ... linking ... done.
Loading package stm-2.0 ... linking ... done.
:Prelude> :load experiment.hs
[1 of 1] Compiling Main             ( experiment.hs, interpreted )
Ok, modules loaded: Main.
*Main> let h = Holder (Entry "foo" False) False
*Main> th < atomically ( newTVar h )
*Main> s_show th
"Holder {entry = Entry {entry_id = \"foo\", published = False}, locked = False}"
*Main> let co = check_out th
*Main> let pub = (liftH publish) th
*Main> let ci = check_in th
*Main> atomically ( co >> pub >> ci)
*Main> s_show th
"Holder {entry = Entry {entry_id = \"foo\", published = True}, locked = False}"

Integrating Persistence via IO

One of my working design assumptions is that the data for the system will reside entirely in memory, being updated as changes are made and reloaded (lazily) in the event of a system crash or system start-up. (As I commented previously, four years of blogging has produced around 500kb of content, mark-up included, so this isn't an unreasonable assumption.) Comments from spammers could produce a lot more data, but I plan to save every item but only load published items into memory. (So spammers are just going to burn disk space.) I'm going to aim for one file per entry, for the sake of the current discussion, named by the entry_id of the Entry. Conveniently, STM includes the unsafeIOToSTM function for composing STM actions and IO actions. (The other way around is not permitted by design.)

Attention: I've gotten some public and private comments that unsafeIOToSTM is not the right thing to use in this scenario, so I've written a revision to this entry.

Writing an entry to a file is straightforward:

store :: TVar Holder -> STM ()
store h = do { h' < readTVar h
             ; let e = entry h'
             ; unsafeIOToSTM (writeFile (fname e) ((show e) ++ "\n")) }

Continuing the same ghci session from above:

*Main> let out = store th
*Main> :! cat entry-foo.hb
cat: entry-foo.hb: No such file or directory
*Main> atomically (  co >> pub >> out >> ci )
*Main> :! cat entry-foo.hb
Entry {entry_id = "foo", published = True}
*Main> let unpub = (liftH unpublish) th
*Main> atomically (  co >> unpub >> out >> ci )
*Main> :! cat entry-foo.hb
Entry {entry_id = "foo", published = False}

This could (and probably should) be made a bit fancier with regard to recovering from errors while writing the file, but I'm happy with the basic ergonomics so far.

(comment bubbles) 1 comment

All Posts contains 397 items in 57 pages of 7 items each:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57