Splitting XML Well with XSLT 2

Paul R. Brown @ 2009-09-30T18:25:32Z

I recently had the need to split up a result set from a Solr query into a collection of smaller groups of add requests for POSTing into a different core. There are some ways to make the split work with text processing tools (split and friends), but it's always an open question whether an ad hoc approach will trip over some markup — it's just better to use XML tooling. By no coincidence (based on features missing from ), XSLT 2 makes it easy to do the right thing.

First up is grouping in chunks of 2000 records:

<xsl:for-each-group select="/response/result/doc"
                    group-by="round(position() div 2000)">
...
</xsl:for-each-group>

Outputting each hunk to a file named for the index of the group is also a one-liner:

<xsl:result-document href="{current-grouping-key()}_out.xml">
  <add>
    <xsl:for-each select="current-group()">
      <doc>
        <xsl:apply-templates />
      </doc>
    </xsl:for-each>
  </add>
</xsl:result-document>

And that's it. The only trick is choosing an XSLT  processor, and the superlative Saxon (from Saxonica) is my default choice.

(comment bubbles) 0 comments

Commandline Puzzler

Paul R. Brown @ 2009-09-25T19:39:39Z

Suppose that you have to files that consist of records, one per line, and you want to ensure that none of the records in the second file appear in the first. How do you do it with only the text processing commandline tools commonly available on *nix systems?

(comment bubbles) 3 comments

NoNoSQL

Paul R. Brown @ 2009-09-24T05:38:08Z

Ben Black, one of the organizers of no:sql(east). conference, tweeted, and I twote:

my current vote for renaming #nosql is #altdb. what are your ideas?

Chris Williams, another no:sql(east). organizer, has had similar sentiments, but what really needs to happen is for people to stop using the "NoSQL" term. I originally proposed "dbng" for next-generation database (and with an intended allusion to RELAX NG), but I'm warming up to Ben Black's suggestion of "altdb" for the hint of Usenet alt.* if nothing else.

I propose a new movement called the NoNoSQL movement. It is a movement for those interested in alternative and next-generation databases but not in the inaccurate "NoSQL" neologism.

Seems like some cool altdb schwag (t-shirts, mugs, etc.) is in order — "Not your daddy's database." or "Joiners need not apply." or...

(comment bubbles) 1 comment

Voldemort-Based Twitter Clone Talk at OSCON

Paul R. Brown @ 2009-07-24T00:08:00Z

Dan and I just finished up our talk at OSCON. You can download the slides or view it on Slideshare. I'll probably take it down at some point in the near future, but the sample system from the presentation is up and running for the moment.

We got started on the material for the talk several months back with the Twitter one-to-many publishing problem as a motivating problem to play with various non-relational data stores, and after some dabbling with Cassandra and HBase, we ended up focusing on Voldemort as an initial backend for the system. It is very likely that we'll craft some additional backends, and I'd particularly like to get to a more forgiving model for storing lists. (I'm already part way there on Dynomite with Osmos as the storage engine.)

The system described in the talk uses a small (two nodes) Voldemort cluster and a small cluster of web nodes (JAX-RS with a jQuery front-end) to implement enough microblogging functionality to be interesting — users, follow/followed, publishing — along with a simple dashboard implemented with Cacti and rudimentary deployment automation. The source is out on GitHub if you want to take a look. (Feel free to fork with it...)

[dashboard snapshot]

Dan's blog entry on the presentation is here.

(comment bubbles) 0 comments

If you have nothing to say, say nothing

Paul R. Brown @ 2009-06-05T20:18:16Z

There is never a good reason to announce that you're going to make an announcement. This rule came to mind when I saw this tweet scroll by this morning:

[screenshot of tweet]

This belongs in the same category of non-actions as a blog post to say you haven't been blogging, telling people about your "stealth" startup, or a statement like "with all due respect".

(comment bubbles) 0 comments

Speaking at OSCON 2009

Paul R. Brown @ 2009-05-29T05:08:55Z

speaking @OSCON With Dan Diephouse, I'll be speaking at OSCON on July 23.

Taking the abstract literally, the talk looks like it is about building a Twitter clone with open source components, but it is not at all intended to be armchair quarterbacking about Twitter's early problems with availability. (We should all have these problems!) Rather, the talk is intended to be about some of the current crop of interesting open source distributed storage technologies — Cassandra, Voldemort, Redis (where the folks have already done some thinking about Twitter-like apps), CouchDB, HBase, Dynomite — as well as how to attack some of the operational problems (e.g., deployment, instrumentation, application updates) that come with using new tools in multi-node environments.

That's obviously quite a bit to fit into a relatively short speaking slot, but Dan and I plan to blog or otherwise publish material that won't fit.

(comment bubbles) 0 comments

Integrating Github and Redmine

Paul R. Brown @ 2009-05-27T05:37:21Z

I've been a fan and user of Atlassian's excellent Jira since the company was founded back in 2002, but I needed the ability to set up some quick-hit bug/task/wiki sites for smaller consulting projects and neither the month-to-month hosted model nor the enterprise license made good economic sense. I opted for the an install of Redmine, and while it's no Jira, I've been reasonably happy with it. (The one big headache was getting SMTP over TLS working.)

Redmine supports integration with Git repositories on a per-project basis and will link commits to issues based on the presence of keywords and issue identifiers (e.g., "refs #123"). The way the integration is implemented works well if the Git repository is hosted on the same machine as the Remine instance, but I host all customer and internal work on github. Here's a quick recipe to bridge the gap.

First, add an ssh key for the redmine user to your github account.

Next, create a home for the following shell script, e.g., /opt/redmine_extras/bin and a home for Git repositories on the server, e.g., /var/redmine/git_repositories and ensure that the redmine user has write privileges for the repositories. Here's the pull_git script:

#!/bin/bash
export REPOS=/var/redmine/git_repositories
export REDMINE_HOME=/opt/redmine-0.8.2
export LOGFILE=/var/log/redmine_extras.log

function log_prefix {
        echo -n `date '+%Y/%m/%d %H:%M:%S'`" ["$$"] ${2}"
}

for i in `ls -d ${REPOS}/*.git`; do 
  cd $i;
  log_prefix && echo 'Processing git repository from '${i}'...';
  /usr/local/bin/git --bare fetch origin :master
done

cd ${REDMINE_HOME}
log_prefix && echo 'Updating Redmine...'
/usr/local/bin/ruby script/runner "Repository.fetch_changesets" -e production

Then (I'm logged in as root) add the command to the redmine user's crontab:

# echo '*/10 * * * *    /opt/redmine_extras/bin/pull_git 2>&1 >> /var/log/redmine_extras.log'\
 | crontab -u redmine -

Now, for each repository, say foo and your github user is bar, you will track from Redmine, do:

# cd /var/redmine/git_repositories
# sudo -u redmine -H git clone --bare git@github.com:bar/foo.git
# cd foo.git
# sudo -u redmine -H git --bare remote add origin git@github:bar/foo.git

Ensure that the Redmine project points to the local copy of the Git repository, and the revisions should start getting syncronized every ten minutes.

(comment bubbles) 0 comments

All Posts contains 397 items in 57 pages of 7 items each:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57