Splitting XML Well with XSLT 2

Paul R. Brown @ 2009-09-30T18:25:32Z

I recently had the need to split up a result set from a Solr query into a collection of smaller groups of add requests for POSTing into a different core. There are some ways to make the split work with text processing tools (split and friends), but it's always an open question whether an ad hoc approach will trip over some markup — it's just better to use XML tooling. By no coincidence (based on features missing from ), XSLT 2 makes it easy to do the right thing.

First up is grouping in chunks of 2000 records:

<xsl:for-each-group select="/response/result/doc"
                    group-by="round(position() div 2000)">
...
</xsl:for-each-group>

Outputting each hunk to a file named for the index of the group is also a one-liner:

<xsl:result-document href="{current-grouping-key()}_out.xml">
  <add>
    <xsl:for-each select="current-group()">
      <doc>
        <xsl:apply-templates />
      </doc>
    </xsl:for-each>
  </add>
</xsl:result-document>

And that's it. The only trick is choosing an XSLT  processor, and the superlative Saxon (from Saxonica) is my default choice.

(comment bubbles) 0 comments

ElementTraversal == Pig Lipstick

Paul Brown @ 2007-08-28T16:10:00Z

Elliotte's right — the ElementTraversal spec is lipstick on a very ugly pig that's already wearing a good amount of makeup (serialization) and a ridiculous hat (XML namespaces "support"). (I've already given the DOM a few deserved kicks.)

So what does it take to deprecate the DOM? It takes a better API on equivalent licensing terms, as more liberal licenses will tend to trump better software and many of the customers of a better XML API are at the more liberal end of the licensing spectrum, i.e., Apache. I'll second Dan's call to get XOM — or something that sucks as little as XOM does — packaged as a DOM killer, and at least from my perspective, that does not include a JSR or the JCP.

Do I smell bacon? It is, after all, the Year of the Pig...

(comment bubbles) 0 comments

ANTLR Tooling

Paul Brown @ 2006-07-05T21:00:00Z

I've been tinkering on a little language in Java with ANTLR v3 (still in beta, but desirable because it supports incremental parsing) as the parser generator, and so far ANTLRWorks is quite convenient as an editing environment for the grammar. (I'm also a fan and paid licensee of Prashant Deva's very nice ANTLR Studio Eclipse plugin, but it only supports ANTLR v2 for the time being.)

More approachable tooling for ANTLR will motivate fewer people to create XML languages, and the possibility to have out-of-the-box support for incremental parsing in a generic editing environment (e.g., Eclipse,NetBeans, IDEA, JEdit, etc.) should make writing a real parser a hands-down winner over using an XML parser and a ream of glue code as a crutch.

(comment bubbles) 0 comments

Populating a Java Object Model from XML

Paul Brown @ 2006-02-05T01:02:00Z

This post describes an approach to populating a Java object model from an XML document. It's an approach that I came up with when working on a particular parsing problem.

Updates: A couple of people have mentioned XStream and XMLBeans, but those fail my tests below. XStream is a serialization tool (as its docs say), and XMLBeans was chubby in terms of the size of the libraries. For what it's worth, if I were willing to suffer a large dependency, XMLBeans version 2 looks pretty good in that it provides a token-oriented interface and location information (via XmlLineNumber).

A closet full of clothes, and not a thing to wear...?

My self-imposed requirements were as follows:

  • Populate a pre-existing Java object model from SAX events.
  • Support multiple XML dialects mapping directly to a single object model.
  • Both the XML dialects and the object model are specified a priori.
  • Impose zero additional dependencies beyond SAX; ideally, the implementation will be just a fancy ContentHandler.
  • Expose SAX location information from the parse (e.g., line and column) to the target object model during construction.
  • Expose namespace context from the parse so that expressions like QNames and XPaths in attribute values can be properly post-processed.
  • Use programmatic configuration, not properties files or XML or annotations to the schema.

For my particular application, the XML documents would be BPEL processes in either flavor (1.1 or 2.0), and the target object model would be PXE's BPEL Object Model or “BOM”. There were additional requirements around handling extensions, but those aren't directly relevant to the approach that I settled on.

Now, surely someone else has had the same or a similar set of requirements, more or less the same sensibilities, and the altruism to post it as open source...

Of the various approaches to XML binding (the bindmark project has a good list), I didn't find any that fit the requirements. Many tools generate an object model from a schema, and while the generated models don't usually meet my taste for API ergonomics (JAXB 1.0 had a particularly rank code smell when applied to the BPEL4WS 1.1 schema...), that would be one way to go if additional dependencies were acceptable. (The idea would be to use the generated object models as data transfer objects and then maintain multiple mappings onto the internal object model as domain objects.) JiBX looked particularly interesting, but it requires XPP3 and uses bytecode enhancement, which would rule out the simultaneous support for multiple XML dialects without intermediate object models. Digester has approximately the right flavor, but the target object model wasn't particularly JavaBean-ish and location information wasn't exposed.

One of the flaws in schema-driven bindings is that XML schema rarely (if ever) encapsulates all of the semantics of the XML language that it can be used to (loosely) validate, so automated or generated bindings do at most a partial job.

The Idea and Outcomes

So I came up with a different approach. The basic idea was to construct a graph of event consumers that closely resembles the grammar for the XML document and use SAX events to walk the graph. Each edge of the graph is decorated with a function that accepts a single SAX event and returns true or false, e.g., a QName with or without an attribute mask, or a non-whitespace characters event. The edges incident to a vertex are ordered, and events are matched (or not) according to the ordering.

From another perspective, this uses the XML parser like a lexer and the graph like a parser.

From yet another perspective, the idea is rather like Haskell's pattern matching, in which case the whole thing could be looked at as a collection of functions that accept a list of SAX events and return an object. Each function consumes the head event from the list, selects another function to pass the tail of the list to, and adds the result of the call to the current object. (The presumption is that objects know how to add various kinds of children or metadata to themselves.) Of course, Haskell wasn't an option. (And of the two Jaskells, Jaskell has a few too many moving parts in the toolchain for my taste, and Jaskell doesn't have pattern matching.)

My first-pass implementation in Java (PXE's bpel-parser module) did the job nicely but wasn't quite as pretty at the code level as I might have liked, as it required a a good amount of boilerplate. That said, and in-line with the lexer/parser observation above, the boilerplate and transition set could easily be generated from a RELAX NG grammar.

Considering that JAXB 2 looks slick, has a non-regressive license, will be part of both Java EE 5 and Java SE 6, and supports passing through some XML fragments in raw form, it would be a difficult call if I faced the same problem at present. (Like JAXB 1, JAXB 2 doesn't expose location information, but location information can be added to the XML document as content using some SAX tricks, but that's a hack.)That said, the need to support semantics beyond those present in the schema might very well drive me down the same path again.

(comment bubbles) 6 comments

XML Languages and Sins of Syntax

Paul Brown @ 2006-01-15T14:49:00Z

Tim Bray wrote a post about unnecessary reinvention in XML languages, arguing that the use cases are more or less covered by the “big five”: XHTML, DocBook, OpenDocument (so, presumedly also Dublin Core, MathML, SVG, and SMIL by reference), UBL, and Atom. (I don't recall his take on the MySDL meme, e.g., NSDL.) Uche Ogbuji takes a slightly looser stance and makes the excellent point that RELAX NG plus Schematron can get pretty darn good (but still not great) portable validation in excess of what a DTD or («gag» «cough») XML Schema would provide.

That said, why create an XML language at all, particularly if humans will directly create, modify, and consume the documents? How will documents be created? What is their purpose? Who will consume them? In what form? What is the difference between a schema-valid document and a correct one? (That is, can all of the constraints be expressed by the schema?) Of course, before I throw too many stones, if I may paraphrase Barabas (and Stallings), I have created an XML language, but that was in another country; and besides, no one else knew better at the time. It might be just as much trouble to specify a non-XML language that's easier for humans to compose and avoids the various pitfalls of correctly processing XML with generally available tooling, to say nothing of versioning, differencing, or patching.

Take the RELAX NG compact syntax as a case in point for creating non-XML languages instead of XML languages. The difference between

<element name=“addressBook” xmlns=“http://relaxng.org/ns/structure/1.0”>
  <zeroOrMore>
    <element name=“card”>
      <choice>
        <attribute name=“name”/>
        <group>
          <attribute name=“givenName”/>
          <attribute name=“familyName”/>
        </group>
      </choice>
      <attribute name=“email”/>
    </element>
  </zeroOrMore>
</element>

and

element addressBook {
  element card {
    (attribute name { text }
     | (attribute givenName { text },
        attribute familyName { text })),
    attribute email { text }
  }*
}

should be obvious, and coding a few grammars in the compact and verbose formats will probably convince most people of the utility of the compact representation. This is purely subjective, of course, but it is precisely subjective utility and aesthetics that I'm arguing.

On the same topic, BPEL4WS 1.1 and WS-BPEL 2.0 are examples of one XML language sin committed and one in progress. Both are programming languages, and both have an XML syntax that's just painful to type, even with a decent XML editor, and impossible to get right without additional sugar on top to ensure that namespace prefixes and WSDL component names used in expressions tie-back properly. (Worse, some folks use lossy visual “editors”. It's one thing to use a design tool that translates a specific visual representation of a process into BPEL, but it's quite another to try to do fine-grained visual editing of BPEL at the detail level.) Why not something along the lines of what Brian McAllister proposed, i.e., something that looks and feels like a programming language? It would be straightforward to tie an ANTLR grammar into PXE's compiler pipeline in place of an XML parser...

(comment bubbles) 1 comment

SAX Events as an Alphabet

Paul Brown @ 2003-11-22T08:00:00Z

Applying XPath to a SAX stream for filtering or selective construction of heavier-weight objects (like a DOM) is one of the PAQ (perennially asked questions) on xml-dev.

Almost two years ago, I was fooling around with an approach to applying XPath expressions to SAX streams by looking at the SAX events as the letters in an alphabet and passing them through a collection of automata. If the automaton hit an accept state, the expression had a match in the document. My plan was to use the concept for high-performance XPath-based routing of streaming XML documents, and initial indications were that a SAX parse with queries was substantially faster than building a DOM (with no queries) for smallish (~5kb) XML documents.

Like most weekend projects, it eventually went by the wayside, and even then I had skipped some of the nastier bits (e.g., reverse axes) of XPath and only got as far as implementing the necessary operations to merge multiple machines together on AND and OR. (Anders Moeller's dk.brics.automaton does a nice job for plain old regular expressions.)

And I would have been late to the game anyway… Apropos to a recent posting on Slashdot, if I'd thought to type the query XML and event and “finte state” into CiteSeer, I would have seen that other people were already looking at the problem in similar ways. Specifically, I would have found a paper from 2000 by Mehmet Altinel and Michael Franklin in which they discuss a scalable, event-driven, finite state machine-based approach to executing XPath queries.

This is top-of-mind right now because I ran across some recent work from Feng Peng and Sudharshan Charwathe that uses a hierarchical arrangement of pushdown transducers to model an XPath expression, and it will actually draw the hierarchy! The project is called XSQ, and there is a poster that gives a nice summary and an example.

(Among others, Dan Suciu's XMLTK and the YFilter project are also worth a visit.)

(comment bubbles) 0 comments

The Case for XML Databases

Paul Brown @ 2003-03-30T00:00:00Z

On May 8, I will give a talk in the Web/XML portion of the CampIT Expo in Chicago on "The Case for XML Databases". Here is the abstract:

Unfortunately, the answer to the question, "So, why do I need an XML database?" is usually "Because it is better for XML storage." The real answer is that flexible storage for XML enables a host of interesting opportunities for managing and re-purposing enterprise content, and the rich array of XML standards — XSLT, XSL-FO, SVG — provide ways to deliver accessible presentations to a wide variety of corporate users.

If you meet the qualifications, you can register to go to the conference for free; alternatively, I'm willing to send out copies of the slides after the presentation.

(comment bubbles) 0 comments

Posts tagged ["xml"] contains 8 items in 2 pages of 7 items each:
1 2