If you work with lots of XML, you’re likely to have the occasional (or constant) need to extract, merge, manipulate or otherwise transform it with XSL. For the last year or so I’ve been using Apache Ant to manage a couple of large and complex such projects. This posting describes how to unleash the power of Ant. Mercifully, reference to rubber tree plants or other members of the genus Ficus will not be necessary.

Ant is great for this because:

  • it’s self-documenting (in the sense that you specify in a build file the transformations you want to enable). No more coming back to an old project and wondering which stylesheet should be applied to which xml file.
  • it’s smart about directory structures: you can easily specify that every XML file in this directory should be transformed, with the output going to that directory, and with the output file name determined like this.
  • complicated chained transformations are easy to specify; and since Ant looks at file timestamps to determine whether the input or the stylesheet are newer than the output, it doesn’t redo work unnecessarily.
  • Ant does lots of other stuff that may come in useful: basic things like copying and deleting, advanced things like querying a database.

Getting started: install Ant. If you’re using a Windows box, save yourself some aggro and install the CmdHere UI tweak. It lets you open a command window in a directory by right-clicking in Explorer. Since Ant is a command-line tool, it’s handy to be able to get to the command line quickly.

Ant uses Xalan as its XSLT engine, but if you want to use a different one (provided it’s TRaX-compliant) you can just put it in your classpath ahead of Xalan. If you want to pick and choose, you might want to install the Multi-XSLT task.

Let’s suppose you have a project directory, with the source XML files in a subdirectory “input”, and a subdirectory “output” ready to receive the transformed product. Ant needs a build file, which is an XML file containing a “project” element as the root, containing any number of “target” elements (representing distinct actions you might want to perform), each of which contains any number of tasks needed to complete that action. A simple XSLT target would have a build file like this:

<project name="Test project" default="build">
  <target name="build" description="Run the transformation">
    <xslt basedir="input" includes="*.xml" destdir="output" style="mytransform.xsl"/>
  </target>
</project>`

That applies the specified stylesheet to all the XML files in “input” and saves the transformed files in “output”. When you run Ant, it looks for “build.xml” in the current directory and executes the default target, in this case “build”. If your build file contains more than one target, you can specify the desired target name as a parameter: “ant compile”, etc.

The xslt task has several options, mainly providing different ways to specify the input and output; look at the examples in the Ant manual for some idea of the range of possibilities.

You can pass parameters to the XSLT engine using a “param” element as a child of the xslt task:

<param name="plant" expression="Ficus elastica"/>

(OK, I was wrong.) Using Ant’s properties, you can set the value of parameters at runtime.

[update: Dan McCreary pointed out in the comments that filesets don’t work with the xslt task. Sorry about that. Use the “includes” attribute to get the same effect: e.g.

<xslt style="${docbookxsllocation}/html/docbook.xsl" includes="*.docbook.xml" destdir="doc/docbook/" basedir="${martinihome}/docbook/">
  <mapper type="glob" from="*.docbook.xml" to="*.docbook.html"/>
</xslt>

Apologies for the error, and thanks to Dan for pointing it out.]

You can gain more fine-grained control of the selection of input files using a fileset, such as:

<fileset dir="input"><include name="**/*.xml"/><exclude name="**/*test*"/></fileset>

This will recurse through the directory tree under the “input” subdirectory and include all the xml files it finds, except those that contain the word “test” in their filenames.

You can use a different naming structure for the output files using a mapper:

<mapper type="glob" from="*.xml.en" to="*.html.en"/>

This will transform files ending with “.xml.en” to files ending “.html.en”. For more complex manipulations, there is a regexp mapper that gives you full power of regular expressions.

The xslt task is smart enough to compare the timestamp of the source file and the stylesheet with the output file (if it already exists), and it only runs the transformation if the output is out of date. This is very handy in more complex multi-step pipelines. You might, for example, apply one stylesheet to transform the contents of “input” into a subdirectory “step1”, another to transform “step1” into “step2”, and so on. If you modify the last stylesheet in the chain, Ant will automatically skip the earlier steps when you run the process again. Note, however, that Ant doesn’t look at any included stylesheets.

There’s lots more in Ant; it’s a tool worth getting to know, and not just for compiling code.