Managing large XSLT projects with Ant
If you work with lots of XML, you’re likely to have the occasional (or constant) need to extract, merge, manipulate or otherwise transform it with XSL. For the last year or so I’ve been using Apache Ant to manage a couple of large and complex such projects. This posting describes how to unleash the power of Ant. Mercifully, reference to rubber tree plants or other members of the genus Ficus will not be necessary.
Ant is great for this because:
- it’s self-documenting (in the sense that you specify in a build file the transformations you want to enable). No more coming back to an old project and wondering which stylesheet should be applied to which xml file.
- it’s smart about directory structures: you can easily specify that every XML file in this directory should be transformed, with the output going to that directory, and with the output file name determined like this.
- complicated chained transformations are easy to specify; and since Ant looks at file timestamps to determine whether the input or the stylesheet are newer than the output, it doesn’t redo work unnecessarily.
- Ant does lots of other stuff that may come in useful: basic things like copying and deleting, advanced things like querying a database.
Getting started: install Ant. If you’re using a Windows box, save yourself some aggro and install the CmdHere UI tweak. It lets you open a command window in a directory by right-clicking in Explorer. Since Ant is a command-line tool, it’s handy to be able to get to the command line quickly.
Ant uses Xalan as its XSLT engine, but if you want to use a different one (provided it’s TRaX-compliant) you can just put it in your classpath ahead of Xalan. If you want to pick and choose, you might want to install the Multi-XSLT task.
Let’s suppose you have a project directory, with the source XML files in a subdirectory “input”, and a subdirectory “output” ready to receive the transformed product. Ant needs a build file, which is an XML file containing a “project” element as the root, containing any number of “target” elements (representing distinct actions you might want to perform), each of which contains any number of tasks needed to complete that action. A simple XSLT target would have a build file like this:
<project name="Test project" default="build">
<target name="build" description="Run the transformation">
<xslt basedir="input" includes="*.xml" destdir="output" style="mytransform.xsl"/>
</target>
</project>`
That applies the specified stylesheet to all the XML files in “input” and saves the transformed files in “output”. When you run Ant, it looks for “build.xml” in the current directory and executes the default target, in this case “build”. If your build file contains more than one target, you can specify the desired target name as a parameter: “ant compile”, etc.
The xslt task has several options, mainly providing different ways to specify the input and output; look at the examples in the Ant manual for some idea of the range of possibilities.
You can pass parameters to the XSLT engine using a “param” element as a child of the xslt task:
<param name="plant" expression="Ficus elastica"/>
(OK, I was wrong.) Using Ant’s properties, you can set the value of parameters at runtime.
[update: Dan McCreary pointed out in the comments that filesets don’t work with the xslt task. Sorry about that. Use the “includes” attribute to get the same effect: e.g.
<xslt style="${docbookxsllocation}/html/docbook.xsl" includes="*.docbook.xml" destdir="doc/docbook/" basedir="${martinihome}/docbook/">
<mapper type="glob" from="*.docbook.xml" to="*.docbook.html"/>
</xslt>
Apologies for the error, and thanks to Dan for pointing it out.]
You can gain more fine-grained control of the selection of input
files using a
fileset, such
as:
<fileset dir="input"><include name="**/*.xml"/><exclude name="**/*test*"/></fileset>
This will recurse through the directory tree under the “input”
subdirectory and include all the xml files it finds, except those that
contain the word “test” in their filenames.
You can use a different naming structure for the output files using a mapper:
<mapper type="glob" from="*.xml.en" to="*.html.en"/>
This will transform files ending with “.xml.en” to files ending “.html.en”. For more complex manipulations, there is a regexp mapper that gives you full power of regular expressions.
The xslt task is smart enough to compare the timestamp of the source file and the stylesheet with the output file (if it already exists), and it only runs the transformation if the output is out of date. This is very handy in more complex multi-step pipelines. You might, for example, apply one stylesheet to transform the contents of “input” into a subdirectory “step1”, another to transform “step1” into “step2”, and so on. If you modify the last stylesheet in the chain, Ant will automatically skip the earlier steps when you run the process again. Note, however, that Ant doesn’t look at any included stylesheets.
There’s lots more in Ant; it’s a tool worth getting to know, and not just for compiling code.
Thanks for this, Peter. You just changed my workflow.
Interesting post. We have been using Ant & XSLT for 12 months, with great results. Coupled with other tasks (, , , etc.), it is a great tool for publishing workflow.
I tried to used a fileset with the xslt task and the error message indicated the XSLT task does no support filesets. Do you have a working example? I am using ant 1.6.
I forgot to thank you for the post. It was helpful. How about some examples on how to us Ant with Eclipse and Saxon? ;-) - Dan
Dan: you're right, I got confused with other Ant tasks that use filesets. With the xslt task you have to use the "includes" attribute to get more or less the same effect. I'll fix the example in the posting. Thanks for letting me know. And yes, how about examples on how to use Ant with Eclipse and Saxon?
Thanks for the suggestion to use the SourceForge froum: http://sourceforge.net/mailarchive/forum.php?thread_id=5806889&forum_id=1398 I will take your suggestion to heart and try to put up some good examples on how to use Saxon and Eclipse. Here is where I am putting my examples to date: http://en.wikibooks.org/wiki/Programming:Apache_Ant Guest editors welcome! ;-)
Here is a way to check for the dependancies of an imported file: Assume you have a single XSLT file that imports several other files. You can add a section to your target. Here is an example: In the example, if ANY of the HTMLHeader, PageHeader, LeftNav or PageFooter xsl files get changed, all the output will get re-generated.
[...] This will come in very handy in my Ant-based approach to large xslt projects. [...]
We have a need for using xslt on very large input xml files - I have some experience with Ant and xslt, but it runs out of memory on such large input files. Do you have any experience with this? I've been looking for a stream-based xslt processor that can output straight text.
[...] transformations, re-writing the shell script as appropriate, of course. But I've just read Peter Binkley's post about managing large XSLT projects with Ant, and I'm intrigued. The last time Peter talked me into using Ant for a project it worked out [...]