Stripping diacritics in XSLT

Here’s a trick I’ve had to (re)invent twice, so I’ll put it here where I won’t lose it again. The problem was to strip accents and otherwise normalize text that might be in any of a couple of dozen languages, including Russian and Ukrainian. XSLT’s translate() wasn’t going to cut it. The answer is IBM’s open-source ICU package, which handles all sorts of internationalization tasks. Here’s how to incorporate it into a stylesheet as a Java extension (this is for Xalan, obviously, but Saxon has similar functionality; I’ve only tried it in XSLT 1.0):

<?xml version="1.0" encoding="UTF-8"?> 
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
	xmlns:java="http://xml.apache.org/xalan/java" exclude-result-prefixes="java">

<xsl:param name="stripString">NFD; [:Nonspacing Mark:] Remove; NFC</xsl:param>

<xsl:variable name="transliterator" select= "java:com.ibm.icu.text.Transliterator.getInstance($stripString)"/>

First we declare the java namespace in the xsl:stylesheet element. Then we come up with a string representing the transformation we want to perform; I’ve set this as the default value of an xsl:param to make it easy to tinker with. Finally, we create an ICU Transliterator to perform the desired transformation.

The stripString parameter contains instructions for the transformation we want to perform. There are lots of things we can do here. In this case, we’re doing three things: decomposing the string (i.e. normalizing it in Unicode Form D, wherein the diacritics are separated from their base letters and converted to non-spacing marks); removing the non-spacing marks; and normalizing back to composed characters, Form C.

With all that at the top of the stylesheet, we can use the transliterator anywhere we need to, like this:

<xsl:variable name="output" select="java:transliterate($transliterator, .)"/>

To run it, just put the ICU4J jar file where Xalan can find it on its classpath. I keep a copy in Ant’s lib directory.

1 response to "Stripping diacritics in XSLT"