Stripping diacritics in XSLT
Here’s a trick I’ve had to (re)invent twice, so I’ll put it here where I
won’t lose it again. The problem was to strip accents and otherwise
normalize text that might be in any of a couple of dozen languages,
including Russian and Ukrainian. XSLT’s translate()
wasn’t going to
cut it. The answer is IBM’s open-source
ICU package, which handles all sorts of
internationalization tasks. Here’s how to incorporate it into a
stylesheet as a Java extension (this is for
Xalan, obviously, but
Saxon has similar functionality; I’ve
only tried it in XSLT 1.0):
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:java="http://xml.apache.org/xalan/java" exclude-result-prefixes="java">
<xsl:param name="stripString">NFD; [:Nonspacing Mark:] Remove; NFC</xsl:param>
<xsl:variable name="transliterator" select= "java:com.ibm.icu.text.Transliterator.getInstance($stripString)"/>
First we declare the java
namespace in the xsl:stylesheet
element.
Then we come up with a string representing the transformation we want to
perform; I’ve set this as the default value of an xsl:param
to make it
easy to tinker with. Finally, we create an ICU Transliterator
to
perform the desired transformation.
The stripString
parameter contains instructions for the transformation
we want to perform. There are lots of
things we can do
here. In this case, we’re doing three things: decomposing the string
(i.e. normalizing it in Unicode Form
D,
wherein the diacritics are separated from their base letters and
converted to non-spacing marks); removing the non-spacing marks; and
normalizing back to composed characters, Form C.
With all that at the top of the stylesheet, we can use the transliterator anywhere we need to, like this:
<xsl:variable name="output" select="java:transliterate($transliterator, .)"/>
To run it, just put the ICU4J jar file where Xalan can find it on its
classpath. I keep a copy in Ant’s lib
directory.
I just had to solve this exact problem again, and once again you came to my rescue. Thanks for the clear write-up of this solution.