Here’s a trick I’ve had to (re)invent twice, so I’ll put it here where I won’t lose it again. The problem was to strip accents and otherwise normalize text that might be in any of a couple of dozen languages, including Russian and Ukrainian. XSLT’s
translate() wasn’t going to cut it. The answer is IBM’s open-source ICU package, which handles all sorts of internationalization tasks. Here’s how to incorporate it into a stylesheet as a Java extension (this is for Xalan, obviously, but Saxon has similar functionality; I’ve only tried it in XSLT 1.0):
<?xml version="1.0" encoding="UTF-8"?>
<xsl:param name="stripString">NFD; [:Nonspacing Mark:] Remove; NFC</xsl:param>
First we declare the
java namespace in the
xsl:stylesheet element. Then we come up with a string representing the transformation we want to perform; I’ve set this as the default value of an
xsl:param to make it easy to tinker with. Finally, we create an ICU
Transliterator to perform the desired transformation.
stripString parameter contains instructions for the transformation we want to perform. There are lots of things we can do here. In this case, we’re doing three things: decomposing the string (i.e. normalizing it in Unicode Form D, wherein the diacritics are separated from their base letters and converted to non-spacing marks); removing the non-spacing marks; and normalizing back to composed characters, Form C.
With all that at the top of the stylesheet, we can use the transliterator anywhere we need to, like this:
To run it, just put the ICU4J jar file where Xalan can find it on its classpath. I keep a copy in Ant‘s
Warning: Creating default object from empty value in /home/wandb/wallandbinkley.com/production/quaedam/wp-includes/comment-template.php on line 1056