Stripping diacritics in XSLT « Quædam cuiusdam
Stripping diacritics in XSLT
Monday 22 August 2005 @ 3:39 pm

Here’s a trick I’ve had to (re)invent twice, so I’ll put it here where I won’t lose it again. The problem was to strip accents and otherwise normalize text that might be in any of a couple of dozen languages, including Russian and Ukrainian. XSLT’s translate() wasn’t going to cut it. The answer is IBM’s open-source ICU package, which handles all sorts of internationalization tasks. Here’s how to incorporate it into a stylesheet as a Java extension (this is for Xalan, obviously, but Saxon has similar functionality; I’ve only tried it in XSLT 1.0):

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
  version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:java="http://xml.apache.org/xalan/java"
  exclude-result-prefixes="java">

<xsl:param name="stripString">NFD; [:Nonspacing Mark:] Remove; NFC</xsl:param>

<xsl:variable name="transliterator"
select= "java:com.ibm.icu.text.Transliterator.getInstance($stripString)"/>

First we declare the java namespace in the xsl:stylesheet element. Then we come up with a string representing the transformation we want to perform; I’ve set this as the default value of an xsl:param to make it easy to tinker with. Finally, we create an ICU Transliterator to perform the desired transformation.

The stripString parameter contains instructions for the transformation we want to perform. There are lots of things we can do here. In this case, we’re doing three things: decomposing the string (i.e. normalizing it in Unicode Form D, wherein the diacritics are separated from their base letters and converted to non-spacing marks); removing the non-spacing marks; and normalizing back to composed characters, Form C.

With all that at the top of the stylesheet, we can use the transliterator anywhere we need to, like this:

<xsl:variable
  name="output"
  select="java:transliterate($transliterator, .)"/>

To run it, just put the ICU4J jar file where Xalan can find it on its classpath. I keep a copy in Ant‘s lib directory.

Comments (1) - Posted in XML by  




Warning: Creating default object from empty value in /home/wandb/wallandbinkley.com/production/quaedam/wp-includes/comment-template.php on line 1015

 One response to “Stripping diacritics in XSLT”

  •   Bess wrote:

    I just had to solve this exact problem again, and once again you came to my rescue. Thanks for the clear write-up of this solution.

Leave a comment