I recently needed a function for stripping characters from a String in Java to make searching for words in a text simpler.

Luckily I was using Java 6 which has a new class in the java.text package that provides a really simple solution for doing this: the Normalizer.

The Normalizer class is capable of performing decomposition of composite characters in Unicode. In a nutshell, this means that a character with an accent can be split into two characters: the first being the unaccented character, and the second the accent itself; for example: [ã] would be decomposed as [a,~].

If you have a decent debugger to hand, try stepping through this Java code and examining the characters in the String before and after the call to the Normalizer are made:

String s = "garçon";
s = Normalizer.normalize(s, Normalizer.Form.NFD);
System.out.println(s);

All that remains to do is to remove the accent characters from the decomposed String using a regular expression, wrap it all up in a simple function and we’re done:

public static String stripAccents(String s) {
    s = Normalizer.normalize(s, Normalizer.Form.NFD);
    s = s.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
    return s;
}