How to strip accents from Strings using Java 6
I recently needed a function for stripping characters from a String in Java to make searching for words in a text simpler.
Luckily I was using Java 6 which has a new class in the java.text package that provides a really simple solution for doing this: the Normalizer.
The Normalizer class is capable of performing decomposition of composite characters in Unicode. In a nutshell, this means that a character with an accent can be split into two characters: the first being the unaccented character, and the second the accent itself; for example: [ã] would be decomposed as [a,~].
If you have a decent debugger to hand, try stepping through this Java code and examining the characters in the String before and after the call to the Normalizer are made:
String s = "garçon"; s = Normalizer.normalize(s, Normalizer.Form.NFD); System.out.println(s);
All that remains to do is to remove the accent characters from the decomposed String using a regular expression, wrap it all up in a simple function and we’re done:
public static String stripAccents(String s) {
s = Normalizer.normalize(s, Normalizer.Form.NFD);
s = s.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
return s;
}

January 28, 2010 - 11:49 am
I spent ages trying to to this, I can’t believe it can be this simple. Thanks for sharing.
May 20, 2010 - 7:07 pm
It’s really simple! Thanks!