Java: convert UTF-8 to unicode escape string 

There are 2 notes for this topic, click above title to see all notes.
Joined:
04/09/2007
Posts:
753

August 03, 2010 16:23:55    Last update: October 22, 2010 15:35:20
According to Java documentation:
The Java compiler and other Java tools can only process files which contain Latin-1 and/or Unicode-encoded (\udddd notation) characters.

This utility converts a utf-8 encoded file to ascii with unicode escape strings for non-ascii characters.
import java.io.*;

/**
 * Reads file in UTF-8 encoding and output to STDOUT in ASCII with unicode
 * escaped sequence for characters outside of ASCII.
 */
public class UTF8ToAscii {
    public static void main(String[] args) throws Exception {
    	if (args.length < 1) {
	    System.out.println("Usage: java UTF8ToAscii <filename>");
	    return;
	}

	BufferedReader r = new BufferedReader(
				new InputStreamReader(
				    new FileInputStream(args[0]),
				    "UTF-8"
				)
			   );
	String line = r.readLine();
	while (line != null) {
	    System.out.println(unicodeEscape(line));
	    line = r.readLine();
	}
	r.close();
    }

    private static final char[] hexChar = {
        '0','1','2','3','4','5','6','7','8','9','A','B','C','D','E','F'
    };

    private static String unicodeEscape(String s) {
	StringBuilder sb = new StringBuilder();
	for (int i = 0; i < s.length(); i++) {
	    char c = s.charAt(i);
	    if ((c >> 7) > 0) {
		sb.append("\\u");
		sb.append(hexChar[(c >> 12) & 0xF]); // append the hex character for the left-most 4-bits
		sb.append(hexChar[(c >> 8) & 0xF]);  // hex for the second group of 4-bits from the left
		sb.append(hexChar[(c >> 4) & 0xF]);  // hex for the third group
		sb.append(hexChar[c & 0xF]);         // hex for the last group, e.g., the right most 4-bits
	    }
	    else {
		sb.append(c);
	    }
	}
	return sb.toString();
    }
}


It is equivalent to:
native2ascii -encoding utf-8

using the standard Java native2ascii utility.
Share |
| Comment  | Tags
2 comments