Question about UTF-8

Soldato
Joined
1 Mar 2003
Posts
5,508
Location
Cotham, Bristol
I was asked a question the other day in an interview where he had a method for counting number of words characters and new lines. The method was something along the lines of (this is untested I'm just going based on memory)

Code:
public void readWordCharLineCount(InputStream in, PrintStream out, PrintStream err) {
   int nw = 0;
   int nc = 0;
   int nl = 0;

   byte[] buff = new byte[4096];

   try {
      while(in.read(buff) != -1) {
         for(int i=0; i < buff.length; i++) {
            char c = (char) buff[i];
            if(c == '\n') {
               ++nl;
            } else if(Character.isWhiteSpace(c)) {
               ++nw;
            }
            ++nc;
         }
      }
      System.out.println(nl + " " + nw + " " + nc);
   } catch(IOException e) {
      err.print(e);
      return;
   }
}

He then asked could I see any problems if the input were in UTF-8, I wasn't sure what he was talking about, but apparently in UTF-8 characters can "use one to four 8-bit bytes" (pinched from wikipedia).

So does that affect the character count in the above code snippet, is that what he was getting at?
 
Last edited:
Soldato
Joined
9 Mar 2010
Posts
2,841
He then asked could I see any problems if the input were in UTF-8, I wasn't sure what he was talking about, but apparently in UTF-8 characters can "use one to four 8-bit bytes" (pinched from wikipedia).

So does that affect the character count in the above code snippet, is that what he was getting at?

Yeah, you've got it. Essentially your code doesn't take into account the variable number of bytes used to represent a character if you use UTF-8.
 
Back
Top Bottom