Question about UTF-8

PaulStat · 12 Jan 2015 at 09:12

I was asked a question the other day in an interview where he had a method for counting number of words characters and new lines. The method was something along the lines of (this is untested I'm just going based on memory)

Code:

public void readWordCharLineCount(InputStream in, PrintStream out, PrintStream err) {
   int nw = 0;
   int nc = 0;
   int nl = 0;

   byte[] buff = new byte[4096];

   try {
      while(in.read(buff) != -1) {
         for(int i=0; i < buff.length; i++) {
            char c = (char) buff[i];
            if(c == '\n') {
               ++nl;
            } else if(Character.isWhiteSpace(c)) {
               ++nw;
            }
            ++nc;
         }
      }
      System.out.println(nl + " " + nw + " " + nc);
   } catch(IOException e) {
      err.print(e);
      return;
   }
}

He then asked could I see any problems if the input were in UTF-8, I wasn't sure what he was talking about, but apparently in UTF-8 characters can "use one to four 8-bit bytes" (pinched from wikipedia).

So does that affect the character count in the above code snippet, is that what he was getting at?

Question about UTF-8

PaulStat

PaulStat