Question about UTF-8

PaulStat · 12 Jan 2015 at 09:12

I was asked a question the other day in an interview where he had a method for counting number of words characters and new lines. The method was something along the lines of (this is untested I'm just going based on memory)

Code:

public void readWordCharLineCount(InputStream in, PrintStream out, PrintStream err) {
   int nw = 0;
   int nc = 0;
   int nl = 0;

   byte[] buff = new byte[4096];

   try {
      while(in.read(buff) != -1) {
         for(int i=0; i < buff.length; i++) {
            char c = (char) buff[i];
            if(c == '\n') {
               ++nl;
            } else if(Character.isWhiteSpace(c)) {
               ++nw;
            }
            ++nc;
         }
      }
      System.out.println(nl + " " + nw + " " + nc);
   } catch(IOException e) {
      err.print(e);
      return;
   }
}

He then asked could I see any problems if the input were in UTF-8, I wasn't sure what he was talking about, but apparently in UTF-8 characters can "use one to four 8-bit bytes" (pinched from wikipedia).

So does that affect the character count in the above code snippet, is that what he was getting at?

RoyMi6 · 12 Jan 2015 at 10:17

PaulStat said:
He then asked could I see any problems if the input were in UTF-8, I wasn't sure what he was talking about, but apparently in UTF-8 characters can "use one to four 8-bit bytes" (pinched from wikipedia).

So does that affect the character count in the above code snippet, is that what he was getting at?

Yeah, you've got it. Essentially your code doesn't take into account the variable number of bytes used to represent a character if you use UTF-8.

Question about UTF-8

PaulStat

PaulStat

RoyMi6

RoyMi6