Question about UTF-8

Soldato
Joined
1 Mar 2003
Posts
5,508
Location
Cotham, Bristol
I was asked a question the other day in an interview where he had a method for counting number of words characters and new lines. The method was something along the lines of (this is untested I'm just going based on memory)

Code:
public void readWordCharLineCount(InputStream in, PrintStream out, PrintStream err) {
   int nw = 0;
   int nc = 0;
   int nl = 0;

   byte[] buff = new byte[4096];

   try {
      while(in.read(buff) != -1) {
         for(int i=0; i < buff.length; i++) {
            char c = (char) buff[i];
            if(c == '\n') {
               ++nl;
            } else if(Character.isWhiteSpace(c)) {
               ++nw;
            }
            ++nc;
         }
      }
      System.out.println(nl + " " + nw + " " + nc);
   } catch(IOException e) {
      err.print(e);
      return;
   }
}

He then asked could I see any problems if the input were in UTF-8, I wasn't sure what he was talking about, but apparently in UTF-8 characters can "use one to four 8-bit bytes" (pinched from wikipedia).

So does that affect the character count in the above code snippet, is that what he was getting at?
 
Last edited:
Back
Top Bottom