Best way to remove unicode Char in text input

davetherave2 · 19 Jul 2007 at 09:26

Hi Guys,

just wondering what the best way is for removing/ stripping out unicode characters. The reason I ask is that it is causing some issues with saving information in a particular way.

Currently I am using this code:

char[] stringconvertor = new char[TextBox1.Text.Length];
string returnstring = "";
stringconvertor = TextBox1.Text.ToCharArray();

for(int loop = 0;loop < TextBox1.Text.Length;loop++)
{

if (stringconvertor[loop] <= 127)
{
returnstring = returnstring+ stringconvertor[loop].ToString();
}

}
TextBox2.Text = returnstring;

now I have found that running this code it takes around 4 minutes to go through about 200,000 characters which is obviously unacceptable for users so is there a better way.

Chrisss · 19 Jul 2007 at 14:08

Code:

char[] stringconvertor = new char[TextBox1.Text.Length];
string returnstring = "";
stringconvertor = TextBox1.Text.ToCharArray();

for(int loop = 0;loop < TextBox1.Text.Length;loop++)
{

    if (stringconvertor[loop] <= 127)
    {
         returnstring = returnstring+ stringconvertor[loop].ToString();
    }

}

TextBox2.Text = returnstring;

Inquisitor · 19 Jul 2007 at 14:34

Please don't do successive string concatenation like that

Use a StringBuilder:

Code:

string input = TextBox1.Text;
StringBuilder stringBuilder = new StringBuilder(input.Length);

for(int i = 0; i < input.Length; i++)
{
    if (input[i] < 127)
    {
         stringBuilder.Append(input[i]);
    }
}

TextBox2.Text = stringBuilder.ToString();

This should fix the long execution times.

Alternatively, get .NET to do it for you:

Code:

Encoding originalEncoding = Encoding.Unicode;
Encoding targetEncoding = Encoding.ASCII;

byte[] inputBuffer = originalEncoding.GetBytes(TextBox1.Text);
byte[] outputBuffer = Encoding.Convert(originalEncoding, targetEncoding, inputBuffer);
TextBox2.Text = targetEncoding.GetString(outputBuffer);

This will result in all non-ASCII characters being replaced by question marks.

davetherave2 · 19 Jul 2007 at 14:54

i wasn't aware that string concatination was bad. I am still learning c# and although I have learnt a lot I am still doing things that way I probably would when I started out learning other languages such as VB and c++.

I shall try them out and see how much of an effect performance.

edit: well just tried the first example you gave me and it has just flown through a 13million character string in just over 90 seconds.

Inquisitor · 19 Jul 2007 at 15:11

The reason your original algorithm is so slow is that strings are immutable (i.e. they can't be changed). So when you concatenate the two strings, .NET has to create a new string entirely to store the result, and the old one gets thrown out of the window. This is fine when you're doing it just a few times, but when you're doing it as many times as you are, it can adversely affect performance. Also, the fact that the string is getting longer each time you do it isn't helping

More info here:
http://www.yoda.arachsys.com/csharp/stringbuilder.html

And lots more very useful and interesting articles:
http://www.yoda.arachsys.com/csharp/index.html