c# and reading file contents

Gman · 25 Sep 2007 at 21:02

Here's a little background as to what i'm trying to do

i've got a html file which has got sections in out identified by div's like so

Code:

<HTML>
   <HEAD><TITLE>This Page</TITLE></HEAD>
   <BODY>
       <Div id="Section1">
            lar lar lar lar lar
       </div>
       <Div id="Section2">
            lar lar lar lar lar
       </div>
       <Div id="Section3">
            lar lar lar lar lar
       </div>
       <Div id="Section4">
            lar lar lar lar lar
       </div>
   </BODY>
</HTML>

This html file is called something like test.htm

Now in C# I'm trying to read the contents of this html file but exclude certain sections depending on that variables i've got.
i.e. in the C# code I might have an array of boolean like:

Section1 = true
Section2 = false
Section3 = true
Section4 = false

I need to ready the contents of the html file but in the variable that holds the read data I need the sections that are marked false removed.

Any ideas on this ?

Inquisitor · 26 Sep 2007 at 04:46

Are you assuming the HTML to be XML compliant? This would make it much, much easier, but by the looks of your example, you're not.

Gman · 26 Sep 2007 at 11:13

yep you can assume it is

TrUz · 26 Sep 2007 at 12:03

XmlTextReader should do what you need.

TrUz

Inquisitor · 26 Sep 2007 at 16:05

Yup. .NET's XML libraries are enormously useful

Gman · 26 Sep 2007 at 18:20

thanks will take a look at that.
Cheers

Gman · 30 Sep 2007 at 16:27

Finally got round to giving this a go and had to try and do it the following way as the XmlTextReader didn't like the file contents.

This is the code but for some reason its not removing the divs that I don't need.

What happens is the printJobFileName contains a querystring such as Section_0, Section_1 etc and in the html file this has got divs with the same ID's . This come should loop through and remove the divs where there set as false in the querystring.

Any ideas ?

Code:

private byte [] extraxtContents(byte[] data, string printJobFileName)
    {

     

       
        


        Uri uri = new Uri(printJobFileName);

        string html = System.Text.ASCIIEncoding.ASCII.GetString(data);

        bool finished = false;
        int startIndex = html.IndexOf( "<div" );
        sEvent = startIndex.ToString();
        int endIndex;
        // split the query string on the '?' character
        string [] queryValues = uri.Query.Split( '&' );
        // iterate thru all div tags and check id attr against query string values
        while( startIndex > 0 )
        {
            // find end of opening div tag
            endIndex = html.IndexOf(">", startIndex);
            if (endIndex > 0)
            {
                // iterate thru all query string values
                foreach (string query in queryValues)
                {
                    if (query.Split('=').Length < 2)
                        continue;
                    string sectionName = query.Split('=')[0];
                    bool enabled = bool.Parse( query.Split('=')[1] );

                    if (!enabled)   // is this section disabled?
                    {
                        // does this div's id match the query string value
                        int idIndex = html.IndexOf(sectionName, startIndex);
                        if (idIndex < endIndex) // yes
                        {
                            // remove this div tag from HTML
                            endIndex = html.IndexOf("</div>", startIndex);
                            if (endIndex > 0)
                            {
                                string divTag = html.Substring(startIndex, endIndex + 5);
                                html.Replace(divTag, "");
                                break;
                            }
                        }
                    }
                }                            
                // find next div tag
                startIndex = html.IndexOf("<div", startIndex+1 );
            }
        }

 

        return ASCIIEncoding.ASCII.GetBytes(html);
    }

Gman · 30 Sep 2007 at 18:22

little update, it appears that its the line
Uri uri = new Uri(printJobFileName);
which is the problem because the URI function is converting the '?' character in the URL to '%3F' which is screwing things up when the URI is trying to separate the query string.

Anyone had this problem before and know how to sort it as a google is throwing up nothing

TrUz · 1 Oct 2007 at 08:35

You will need to know that '%3F' is a '?' or read as a string.

TrUz

Stelly · 1 Oct 2007 at 10:14

there is a way in c# to search the thread for %3F and replace it with ? I will have a look today...

Stelly

Stelly · 1 Oct 2007 at 10:33

you need to use something like...

int searchSoft = -1;
string str = "'%3F"

searchSoft = SoftName.IndexOf(str, StringComparison.OrdinalIgnoreCase)

if(searchSoft >=0)
{
SoftName.Replace("'%3F","?");
break;
}

do you get where I'm coming from?

Stelly

Inquisitor · 1 Oct 2007 at 11:30

Why initialise searchSoft to -1?

Gman · 2 Oct 2007 at 19:02

in the end I just scraped the use of URI and just split the whole URL string on '?' and took the 2nd element in the array.

There was also a few other problems such as the replace not being assigned back into the html variable, also the substring method was wrong as I was using the endindex as a seccond parameter when it needs to be the number of characters from the startindex.

And finally had to add some more validation to ' if (idIndex < endIndex)' as if the Div was removed and the id was not found then a -1 was returned so it eneded up removing everything after this.

Thanks for the help guys.

Also will there be any isssues with using the split on the '?' insted of using URI ?????