XML reading empty tags with no whitespace - python

Soldato
Joined
1 Mar 2003
Posts
5,508
Location
Cotham, Bristol
I have an issue with parsing the following xml file correctly

Code:
<?xml version="1.0" encoding="UTF-8"?>
<!--Sample XML file generated by XML Spy v4.1 U (http://www.xmlspy.com)-->
<xml xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="director_param.xsd">
	<Header hdr_filename="345678.par" hdr_filetype="Director Parameters"></Header>
	<Content>
		<JOB>
			<CUSTOMER></CUSTOMER>
			<DEPARTMENT></DEPARTMENT>
			<DOCUMENT_TYPE></DOCUMENT_TYPE>
			<JOB_RECEIVED_DATE></JOB_RECEIVED_DATE>
			<SLA_DUE_DATE></SLA_DUE_DATE>
			<SLA_WARNING_OFFSET></SLA_WARNING_OFFSET>
		</JOB>
	</Content>
</xml>

This is the test python code. I'm using XMLParser as I have some bespoke parsing behaviour I need to implement for start, data and end. I am aware of standard ElementTree functionality but it's not appropriate for my scenario.

Code:
import xml.etree.ElementTree as ET

class Parser:

    def __init__(self):
        self._tag_name = ""
        self._section_tagdata = {}

    def start(self, tag, attrs):
        self._tag_name = str(tag).encode('ascii','ignore')

    def end(self, tag):        
        pass

    def data(self, data):        
        if '\t' not in data and '\n' not in data:
            self._section_tagdata[self._tag_name] = data.encode('ascii','ignore')

    def close(self):
       pass

target = Parser()
parser = ET.XMLParser(target=target)
xmlFile = open('O:/clients/concepts/workflow/test.xml','rbU')
xml = ""
for line in xmlFile:
    xml += line

parser.feed(xml)
parser.close()
for k,v in target._section_tagdata.iteritems():
    print k,v

In this scenario there is no output from the print loop. What seems to be happening is in cases where you have xml data like this.

Code:
\t\t\t<CUSTOMER></CUSTOMER>\r\n
\t\t\t<DEPARTMENT></DEPARTMENT>\r\n

When it tries to read the data for the <CUSTOMER> tag it takes the data as being \r\n\t\t\t. If however the xml file is formatted so that if there's no data for the tag there's a whitespace character like so.

Code:
<?xml version="1.0" encoding="UTF-8"?>
<!--Sample XML file generated by XML Spy v4.1 U (http://www.xmlspy.com)-->
<xml xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="director_param.xsd">
	<Header hdr_filename="345678.par" hdr_filetype="Director Parameters"></Header>
	<Content>
		<JOB>
			<CUSTOMER> </CUSTOMER>
			<DEPARTMENT> </DEPARTMENT>
			<DOCUMENT_TYPE> </DOCUMENT_TYPE>
			<JOB_RECEIVED_DATE> </JOB_RECEIVED_DATE>
			<SLA_DUE_DATE> </SLA_DUE_DATE>
			<SLA_WARNING_OFFSET> </SLA_WARNING_OFFSET>
		</JOB>
	</Content>

Then the output for the above code is

Code:
CUSTOMER  
SLA_DUE_DATE  
JOB_RECEIVED_DATE  
SLA_WARNING_OFFSET  
DEPARTMENT  
DOCUMENT_TYPE

Which is correct. Other than altering the format of my xml, what can I do to the code to accommodate it.
 
Ok fixed it, it was quite a simple change in the end.

Code:
    def data(self, data):        
        if '\t' not in data and '\n' not in data:
            self._section_tagdata[self._tag_name] = data.encode('ascii','ignore')
        else:
            if self._tag_name not in self._section_tagdata:
                self._section_tagdata[self._tag_name] = ""
 
Back
Top Bottom