I have an issue with parsing the following xml file correctly
This is the test python code. I'm using XMLParser as I have some bespoke parsing behaviour I need to implement for start, data and end. I am aware of standard ElementTree functionality but it's not appropriate for my scenario.
In this scenario there is no output from the print loop. What seems to be happening is in cases where you have xml data like this.
When it tries to read the data for the <CUSTOMER> tag it takes the data as being \r\n\t\t\t. If however the xml file is formatted so that if there's no data for the tag there's a whitespace character like so.
Then the output for the above code is
Which is correct. Other than altering the format of my xml, what can I do to the code to accommodate it.
Code:
<?xml version="1.0" encoding="UTF-8"?>
<!--Sample XML file generated by XML Spy v4.1 U (http://www.xmlspy.com)-->
<xml xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="director_param.xsd">
<Header hdr_filename="345678.par" hdr_filetype="Director Parameters"></Header>
<Content>
<JOB>
<CUSTOMER></CUSTOMER>
<DEPARTMENT></DEPARTMENT>
<DOCUMENT_TYPE></DOCUMENT_TYPE>
<JOB_RECEIVED_DATE></JOB_RECEIVED_DATE>
<SLA_DUE_DATE></SLA_DUE_DATE>
<SLA_WARNING_OFFSET></SLA_WARNING_OFFSET>
</JOB>
</Content>
</xml>
This is the test python code. I'm using XMLParser as I have some bespoke parsing behaviour I need to implement for start, data and end. I am aware of standard ElementTree functionality but it's not appropriate for my scenario.
Code:
import xml.etree.ElementTree as ET
class Parser:
def __init__(self):
self._tag_name = ""
self._section_tagdata = {}
def start(self, tag, attrs):
self._tag_name = str(tag).encode('ascii','ignore')
def end(self, tag):
pass
def data(self, data):
if '\t' not in data and '\n' not in data:
self._section_tagdata[self._tag_name] = data.encode('ascii','ignore')
def close(self):
pass
target = Parser()
parser = ET.XMLParser(target=target)
xmlFile = open('O:/clients/concepts/workflow/test.xml','rbU')
xml = ""
for line in xmlFile:
xml += line
parser.feed(xml)
parser.close()
for k,v in target._section_tagdata.iteritems():
print k,v
In this scenario there is no output from the print loop. What seems to be happening is in cases where you have xml data like this.
Code:
\t\t\t<CUSTOMER></CUSTOMER>\r\n
\t\t\t<DEPARTMENT></DEPARTMENT>\r\n
When it tries to read the data for the <CUSTOMER> tag it takes the data as being \r\n\t\t\t. If however the xml file is formatted so that if there's no data for the tag there's a whitespace character like so.
Code:
<?xml version="1.0" encoding="UTF-8"?>
<!--Sample XML file generated by XML Spy v4.1 U (http://www.xmlspy.com)-->
<xml xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="director_param.xsd">
<Header hdr_filename="345678.par" hdr_filetype="Director Parameters"></Header>
<Content>
<JOB>
<CUSTOMER> </CUSTOMER>
<DEPARTMENT> </DEPARTMENT>
<DOCUMENT_TYPE> </DOCUMENT_TYPE>
<JOB_RECEIVED_DATE> </JOB_RECEIVED_DATE>
<SLA_DUE_DATE> </SLA_DUE_DATE>
<SLA_WARNING_OFFSET> </SLA_WARNING_OFFSET>
</JOB>
</Content>
Then the output for the above code is
Code:
CUSTOMER
SLA_DUE_DATE
JOB_RECEIVED_DATE
SLA_WARNING_OFFSET
DEPARTMENT
DOCUMENT_TYPE
Which is correct. Other than altering the format of my xml, what can I do to the code to accommodate it.