Normally, I prefer to send CSV or JSON data to Splunk. But sometimes XML can’t be avoided. I recently needed to ingest an XML file, and through judicious use of ‘MUST_BREAK_AFTER’ and ‘BREAK_ONLY_BEFORE’ in props.conf, I was able to extract the events from the XML file that looked like this:
<ReportSection name="foo_node" category="node">
   <Long name="nodeOid">-94323972016633549</Long>
   <String name="type">Windows Server</String>
   <Integer name="passed">9</Integer>
   <Integer name="failed">1</Integer>
   <Integer name="errors">0</Integer>
   <String name="status"></String>
   <Integer name="statusPercent">90</Integer>
   <String name="statusRange"></String>
   <Integer name="noResults">0</Integer>
   <Timestamp name="lastCheckTime" displayvalue="10/26/15 1:04 AM">1445835867360</Timestamp>
</ReportSection>
The problem with this XML is that KV_MODE = XML will cause Splunk to extract the tag name (eg. “String”) as the events’ field name, rather than extracting the value of the name attribute from the XML. So you end up with an event looking like this:
Since I don’t write this blog to show you problems and leave you hopeless, here’s how to extract meaningful fields from this XML:
- Don’t put KV_MODE in props.conf
- Use index-time extractions instead. You can use more than one extraction if necessary.
props.conf:
REPORT-xml1 = xml1
transforms.conf:
[xml1]
REGEX = <\w+ name="(\w+)"(?: displayvalue.*?)*>(.*?)<\/\w+>
FORMAT = $1::$2
MV_ADD = true
REPEAT_MATCH = true
Originally, I was going to use a second extraction which would match the Timestamp tag and get the value of the displayvalue attribute. However, I decided instead to just grab the value for the whole Timestamp tag, which is the Unix timestamp. Splunk’s convert command makes it easy to work with Unix timestamps.
Here’s a screenshot of the end result in Splunk: