Skip to content

Add XML parser using XPath queries#8047

Closed
srebhan wants to merge 34 commits intoinfluxdata:masterfrom
HRI-EU:parser_xml
Closed

Add XML parser using XPath queries#8047
srebhan wants to merge 34 commits intoinfluxdata:masterfrom
HRI-EU:parser_xml

Conversation

@srebhan
Copy link
Copy Markdown
Member

@srebhan srebhan commented Aug 28, 2020

Required for all PRs:

  • Signed CLA.
  • Associated README.md updated.
  • Has appropriate unit tests.

This PR adds a XML parser using XPath expressions to define fields, tags, etc. To achieve this, we use the underlying antchfx/xpath library. Please check there to see what expressions are supported.

The PR closes #1758 and #6968. The configuration syntax is borrowed from suggestions made by @danielnelson in #1758 (comment).
Furthermore, the code is heavily inspired by the json parser plugin.

@ssoroka
Copy link
Copy Markdown
Contributor

ssoroka commented Aug 28, 2020

Looks like the build failure is related to shirou/gopsutil#912

@srebhan
Copy link
Copy Markdown
Member Author

srebhan commented Aug 31, 2020

This is also related to PR #7988 trying to bump gopsutils...

@srebhan srebhan force-pushed the parser_xml branch 3 times, most recently from e0f9244 to 9629438 Compare August 31, 2020 10:56
@M0rdecay
Copy link
Copy Markdown
Contributor

M0rdecay commented Aug 31, 2020

Hello!

It looks like we are trying to solve one problem in parallel (https://github.com/M0rdecay/telegraf/tree/xml_parser/plugins/parsers/xml).

I like your solution, but I also have a couple of questions:

  • with your parser, is it possible to solve the problem when I need to get all the values ​​from a node without listing each of them in fields? In the case of very large documents, this would be useful
  • a similar question about tags
  • are all child nodes of the current node retrieved recursively?
  • is it possible to limit the array of analyzed nodes across the nodes?

@M0rdecay M0rdecay mentioned this pull request Aug 31, 2020
3 tasks
@srebhan
Copy link
Copy Markdown
Member Author

srebhan commented Aug 31, 2020

Hey @M0rdecay,

indeed we are trying to solve the same problem... :-) Let me answer your questions:

* with your parser, is it possible to solve the problem when I need to get all the values ​​from a node without listing each of them in `fields`? In the case of very large documents, this would be useful
* a similar question about tags

I thought about this one and I think it can be done. I want to be able to specify something like

field_query = some_node/*    # relative to selected node
field_query_name = @name    # relative to field_query or nothing for using the node name
field_query_value = @value    # relative to field_query or nothing for using the node value

and similar for tags. When the field_query_name/_value is given we take the values of the query relative to each field_query node. This allows to use attributes as names/values. If not given we simply take the node-names as field/tag names and values.

The problems I see:

  • While this works nicely for tags, the type of the fields cannot be specified. This way everything will be a string... Is this acceptable?
  • What if the name/values are given as attribute pairs? Ok we maybe could craft a XPath for that one...
  • What do we do if both the queries above and field/tag-definitions are given?
* are all child nodes of the current node retrieved recursively?

I think you can do this as a XPath query (see descendant).

* is it possible to limit the array of analyzed nodes across the nodes?

This should also be possible with XPath syntax I think.

This being said, I first want to get this merged without adding more complexity. Let's extend this in a second round. I'm looking forward to your opinion on my thoughts above! Any help appreciated! :-)

@M0rdecay
Copy link
Copy Markdown
Contributor

M0rdecay commented Aug 31, 2020

Thank you for the quick response!

I agree, let`s continue in the second round)

I ask you to see the solution I proposed for your example - #7460 (comment) - I will wait for your opinion.

Answering your questions and suggestions:

  • This way everything will be a string... Is this acceptable? For myself, I found such a solution for converting a string - https://github.com/M0rdecay/telegraf/blob/xml_parser/plugins/parsers/xml/parser.go#L211, maybe it will work for you too
  • What if the name/values are given as attribute pairs? - Is it possible to take a key-value pair directly from a node/attribute? Does the library you use allow it? If so, it should make the task a lot easier. So far, the configuration without this feature seems to me very difficult, I would like a simpler configuration, like in json parser

It seems to me that trying to provide maximum flexibility, we may fall into the hell of configuration.
The name and value of a field/tag can be formed directly from the name and value of a node or attribute. If you need to rename them, you can use processors - fortunately, the telegraf offers a rich set of tools for this.

Upd. For example, I need to get from a document like this:

<Document>
  <Data>
    <Server_first>
      <Data>
        <Value>1</Value>
      </Data>
      <Hosts>
        <Host_1>
          <Name>Host1.local</Name>
          <Uptime>1000</Uptime>
          <Connections>
            <Total>15</Total>
            <Current>2</Current>
          </Connections>
        </Host_1>
        <Host_2>
          <Name>Host2.local</Name>
          <Uptime>1240</Uptime>
          <Connections>
            <Total>33</Total>
            <Current>4</Current>
          </Connections>
        </Host_2>
      </Hosts>
    </Server_first>
    <Server_second>
      <Data>
        <Value>1</Value>
      </Data>
      <Hosts>
        <Host_3>
          <Name>Host3.local</Name>
          <Uptime>3000</Uptime>
          <Connections>
            <Total>35</Total>
            <Current>33</Current>
          </Connections>
        </Host_3>
        <Host_4>
          <Name>Host4.local</Name>
          <Uptime>5240</Uptime>
          <Connections>
            <Total>63</Total>
            <Current>78</Current>
          </Connections>
        </Host_4>
      </Hosts>
    </Server_second>
  </Data>
</Document>

metrics like this:

file,Name=Host1.local,xml_node_name=Host_1 Current=2i,Total=15i,Uptime=1000i 1598901930000000000
file,Name=Host2.local,xml_node_name=Host_2 Current=4i,Total=33i,Uptime=1240i 1598901930000000000
file,Name=Host3.local,xml_node_name=Host_3 Current=33i,Total=35i,Uptime=3000i 1598901930000000000
file,Name=Host4.local,xml_node_name=Host_4 Current=78i,Total=63i,Uptime=5240i 1598901930000000000

I'll be glad to see how this can be done with your parser!

@srebhan
Copy link
Copy Markdown
Member Author

srebhan commented Sep 1, 2020

@M0rdecay here is the config to reproduce your output on the given XML:

[[inputs.file]]
  files = ["M0rdecay.xml"]
  data_format = "xml"

  [[inputs.file.xml]]
    selected_nodes = "//Hosts/*[starts-with(name(), 'Host_')]"

    [inputs.file.xml.tags]
      name = "Name"
      xml_node_name="name()"

    [inputs.file.xml.fields_int]
      current = "Connections/Current"
      total = "Connections/Total"
      uptime = "Uptime"

It is a bit more verbose than your config without the field/tag_queries we discussed earlier, but as I said, I'm willing to add this later.

@M0rdecay
Copy link
Copy Markdown
Contributor

M0rdecay commented Sep 8, 2020

@srebhan hello!

Please, tell me, does your parser, when parsing an array, allow you to get tags/fields from the beginning of the document (like //some_node) or relative to the current parsed node, but at a higher level (../../some_element)?

In one of my cases, I found that I was missing such functionality.

@srebhan
Copy link
Copy Markdown
Member Author

srebhan commented Sep 9, 2020

Hey @M0rdecay,

sure. The readme states that if you start the path with a / you specify an absolute path, otherwise it is relative to the selected nodes (or the root node if non selected, which is the same as an absolute path :-)).
If you have more questions please join the slack-channel (I'm srebhan :-)) as it is easier than filling the Pull-Request...

@srebhan srebhan force-pushed the parser_xml branch 2 times, most recently from 0701a75 to 3aa9608 Compare September 14, 2020 08:22
@srebhan srebhan force-pushed the parser_xml branch 2 times, most recently from 8778046 to f7a62d1 Compare October 1, 2020 16:26
@srebhan
Copy link
Copy Markdown
Member Author

srebhan commented Oct 1, 2020

@M0rdecay:
I've found some time to implement the field selection similar to what you do. The only missing feature is the automagic type-conversion. I've we can settle on this parser you could easily implement that on top of the PR...
Could you give it a try and see if it works for you so we can get this rolling!?

@M0rdecay
Copy link
Copy Markdown
Contributor

M0rdecay commented Oct 1, 2020

@M0rdecay:
I've found some time to implement the field selection similar to what you do. The only missing feature is the automagic type-conversion. I've we can settle on this parser you could easily implement that on top of the PR...
Could you give it a try and see if it works for you so we can get this rolling!?

Sure. I'll try to do it by tomorrow night.

@M0rdecay
Copy link
Copy Markdown
Contributor

M0rdecay commented Oct 1, 2020

I have checked if this works for me with one example.
We discussed a case with Sven - Unfortunately, it didn't work for me :(

For a part of document like this, information will be lost:

                <COMMAND>
                    <CMD>41</CMD>
                    <ALL>
                        <PENDING_COUNT>0</PENDING_COUNT>
                        <MIN_DURATION>0</MIN_DURATION>
                        <MAX_DURATION>0</MAX_DURATION>
                        <DURATION_SUM>0</DURATION_SUM>
                        <DURATION_SQR_SUM>0</DURATION_SQR_SUM>
                        <DET_COUNT>0</DET_COUNT>
                    </ALL>
                    <SUCC>
                        <PENDING_COUNT>0</PENDING_COUNT>
                        <MIN_DURATION>0</MIN_DURATION>
                        <MAX_DURATION>0</MAX_DURATION>
                        <DURATION_SUM>0</DURATION_SUM>
                        <DURATION_SQR_SUM>0</DURATION_SQR_SUM>
                        <DET_COUNT>0</DET_COUNT>
                    </SUCC>
                    <FAIL>
                        <PENDING_COUNT>0</PENDING_COUNT>
                        <MIN_DURATION>0</MIN_DURATION>
                        <MAX_DURATION>0</MAX_DURATION>
                        <DURATION_SUM>0</DURATION_SUM>
                        <DURATION_SQR_SUM>0</DURATION_SQR_SUM>
                        <DET_COUNT>0</DET_COUNT>
                    </FAIL>
                </COMMAND>

@M0rdecay
Copy link
Copy Markdown
Contributor

M0rdecay commented Oct 2, 2020

Okay, we've discussed the case again - it looks like it's working very well now.

It will take me a little time for other tests.

@srebhan
Copy link
Copy Markdown
Member Author

srebhan commented Oct 2, 2020

One more question from my side: Currently the configuration options have inconsistent naming schemes e.g. "selected_nodes" (this should probably be "selected_metrics" anyway) and "field_selection". There are more inconsistencies...

Which naming scheme do you prefer?
@ssoroka, @M0rdecay others?

@M0rdecay
Copy link
Copy Markdown
Contributor

M0rdecay commented Oct 2, 2020

Which naming scheme do you prefer?

I like the current. After all, we first parse XML, and only then generate metrics.

@M0rdecay
Copy link
Copy Markdown
Contributor

M0rdecay commented Oct 2, 2020

I think the parser is flexible and functional.

The only thing that scares me is the complexity of the configuration and the requests themselves. I didn't succeed the first time -_-

@sjwang90 sjwang90 mentioned this pull request Oct 23, 2020
@M0rdecay
Copy link
Copy Markdown
Contributor

Well, so far so good
Honestly, I like it

@ssoroka when is this PR expected to merge?

@reimda
Copy link
Copy Markdown
Contributor

reimda commented Mar 3, 2021

Merged in #8931. Thanks @srebhan!

@reimda reimda closed this Mar 3, 2021
@srebhan srebhan deleted the parser_xml branch March 4, 2021 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/xml new plugin plugin/parser 1. Request for new parser plugins 2. Issues/PRs that are related to parser plugins

Projects

None yet

Development

Successfully merging this pull request may close these issues.

XML input

7 participants