Extract an XML node value from an XML doc based on XPATH?

kmbailey · January 28, 2011, 4:56pm

I’m looking for a way to retrieve an XML doc (via URL), and then extract a node value from it, preferably identifying the node by XPATH.

The context for using this is either in the Content Type Properties (as a Pre-Processing exit to populate a field from the XML) or in a Velocity template (to get a value from XML for inclusion in the assembled page).

I’ve gotten part of the way there. In the Velocity template, I used $rx.doc.getDocument to hit the URL and get the XML document in a variable. But I’m not sure what methods or functions to use to parse out the XML node I’m looking for.

In the Content Type Pre-Processing, I’m not sure how to get the XML document or parse it. I have tried using the sys_xdDomToText extension, passing it the URL for the XML document, but it’s unclear what the other arguments do, or where the result goes (and I want to extract a node from the result anyway).

Has anyone got any experience in making this work?

Thanks,
Kathleen

Rushing · January 29, 2011, 12:11am

I haven’t tried this before… but you should have access to the jDOM classes that the Rhythmyx server uses. In so doing, you won’t need to use $rx.doc.getDocument()

To get to those classes without developing a Velocity Extension for it, you can use code similar to the following (untested):

#set( $dom = $sys.getClass().forName('org.jdom.input.SAXBuilder').newInstance().build( $url ) )
#set( $xpath = $sys.getClass().forName('org.jdom.xpath.XPath').newInstance( $expression ) )
#set( $resultNodes = $xpath.selectNodes( $dom ) )
#foreach( $node in $resultNodes )
$node.getValue()
#end

or

#set( $dom = $sys.getClass().forName('org.jdom.input.SAXBuilder').newInstance().build( $url ) )
#set( $xpath = $sys.getClass().forName('org.jdom.xpath.XPath').newInstance( $expression ) )
#set( $result = "$!{xpath.selectSingleNode( $dom ).getValue()}" )

In both of these code blocks, $url is the URL of your XML file and $expression is your XPath expression. Both are normal strings.

kmbailey · January 31, 2011, 2:27pm

Thanks for this suggestion.

I tried this, both in the Velocity code, and as template bindings, and in both cases, I got java.lang.ClassNotFoundException: org.jdom.input.SAXBuilder. This class is not available in that context in my environment (we are running Rhythmyx 6.5.2 - not sure if that makes a difference).

-Kathleen

Rushing · February 4, 2011, 12:17am

Blast… I was assuming that since jdom.jar was in my 6.5 folder (AppServer\server\rx\deploy\rxapp.ear\rxapp.war\WEB-INF\lib) that it would be available for use. I’ll have to test to make sure, but in the mean time, check to see if it’s in your folder. If not, download a copy from www.jdom.com.

Normally, I’d prefer to use SAXON, but jdom’s interface is much simpler for what you’re looking to accomplish.

kmbailey · April 6, 2011, 11:43am

Thanks for this pointer.

In the meantime, I worked around this by using JSON instead of XML, but I would still like to know how to do this. I’m in the midst of planning our upgrade from Rhythmyx 6.5.2 to CM System 6.7, but once we’re on 6.7, I expect to get back to figuring this out. This is in the context of our plans for integration with online video platform, and that platform provides both XML and JSON integration options - I’d like to make both work with Rhythmyx, so I have a choice in different situations. I’ll try to remember to come back and post my results here.

-Kathleen

jitendra · April 6, 2011, 2:00pm

Have a look at PSXMLDomUtil (It is in the RX Public API)…that might just be what you are looking for once you have the xml doc… I think I had used it for something before, but I can’t remember if I settled for using that vs. org.w3c.dom.* classes…

Rushing · April 6, 2011, 2:24pm

looking over a freash install of 6.7 I see dom4j-1.6.1.jar.

Here’s some code that might work in place of what I wrote earlier for jDom…


#set( $urlObj = $sys.getClass().forName('java.net.URL').newInstance( $url ) )
#set( $dom = $sys.getClass().forName('org.dom4j.io.SAXReader').newInstance().read( $urlObj ) )
#set( $resultNodes = $dom.selectNodes( $expression ) )
#foreach( $node in $resultNodes )
$node.getValue()
#end

or


#set( $urlObj = $sys.getClass().forName('java.net.URL').newInstance( $url ) )
#set( $dom = $sys.getClass().forName('org.dom4j.io.SAXReader').newInstance().read( $urlObj ) )
#set( $resultNode = $dom.selectSingleNode( $expression ) )
$resultNode.getValue()

Gotchas to watch out for: if the read or select method calls return null, then the variable won’t be set… use your favorite means of error checking to prevent unpredictable results.

Jitendra, I’m not familiar with using the w3c classes to do this… do they provide an implementation-independent interface to do all this?

would the code end up looking like this?


#set( $dom = $sys.getClass().forName('javax.xml.parsers.DocumentBuilderFactory').newInstance().newDocumentBuilder().parse( $url ) )
#set( $xpath = $sys.getClass().forName('javax.xml.xpath.XPathFactory').newInstance().newXPath() )
$xpath.evaluate( $expression, $dom )

jitendra · April 6, 2011, 3:19pm

Ah…it comes back to me…Long story short, we needed a way to parse a document (say the body of a particular item) and create new content items on the fly based on what was found and also remove certain divs with ids.


...
String xmlString = itemBodyField.getValue().getValueAsString();
PSXmlDocumentBuilder psdb = new PSXmlDocumentBuilder();
Document doc = psdb.createXmlDocument(new StringReader(xmlString), false);
Element docEle = doc.getDocumentElement();
HashSet removeElements = new HashSet();
      
NodeList divNodes = docEle.getElementsByTagName("div");
// Now iterate through divNodes to see check contents and see if we need to create a new item
...

Rushing, w3c provides the interface and percussion has implemented it in that particular class. I believe I just settled on using PSXmlDocumentBuilder to write the xml doc back and I didn’t need to use PSXMLDomUtil to find / select nodes. I believe if you go through Perc’s implemented classes, then it should be platform independent and you don’t have to rely on a separate jar installation. However i don’t think it implements the latest version of w3c.dom as I do recall a method (don’t recall which one ) not existing that should have…

Rushing · May 2, 2011, 10:07am

I finally ran across an instance where I needed to do this. Here’s how I implemented this:


#macro(xpathEval $BASE_URL $EXPRESSION)
#set($url = $rx.link.addParams($BASE_URL,'year',$year,'make',$make,'model',$model,'style_id',$styleId,'model_id',$modelId))
#set($domBuilder = $user.debug.invoke('javax.xml.parsers.DocumentBuilderFactory','newInstance').newDocumentBuilder())
#set($source = $user.debug.create('org.xml.sax.InputSource'))
$source.setCharacterStream($user.debug.create('java.io.StringReader',$rx.doc.getDocument($url)))##
#set($dom = $domBuilder.parse($source))
#set($xpath = $user.debug.invoke('javax.xml.xpath.XPathFactory','newInstance').newXPath())
$!{xpath.evaluate($EXPRESSION, $dom)}##
#end

Dependency: AtcToolkit

Caveat: javax.xml.XPath evaluations always return a string, so result nodes will be translated to strings before returning, preventing customized traversal of subtrees.

Rushing · May 3, 2011, 9:05am

ah… the year, make and model params aren’t supposed to be part of the generic code… sorry. lol

ckeleher · December 21, 2011, 4:21pm

Hi Rushing,

Your macro would solve a major problem for me, but I can’t get it to work. We have installed the AtcToolkit.

I edited the macro to return each of the variables set so I could see where it broke down.
$domBuilder (returns org.apache.xerces.jaxp.DocumentBuilderImpl@1a9dd0e) and $xpath (returns com.sun.org.apache.xpath.internal.jaxp.XPathImpl@efda94) both appear to be set correctly.

$url fails, even after removing your parameters, but I can set that using #set($url = $BASE_URL) if that is acceptable.

But I can’t get $source or anything dependent on $source to work.

Any help would be much appreciated!

Cindy