Archive for the ‘API guide’ Category

ParseFile vs Parse: A Quick Comparison

VTDGen has two main methods in version 2.12 that you can call to parse XML documents.

The first one is parse(), which accepts a boolean indicating the namespace awareness of the parsing operation. It throws a variety of exceptions, corresponding to various parsing errors, such as encoding errors, invalid entity reference, or name-space qualification errors , etc. You need to catch those exceptions in your code,  and then choose to obtain the detailed diagnostic message about the nature of error. Parse() always works in conjunction with a pair of setDoc() methods, which either accepts a byte array containing the entire input XML, or a byte array and a pair of integers delimiting the segment in the byte array that contains the XML document. The maximum file size limit is 2 GB without namespace awareness, and 1 GB with. Also remember that you will need to manually read the file content into memory and the whole parsing takes about six to ten lines of code.

The second one is parseFile(), which accepts the full path qualified file name of an XML document, and a boolean, which switches on/off the name space awareness of the parsing routine. Built on top of setDoc() and parse(), parseFile(), it returns the status of parsing as a boolean. If for any reason parsing fails, whether the file does not exist, or there is a wellformedness error, it will only return false, and furnishes no detailed diagnostic information. So if you don’t  concern yourself with the nitty-gritty of exception handling or simply want to avoid the clutter, then choose parseFile() and parsing requires no more than two lines of code.

There is More

Cast in the same mold as parseFile(), there are actually three more purpose-built parsing methods at your disposal serving to simplify coding. Those methods are:

  • parseFileHttpURL()- obtain XML docs via HTTP URL , parse it, and then return
  • parseZIPFile()- parse a ZIP compressed XML file, it will inflate the document into an in-memory byte array before parsing
  • parseGZIPFile()-

They offer the same benefit/limitation argument for their intended use cases as parseFile() is for reading uncompressed, local XML files.

Configure Parser

In addition, the following routines  help you configure the run-time behavior of the parser.

  • setLcDepth()  configures VTDGen to generate Location Cache of either depth 3 or 5. Before version 2.12, the default depth is 3. After, 5 is the default.
  • enableIgnoredWhiteSpace()  enables the parser to collect all white spaces, including the trivial white spaces. By default, trivial white spaces are ignored

 

End Note

Hope this article has provided a glimpse of VTDGen‘s parsing methods. As you have noticed there are a few more VTDGen‘s member methods this article has not covered.They pertain to advanced features and capabilities of VTD-XML –namely, buffer reuse and loading/writing index.  Both subjects will be covered in detail in a later article.

Advertisements

How to remove comment nodes from an XML document?

In this post I am going to show you how to effectively remove comments from an XML document using the combination of XMLModifier and XPath. The input XML document looks like the following.

<?xml version="1.0"?><!-- some other code here -->
<clients>
<!-- some other code here -->

<function>
</function>

<function>
</function>

<function>
<name>data_values</name>
<variables>
<variable><!-- some other code here -->
<name>temp</name>
<!-- some other code here --> <type>double</type>
</variable>
</variables><!-- some other code here -->
<block><!-- some other code here -->
<opster>temp = 1</opster>
</block>
</function>
</clients>

The code that performs the task is listed below. The key is the XPath expression “//comment()” which selects all the comment nodes in the document. After binding VTDNav object to the XMLModifier object, you can simply call the “remove()” method, which will not only remove the content of the comment, but also the surrounding delimiting text (i.e. <!–, and –>).


import com.ximpleware.*;
import java.io.*;
public class removeNodesDemo {

public static void main(String[] args) throws VTDException, IOException{
VTDGen vg = new VTDGen();
if (!vg.parseFile("quot;d:\\xml\\input2.xml",false))
return;
VTDNav vn = vg.getNav();
AutoPilot ap = new AutoPilot(vn);
XMLModifier xm = new XMLModifier(vn);
ap.selectXPath(&amp;quot;//comment()&amp;quot;);

int i=0;
while((i=ap.evalXPath())!=-1){
xm.remove();
}
xm.output("d:\\xml\\output2.xml");
}
}

The output XML is

<clients>


<function>
</function>

<function>
</function>

<function>
<name>data_values</name>
<variables>
<variable>
<name>temp</name>
<type>double</type>
</variable>
</variables>
<block>
<opster>temp = 1</opster>
</block>
</function>
</clients>

You might ask what if I want to remove an attribute node,  a text node, or a CDATA node, an element node, or an processing instruction node?

The effective of XMLModifer’s remove() method has the following effect on each type of nodes:

  • On a non-CDATA text node, it will simply remove it
  • On a CDATA typed text node, it will remove the text content and surrounding delimiting texts
  • On an element node, it remove the entire fragment of it
  • On an attribute node, it will remove both the attribute name value pair in its entirety.
  • On a processing instruction node, it will remove both the content and the surrounding delimiting text.

In other words, to remove all processing instruction nodes, just substitute the XPath expression above with “//processing-instruction().”