White Space Handling in 2.13

In this post I will highlight VTD-XML v2.13’s whitespace handling capability along with some examples in Java.

Quick Review

Native to non-extractive parsing, VTD-XML’s handling of XML tokens and elements frequently revolves around the concept of byte segments.Once an XML document is parsed into VTD tokens, the byte segment enveloping the entire content of any  token or element can be visualized of as a pair of descriptors (i.e. offset and length) projecting into the original document.

For a large class of XML content extraction and modification operations, non-extractive parsing allows applications to circumvent the tedious, cycle-wasting tasks of de-serializing and re-serializing byte content of elements, and thereby help achieving maximum performance possible.

Whitespace handling in 2.12

Version 2.12 of VTD-XML introduces two new methods that help either trim or expand the surrounding white spaces of byte
segments denoted by 64-bit integers.

  • trimWhiteSpaces(long l), of VTDNav class, accepts a byte segment descriptor, removes both the leading and trailing white spaces, and returns a new descriptor.
  • expandWhiteSpaces(long l), of the same VTDNav class, takes a segment descriptor ands returns a new descriptor that includes all the leading and trailing white spaces around the input segment

It is worth noting that both methods are greedy: they will remove/expand as many white spaces as they possibly can. Furthermore, you can make the observation that the effect of one call often negates the other.

Additions in 2.13

Three static constants and three more methods are added to VTDNav class in 2.13.

Those constants are:

  1. VTDNav.WS_LEADING
  2. VTDNav.WS_TRAILING
  3.  VTDNav.WS_BOTH

The two new methods are:

  • trimWhiteSpaces(long l, short actionType) still trims the white spaces around the segment and returns a new segment descriptor.But the trimming operation can now be applied to the leading whitespaces, the trailing ones, or both, depending on the value of actionType.
  • trimWhiteSpaces(int index, short actionType) brings you the power and convenience of trimming a VTD record without the hassle of manually compose a 64-bit segment descriptor.
  • expandWhteSpaces(long l, short actionType) still expands the whitespaces. But the expansion can also be designated to include either the leading whitespaces, the trailing ones, or both.

 

Common Use Cases

Suppose you want to remove some element fragments from the master XML document, but you want the remaining XML text to retain the orginal format, or make slight, fine granular changes to it (ex. paragraph separation, indentation). You can also extract out
a segment of XML bytes without losing its surrounding formatting line breaks or tabs.

An Example


<root>
<name>suresh</name>
<address>Address</address>
</root>

Consider the following example taken from a Q/A thread on StackOverflow web site
(http://stackoverflow.com/questions/36972163/vtd-xmlremoving-the-spaces-after-removing-the-element)

With 2.12, if you want to remove the “<name>suresh</name>” fragment using the following code, you will end up with an XML document


import com.ximpleware.*;
import java.io.*;
public class testExpandSpace {
public static void main(String[] args) throws VTDException,IOException{
// TODO Auto-generated method stub
VTDGen vg = new VTDGen();
AutoPilot ap = new AutoPilot();
XMLModifier xm = new XMLModifier();
if (!vg.parseFile("d://xml//testSuresh.xml",false))
return;
VTDNav vn=vg.getNav();
ap.bind(vn);
xm.bind(vn);
ap.selectXPath("//name");
int index=-1;
while((index=ap.evalXPath())!=-1)
{
System.out.println(" ===> "+vn.toString(index) +"===>");
long elementFragment=vn.getElementFragment();
xm.remove(vn.expandWhiteSpaces(elementFragment));
}
xm.output("d://xml//test1111.xml");
}
}

<root><address>Address</address>
</root>

With 2.13, you can trim off only the trailing space from the “name” fragment before removing it, thereby maintaining the desirable output
indentation

import com.ximpleware.*;
import java.io.*;
public class testExpandSpace {
public static void main(String[] args) throws VTDException,IOException{
// TODO Auto-generated method stub
VTDGen vg = new VTDGen();
AutoPilot ap = new AutoPilot();
XMLModifier xm = new XMLModifier();
if (!vg.parseFile("d://xml//testSuresh.xml",false))
return;
VTDNav vn=vg.getNav();
ap.bind(vn);
xm.bind(vn);
ap.selectXPath("//name");
int index=-1;
while((index=ap.evalXPath())!=-1)
{
System.out.println(" ===> "+vn.toString(index) +"===>");
long elementFragment=vn.getElementFragment();
xm.remove(vn.expandWhiteSpaces(elementFragment,VTDNav.WS_TRAILING));
}
xm.output("d://xml//test1111.xml");
}
}

Below is the output with desirable format.


<root>
<address>Address</address>
</root>

Maven Repository

In case you are not aware, VTD-XML 2.11 and 2.12 are also available on Maven Repository. They are available at http://mvnrepository.com/artifact/com.ximpleware/vtd-xml

Unlike Sourceforge or Github, the maven repository hosts only snapshots of VTD-XML at the point of releases for only the Java platform. With  sourceforge’s CVS you got a complete, history of entire source base, as well as most up to the minute update, for the whole range of supported platforms.

ParseFile vs Parse: A Quick Comparison

VTDGen has two main methods in version 2.12 that you can call to parse XML documents.

The first one is parse(), which accepts a boolean indicating the namespace awareness of the parsing operation. It throws a variety of exceptions, corresponding to various parsing errors, such as encoding errors, invalid entity reference, or name-space qualification errors , etc. You need to catch those exceptions in your code,  and then choose to obtain the detailed diagnostic message about the nature of error. Parse() always works in conjunction with a pair of setDoc() methods, which either accepts a byte array containing the entire input XML, or a byte array and a pair of integers delimiting the segment in the byte array that contains the XML document. The maximum file size limit is 2 GB without namespace awareness, and 1 GB with. Also remember that you will need to manually read the file content into memory and the whole parsing takes about six to ten lines of code.

The second one is parseFile(), which accepts the full path qualified file name of an XML document, and a boolean, which switches on/off the name space awareness of the parsing routine. Built on top of setDoc() and parse(), parseFile(), it returns the status of parsing as a boolean. If for any reason parsing fails, whether the file does not exist, or there is a wellformedness error, it will only return false, and furnishes no detailed diagnostic information. So if you don’t  concern yourself with the nitty-gritty of exception handling or simply want to avoid the clutter, then choose parseFile() and parsing requires no more than two lines of code.

There is More

Cast in the same mold as parseFile(), there are actually three more purpose-built parsing methods at your disposal serving to simplify coding. Those methods are:

  • parseFileHttpURL()- obtain XML docs via HTTP URL , parse it, and then return
  • parseZIPFile()- parse a ZIP compressed XML file, it will inflate the document into an in-memory byte array before parsing
  • parseGZIPFile()-

They offer the same benefit/limitation argument for their intended use cases as parseFile() is for reading uncompressed, local XML files.

Configure Parser

In addition, the following routines  help you configure the run-time behavior of the parser.

  • setLcDepth()  configures VTDGen to generate Location Cache of either depth 3 or 5. Before version 2.12, the default depth is 3. After, 5 is the default.
  • enableIgnoredWhiteSpace()  enables the parser to collect all white spaces, including the trivial white spaces. By default, trivial white spaces are ignored

 

End Note

Hope this article has provided a glimpse of VTDGen‘s parsing methods. As you have noticed there are a few more VTDGen‘s member methods this article has not covered.They pertain to advanced features and capabilities of VTD-XML –namely, buffer reuse and loading/writing index.  Both subjects will be covered in detail in a later article.

How to remove comment nodes from an XML document?

In this post I am going to show you how to effectively remove comments from an XML document using the combination of XMLModifier and XPath. The input XML document looks like the following.

<?xml version="1.0"?><!-- some other code here -->
<clients>
<!-- some other code here -->

<function>
</function>

<function>
</function>

<function>
<name>data_values</name>
<variables>
<variable><!-- some other code here -->
<name>temp</name>
<!-- some other code here --> <type>double</type>
</variable>
</variables><!-- some other code here -->
<block><!-- some other code here -->
<opster>temp = 1</opster>
</block>
</function>
</clients>

The code that performs the task is listed below. The key is the XPath expression “//comment()” which selects all the comment nodes in the document. After binding VTDNav object to the XMLModifier object, you can simply call the “remove()” method, which will not only remove the content of the comment, but also the surrounding delimiting text (i.e. <!–, and –>).


import com.ximpleware.*;
import java.io.*;
public class removeNodesDemo {

public static void main(String[] args) throws VTDException, IOException{
VTDGen vg = new VTDGen();
if (!vg.parseFile("quot;d:\\xml\\input2.xml",false))
return;
VTDNav vn = vg.getNav();
AutoPilot ap = new AutoPilot(vn);
XMLModifier xm = new XMLModifier(vn);
ap.selectXPath(&amp;quot;//comment()&amp;quot;);

int i=0;
while((i=ap.evalXPath())!=-1){
xm.remove();
}
xm.output("d:\\xml\\output2.xml");
}
}

The output XML is

<clients>


<function>
</function>

<function>
</function>

<function>
<name>data_values</name>
<variables>
<variable>
<name>temp</name>
<type>double</type>
</variable>
</variables>
<block>
<opster>temp = 1</opster>
</block>
</function>
</clients>

You might ask what if I want to remove an attribute node,  a text node, or a CDATA node, an element node, or an processing instruction node?

The effective of XMLModifer’s remove() method has the following effect on each type of nodes:

  • On a non-CDATA text node, it will simply remove it
  • On a CDATA typed text node, it will remove the text content and surrounding delimiting texts
  • On an element node, it remove the entire fragment of it
  • On an attribute node, it will remove both the attribute name value pair in its entirety.
  • On a processing instruction node, it will remove both the content and the surrounding delimiting text.

In other words, to remove all processing instruction nodes, just substitute the XPath expression above with “//processing-instruction().”

Whitespace Trimming in 2.12

This blog shows an example of using vtd-xml 2.12’s latest methods to remove the leading and trailing white spaces of the text nodes in an XML document.

 

This is the input document. And as you can easily see, ID’s text nodes have long trailing white spaces.

<?xml version="1.0"?>
<ns:myOrder xmlns:ns="http://w3schools.com/BusinessDocument" xmlns:ct="http://something.com/CommonTypes">
    <MessageHeader>
        <ct:ID>i7         </ct:ID>
        <ct:ID>i7         </ct:ID>
        <ct:ID>i7         </ct:ID>
        <ct:ID>i7         </ct:ID>
        <ct:Name> Company Name    </ct:Name>
    </MessageHeader>
</ns:myOrder>

This the output XML file, with all trailing white spaces removed.

<?xml version="1.0"?>
<ns:myOrder xmlns:ns="http://w3schools.com/BusinessDocument" xmlns:ct="http://something.com/CommonTypes">
    <MessageHeader>
        <ct:ID>i7</ct:ID>
        <ct:ID>i7</ct:ID>
        <ct:ID>i7</ct:ID>
        <ct:ID>i7</ct:ID>
        <ct:Name>Company Name</ct:Name>
    </MessageHeader>
</ns:myOrder>

Below is the java code that does the white spaces removal.


import com.ximpleware.*;
public class removeWS {

       public static void main(String[] s) throws VTDException, Exception{
             VTDGen vg = new VTDGen();
             AutoPilot ap = new AutoPilot();
             XMLModifier xm = new XMLModifier();
             if (vg.parseFile("d:\\xml2\\ws.xml", true)){
                    VTDNav vn = vg.getNav();
                    ap.bind(vn);
                    xm.bind(vn);
                    ap.selectXPath("//text()");
                    int i=-1;
                    while((i=ap.evalXPath())!=-1){
                        int offset = vn.getTokenOffset(i);
                        int len = vn.getTokenLength(i);

                        long l = vn.trimWhiteSpaces((((long)len)<<32)|offset );
System.out.println(" ===> "+vn.toString(i));
                        System.out.println("len ==>"+len+" new len==>"+ (l>>32));
                        int nlen = (int)(l>>32);
                        int nos= (int) l;
                        xm.updateToken(i,vn,nos,nlen);
                    }
             xm.output("d:\\xml2\\new.xml");
          }
      }
}

VTD-XML 2.12 Released

VTD-XML 2.12 is released. To download the latest version, go to

https://sourceforge.net/projects/vtd-xml/files/vtd-xml/ximpleware_2.12/

The many ways that vtd-xml can help you optimize the performance of your applications

I was asked on stackoverflow the possible options available for VTD-XML to improve XML performance. Below is my answer that I think is useful in sharing with readers of this blog.

There are usually the following ways to optimize performance with VTD-XML:

  1. White space option- You can ask VTDGen to ignore or retain trivial white space characters. By default, VTDGen throws away those trivial white spaces. The difference is mainly in memory usage.
  2. Buffer reuse- You can ask VTDGen to reuse VTD buffers for the next parsing task. Otherwise, by default, VTDGen will allocate new buffer for each parsing run. This optimization technique is most useful if you are processing similar sized XML file, so that the VTD buffer page size remains unchanged across consecutive parsing runs.
  3. Adjust LC level- By default, it is 3. But you can set it to 5. When your XML are deeply nested, setting LC level to 5 results in better XPath performance. But it increases memory usage and parsing time very slightly.
  4. Reuse XPath: Compiling/selecting XPath is a relatively slow operation, especially when you run XPath expression over many small files. The key is to take any AutoPilot.selectXPath() out of loops and reuse them by calling ap.resetXPath().
  5. Use VTD+XML indexing- Instead of parsing XML files at the time of processing request, you can pre-index your XML into VTD+XML format and dump them on disk. When the processing request commences, simply load VTD+xml in memory and voila, parsing is no longer needed!! Read this article for a detailed description of this feature.
  6. The overwrite feature aka. data templating- Because VTD-XML retains XML in memory as is, you can actually create a template XML file (pre-indexed in vtd+xml) whose value fields are left blank and let your app fill in the blank, thus creating XML data that never need to be parsed.

option 1 2 3 and 4 usually improve performance incrementally. option 5 and 6 enable paradigm shift by fundamentally changing the way XML data are generated and consumed and giving you potentially vast performance improvements over existing processing framework and methodology. For one thing, you can easily figure out that the result of xpath evaluation can also be persisted along with VTD index to actually bypass the XPath evaluation. There are just so many ways to improve your apps that i will leave this to your imaginations.

VTD-XML Repository Available on GitHub

VTD-XML full source repository is now available on Github (http://github.com/jzhang2004/vtd-xml). Every commit log is available and in the near all commits and check in should be in sync with CVS on sourceforge.

An interesting paper on vtd-xml performance vs other XML parsers

I recently came across an interesting paper by some researchers in Portugal. The topic of the paper is “PERFORMANCE ANALYSIS OF JAVA APIS FOR XML PROCESSING.” In this paper, various XML Processing API are thoroughly bench-marked  and compared. Those APIs include various flavors of DOM, SAX, PULL, JDOM and VTD-XML. Below is the abstract of the paper.

ABSTRACT

Over time, XML markup language has acquired a considerable importance in applications development, standards definition and in the representation of large volumes of data, such as databases. Today, processing XML documents in a short period of time is a critical activity in a large range of applications, which imposes choosing the most appropriate mechanism to parse XML documents quickly and efficiently. When using a programming language for XML processing, such as Java, it becomes necessary to use effective mechanisms, e.g. APIs, which allow reading and processing of large documents in appropriated manners. This paper presents a performance study of the main existing Java APIs that deal with XML documents, in order to identify the most suitable one for processing large XML files.

The full paper can be downloaded here.

What is New in 2.11

Version 2.11, simultaneously available  in C, Java, C++, and C#, is the latest release of VTD-XML. So what is new? The shortly answer: (1) It is more standards-compliant by conforming strictly to XPath 1.0 spec’s notion of node(). (2) It  introduces major performance improvement for XPath expressions involving simple position index.(3)This release introduces major performance improvement for XPath expression containing complex predicates involving  absolute location path expressions. (4) It also contains various bug releases as reported by VTD-XML users.

Change to Node() Interpretation

Before 2.11, node() in a location step in an XPath expression will be interpreted as equivalent to *, i.e., an element node with any name. With 2.11 the same node() will be interpreted either one of “element(), text(), comment(), or processing-instruction(), as defined by XPath 1.0 spec.

Performance Improvement for Simple Position Index

A quick example is “a[2]/b[1].”  A simple position index is basically a constant index value in predicate. 2.11’s XPath engine is now smart enough to detect this use case and allow for early escaping from the execution loop, resulting in faster execution performance. The amount of improvement depends on how frequent the simple index is used in each location step. In some cases, a 50% to 70% execution speedup is possible.

Performance Improvement for Predicates Containing Absolute Path Expressions

A quick example is //a[//abc/@val=’1′]. Notice that predicate contains //abc, which is an absolute path expression. Before 2.11, this expression will trigger repetitive evaluation of //abc to determine whether the predicate is true or false.  The processing cost would increase rapidly with respect to the size of the document. This release would intelligently cache the evaluation result so the corresponding XPath  is evaluated only once. Please notice that this feature is enabled by default, if you can turn it off (we don’t recommend it)  by invoking AutoPilot’s enableCaching’s method and give it a “false.”

How much of an improvement can you expect to see? Depending on size of documents, complexity of predicates and other things. Sometime you will be achieve astonishing results. Consider the following expression.

//CDResults[../../../TargetName/@Value=”//SiteInformation[“TargetName/@Value!=//SiteInformation[1]/TargetName/@Value and TargetName/@Value!=//SiteInformation[TargetName/@Value!=//SiteInformation[1]/TargetName/@Value”][1]/TargetName/@Value][1]/TargetName/@Value]/BottomCD/@Value

Running this document on a 22MB xml document  in Java would take many hours in virtually all XPath implementation including 2.10 version of VTD-XML. With 2.11, it took less than 5 seconds on a commodity, 3 year old PC.

Bug fixes

There are other bug fixes, covering XMLModifier’s deletion capabilities and permissiveness of deletion of sub-nodes.