Whitespace Trimming in 2.12

This blog shows an example of using vtd-xml 2.12’s latest methods to remove the leading and trailing white spaces of the text nodes in an XML document.

 

This is the input document. And as you can easily see, ID’s text nodes have long trailing white spaces.

<?xml version="1.0"?>
<ns:myOrder xmlns:ns="http://w3schools.com/BusinessDocument" xmlns:ct="http://something.com/CommonTypes">
    <MessageHeader>
        <ct:ID>i7         </ct:ID>
        <ct:ID>i7         </ct:ID>
        <ct:ID>i7         </ct:ID>
        <ct:ID>i7         </ct:ID>
        <ct:Name> Company Name    </ct:Name>
    </MessageHeader>
</ns:myOrder>

This the output XML file, with all trailing white spaces removed.

<?xml version="1.0"?>
<ns:myOrder xmlns:ns="http://w3schools.com/BusinessDocument" xmlns:ct="http://something.com/CommonTypes">
    <MessageHeader>
        <ct:ID>i7</ct:ID>
        <ct:ID>i7</ct:ID>
        <ct:ID>i7</ct:ID>
        <ct:ID>i7</ct:ID>
        <ct:Name>Company Name</ct:Name>
    </MessageHeader>
</ns:myOrder>

Below is the java code that does the white spaces removal.


import com.ximpleware.*;
public class removeWS {

       public static void main(String[] s) throws VTDException, Exception{
             VTDGen vg = new VTDGen();
             AutoPilot ap = new AutoPilot();
             XMLModifier xm = new XMLModifier();
             if (vg.parseFile("d:\\xml2\\ws.xml", true)){
                    VTDNav vn = vg.getNav();
                    ap.bind(vn);
                    xm.bind(vn);
                    ap.selectXPath("//text()");
                    int i=-1;
                    while((i=ap.evalXPath())!=-1){
                        int offset = vn.getTokenOffset(i);
                        int len = vn.getTokenLength(i);

                        long l = vn.trimWhiteSpaces((((long)len)<<32)|offset );
System.out.println(" ===> "+vn.toString(i));
                        System.out.println("len ==>"+len+" new len==>"+ (l>>32));
                        int nlen = (int)(l>>32);
                        int nos= (int) l;
                        xm.updateToken(i,vn,nos,nlen);
                    }
             xm.output("d:\\xml2\\new.xml");
          }
      }
}

VTD-XML 2.12 Released

VTD-XML 2.12 is released. To download the latest version, go to

https://sourceforge.net/projects/vtd-xml/files/vtd-xml/ximpleware_2.12/

The many ways that vtd-xml can help you optimize the performance of your applications

I was asked on stackoverflow the possible options available for VTD-XML to improve XML performance. Below is my answer that I think is useful in sharing with readers of this blog.

There are usually the following ways to optimize performance with VTD-XML:

  1. White space option- You can ask VTDGen to ignore or retain trivial white space characters. By default, VTDGen throws away those trivial white spaces. The difference is mainly in memory usage.
  2. Buffer reuse- You can ask VTDGen to reuse VTD buffers for the next parsing task. Otherwise, by default, VTDGen will allocate new buffer for each parsing run. This optimization technique is most useful if you are processing similar sized XML file, so that the VTD buffer page size remains unchanged across consecutive parsing runs.
  3. Adjust LC level- By default, it is 3. But you can set it to 5. When your XML are deeply nested, setting LC level to 5 results in better XPath performance. But it increases memory usage and parsing time very slightly.
  4. Reuse XPath: Compiling/selecting XPath is a relatively slow operation, especially when you run XPath expression over many small files. The key is to take any AutoPilot.selectXPath() out of loops and reuse them by calling ap.resetXPath().
  5. Use VTD+XML indexing- Instead of parsing XML files at the time of processing request, you can pre-index your XML into VTD+XML format and dump them on disk. When the processing request commences, simply load VTD+xml in memory and voila, parsing is no longer needed!!
  6. The overwrite feature aka. data templating- Because VTD-XML retains XML in memory as is, you can actually create a template XML file (pre-indexed in vtd+xml) whose value fields are left blank and let your app fill in the blank, thus creating XML data that never need to be parsed.

option 1 2 3 and 4 usually improve performance incrementally. option 5 and 6 enable paradigm shift by fundamentally changing the way XML data are generated and consumed and giving you potentially vast performance improvements over existing processing framework and methodology. For one thing, you can easily figure out that the result of xpath evaluation can also be persisted along with VTD index to actually bypass the XPath evaluation. There are just so many ways to improve your apps that i will leave this to your imaginations.

VTD-XML Repository Available on GitHub

VTD-XML full source repository is now available on Github (http://github.com/jzhang2004/vtd-xml). Every commit log is available and in the near all commits and check in should be in sync with CVS on sourceforge.

An interesting paper on vtd-xml performance vs other XML parsers

I recently came across an interesting paper by some researchers in Portugal. The topic of the paper is “PERFORMANCE ANALYSIS OF JAVA APIS FOR XML PROCESSING.” In this paper, various XML Processing API are thoroughly bench-marked  and compared. Those APIs include various flavors of DOM, SAX, PULL, JDOM and VTD-XML. Below is the abstract of the paper.

ABSTRACT

Over time, XML markup language has acquired a considerable importance in applications development, standards definition and in the representation of large volumes of data, such as databases. Today, processing XML documents in a short period of time is a critical activity in a large range of applications, which imposes choosing the most appropriate mechanism to parse XML documents quickly and efficiently. When using a programming language for XML processing, such as Java, it becomes necessary to use effective mechanisms, e.g. APIs, which allow reading and processing of large documents in appropriated manners. This paper presents a performance study of the main existing Java APIs that deal with XML documents, in order to identify the most suitable one for processing large XML files.

The full paper can be downloaded here.

What is New in 2.11

Version 2.11, simultaneously available  in C, Java, C++, and C#, is the latest release of VTD-XML. So what is new? The shortly answer: (1) It is more standards-compliant by conforming strictly to XPath 1.0 spec’s notion of node(). (2) It  introduces major performance improvement for XPath expressions involving simple position index.(3)This release introduces major performance improvement for XPath expression containing complex predicates involving  absolute location path expressions. (4) It also contains various bug releases as reported by VTD-XML users.

Change to Node() Interpretation

Before 2.11, node() in a location step in an XPath expression will be interpreted as equivalent to *, i.e., an element node with any name. With 2.11 the same node() will be interpreted either one of “element(), text(), comment(), or processing-instruction(), as defined by XPath 1.0 spec.

Performance Improvement for Simple Position Index

A quick example is “a[2]/b[1].”  A simple position index is basically a constant index value in predicate. 2.11’s XPath engine is now smart enough to detect this use case and allow for early escaping from the execution loop, resulting in faster execution performance. The amount of improvement depends on how frequent the simple index is used in each location step. In some cases, a 50% to 70% execution speedup is possible.

Performance Improvement for Predicates Containing Absolute Path Expressions

A quick example is //a[//abc/@val=’1′]. Notice that predicate contains //abc, which is an absolute path expression. Before 2.11, this expression will trigger repetitive evaluation of //abc to determine whether the predicate is true or false.  The processing cost would increase rapidly with respect to the size of the document. This release would intelligently cache the evaluation result so the corresponding XPath  is evaluated only once. Please notice that this feature is enabled by default, if you can turn it off (we don’t recommend it)  by invoking AutoPilot’s enableCaching’s method and give it a “false.”

How much of an improvement can you expect to see? Depending on size of documents, complexity of predicates and other things. Sometime you will be achieve astonishing results. Consider the following expression.

//CDResults[../../../TargetName/@Value=”//SiteInformation[“TargetName/@Value!=//SiteInformation[1]/TargetName/@Value and TargetName/@Value!=//SiteInformation[TargetName/@Value!=//SiteInformation[1]/TargetName/@Value”][1]/TargetName/@Value][1]/TargetName/@Value]/BottomCD/@Value

Running this document on a 22MB xml document  in Java would take many hours in virtually all XPath implementation including 2.10 version of VTD-XML. With 2.11, it took less than 5 seconds on a commodity, 3 year old PC.

Bug fixes

There are other bug fixes, covering XMLModifier’s deletion capabilities and permissiveness of deletion of sub-nodes.

VTD-XML 2.11 Released

VTD-XML is now released, i will add a blog talking about the major features/improvements in this release soon… The release can be downloaded from

http://sourceforge.net/projects/vtd-xml/files/vtd-xml/ximpleware_2.11/

VTD-XML 2.10 Released

VTD-XML 2.10 is now released under Java, C#, C and C++. It can be downloaded at
https://sourceforge.net/projects/vtd-xml/files/vtd-xml/ximpleware_2.10/. This release includes a number of new features and enhancement.

  • The core API of VTD-XML has been expanded. Users can now perform cut/paste/insert on an empty element.
  • This release also adds the support of deeper location cache support for parsing and indexing. This feature is useful for application performance  tuning for processing various XML documents.
  • The java version also added support for processing zip and gzip files. Direct processing of httpURL based XML is enhanced.
  • Extended Java version now support Iso-8859-10~16 encoding.
  • A full featured C++ port is released.
  • C version of VTD-XML now make use of thread local storage to address the  thread-safety issue for multi-threaded application.

There are also a number of bugs fixed. Special thanks to Jozef Aerts, John Sillers, Chris Tornau and a number of other users for input and suggestions

Thread Safety in C Version of VTD-XML

Before 2.10, the C version of vtd-xml makes extensive use of global variables for XPath query compilation. The thread safety problem arises when multple instances of an application  perform  XPath compilation at the same time. To resolve this issue, VTD-XML 2.10 replaces all global variables for XPath compilation with thread local vriables: instead of simply declaring a variable, prepend _thread to the declaration. The thread local variable is just like global variable, except it is specific/visible within a thread. The macro for “_thread” is defined in “customTypes.h.”

How does the use of thread local variable impact the overall design of your application? Fortunately, very little change is required. The most significant one is the global thread context declaration: The old one looks something like this:

struct exception_context the_exception_context[1];
int main(){
          exception e;
          Try {  // put the code throwing exceptions here
          } Catch (e) {  // handle exception in here
          }
}

From 2.10 and onward the app will look like below

_thread struct exception_context the_exception_context[1];
int main(){
          exception e;
          Try {  // put the code throwing exceptions here
          } Catch (e) {  // handle exception in here
          }
}

Location Cache Depth Tuning

Before version 2.10, the location cache depth is set to 3. In this version, you can choose either 3 or 5, by simply calling VTDGen’s setLcDepth() (see the example below). The benefit is that at the cost of negligible parsing and memory overhead, the random access performance of VTDNav improves, especially for depth XML documents.


   VTDGen vg = new VTDGen();

   vg.selectLcDepth(5);

Insert Text into Empty Element

In this release,  you now have the ability to insert text into an empty element  (e.g. <a/>).

  • Insert “some text” into <a/>, you get “<a>some text</a>”.
  • Insert <b/> into <a/>, you get <a><b/></a>. 

VTD-XML in  C++

Below is a simple app written in VTD-XML and C++.

#include "everything.h"
//#include "bookMark.h"

using namespace com_ximpleware;

int main(){
 FILE *f = NULL;
 FILE *fo = NULL;
 int i = 0;

 Long l = 0;
 int len = 0;
 int offset = 0;

 char* filename = "c:/xml/soap2.xml";
 struct stat s;
 UByte *xml = NULL; // this is the buffer containing the XML content, UByte means unsigned byte
 //VTDGen *vg = NULL; // This is the VTDGen that parses XML
 VTDNav *vn = NULL; // This is the VTDNav that navigates the VTD records
 AutoPilot *ap = NULL;
 char *sm = "\n================\n";

 // allocate a piece of buffer then reads in the document content
 // assume "c:\soap2.xml" is the name of the file
 f = fopen(filename,"r");
 fo = fopen("c:/xml/out.txt","w");

 stat(filename,&s);

 i = (int) s.st_size; 
 printf("size of the file is %d \n",i);
 xml = new UByte[i];
 fread(xml,sizeof(UByte),i,f);
 VTDGen vg;
 try{
  
  vg.setDoc(xml,i);
  vg.parse(true);
  vn = vg.getNav();
  AutoPilot ap;
  ap.declareXPathNameSpace(L"ns1",L"<a href="http://www.w3.org/2003/05/soap-envelope">http://www.w3.org/2003/05/soap-envelope</a>");
  //if (ap.selectXPath(L"/ns1:Envelope/ns1:Header/*[@ns1:mustUnderstand]")){
  if (ap.selectXPath(L"/ns1:Envelope/ns1:Header/*[@ns1:mustUnderstand]")){
  //if (ap.selectXPath(L"/a/b/*")){
   ap.printExprString();
   ap.bind(vn);
   int i=-1;
   while((i=ap.evalXPath())!= -1){
    //printf("\n hi ==> %d \n",i);
    l = vn->getElementFragment();
    offset = (int) l;
    len = (int) (l>>32);
    fwrite((char *)(xml+offset),sizeof(UByte),len,fo);
    fwrite((char *) sm,sizeof(UByte),strlen((char*)sm),fo);
   }
  }
  fclose(f);
  fclose(fo);
  // remember C has no automatic garbage collector
  // needs to deallocate manually.
  delete(vn);
 }
 catch (ParseException &e){
  //vg.printLineNumber();
  printf(" error ===> %s \n",e.getMessage());
 }
 catch (...) {
  delete (vn);
 }
 return 0;
} 

How to Read All Attributes of an Element in VTD-XML?

There are two ways to read all the attribute values of an element node.

The first one is to use XPath expression  @* as in the example below
  ap = new AutoPilot (vn);
  ap.selectXPath(“@*”);
  int i=-1;
while((i=ap.evalXPath())!=-1){
      // i will be attr name, i+1 will be attribute value
   } 

  

The second is lighter weight, which is by directly using autoPilot’s selectAttr() and iterAttr()
ap = new AutoPilot(vn);
ap.selectAttr(“*”);
int i=-1;
while((i=ap.iterateAttr())!=-1){
 // i will be attr name, i+1 will be attribute value
}

60x? That sounds just right.

I came across a recent blog in which the author benchmarks the performance of evaluating XPath using VTD-XML on a 20 MB and comparing it to JAXP. The result is a convincing 60X. Surprised? Don’t be. The fact is that DOM and JAXP just have too much inherent issues (performance, memory usage etc). Below is the link to that blog

http://fahdshariff.blogspot.com/2010/08/faster-xpaths-with-vtd-xml.html

Follow

Get every new post delivered to your Inbox.