White Space Handling in 2.13

In this post I will highlight VTD-XML v2.13’s whitespace handling capability along with some examples in Java.

Quick Review

Native to non-extractive parsing, VTD-XML’s handling of XML tokens and elements frequently revolves around the concept of byte segments.Once an XML document is parsed into VTD tokens, the byte segment enveloping the entire content of any ¬†token or element can be visualized of as a pair of descriptors (i.e. offset and length) projecting into the original document.

For a large class of XML content extraction and modification operations, non-extractive parsing allows applications to circumvent the tedious, cycle-wasting tasks of de-serializing and re-serializing byte content of elements, and thereby help achieving maximum performance possible.

Whitespace handling in 2.12

Version 2.12 of VTD-XML introduces two new methods that help either trim or expand the surrounding white spaces of byte
segments denoted by 64-bit integers.

  • trimWhiteSpaces(long l), of VTDNav class, accepts a byte segment descriptor, removes both the leading and trailing white spaces, and returns a new descriptor.
  • expandWhiteSpaces(long l), of the same VTDNav class, takes a segment descriptor ands returns a new descriptor that includes all the leading and trailing white spaces around the input segment

It is worth noting that both methods are greedy: they will remove/expand as many white spaces as they possibly can. Furthermore, you can make the observation that the effect of one call often negates the other.

Additions in 2.13

Three static constants and three more methods are added to VTDNav class in 2.13.

Those constants are:

  1. VTDNav.WS_LEADING
  2. VTDNav.WS_TRAILING
  3.  VTDNav.WS_BOTH

The two new methods are:

  • trimWhiteSpaces(long l, short actionType) still trims the white spaces around the segment and returns a new segment descriptor.But the trimming operation can now be applied to the leading whitespaces, the trailing ones, or both, depending on the value of actionType.
  • trimWhiteSpaces(int index, short actionType) brings you the power and convenience of trimming a VTD record without the hassle of manually compose a 64-bit segment descriptor.
  • expandWhteSpaces(long l, short actionType) still expands the whitespaces. But the expansion can also be designated to include either the leading whitespaces, the trailing ones, or both.

 

Common Use Cases

Suppose you want to remove some element fragments from the master XML document, but you want the remaining XML text to retain the orginal format, or make slight, fine granular changes to it (ex. paragraph separation, indentation). You can also extract out
a segment of XML bytes without losing its surrounding formatting line breaks or tabs.

An Example


<root>
<name>suresh</name>
<address>Address</address>
</root>

Consider the following example taken from a Q/A thread on StackOverflow web site
(http://stackoverflow.com/questions/36972163/vtd-xmlremoving-the-spaces-after-removing-the-element)

With 2.12, if you want to remove the “<name>suresh</name>” fragment using the following code, you will end up with an XML document


import com.ximpleware.*;
import java.io.*;
public class testExpandSpace {
public static void main(String[] args) throws VTDException,IOException{
// TODO Auto-generated method stub
VTDGen vg = new VTDGen();
AutoPilot ap = new AutoPilot();
XMLModifier xm = new XMLModifier();
if (!vg.parseFile("d://xml//testSuresh.xml",false))
return;
VTDNav vn=vg.getNav();
ap.bind(vn);
xm.bind(vn);
ap.selectXPath("//name");
int index=-1;
while((index=ap.evalXPath())!=-1)
{
System.out.println(" ===> "+vn.toString(index) +"===>");
long elementFragment=vn.getElementFragment();
xm.remove(vn.expandWhiteSpaces(elementFragment));
}
xm.output("d://xml//test1111.xml");
}
}

<root><address>Address</address>
</root>

With 2.13, you can trim off only the trailing space from the “name” fragment before removing it, thereby maintaining the desirable output
indentation

import com.ximpleware.*;
import java.io.*;
public class testExpandSpace {
public static void main(String[] args) throws VTDException,IOException{
// TODO Auto-generated method stub
VTDGen vg = new VTDGen();
AutoPilot ap = new AutoPilot();
XMLModifier xm = new XMLModifier();
if (!vg.parseFile("d://xml//testSuresh.xml",false))
return;
VTDNav vn=vg.getNav();
ap.bind(vn);
xm.bind(vn);
ap.selectXPath("//name");
int index=-1;
while((index=ap.evalXPath())!=-1)
{
System.out.println(" ===> "+vn.toString(index) +"===>");
long elementFragment=vn.getElementFragment();
xm.remove(vn.expandWhiteSpaces(elementFragment,VTDNav.WS_TRAILING));
}
xm.output("d://xml//test1111.xml");
}
}

Below is the output with desirable format.


<root>
<address>Address</address>
</root>

Advertisements

3 comments so far

  1. hoodaticus on

    Hi there! Thank you for this amazing parser! I have sadly found what I believe to be a bug. On some files – but not others – I am getting a ParseException at the ‘x’ character in the xmlns declaration of the root node. I would like to debug this and propose a fix if I could. The bug reproach for me at least as far back as 2.11. Is there any way I could get source to debug with? Thank you for your time and your amazing library!

    • jimmyzhang on

      Thanks for reporting the bug I will look into it…The source code is part of the 2.11 distribution that is available for download on the VTD-XML website. Here is the URL: https://sourceforge.net/projects/vtd-xml/files/vtd-xml/ximpleware_2.11/
      the source is also available on CVS. http://vtd-xml.cvs.sourceforge.net/viewvc/vtd-xml/ximple-dev/
      I am glad you liked VTD-XML… I also think that it is just an imperfect tool that has its own strengths and weaknesses… Ultimately you are the one to determine what tools to use and for what reason.

      • hoodaticus on

        Hi jimmyzhang! Thanks for getting back to me so fast! Another engineer at my firm traced the cause to memory corruption in our own code. There is no bug here in yours. Thank you so much for your time!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: