Process Huge XML Documents with Extended VTD-XML

If you have XML files that are larger than 2GB, and don’t want to lose the benefit of XPath (full set), you will be surprised on how handy and fast extended VTD-XML can become.

New since version 2.3 and a part of full VTD-XML distribution , extended VTD-XML expands the maximum document size to 256 GB and requires 64-bit JVM to achieve those limits.

Extended VTD-XML works with XML either with standard in-memory mode (like the standard VTD-XML), or the memory mapped mode, which allows partial document loading.

The code examples below shows you how to use extended VTD-XML to process XML document using in-memory mode and memory mapped mode:

Memory Mapped Mode

import com.ximpleware.extended.*;
public class mem_mapped_read {
	// first read is the longer version of loading the XML file 
	public static void first_read() throws Exception{
	XMLMemMappedBuffer xb = new XMLMemMappedBuffer();
        VTDGenHuge vg = new VTDGenHuge();
        xb.readFile("test.xml");
        vg.setDoc(xb);
        vg.parse(true);
        VTDNavHuge vn = vg.getNav();
        System.out.println("text data ===>" + vn.toString(vn.getText()));
	}	

	// second read is the shorter version of loading the XML file 
	public static void second_read() throws Exception{
	    VTDGenHuge vg = new VTDGenHuge();
	    if (vg.parseFile("test.xml",true,VTDGenHuge.MEM_MAPPED)){
	        VTDNavHuge vn = vg.getNav();
	        System.out.println("text data ===>" + vn.toString(vn.getText()));
	    }
	}

	public static void main(String[] s) throws Exception{
		first_read();
	 	second_read();
	}
}

In Memory Mode

/**
 * This is a demonstration of how to use the extended VTD parser
 * to process large XML file. 
 *
 */
import com.ximpleware.extended.*;
public class in_mem_read {
	// first read is the longer version of loading the XML file 
	public static void first_read() throws Exception{
		XMLBuffer xb = new XMLBuffer();
        VTDGenHuge vg = new VTDGenHuge();
        xb.readFile("test.xml");
        vg.setDoc(xb);
        vg.parse(true);
        VTDNavHuge vn = vg.getNav();
        System.out.println("text data ===>" + vn.toString(vn.getText()));
	}	

	// second read is the shorter version of loading the XML file 
	public static void second_read() throws Exception{
	    VTDGenHuge vg = new VTDGenHuge();
	    if (vg.parseFile("test.xml",true,VTDGenHuge.IN_MEMORY)){
	        VTDNavHuge vn = vg.getNav();
	        System.out.println("text data ===>" + vn.toString(vn.getText()));
	    }
	}

	public static void main(String[] s) throws Exception{
		first_read();
	 	second_read();
	}
}

Advertisements

22 comments so far

  1. synhershko on

    This doesn’t seem to be available for C#, or is it? Its quite a bummer not having the Huge versions available in .Net…

    • jimmyzhang on

      C# doesn’t yet support memory map, we actually have an inhouse implementation of vtd gen support 256GB, do you want to give a try? what is the max file size of your environment?

  2. Vizz on

    I am trying to parse 7GB of Pubmed data and getting parsing exception below even when using extended version and also unable to create the Mem_Mapped instance with VTDGenHuge.I am using 64 jvm. My parsing code is similar to that of the BioInfo Java demo from offical website.

    Exception in thread “main” java.lang.ArrayIndexOutOfBoundsException: -2
    at com.ximpleware.extended.XMLBuffer.byteAt(XMLBuffer.java:100)
    at com.ximpleware.extended.VTDNavHuge.getChar(VTDNavHuge.java:556)
    at com.ximpleware.extended.VTDNavHuge.compareRawTokenString(VTDNavHuge.java:1570)
    at com.ximpleware.extended.VTDNavHuge.matchRawTokenString(VTDNavHuge.java:1641)
    at com.ximpleware.extended.VTDNavHuge.matchElement(VTDNavHuge.java:1423)
    at com.ximpleware.extended.VTDNavHuge.toElement(VTDNavHuge.java:2984)

    Thanks in advance

  3. Vijaya on

    Hi Jimmy Zhang

    Same parsing exception even after working with cvs code.

    • jimmyzhang on

      can you send me an sample file that can be used to dupliate teh issue?

  4. Vizz on

    Hi Jimmy Zhang
    Thanks for the quick response.

    I could not able to send you 7gb file. Here the the few places where parsing is breaking down.
    1. I could able to parse with vn.toElement(int) and count the number of files. Any how while converting token character data i am using vn.toString(vn.getText()) throws exception.(Converting into String only when vn.getText method doesn’t return -1),

    2.When trying to convert vn.toString( vn.getCurrentIndex()) throws exception even when limiting the -1 occurance.

    3.When directly using vn.toElement(int,String) , throws an exception.

    The file which i am using is passed as an argument at runtime. The file is already tested under SAX praser without any exceptions.

    I really appreciate for your time.
    Thank you
    Vizz / Vijaya

  5. jimmyzhang on

    you don’t have to send me the 7gb file, just send me somethin that I can use to duplicate the issue… otherwise, it would be difficult for me to help… I can step you thru the process of fixing the bug yourself.. if that can help you

  6. Vizz on

    Could you please send me your mail ID?

    Thank you
    Vizz

  7. Vizz on

    Here is the sample version which i am parsing

    http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helppubmed&part=publisherhelp#publisherhelp.XML_Tag_Descriptions

    I need to pull the “ISSN” and “abstract” from that xml file

    • jimmyzhang on

      I did play with a couple of files from the links that you provided… could not duplicate the issues … I tried stuff far more complex, like xpath evaluation and string conversion of many kind…

      Can you provide a complete test case with code along with sample xml?

      • Vizz on

        Could you please check your mail for details

  8. jimmyzhang on

    I sent u the cvs link that you can retrieve teh latest version/update… you might need to recompile…

    • Vizz on

      Thanks

    • Vizz on

      Amazing performance . Parsing performance 650M/s. You never get disappointed with this VTD-XML.

      Thanks Mr.Jimmy Zhang for this excellent software.

  9. Finpush on

    Does the parser support streaming mode? By streaming mode, I mean a method where the document arrives in chunks and the parsing begins before all of the document is received.

    • jimmyzhang on

      it turns out that streaming mode is a *bad* idea in general… u re better off using sax with streaming mode.

  10. Sekhar on

    Hi,

    I am very interested to dig more on this api, i have a complex xml structure, my job is to pick the repeat xml tags say 0 to 100 and so on. and process them each tag and getting the value & computing some values and adding new tags or changing existing values with the new one i have to create a new xml file.

    Please provide me the sample java implemented example for this kind of requirement. (Getting values for all xml elements & adding new tags, updating existing xml element data).

  11. Andrew on

    I am also having an issue.
    java.lang.IndexOutOfBoundsException
    at com.ximpleware.extended.FastLongBuffer.longAt(FastLongBuffer.java:303)
    at com.ximpleware.extended.VTDNavHuge.getTokenType(VTDNavHuge.java:1104)
    at com.ximpleware.extended.VTDNavHuge.toString(VTDNavHuge.java:3356)

    when trying to do this
    String temp=vn.toString(vn.getText());

    xb.readFile(xmlPath);
    could this be because it is not supported in java 1.5 i am getting an no source attachment?

    • Andrew on

      also a note the VTDGen also is no source attached. please help

  12. Chanchal Raj on

    I am to parse dblp.xml file that is 1.04 GB. I had to extract out article data and put it in a database. But I get error for memory mapped code above:

    Exception in thread “main” com.ximpleware.extended.EntityExceptionHuge: Errors in Entity: Illegal entity char
    at com.ximpleware.extended.VTDGenHuge.entityIdentifier(VTDGenHuge.java:1021)
    at com.ximpleware.extended.VTDGenHuge.parse(VTDGenHuge.java:1821)
    at mem_maped_read.first_read(mem_maped_read.java:17)
    at mem_maped_read.main(mem_maped_read.java:47)
    Java Result: 1

    Line No. 17 is vg.parse (true);
    Line No. 47 is first_read();

    Please help me.

    • jimmyzhang on

      It looks that your xml is not wellformed… try to parsed it using regular version of vtd-xml and see if you get the same error or not.

  13. Chanchal Raj on

    Do u have a good of regular vtd-xml usage? Could you send it plz? My email address is rajchanchalkhatri@yahoo.com


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: