Process Huge XML Documents with Extended VTD-XML

If you have XML files that are larger than 2GB, and don’t want to lose the benefit of XPath (full set), you will be surprised on how handy and fast extended VTD-XML can become.

New since version 2.3 and a part of full VTD-XML distribution , extended VTD-XML expands the maximum document size to 256 GB and requires 64-bit JVM to achieve those limits.

Extended VTD-XML works with XML either with standard in-memory mode (like the standard VTD-XML), or the memory mapped mode, which allows partial document loading.

The code examples below shows you how to use extended VTD-XML to process XML document using in-memory mode and memory mapped mode:

Memory Mapped Mode

import com.ximpleware.extended.*;
public class mem_mapped_read {
	// first read is the longer version of loading the XML file 
	public static void first_read() throws Exception{
	XMLMemMappedBuffer xb = new XMLMemMappedBuffer();
        VTDGenHuge vg = new VTDGenHuge();
        xb.readFile("test.xml");
        vg.setDoc(xb);
        vg.parse(true);
        VTDNavHuge vn = vg.getNav();
        System.out.println("text data ===>" + vn.toString(vn.getText()));
	}	

	// second read is the shorter version of loading the XML file 
	public static void second_read() throws Exception{
	    VTDGenHuge vg = new VTDGenHuge();
	    if (vg.parseFile("test.xml",true,VTDGenHuge.MEM_MAPPED)){
	        VTDNavHuge vn = vg.getNav();
	        System.out.println("text data ===>" + vn.toString(vn.getText()));
	    }
	}

	public static void main(String[] s) throws Exception{
		first_read();
	 	second_read();
	}
}

In Memory Mode

/**
 * This is a demonstration of how to use the extended VTD parser
 * to process large XML file. 
 *
 */
import com.ximpleware.extended.*;
public class in_mem_read {
	// first read is the longer version of loading the XML file 
	public static void first_read() throws Exception{
		XMLBuffer xb = new XMLBuffer();
        VTDGenHuge vg = new VTDGenHuge();
        xb.readFile("test.xml");
        vg.setDoc(xb);
        vg.parse(true);
        VTDNavHuge vn = vg.getNav();
        System.out.println("text data ===>" + vn.toString(vn.getText()));
	}	

	// second read is the shorter version of loading the XML file 
	public static void second_read() throws Exception{
	    VTDGenHuge vg = new VTDGenHuge();
	    if (vg.parseFile("test.xml",true,VTDGenHuge.IN_MEMORY)){
	        VTDNavHuge vn = vg.getNav();
	        System.out.println("text data ===>" + vn.toString(vn.getText()));
	    }
	}

	public static void main(String[] s) throws Exception{
		first_read();
	 	second_read();
	}
}

Advertisement

16 comments so far

  1. synhershko on

    This doesn’t seem to be available for C#, or is it? Its quite a bummer not having the Huge versions available in .Net…

    • jimmyzhang on

      C# doesn’t yet support memory map, we actually have an inhouse implementation of vtd gen support 256GB, do you want to give a try? what is the max file size of your environment?

  2. Vizz on

    I am trying to parse 7GB of Pubmed data and getting parsing exception below even when using extended version and also unable to create the Mem_Mapped instance with VTDGenHuge.I am using 64 jvm. My parsing code is similar to that of the BioInfo Java demo from offical website.

    Exception in thread “main” java.lang.ArrayIndexOutOfBoundsException: -2
    at com.ximpleware.extended.XMLBuffer.byteAt(XMLBuffer.java:100)
    at com.ximpleware.extended.VTDNavHuge.getChar(VTDNavHuge.java:556)
    at com.ximpleware.extended.VTDNavHuge.compareRawTokenString(VTDNavHuge.java:1570)
    at com.ximpleware.extended.VTDNavHuge.matchRawTokenString(VTDNavHuge.java:1641)
    at com.ximpleware.extended.VTDNavHuge.matchElement(VTDNavHuge.java:1423)
    at com.ximpleware.extended.VTDNavHuge.toElement(VTDNavHuge.java:2984)

    Thanks in advance

  3. Vijaya on

    Hi Jimmy Zhang

    Same parsing exception even after working with cvs code.

    • jimmyzhang on

      can you send me an sample file that can be used to dupliate teh issue?

  4. Vizz on

    Hi Jimmy Zhang
    Thanks for the quick response.

    I could not able to send you 7gb file. Here the the few places where parsing is breaking down.
    1. I could able to parse with vn.toElement(int) and count the number of files. Any how while converting token character data i am using vn.toString(vn.getText()) throws exception.(Converting into String only when vn.getText method doesn’t return -1),

    2.When trying to convert vn.toString( vn.getCurrentIndex()) throws exception even when limiting the -1 occurance.

    3.When directly using vn.toElement(int,String) , throws an exception.

    The file which i am using is passed as an argument at runtime. The file is already tested under SAX praser without any exceptions.

    I really appreciate for your time.
    Thank you
    Vizz / Vijaya

  5. jimmyzhang on

    you don’t have to send me the 7gb file, just send me somethin that I can use to duplicate the issue… otherwise, it would be difficult for me to help… I can step you thru the process of fixing the bug yourself.. if that can help you

  6. Vizz on

    Could you please send me your mail ID?

    Thank you
    Vizz

  7. Vizz on

    Here is the sample version which i am parsing

    http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helppubmed&part=publisherhelp#publisherhelp.XML_Tag_Descriptions

    I need to pull the “ISSN” and “abstract” from that xml file

    • jimmyzhang on

      I did play with a couple of files from the links that you provided… could not duplicate the issues … I tried stuff far more complex, like xpath evaluation and string conversion of many kind…

      Can you provide a complete test case with code along with sample xml?

      • Vizz on

        Could you please check your mail for details

  8. jimmyzhang on

    I sent u the cvs link that you can retrieve teh latest version/update… you might need to recompile…

    • Vizz on

      Thanks

    • Vizz on

      Amazing performance . Parsing performance 650M/s. You never get disappointed with this VTD-XML.

      Thanks Mr.Jimmy Zhang for this excellent software.

  9. Finpush on

    Does the parser support streaming mode? By streaming mode, I mean a method where the document arrives in chunks and the parsing begins before all of the document is received.

    • jimmyzhang on

      it turns out that streaming mode is a *bad* idea in general… u re better off using sax with streaming mode.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.