Process Huge XML Documents with Extended VTD-XML
If you have XML files that are larger than 2GB, and don’t want to lose the benefit of XPath (full set), you will be surprised on how handy and fast extended VTD-XML can become.
New since version 2.3 and a part of full VTD-XML distribution , extended VTD-XML expands the maximum document size to 256 GB and requires 64-bit JVM to achieve those limits.
Extended VTD-XML works with XML either with standard in-memory mode (like the standard VTD-XML), or the memory mapped mode, which allows partial document loading.
The code examples below shows you how to use extended VTD-XML to process XML document using in-memory mode and memory mapped mode:
Memory Mapped Mode
import com.ximpleware.extended.*;
public class mem_mapped_read {
// first read is the longer version of loading the XML file
public static void first_read() throws Exception{
XMLMemMappedBuffer xb = new XMLMemMappedBuffer();
VTDGenHuge vg = new VTDGenHuge();
xb.readFile("test.xml");
vg.setDoc(xb);
vg.parse(true);
VTDNavHuge vn = vg.getNav();
System.out.println("text data ===>" + vn.toString(vn.getText()));
}
// second read is the shorter version of loading the XML file
public static void second_read() throws Exception{
VTDGenHuge vg = new VTDGenHuge();
if (vg.parseFile("test.xml",true,VTDGenHuge.MEM_MAPPED)){
VTDNavHuge vn = vg.getNav();
System.out.println("text data ===>" + vn.toString(vn.getText()));
}
}
public static void main(String[] s) throws Exception{
first_read();
second_read();
}
}
In Memory Mode
/**
* This is a demonstration of how to use the extended VTD parser
* to process large XML file.
*
*/
import com.ximpleware.extended.*;
public class in_mem_read {
// first read is the longer version of loading the XML file
public static void first_read() throws Exception{
XMLBuffer xb = new XMLBuffer();
VTDGenHuge vg = new VTDGenHuge();
xb.readFile("test.xml");
vg.setDoc(xb);
vg.parse(true);
VTDNavHuge vn = vg.getNav();
System.out.println("text data ===>" + vn.toString(vn.getText()));
}
// second read is the shorter version of loading the XML file
public static void second_read() throws Exception{
VTDGenHuge vg = new VTDGenHuge();
if (vg.parseFile("test.xml",true,VTDGenHuge.IN_MEMORY)){
VTDNavHuge vn = vg.getNav();
System.out.println("text data ===>" + vn.toString(vn.getText()));
}
}
public static void main(String[] s) throws Exception{
first_read();
second_read();
}
}
Advertisement
This doesn’t seem to be available for C#, or is it? Its quite a bummer not having the Huge versions available in .Net…
C# doesn’t yet support memory map, we actually have an inhouse implementation of vtd gen support 256GB, do you want to give a try? what is the max file size of your environment?
I am trying to parse 7GB of Pubmed data and getting parsing exception below even when using extended version and also unable to create the Mem_Mapped instance with VTDGenHuge.I am using 64 jvm. My parsing code is similar to that of the BioInfo Java demo from offical website.
Exception in thread “main” java.lang.ArrayIndexOutOfBoundsException: -2
at com.ximpleware.extended.XMLBuffer.byteAt(XMLBuffer.java:100)
at com.ximpleware.extended.VTDNavHuge.getChar(VTDNavHuge.java:556)
at com.ximpleware.extended.VTDNavHuge.compareRawTokenString(VTDNavHuge.java:1570)
at com.ximpleware.extended.VTDNavHuge.matchRawTokenString(VTDNavHuge.java:1641)
at com.ximpleware.extended.VTDNavHuge.matchElement(VTDNavHuge.java:1423)
at com.ximpleware.extended.VTDNavHuge.toElement(VTDNavHuge.java:2984)
Thanks in advance
Hi Jimmy Zhang
Same parsing exception even after working with cvs code.
can you send me an sample file that can be used to dupliate teh issue?
Hi Jimmy Zhang
Thanks for the quick response.
I could not able to send you 7gb file. Here the the few places where parsing is breaking down.
1. I could able to parse with vn.toElement(int) and count the number of files. Any how while converting token character data i am using vn.toString(vn.getText()) throws exception.(Converting into String only when vn.getText method doesn’t return -1),
2.When trying to convert vn.toString( vn.getCurrentIndex()) throws exception even when limiting the -1 occurance.
3.When directly using vn.toElement(int,String) , throws an exception.
The file which i am using is passed as an argument at runtime. The file is already tested under SAX praser without any exceptions.
I really appreciate for your time.
Thank you
Vizz / Vijaya
you don’t have to send me the 7gb file, just send me somethin that I can use to duplicate the issue… otherwise, it would be difficult for me to help… I can step you thru the process of fixing the bug yourself.. if that can help you
Could you please send me your mail ID?
Thank you
Vizz
Here is the sample version which i am parsing
http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helppubmed&part=publisherhelp#publisherhelp.XML_Tag_Descriptions
I need to pull the “ISSN” and “abstract” from that xml file
I did play with a couple of files from the links that you provided… could not duplicate the issues … I tried stuff far more complex, like xpath evaluation and string conversion of many kind…
Can you provide a complete test case with code along with sample xml?
Could you please check your mail for details
I sent u the cvs link that you can retrieve teh latest version/update… you might need to recompile…
Thanks
Amazing performance . Parsing performance 650M/s. You never get disappointed with this VTD-XML.
Thanks Mr.Jimmy Zhang for this excellent software.
Does the parser support streaming mode? By streaming mode, I mean a method where the document arrives in chunks and the parsing begins before all of the document is received.
it turns out that streaming mode is a *bad* idea in general… u re better off using sax with streaming mode.