Re: splitting up huge (1 GB) xml documents
- From: Thomas Weidenfeller <nobody@xxxxxxxxxxxxxxxx>
- Date: Fri, 29 Apr 2005 11:26:27 +0200
twinkler wrote:
I need to split this massive document up into smaller chunks of valid XML: The structure of the XML is quite easy:
<businessHeader>Bla, bla -> only about 10 tags</businessHeader>
<businessInformation>info goes here</businessInformations> <!--the tag business information is repeated a couple of hundred thousand times... --> <businessInformation>info goes here</businessInformations> <businessFoolter>about 10 tags footer</businessFooter>
My current approach is to use SAX to parse the document and write the businessInformation into different files. Before that the header gets inserted into each file and after that the footer.
This obviously consumes quite a lot of time since the entire file is parsed sequentially.
Can you think about a way of how to speed this process up ? I was thinking of jumping randomly into the <businessInformation>-section of the file (Random Access File) and then start parsing from there on with SAX (potentially in parallell by using threads) but I am not sure if this works.
First, you have to read the whole file anyhow. So randomly jumping around doesn't make too much sense. Threads shouldn't gain you much too. If the thing is I/O bound (which is likely) then your threads would hang around idle waiting for their next chunk of input data.
I would not use thread. I would not use random access. I would not use SAX, I would not use any kind of XML parser. In fact I would not even use Java.
I would give the XML a very intensive look. Assuming that it is machine generated it should have a regular layout. Based on that layout I would use Perl and a Perl script. That script would have regular expressions (the simplest ones that could possibly work) to identify the different parts in the file, and break it up. XML is not too well suited to be processed with pattern matching, but machine-generated XML is usually regular enough to do so. And maybe 20 or 40 lines of Perl are enough to process the file.
I would also consider tampering with the way the writing application generates the data. Under Unix I would try the age-old trick of providing the writing application with an output file-name which in fact does not point to a file, but to a FIFO (named pipe). The Perl script would sit at the reading end of the FIFO and directly write chunks, and there would never be a 1 GB file at all.
/Thomas
-- The comp.lang.java.gui FAQ: ftp://ftp.cs.uu.nl/pub/NEWS.ANSWERS/computer-lang/java/gui/faq .
- References:
- splitting up huge (1 GB) xml documents
- From: twinkler
- splitting up huge (1 GB) xml documents
- Prev by Date: Re: how to read 16 - bit values?
- Next by Date: Re: java 1.5 on 1.4.2
- Previous by thread: Re: splitting up huge (1 GB) xml documents
- Next by thread: newbie question about XML with java
- Index(es):
Relevant Pages
|
|