Re: splitting up huge (1 GB) xml documents



twinkler wrote:
> Dear all,
>
> I am facing the problem that have to handle XML documents of approx.
1
> GB. Do not ask me which sane architecture allows the creation of such
> files - I have no control over the creation and have to live with it.
>
> I need to split this massive document up into smaller chunks of valid
> XML:
> The structure of the XML is quite easy:

You definitively want to use monq.jfa available from

http://www.ebi.ac.uk/Rebholz-srv/whatizit/software

Download the jar and play with Grep. As an example
use a command like

java -cp monq.jar monq.programs.Grep \
-r '<YourTag[^>]*>' '</YourTag>' \
-rf %0 '%0\n' \
-cr <your_file

It will extract the YourTag XML elements only. The
-r defines the 'region of interest' and -rf says
how to handle the start and end of it. The -cr
requests to print every region of interest. You could
also define regular expressions to fetch only
regions with a match.

To distribute into different files, you will have
to write some lines of code yourself. To get started,
read the example:

http://www.ebi.ac.uk/Rebholz-srv/whatizit/monq-doc/monq/jfa/package-summary.html#package_description

and use

http://www.ebi.ac.uk/Rebholz-srv/whatizit/monq-doc/monq/jfa/Xml.html#GoofedElement(java.lang.String)

to create regular expressions for the elements you want to
fetch.

Don't hesitate to contact me (see download page) for more
specific questions and hints.

Harald.

.



Relevant Pages