Re: splitting up huge (1 GB) xml documents
- From: "HK" <pifpafpuf@xxxxxx>
- Date: 29 Apr 2005 01:57:19 -0700
twinkler wrote:
> Dear all,
>
> I am facing the problem that have to handle XML documents of approx.
1
> GB. Do not ask me which sane architecture allows the creation of such
> files - I have no control over the creation and have to live with it.
>
> I need to split this massive document up into smaller chunks of valid
> XML:
> The structure of the XML is quite easy:
You definitively want to use monq.jfa available from
http://www.ebi.ac.uk/Rebholz-srv/whatizit/software
Download the jar and play with Grep. As an example
use a command like
java -cp monq.jar monq.programs.Grep \
-r '<YourTag[^>]*>' '</YourTag>' \
-rf %0 '%0\n' \
-cr <your_file
It will extract the YourTag XML elements only. The
-r defines the 'region of interest' and -rf says
how to handle the start and end of it. The -cr
requests to print every region of interest. You could
also define regular expressions to fetch only
regions with a match.
To distribute into different files, you will have
to write some lines of code yourself. To get started,
read the example:
http://www.ebi.ac.uk/Rebholz-srv/whatizit/monq-doc/monq/jfa/package-summary.html#package_description
and use
http://www.ebi.ac.uk/Rebholz-srv/whatizit/monq-doc/monq/jfa/Xml.html#GoofedElement(java.lang.String)
to create regular expressions for the elements you want to
fetch.
Don't hesitate to contact me (see download page) for more
specific questions and hints.
Harald.
.
- References:
- splitting up huge (1 GB) xml documents
- From: twinkler
- splitting up huge (1 GB) xml documents
- Prev by Date: Re: Comparables and Generics
- Next by Date: Re: how to read 16 - bit values?
- Previous by thread: splitting up huge (1 GB) xml documents
- Next by thread: Re: splitting up huge (1 GB) xml documents
- Index(es):
Relevant Pages
|
|