Re: filehandle, read lines



roy <roy.schultheiss@xxxxxxxxxxxxxx> wrote:
I receive a XML-File up to 1 GB full of orders every day. I have to
split the orders and load them into a database for further processing.
I share this job onto multiple processes. This runs properly now.

This seems dangerous to me. Generally XML should be parsed by an XML
parser, not by something that happens to parse some restricted subset of
XML with a particular whitespace pattern that works for one example. If
the person who generates the file changes the white-space, for example,
then your program would break, while they could correctly claim that the
file they produced is valid XML and your program shouldn't have broke on
it. OTOH, if you have assurances the file you receive will always be in a
particular subset of XML that matches the expected white space, etc., then
perhaps the trade-off you are making is acceptable.

If parsing is not the bottleneck, then I'd just use something like
XML::Twig to read this one XML file and parcel it out to 10 XML files as a
preprocessing step. Then run each of those 10 files separately. Of
course, if parsing is the bottleneck, then this would defeat the purpose of
parallelizing.

Also, you should probably adapt your code to support "use strict;"


sub insert_orders {
my ($filename, $from, $to) = @_;

my $xml = new IO::File;
open ($xml, "< $filename");

if ($xml = set_handle ($xml, $from)) {
while (defined ($_dat = <$xml>)) {
$_temp = "\U$_dat\E"; # Convert into
capital letters
$_temp =~ s/\s+//g; # Remove blanks

if ($_temp eq '<ORDER>') {
$_mode = 'order';
$_order = '<?xml version="1.0" encoding="UTF-8"?>' .
"\n";
}

If you start fractionally through an order, this code above "burns" lines
until you get to the start of the first "full" order. Yet your set_handle
code also burns the initial (potentially) fraction of a line. It would be
cleaner if the code to burn data was all in one place.



$_order .= $_dat if $_mode eq 'order';

if ($_temp eq '</ORDER>') {
# load $_order into the database ...

This tries to load $_order into the database even when there is no
order to load, i.e. when you started out in the middle of a previous order.
$_order will be empty, but you try to load it anyway. Is that a problem?
You want to load $_order only when
$_temp eq '</ORDER>' and $_mode eq 'order'



$_order = '';
$_mode = '';

last if ($to <= tell ($xml));

This has the potential to lose orders. Let's say that $to is 1000, and
an order starts exactly at position 1000. This job will not process that
order, because because $to<=1000 is true. The next-higher job, whose
$start is 1000, also will not process this order, as the "partial" first
line it burned just happened to be a true full line, and that order
therefore gets forgotten. (I've verified this does in fact happen in a test
case)

last if ($to < tell($xml));

(Or change the way you burn data, as suggested above, so it all happens
in only one place.)

....


sub set_handle {
my ($handle, $pos) = @_;

seek($handle,$pos,SEEK_CUR);

if (defined (<$handle>)) # start new line

You probably only want to burn a line when $pos>0. When $pos==0, you
know the first line you read will be complete, so there is no reason
to burn it. Generally the burned line that starts out with each chunk will
be processed in the "previous" chunk, but when $pos==0 there was no
previous chunk. But this will not actually be a problem unless the first
line of you XML file is "<ORDER>", which does seem likely for "real" XML.

{ return $handle; }
else
{ return; }

I don't understand the above. If $handle returned an undefined
value this time you read from it, won't it do so next time as well?
(I think the only time this isn't true is when $handle is an alias for
ARGV). So why not just return $handle regardless?

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB
.



Relevant Pages

  • Re: Fill Database Table with Sample data
    ... Have an opinion on the effectiveness of Microsoft Embedded newsgroups? ... > I want to load some sample data for my nunit unit tests in the database. ... file that is generated by the command "SELECT * FROM SomeTbl FOR XML AUTO". ... I want to load this back to the table without much effort. ...
    (microsoft.public.dotnet.framework.adonet)
  • Re: XML memory stream
    ... Firstly, you can't really load xml "into" an xml-reader, as it doesn't ... If you want to manipulate the data (in a more convenient than string ... Load() method accepts an input stream, ... essentially an entire database), then the corresponding XmlDocument will be ...
    (microsoft.public.dotnet.languages.csharp)
  • RE: 2 questions. Partial SqlDataAdapter.Fill() and ReadXml()
    ... You can, however, add a timestamp column to your database ... Don't forget to save schema (separately or within XML). ... Don't forget to load schema prior to the data. ... Without primary key it has no way to determine if row is already in the ...
    (microsoft.public.dotnet.framework.compactframework)
  • Re: XML or SQL Server?
    ... Recipe Ingredients... ... load performance. ... I would recommend using a database of some sort ... ... XML files ...
    (microsoft.public.dotnet.languages.vb)
  • Re: xmlfile load failed
    ... according to my xmlfile load function (it ... If you know the XML file ... and if you are running under VS then the current working directory is the ... it is not clear why you are casting this HRESULT to a bool and then doing the remarkably ...
    (microsoft.public.vc.mfc)