Re: Transforming ascii file (pseduo database) into proper database



I need to take a series of ascii files and transform the data contained therein so that it can be inserted into an existing database.
[snip]
I need to transform the data from the files before inserting
into the database. Now, this would all be relatively simple if
not for the following fact: The ascii files are each around
800MB,
[snip]
My questions are:
1. Has anyone done anything like this before, and if so, do
you have any advice?

Yes, I regularly do ETL on files from cellular providers to transform hundreds of megs worth (some approach a gig) of data into our internal system.

2. In the abstract, can anyone think of a way of amassing all
the related data for a specific identifier from all the
individual files without pulling all of the files into memory
and without having to repeatedly open, search, and close the
files over and over again?

if the file is sorted by something you can use, you can iterate over it and just deal with one grouping at a time. In my case, iterating over gobs of call-detail, the file happens to be sorted by the phone-number on the account. So I iterate over the file maintaining a list of calls for the given phonenumber, and when the phonenumber changes, I deal with the previous cache of data, then re-initialize with the new phone's data.

Other ideas:

1) create a temp DB (such as sqlite), skim through the file inserting all your data into a table in this DB, then use DB functionality on it

2) in a light-weight way, assuming there's lots of data per row, and that you have multiple rows associated with a given ID (in my case, such as a phonenumber), you can create a dictionary of an ID to a list of file-offsets in which that ID is used. You can then skim through the file once gathering all the offsets with calls to tell() and then when you want to process an item, you can seek to that particular offset and read in the line. Not greatly efficient, but hackable.

But mostly, it helps if you have a sorted field that's useful to you :)

-tkc


.



Relevant Pages

  • Re: ODB (Cache?) vs ORM
    ... This is common and accepted database knowledge. ... > other way is error-prone, thus virtually guaranteed to break integrity, producing ... > is multiple orders of magnitude more powerful than procedural optimization. ...
    (comp.lang.java.databases)
  • Re: Requery "Too Soon"?
    ... "The database has been placed in a state by user ... then the Requery will return the just ... MsgBox appears, I get an empty subform, i.e., I do ... I'll try inserting a DoEvents or two. ...
    (microsoft.public.access.formscoding)
  • Re: Evaluating D2005, help needed.
    ... > quality of Delphi still hasn't crossed below the productivity of Visual ... and had built several database test cases to backup your claims but now it seems ... Delphi for Win32 and VS.NET for .NET and while I'm quite confident that my ...
    (borland.public.delphi.non-technical)
  • Re: SQLCE performance from .NET CF v2.0
    ... Please remember when bulk inserting is being executed against SQL CE, ... > local database functionality. ... > database technology and I am considering switching to SQLCE at the same ... > execute the prepared statement again. ...
    (microsoft.public.dotnet.framework.compactframework)
  • Re: OOP database tables <-> php interface (semi LONG)
    ... The Access database has over 30 ... database and business object layers. ... Good point about the flexibility too! ...
    (comp.lang.php)