Re: custom data warehouse in python vs. out-of-the-box ETL tool



snfctech wrote:
Does anyone have experience building a data warehouse in python? Any
thoughts on custom vs using an out-of-the-box product like Talend or
Informatica?

I have an integrated system Dashboard project that I was going to
build using cross-vendor joins on existing DBs, but I keep hearing
that a data warehouse is the way to go. e.g. I want to create orders
and order_items with relations to members (MS Access DB), products
(flat file) and employees (MySQL).

Thanks in advance for any tips.

My experience is that if you enjoy hacking around databases and are proficient in Python, than for small scale solutions it is preferable to do it yourself. If you need a large scale solutions with advanced requirements, building it yourself is mostly the only way.

I have build a rather complex datawarehouse system in the past (well actually more like a centralised synchronisation hub, having input and reporting databases as satellite clients), shoving data from around 500 databases (typically 5 Gb in size each) spread over the world.

The only commercial solutions I reviewed was Business Objects Data Integrator and Oracle Warehouse Builder.

These tools where quite flexible and if you already have a license deal which includes these tools I would definitely recommend to have more than just a look at it.

If not and you are comfortably with using python to shovel data from A to B and transform it at the same time (moving relational data automatically into a EAV model and back again, for example) than building your own solution will probably save you money and time (as opposed to learn how to use that ETL tool).

This will require you to have at least interest in the following subjects:
- Authorization, may everybody use all data or should it be limited to a subset on the data depending on the data?
(My solution was one centralised hub which contains all data but is only accessible to special 'client' servers strictly maintained by me which only sync the data relevant to them).

- Authenticity, if you have different values for the same thing, which one should be considered authoritative and if yes may it be pushed back to the un-authoritative?

-Synchronisation, you really don't want to push/pull all of the database content over every x times, so how can you delta it and is there a way to do this only when the data changes (push vs pull)?

-ATOMIC, how long may the data be out of date and is it allowed to partially update

-Using and maintaining multiple databases, hopefully spread over multiple systems. I had a server for each production DB, a server that mirrored that production DB with some added columns per table for external synchronization purposes and a master synchronisation server (so in essence all data was copied three times, not very efficient but good if you like to play it on the safe side).


--
MPH
http://blog.dcuktec.com
'If consumed, best digested with added seasoning to own preference.'
.



Relevant Pages

  • Re: custom data warehouse in python vs. out-of-the-box ETL tool
    ... Bill Inmon's "Building the Data Warehouse" is 17 ... My experience is that if you enjoy hacking around databases and are ... I had a server for each production DB, ...
    (comp.lang.python)
  • Re: Production & Test Environments
    ... The databases contain your contents, ... However on your web front end server if you have done any ... your dev server envoironment you will need to have all these files copied ... to become Production) all the documents etc... ...
    (microsoft.public.sharepoint.portalserver)
  • best solutions : sql server replication (Maintenance)
    ... and after maintenance has been completed ... replicated back to production server. ... these 10 databases will have schema changes monthly. ...
    (microsoft.public.sqlserver.replication)
  • Re: Disaster recovery without backup but just files .edb & .stm
    ... i've to activate my exchange virtual server ... "ExchangeBackup" (exactly same of production), ... Your production server and DR server are both named "ExchangeBackup"? ... Mounting the databases on a server with the same name shouldn't cause ...
    (microsoft.public.exchange.admin)
  • Re: test environment windows 2003
    ... an authorative restore after the demote and system state restore. ... This document was prepared for the building of a copy of the production ... Remove the physical cable for the new pc and build the member server ... Master Operation roles (FSMO and the File Replication service). ...
    (microsoft.public.windows.server.active_directory)

Loading