Re: An odd sort requirement - data munging



On Nov 4, 12:23 pm, Jürgen Exner <jurge...@xxxxxxxxxxx> wrote:
cartercc <carte...@xxxxxxxxx> wrote:
I have a series of date related values, such as 01/2007, 02/2007,
03/2008, 04/2006, etc. These values represent a column in various
files that range from several dozen rows deep to almost 1M rows deep.
My job is to create reports from a collection of these types of files.

I create a number of refs to hashes that have the general appearance
of this:
$h{$k1}{$k2}{$k3} => data (generally but not always a simple count).

After scratching my head for some time I am guessing that probably $k1
contains the month and $k2 contains the year.
If you had provided a minimal, self-contained script as requested in the
posting guidelines it would have been much easier to identify your data
structure.

No. $k1, etc., contains ANYTHING that's sortable and unique. It can
contain names, like "Exner, J", "New York," or "Baltimore" or numbers
(telephone, area code, ID numbers) or other values. The contents of
the keys are not relevant to the code or to the question.

Here's the problem: the ordering of the dates isn't numerical, the
proper order is- 03/2005, 04/2005 ... 01/2006, 02/2006
03/2006, 04/2006 ... 01/2007, 02/2007
03/2007, 04/2007 ... 01/2008, 02/2008

Well, that is what you are asking for. Assuming $k1 and $k2 are month
and year respectively then you are sorting your data by month and within
each month by year.
You could just reverse those two, sorting by year first and then within
each year by month.

Actually, no. Here is a sample of a data file:
"07/T1","A27","117"
"07/T1","D01","3"
"07/T1","EA27","30"
"07/T1","EF20","52"
....
"08/T5","V26","17"
"08/T5","W03","11"
"08/T5","W04","4"
"08/T5","W05","1"

Hee is a sample of another data file:
1222413 G07 07/T2 07/RFA 07/T2
1247990 FH1 08/T4 08/RSP 08/T4
1094529 EARMY 05/T4 05/T4 05/T5 07/T1 07/RFA 07/T2
1247991 V24 08/T4 08/RSP 08/T4

As you can see, the 'date' values are unary values and I don't have
any real need to split them.

Another solution would be to write a custom compare function. You will
have to pass the pair of year and month for each of $a and $b as those
are actually the number you want to sort. And then once you get that
sorted list just loop through it and print the corresponding values from
the data set.
I'd be interested in coding it but I'm not good enough to do it without
any testing and since you didn't provide any self-contained program that
could be used a test bed that's not an option.

I just posted a half-assed idea of a solution that would require
processing the file two more times to convert and unconvert this field
to something that would sort naturally. I would like to see a custom
compare function, and if you want, I can send you sample data files
(they contain no confidential or sensitive information) and a script
that I use to product an OUTFILE. I've spend a non-trivial amount of
time thinking about it, and I can't see a solution.

Yet another solution would be to change your data structure. Your HoHoA
has the granularity of the time spans reversed. Had you put year as the
top value, then your algorithm above would have worked naturally.

No, because of this:
Calendar year -
07/01, 07/02, 07/03, ...
Reporting year -
07/03, 07/04, ... 07/01, 07/02
Academic year -
07/01, 07/02 ... 08/04, 08/05

Please note that the Academic year crosses year boundries, i.e., from
2007 to 2008, while the Reporting year crosses month boundries, i.e.,
'03' starts the series and '01','02' ends the series.


Dah, did you even read the man page for sort()? That's what the first
argument of sort() is all about!

Actually, no, I didn't. I know that sort can take a function as an
argument, but I was focused on the algorithm, not the implementation.
But I'm headed that way now.

CC
.



Relevant Pages

  • Re: Report progess during black box process
    ... Create the DataSource in a BackGroundWorker and make sure the DataSource is ... afterwards sort it in the BackgroundWorker. ... I am reporting to the user the progress while rows are added, ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: unittest: Calling tests in liner number order
    ... still useful to have the tests run in a certain order for reporting ... Then sort your report. ... of tests all run in a few seconds and test some basic functionality. ... but they still take a long time to fail. ...
    (comp.lang.python)
  • An odd sort requirement - data munging
    ... files that range from several dozen rows deep to almost 1M rows deep. ... perfect numerical order. ... What I would like to do is overload the sort operator (call it ...
    (comp.lang.perl.misc)
  • Re: An odd sort requirement - data munging
    ... files that range from several dozen rows deep to almost 1M rows deep. ... My job is to create reports from a collection of these types of files. ... What I would like to do is overload the sort operator (call it ... I'd switch how you store the date in hour hash and reformat for the report. ...
    (comp.lang.perl.misc)
  • Re: Why is device_create_file __must_check?
    ... In the case of the windfarm driver, the sysfs files are reporting ... So just some cheesy printk would do in ... that sort of situation, I guess. ...
    (Linux-Kernel)