Re: Attempting to allocate already allocated array?



Gordon Sande wrote:
On 2011-08-01 21:44:06 -0300, duncan smith said:

Paul Anton Letnes wrote:
Den 01.08.11 19.26, skrev duncan smith:
Thanks to everyone who has helped me learn a bit of Fortran. The code
now works, and I've written a good enough Python wrapper. The surprising
thing is, approximate times,

Python - 44s
Fortran - 58s
Cython - 21s

Hi,

I see that your SUDA subroutine takes several 2D array arguments. If the python array is C contiguous (numpy arrays can be either Fortran or C contiguous under the hood, without you knowing it) you risk copying these arrays on entry to SUDA, just to get them on Fortran form. Try passing this argument to f2py when compiling and linking:
-DF2PY_REPORT_ON_ARRAY_COPY=1
That should give you a warning every time an array is copied. Whether it is bad for performance is a shot in the dark - but it's simple enough to check.


I'm still dumping the data to file from Python and reading the files to construct the arrays. But if I can get this to run quickly enough I'll generate a better wrapper, and bear that in mind. The arrays are not large though.

'arr_shape' is a vector containing an array shape, typically of length about 12.

'keys' contains a set of array indices (for the non-zero counts in a contingency table with dimensions given by arr_shape). This is less than 9000 for my reference problem. i.e. The contingency table total is 9000, but the number of unique keys is likely to be much lower. So this is maybe 12 X 8000.

'unique_inds' contains the column indices of the keys in 'keys' that correspond to counts of 1. There is actually about 5000 of these due to the sparseness of the contingency table.

The lookup and res tables are larger.

Duncan

Can you provide a readable and complete citation of what your real problem is?

It sounds like you might be searching for adjacent nonzero entries
in a 12-d space, if guesses at a variety of blanks. There are
good methods for that that are not based on depth first searchs by axis.
Algorithms matter and so far all you are doing is chasing code polishing.


The real underlying problem is taking records that are unique in a sample with respect to a given set of categorical variables, and assigning them (Bayesian) probabilities of being unique in the population on the same set of variables. e.g. An individual who is a 16 year old widow might be considered to be more likely to be unique in the population than a 20 year old divorcee. In practice there might be many unique records within a sample so the sort of prior information that would lead to the above inference cannot necessarily be easily incorporated. Statistical agencies (because this is about disclosure control) want more automatic methods.

Some people approach this by generating all the subsets of the given set of variables on which an individual is also unique. Due to an obvious transitive property this can be expressed in terms of the set of minimal subsets on which the individual is unique. An "a priori" style bfs search on a subset lattice can be used to identify the complete set of unique subsets. Alternatives, such as dfs, can be used to identify the minimal unique subsets. It could be called "unique itemset mining".

There are references and some papers on David Haglin's page, http://mavdisk.mnsu.edu/haglin/

How this information is converted to probabilities of population uniqueness is a bit questionable. Haglin and others concentrate on the unique itemset mining problem. I want to show that there are better ways of solving the real underlying statistical disclosure problem. To do that I need to compare the SUDA (Special Uniques Detection Algorithm) approach, of which the unique itemset mining is a part, with an alternative. For that I just need something that solves the itemset mining problem in reasonable time.

I was going to e-mail David and get his C++ code, but it was the weekend so I thought I'd just hack something together in Python that would do the job. I came up with a reasonable looking algorithm and coded it. The thing was, it was only about 20-25 times slower than the times reported for the best current algorithm coded in C++. That is why I got side-tracked and decided to code it in Fortran. I suspect it will be quicker for the particular problem that Haglin et al. have considered in their papers, if I can implement it in a language with execution speed comparable to C++. Obviously, when my Fortran version is slower than my Python version, then I suspect the Fortran version can be made quicker.

So the problem really is about implementing a specific algorithm efficiently, rather than improving the algorithm (although I can see a couple of things that might lead to improvements). If it's not quicker than Haglin et al.'s (for at least some problems), then that's OK. But before I abandon it I want to give it a chance, and I'd rather try to iron out the bottlenecks in my Fortran before implementing it in, say, C. I hope the problem's clearer now. Cheers.

Duncan
.



Relevant Pages

  • Re: Efficient binary search tree stored in a flat array?
    ... speed difference between C and Python means the constant for insertion ... DA> such an algorithm. ... in the array and allocate new elements in the array. ... Note the insert in Python, for example, is quite cache-friendly. ...
    (comp.lang.python)
  • Re: Line algorithim for python and numeric
    ... array along a line. ... You can use something like Bresenham's algorithm to march either "over" or "up and over" to just pick out the "one point at each step that falls closest to the actual line" along the run. ... There's also the Wu Anti-aliasing line algorithm which takes something akin to Bresenham's algorithm and then samples the potential points to yield an "anti-aliased" result which averages the two potential data-points (traditionally colors, but they could really be any average-able data values). ... I'm not sure I've seen either of them implemented in Python before ...
    (comp.lang.python)
  • Re: Attempting to allocate already allocated array?
    ... and I've written a good enough Python wrapper. ... If the python array is C contiguous (numpy arrays can be either Fortran or C contiguous under the hood, without you knowing it) you risk copying these arrays on entry to SUDA, just to get them on Fortran form. ... The contingency table total is 9000, but the number of unique keys is likely to be much lower. ...
    (comp.lang.fortran)
  • Re: Line algorithim for python and numeric
    ... array along a line. ... Anti-aliasing line algorithm which takes something akin to Bresenham's ... I'm not sure I've seen either of them implemented in Python before ...
    (comp.lang.python)
  • Re: Attempting to allocate already allocated array?
    ... and I've written a good enough Python wrapper. ... If the python array is C contiguous (numpy arrays can be either Fortran or C contiguous under the hood, without you knowing it) you risk copying these arrays on entry to SUDA, just to get them on Fortran form. ... The contingency table total is 9000, but the number of unique keys is likely to be much lower. ...
    (comp.lang.fortran)