Re: Searching a disk-backed Map



Stefan Ram wrote:
This should be a common need. Yet I am not aware of anything
like it in Java SE. What is the most common (pure Java)
solution to it?

I would like to have an implementation of java.util.Map,
which is constructed with an int »m« and a java.io.File »f«.

It will use no more than »m« bytes of memory, but »swap« out
(the least often used) entries to the file »f«, when they do
not fit into the given memory size anymore.

Not too long ago I wound up needing to implement a program that would store and retrieve large numbers of images, in an environment that precluded the use of a DBMS (which would've been overkill anyway).

I basically created an on-disk HashMap quite simply: images were retrieved into a BufferedImage from a stream, to which a suitable DigestInputStream decorator had been attached before wrapping in ImageInputStream. Out comes the SHA-1 hash of the image, as well as the image itself.

Then the SHA-1 is used to construct a path and filename, mkdirs is used on the path, and one ImageIO.write later, it's stored.

Retrieval was equally simple, since the SHA-1s were then used as the unique identifiers for the images throughout the program and could easily be turned into a path and filename for use this time with ImageIO.read. Delete was equally simple. The only tricky part of CRUD for this prog was the "U" part, since altering an image would alter its hash, and thus its ID. (Sometimes ID had to be determined from a copy of the image, to avoid duplications that could waste many gigabytes.) Fortunately, alterations were much rarer than storage and retrieval, and it wasn't hard to have a higher level of abstraction that had its own identifiers, and that mapped to the SHA-1 that might have changed.

FWIW, I've also implemented a disk-based deque supporting O(1) additions and removals at both ends (using sequentially numbered files as a backing "DiskArrayList", each file holding up to N entries). The constant factor overhead is large, since a linear search of up to N entries may be needed to work with the last entry in the queue, which is the last entry in the last file, and since I/O is much slower than RAM anyway. This disk-deque was needed for some application that had to store potentially millions of items and had to run where there wasn't much RAM, but was plenty of disk. Speed of access to the deque was of little concern because the tasks done would have been I/O bound anyway, and saving tens of megabytes of RAM usage at run-time was worth it. In fact, the application needed many of these deques, potentially running into gigs of total memory use, so the only two alternatives to having the deques themselves live on disk was having them be serialized to disk and loaded on demand (which would make things much slower -- reading a 30 meg file to do a few things and then writing a 30 meg file is a lot slower than reading a few-KB file to do a few things and then writing a few-KB file!) or having them all in RAM and letting the system page like crazy (equally slow, and would make the UI unresponsive at times). Acceptable UI responsiveness under these circumstances dictated the need for a disk-based data structure. (Think "kiosk".)

The disk-deque had some notion of transactions and commits, too, mainly to maintain data integrity; if the power went out in mid-operation the deque wouldn't be corrupted, though some changes to it might not have been saved. The disk-hashmap-of-images had a weak version that saved the image to a temp file and then renamed it when it was done, so the image appearing in the hashmap under its hash would happen atomically as far as any external process was concerned. A power failure in mid-save would result in the image not having been added, rather than a truncated version being in the map in the hash bucket for the un-truncated version. (Eeeuw.) So in both cases operations would have to be retried if there was an inconvenient crash or power failure but not corruption of data; a weak protection of data integrity and some weak atomicity.

When you mission-critically need full ACID, though, you need a proper database engine, even with all the added complexity and overhead that implies.
.



Relevant Pages

  • Re: CF Web Reference
    ... I'm writing a compact framework app. ... retrieved into RAM before I can save it to my external memory card. ... can retrieve the binary data directly to a FileStream instead of into ...
    (microsoft.public.dotnet.framework.compactframework)
  • HW Serial ID
    ... Is there a way in CE 5 to retrieve the serial code of some components, in particular (RAM, HD)... ... I don't know if I have been clear enaugh, I mean the unique ID associated to a component which can identify that specific component by itself. ...
    (microsoft.public.windowsce.embedded.vc)
  • HW Serial ID
    ... Is there a way in CE 5 to retrieve the serial code of some components, in particular (RAM, HD)... ... I don't know if I have been clear enaugh, I mean the unique ID associated to a component which can identify that specific component by itself. ...
    (microsoft.public.windowsce.embedded)
  • HW Serial ID
    ... Is there a way in CE 5 to retrieve the serial code of some components, in particular (RAM, HD)... ... I don't know if I have been clear enaugh, I mean the unique ID associated to a component which can identify that specific component by itself. ...
    (microsoft.public.windowsce.platbuilder)