Re: FAQ 5.38 How do I select a random line from a file?



brian d foy <brian.d.foy@xxxxxxxxx> wrote:
In article <484adab2$0$30196$4c368faf@xxxxxxxxxxxxxx>, Dan Rumney
<danrumney@xxxxxxxxxxxx> wrote:

Peter J. Holzer wrote:
On 2008-06-04 19:03, PerlFAQ Server <brian@xxxxxxxxxxxxxx> wrote:
5.38: How do I select a random line from a file?

If your file is large, and the lines are roughly of equal length, it
is probably preferrable to just do a random seek into the file and
read the line you've hit (by searching backwards and forwards for
$/).

The problem with that is that the probability of selecting a line
becomes a function of that line's length.

The algorithm above ensures that the probability of selecting a line is
precisely 1/N where N is the total number of lines in the file.

From time to time, I think about how I would solve this problem if I
actually needed it for something important. That is, not as some
thought experiment or quote-for-the-sig thing.

Has anyone solved this for something non-trivial? Say, for something
with huge numbers of lines or large file sizes, or where you want to
choose several random lines?

The solution that I think about (but have never implemented), is some
sort of pre-indexing of line endings so you have a list of file
positions where all the lines line (or start, or whatever). When you
want a random line, you choose a random element from that list. Open
the file, seek, and read a line.

If space wasn't at a premium, and the file was static, I might just
determine the maximum line length and then pad out all lines to that length
(putting the padding after the \n, so they are really part of the next
line), and then encode that fixed length either in the code or in the file
name. Then seek to a random record. If the maximum line length was an
order of magnitude more than the typical, I might not do that.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
.



Relevant Pages

  • Re: advocacy advice
    ... > "brian d foy" wrote... ... someone in charge has to act like ... The things that people need to know to advocate Perl are the ...
    (comp.lang.perl.misc)
  • Re: Hyperlinks not opening
    ... "Brian Day" wrote: ... But if i hit 'refresh' it works fine. ... Open the RUN command and type the following: ... Again the RUN command and type: CMD Click OK on the Opening prompt ...
    (microsoft.public.windowsxp.general)
  • Re: simple ACD
    ... Think about and experiment with prompts HOML, and as Brian ... Required are KEY 00 ACD 4321 ... You can probably hit KEY 00 ... Don't change anything in the SCB until you know more ...
    (comp.dcom.sys.nortel)
  • Re: regexpressions help
    ... brian d foy? ... John Paul II? ... George W. Bush? ...
    (perl.beginners)
  • Re: OT I noticed!
    ... >> Gee I make one OT post and you hit the roof. ... >> these other OT poster in the balls Brian? ... > further than that in ham radio. ...
    (rec.radio.shortwave)