Re: Text fingerprinting



Sumedh wrote:

> The problem that i have is something as follows:
> We would like to find similiarity for text that has been copied from a
> source. However, comparing the whole text would not be feasible (so
> string matching algorithms are not useful) and so we would like to
> generate some kind of a fingerprint for the texts which can be compared
> against the stored corpus of fingerprints to detect copying.

You need to define a theory for what you are trying to do. I presume
that the fingerprint is designed to be smaller than the original
text. What aspect of the original do you expect to be preserved in
the fingerprint? When people classify text, they might list subject,
author, fiction/non-fiction, length, creation date, etc. Are you
intending to identify a writing style -- level of formality,
regionalisms, etc.?

I have heard of software comparing text for high correlation (i.e.
copying text with minor modifications), but that requires complete
text. There is also style analysis, that helps to identify an
author. If the style for a written piece didn't match that of the
purported author, you have reason to suspect either copying or
ghost-writing, but it doesn't identify the source (unless it matches
the stored style for the source author).

--
Thad
.



Relevant Pages

  • Re: Biometrics
    ... The article only shows someone copying a fingerprint, ... made with a fingerprint recognition device. ... I was comparing it to the fact of cutting someones finger, ...
    (Security-Basics)
  • Text fingerprinting
    ... We would like to find similiarity for text that has been copied from a ... comparing the whole text would not be feasible (so ... generate some kind of a fingerprint for the texts which can be compared ... Thanks in advance for any replies and/or pointers to resources. ...
    (comp.theory)