Re: Heuristc to distinguish text and code
- From: Phil <spam_from_usenet08@xxxxxxxxxxxx>
- Date: Sun, 08 Jun 2008 19:58:44 +0100
Phil wrote:
Dear Experts,
I'm trying to implement a function that can tell whether a paragraph is text or code. For example, think of formatting messages posted to this and similar newsgroups, or "wiki language" processors, or ReStructured Text processors, etc.
My first attempt is based on a simple character-frequency analysis. I have measured character fequencies in small corpuses of text and code; then for each paragraph I determine the correlation between its character frequency and those two references. To measure the correlation I'm using the algorithm given at the bottom of this Wikipedia page: http://en.wikipedia.org/wiki/Correlation
This probably gets the right answer about 80% of the time, which is not bad but not good enough. So, can anyone suggest any improvements?
My next plan is as follows:
- Classify each character as either letter, digit, punctuation or whitespace.
- Count N-graphs for N up to quite large values (5?), e.g. letter-letter-letter-punctuation-whitespace = 3.
- Test whether this distribution correlates better with a reference distribution for code or for text.
The idea here is that it's the pattern of letters and punctuation that matters, not what the actual letters are. Counting the individual letters may help me to determine whether the text is English or French, or whether the variable names in the code are English or French, or whether the code was using lots of functions from a library with some Frob_ prefix; but that's not important to the "code or text?" question. On the other hand, knowing that in text you frequently see letter-punctuation-whitespace but rarely punctuation-punctuation-punctuation might be a good classifier.
I'm also considering, as CBFalconer suggested, stripping comments at the start (but leaving the actual comment characters). This removes the problem of recognising comment content as text. However, it does need language-specific rules and, for example, the shell # comment rule would strip C preprocessor directives. Maybe that doesn't matter.
I'll let you know how I get on.
Phil.
.
- Follow-Ups:
- Re: Heuristc to distinguish text and code
- From: Phil
- Re: Heuristc to distinguish text and code
- References:
- Heuristc to distinguish text and code
- From: Phil
- Heuristc to distinguish text and code
- Prev by Date: Re: Heuristc to distinguish text and code
- Next by Date: Re: Heuristc to distinguish text and code
- Previous by thread: Re: Heuristc to distinguish text and code
- Next by thread: Re: Heuristc to distinguish text and code
- Index(es):
Relevant Pages
|