Re: Heuristc to distinguish text and code



Phil wrote:
Dear Experts,

I'm trying to implement a function that can tell whether a paragraph is text or code. For example, think of formatting messages posted to this and similar newsgroups, or "wiki language" processors, or ReStructured Text processors, etc.

My first attempt is based on a simple character-frequency analysis. I have measured character fequencies in small corpuses of text and code; then for each paragraph I determine the correlation between its character frequency and those two references. To measure the correlation I'm using the algorithm given at the bottom of this Wikipedia page: http://en.wikipedia.org/wiki/Correlation

This probably gets the right answer about 80% of the time, which is not bad but not good enough. So, can anyone suggest any improvements?

My next plan is as follows:
- Classify each character as either letter, digit, punctuation or whitespace.
- Count N-graphs for N up to quite large values (5?), e.g. letter-letter-letter-punctuation-whitespace = 3.
- Test whether this distribution correlates better with a reference distribution for code or for text.

The idea here is that it's the pattern of letters and punctuation that matters, not what the actual letters are. Counting the individual letters may help me to determine whether the text is English or French, or whether the variable names in the code are English or French, or whether the code was using lots of functions from a library with some Frob_ prefix; but that's not important to the "code or text?" question. On the other hand, knowing that in text you frequently see letter-punctuation-whitespace but rarely punctuation-punctuation-punctuation might be a good classifier.

I'm also considering, as CBFalconer suggested, stripping comments at the start (but leaving the actual comment characters). This removes the problem of recognising comment content as text. However, it does need language-specific rules and, for example, the shell # comment rule would strip C preprocessor directives. Maybe that doesn't matter.

I'll let you know how I get on.


Phil.
.



Relevant Pages

  • Re: Heading Style wont stick
    ... Null character styles, maybe? ... because the selection was always in a paragraph. ... Then users could not understand that their formatting would sometimes apply, ... appended the name "Char" to the original style name. ...
    (microsoft.public.mac.office.word)
  • Re: How to make Word STOP rewriting your copy
    ... The "character stream" that I was referring to is the GUI stream and not a physical storage stream (although I still believe the paragraph marker originally had it's roots there). ... the "power" of a tool can only be measured by how much PROODUCTIVE work can be done AT A GIVEN COST. ...
    (microsoft.public.mac.office.word)
  • Re: Formatting changed when converted in Word 2007
    ... I have narrowed it down a bit: I have one single paragraph (I deleted ... on the previus character style) applied to the text. ... character spacing in any of the styles. ... If I check in Reveal formatting, ...
    (microsoft.public.word.docmanagement)
  • Re: The Literary Calvinist Meets the Arminian Terror
    ... :-) But the character doesn't really ... retroactively claims the whole paragraph. ... I think of Limbo or the Styx as lacking details, gray on gray, blurred, ... carried 'just the right' amount of ambiguity. ...
    (rec.arts.sf.composition)
  • Re: DNA as a book
    ... Space is a character. ... Like the stem cell. ... In digital representations, spaces and letters have ... AB...The irreducability of sentence made me think of genes. ...
    (sci.bio.evolution)