Re: Emoticon text parser



On 21 Mrz., 10:21, Jussi Piitulainen <jpiit...@xxxxxxxxxxxxxxxx>
wrote:
Karsten Wutzke writes:
Here are the possible strings applying to each position:

hair = {"o", "O", ">", "}", "]", ")"} <-- hair optional!
eyes = {":", ";", "8"}
subeyes = {"'", ","} <-- subeyes optional!
nose = {"-"} <-- nose optional!
mouth = {")", "(", "s", "S", "d", "D",
"p", "P", "c", "C", "o", "O",
"#", "@", "*", "$", "|",
"))", "(("}
beard = {"="} <-- beard optional!

That is very close to a regular expression already. It's as if your
are spelling out the meaning of such an expression here.

Most of these are character sets. The exceptions are the two
two-character mouths, so mouth must be partly an alternation.

hair = [oO>}\])]? "]" must be escaped
eyes = [:;8] no problem
subeyes = [',]?
nose = -

mouth = (?:[sSdDpPcCoO#@*$|]|\)\)?|\(\(?)

This is [...] | one or two of ) | one or two of (,
parentheses need escaping, and I've wrapped it all
in (? ) to make it a non-capturing group.

beard = =?

Put it all together, in a string, which requires doubling the escapes:
"[oO>}\\])]?[:;8][',]?-(?:[sSdDpPcCoO#@*$|]|\\)\\)?|\\(\\(?)=?". Ouch.
It does look ugly.

We can ease the pain with the COMMENT flag of Pattern; must escape the
comment character # then; end comments with ends of line. Let's make
it CASE_INSENSITIVE too.

import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Test {
public static void main(String [] args) {
Pattern p =
Pattern.compile
("[o>}\\])]? # hair, optional \n" +
"[:;8] # eyes \n" +
"[',]? # subeyes, optional \n" +
"-? # nose, optional \n" +
"(?: [sdpco\\#@*$|] " +
" | \\)\\)? " +
" | \\(\\(? ) # mouth \n" +
"=? # beard, optional \n",
Pattern.COMMENTS | Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(args[0]);
while (m.find()) {
System.out.println("Found " + m.group() + " at " +
m.start() + " to " + m.end());
}
}

}

That's about the best I can do.

And it is great! It works like a charm and even seems to be fast as
lightning... I also split up the sub components into several strings
as Christian suggested instead of the commenting stuff. I suppose this
was made is for loading (commented) files from disk.

One question that remains is:

The pattern really just addresses strings that are *exactly* 2-7 chars
long. Do I understand right, that there's no way to automatically
detect a pattern ":-)" in the string " :-)" or ":-) " or
" :-) " directly???

Do I always have to make a list of starting characters and then scan
for a 7 char string, a 6 lenght, a 5 length... until maybe one pattern
matched?

Karsten

PS: I'm really really happy :-D ATM
.



Relevant Pages

  • Re: RegEx: How to ignore the number of whitespaces?
    ... a "simpler" regular expression syntax is likely to bite you eventually, ... but that some of these character sequences may be "marked" as ... This is a regular expression "if" conditional statement, ... do not understand why the pattern "personal computer" will only match ...
    (microsoft.public.dotnet.framework)
  • Re: Match anything between two " that is not a " except if it is escaped...
    ... A good habit is to use the hex equivalent character for any character that has a special meaning in pregex expressions. ... This is for double quotes: ... use delimiters that do not occur in your pattern. ... I am struggling with regular expression trying to match strings ...
    (php.general)
  • Re: Identifying allowed characters using Regular Expression
    ... > that control I am using regular expression, ... > that particular pattern. ... > character will apear & user need not to enter any value for that. ...
    (microsoft.public.dotnet.languages.csharp)
  • Regex for repeated character?
    ... How do I make a regular expression which will match the same character ... instead of matching repetitions of any ... pattern like this: ...
    (comp.lang.python)
  • pattern matching numbers using regex?
    ... parts of data that follow a specific pattern. ... strings I would use a regular expression. ... to a regular expression that can be applied to arrays of numbers? ...
    (comp.soft-sys.matlab)