Re: Regexp: \m and [^[:alnum:]_] are not equivalent
- From: "Francois Vogel" <fsvogelnew5NOSPAM@xxxxxxx>
- Date: Thu, 8 Sep 2005 02:16:01 +0200
>> 3. a "word" boundary is a location where on one side you have any
and on the other side
>> you have a character included in set A
>
> I don't believe you. According to your example, words may not begin
> or end at the boundary of the string (no "character not in set A" at
> the boundary), so your example contains *zero* words, by rule 3.
I don't understand why you don't believe me. "Words" may begin at the
beginning or end of a string.
I said:
>> character not in set A (or string start/stop),
Therefore my example sounds correct to me: it contains three of my "words".
But this isn't of big importance I think. As I said, the text in which I
want to search is placed in a text widget. Beginning or end of the part of
the text in which I search is given by the start and stop index of the .t
search command. If my problem is solved for all the cases but the start and
stop, it will be a great step forward. I would look at such cases
afterwards.
> Which indicates you have not thought clearly about your problem,
> rather than merely not describing it clearly.
This is however still possible. I tried to explain my mind as clearly as I
could but might have failed.
Another way of describing my problem to you is to say that I want to mimic
certain wordprocessors "Match whole word" search option (e.g. MS Word offers
such an option).
>> 4. a "word" can be composed of any number of characters
>> 5. the characters in a "word" can appear in any order, at any position
>> inside the "word", and any number of times
>
> Which is an entirely different problem from what you asked before with
> "%foo" versus "bar%foo".
Correct.
> Why are you still using \m after the previous discussion?
I gave my \m example (that partially worked) in order you understand the
history of why I'm looking for something else.
> You had
> an explicit set of characters before. Now you say a word can contain
> *ANY* character, so you have lost the dichotomy between word-characters
> and non-word characters, and searching whole words doesn't make sense.
Searching for whole words makes sense if you consider that a user might
enter let's say %f oo$#ba r in its "Search what" entry box, and ask for
whole word matches only. I want to find this exact string in the text
contained in my text widget, and I want only "whole word" matches, in other
words, this exact text must be bounded by what I called "word boundaries"
(rule 3).
Another example. Consider the 5 lines of text below, as if it was input in a
text widget:
This is the first line of text
Download antivirus software, firewalls, spyware
removal tools, and %f oo$#ba r more to improve the security
of your %f oo$#ba ranything and to help keep it running smoothly
This is a test of a ?something%f oo$#ba r text widget
I want to have a regexp that, given that the users input is %f oo$#ba r
(13 characters, first is %, last is r, contains both set A chars, and
non-set A chars), will match in line 3 but not in lines 4 or 5. At first
sight this should have been \m%f oo\$#ba r\M but as we now all know this
does not work. My boundaries are different from Tcl word boundaries, and
that's why \m cannot be used. Therefore I'm looking for a regexp that will
replace \m and \M taking into account my definition of what a boundary is.
I don't see any contradiction between the rules I gave in my previous post
and my examples.
The user can enter anything he wants, be it "word" characters (i.e. in what
I called set A), or not. But the matching process should only match at
boundaries, i.e. rule 3 must apply.
> If you mainly want to anchor a user's string to follow particular
> characters,
> then see my response from yesterday.
I don't think that this is exactly what I'm looking for.
> You should be sanitizing the user's input before feeding it to regexp!
If by sanitizing you mean escape the Tcl special characters from the user
that have a meaning in a regexp, I agree with you. I'll have to double the
antislashes typed by the user, to change $ into \$ and so on. But changing
the users's input more than this I must not do.
Hope it is a bit clearer, now.
Anyway, thanks for the headache I give you with no doubt ;-)
Francois
.
- Follow-Ups:
- Re: Regexp: \m and [^[:alnum:]_] are not equivalent
- From: Donald Arseneau
- Re: Regexp: \m and [^[:alnum:]_] are not equivalent
- From: Ramon Ribó
- Re: Regexp: \m and [^[:alnum:]_] are not equivalent
- References:
- Regexp: \m and [^[:alnum:]_] are not equivalent
- From: Francois Vogel
- Re: Regexp: \m and [^[:alnum:]_] are not equivalent
- From: Helmut Giese
- Re: Regexp: \m and [^[:alnum:]_] are not equivalent
- From: Francois Vogel
- Re: Regexp: \m and [^[:alnum:]_] are not equivalent
- From: Donald Arseneau
- Re: Regexp: \m and [^[:alnum:]_] are not equivalent
- From: fsvogelnew5
- Re: Regexp: \m and [^[:alnum:]_] are not equivalent
- From: Donald Arseneau
- Regexp: \m and [^[:alnum:]_] are not equivalent
- Prev by Date: tile labelframes
- Next by Date: Re: tile labelframes
- Previous by thread: Re: Regexp: \m and [^[:alnum:]_] are not equivalent
- Next by thread: Re: Regexp: \m and [^[:alnum:]_] are not equivalent
- Index(es):
Relevant Pages
|