Re: Regexp: \m and [^[:alnum:]_] are not equivalent
- From: "Ramon Ribó" <ramsan@xxxxxxxxxxxx>
- Date: Thu, 8 Sep 2005 11:04:42 +0200
Hello,
Sanitizing an arbitrary user input for regexp is not simple at all. The
obvious solution
to your problem is to convert it into a two step process:
- Find a candidate string with the standard search capabilities of the
text (no regexp)
- Check the characters before and after the candidate to see if they match
your criteria
- Treat the special case of your candidate string being at start of end of
the text.
It causes great pains to try to compact a code in less lines than the ones
naturally required.
Regards,
Ramon Ribó
"Francois Vogel" <fsvogelnew5NOSPAM@xxxxxxx> escribió en el mensaje
news:431f82cf$0$32602$626a14ce@xxxxxxxxxxxxxxx
>>> 3. a "word" boundary is a location where on one side you have any
> and on the other side
>>> you have a character included in set A
>>
>> I don't believe you. According to your example, words may not begin
>> or end at the boundary of the string (no "character not in set A" at
>> the boundary), so your example contains *zero* words, by rule 3.
>
> I don't understand why you don't believe me. "Words" may begin at the
> beginning or end of a string.
> I said:
>>> character not in set A (or string start/stop),
>
> Therefore my example sounds correct to me: it contains three of my
> "words".
>
> But this isn't of big importance I think. As I said, the text in which I
> want to search is placed in a text widget. Beginning or end of the part of
> the text in which I search is given by the start and stop index of the .t
> search command. If my problem is solved for all the cases but the start
> and stop, it will be a great step forward. I would look at such cases
> afterwards.
>
>
>> Which indicates you have not thought clearly about your problem,
>> rather than merely not describing it clearly.
>
> This is however still possible. I tried to explain my mind as clearly as I
> could but might have failed.
> Another way of describing my problem to you is to say that I want to mimic
> certain wordprocessors "Match whole word" search option (e.g. MS Word
> offers such an option).
>
>
>>> 4. a "word" can be composed of any number of characters
>>> 5. the characters in a "word" can appear in any order, at any position
>>> inside the "word", and any number of times
>>
>> Which is an entirely different problem from what you asked before with
>> "%foo" versus "bar%foo".
>
> Correct.
>
>
>> Why are you still using \m after the previous discussion?
>
> I gave my \m example (that partially worked) in order you understand the
> history of why I'm looking for something else.
>
>
>> You had
>> an explicit set of characters before. Now you say a word can contain
>> *ANY* character, so you have lost the dichotomy between word-characters
>> and non-word characters, and searching whole words doesn't make sense.
>
> Searching for whole words makes sense if you consider that a user might
> enter let's say %f oo$#ba r in its "Search what" entry box, and ask for
> whole word matches only. I want to find this exact string in the text
> contained in my text widget, and I want only "whole word" matches, in
> other words, this exact text must be bounded by what I called "word
> boundaries" (rule 3).
>
> Another example. Consider the 5 lines of text below, as if it was input in
> a text widget:
> This is the first line of text
> Download antivirus software, firewalls, spyware
> removal tools, and %f oo$#ba r more to improve the security
> of your %f oo$#ba ranything and to help keep it running smoothly
> This is a test of a ?something%f oo$#ba r text widget
>
> I want to have a regexp that, given that the users input is %f oo$#ba r
> (13 characters, first is %, last is r, contains both set A chars, and
> non-set A chars), will match in line 3 but not in lines 4 or 5. At first
> sight this should have been \m%f oo\$#ba r\M but as we now all know this
> does not work. My boundaries are different from Tcl word boundaries, and
> that's why \m cannot be used. Therefore I'm looking for a regexp that will
> replace \m and \M taking into account my definition of what a boundary is.
>
> I don't see any contradiction between the rules I gave in my previous post
> and my examples.
> The user can enter anything he wants, be it "word" characters (i.e. in
> what I called set A), or not. But the matching process should only match
> at boundaries, i.e. rule 3 must apply.
>
>
>> If you mainly want to anchor a user's string to follow particular
>> characters,
>> then see my response from yesterday.
>
> I don't think that this is exactly what I'm looking for.
>
>
>> You should be sanitizing the user's input before feeding it to regexp!
>
> If by sanitizing you mean escape the Tcl special characters from the user
> that have a meaning in a regexp, I agree with you. I'll have to double the
> antislashes typed by the user, to change $ into \$ and so on. But changing
> the users's input more than this I must not do.
>
> Hope it is a bit clearer, now.
> Anyway, thanks for the headache I give you with no doubt ;-)
> Francois
>
>
.
- References:
- Regexp: \m and [^[:alnum:]_] are not equivalent
- From: Francois Vogel
- Re: Regexp: \m and [^[:alnum:]_] are not equivalent
- From: Helmut Giese
- Re: Regexp: \m and [^[:alnum:]_] are not equivalent
- From: Francois Vogel
- Re: Regexp: \m and [^[:alnum:]_] are not equivalent
- From: Donald Arseneau
- Re: Regexp: \m and [^[:alnum:]_] are not equivalent
- From: fsvogelnew5
- Re: Regexp: \m and [^[:alnum:]_] are not equivalent
- From: Donald Arseneau
- Re: Regexp: \m and [^[:alnum:]_] are not equivalent
- From: Francois Vogel
- Regexp: \m and [^[:alnum:]_] are not equivalent
- Prev by Date: Re: call a proc from c
- Next by Date: Tktable - want to disable paste operation
- Previous by thread: Re: Regexp: \m and [^[:alnum:]_] are not equivalent
- Next by thread: Re: Regexp: \m and [^[:alnum:]_] are not equivalent
- Index(es):
Relevant Pages
|