Re: Regexp: \m and [^[:alnum:]_] are not equivalent



Hello,

Sanitizing an arbitrary user input for regexp is not simple at all. The
obvious solution
to your problem is to convert it into a two step process:

- Find a candidate string with the standard search capabilities of the
text (no regexp)
- Check the characters before and after the candidate to see if they match
your criteria
- Treat the special case of your candidate string being at start of end of
the text.

It causes great pains to try to compact a code in less lines than the ones
naturally required.

Regards,

Ramon Ribó

"Francois Vogel" <fsvogelnew5NOSPAM@xxxxxxx> escribió en el mensaje
news:431f82cf$0$32602$626a14ce@xxxxxxxxxxxxxxx
>>> 3. a "word" boundary is a location where on one side you have any
> and on the other side
>>> you have a character included in set A
>>
>> I don't believe you. According to your example, words may not begin
>> or end at the boundary of the string (no "character not in set A" at
>> the boundary), so your example contains *zero* words, by rule 3.
>
> I don't understand why you don't believe me. "Words" may begin at the
> beginning or end of a string.
> I said:
>>> character not in set A (or string start/stop),
>
> Therefore my example sounds correct to me: it contains three of my
> "words".
>
> But this isn't of big importance I think. As I said, the text in which I
> want to search is placed in a text widget. Beginning or end of the part of
> the text in which I search is given by the start and stop index of the .t
> search command. If my problem is solved for all the cases but the start
> and stop, it will be a great step forward. I would look at such cases
> afterwards.
>
>
>> Which indicates you have not thought clearly about your problem,
>> rather than merely not describing it clearly.
>
> This is however still possible. I tried to explain my mind as clearly as I
> could but might have failed.
> Another way of describing my problem to you is to say that I want to mimic
> certain wordprocessors "Match whole word" search option (e.g. MS Word
> offers such an option).
>
>
>>> 4. a "word" can be composed of any number of characters
>>> 5. the characters in a "word" can appear in any order, at any position
>>> inside the "word", and any number of times
>>
>> Which is an entirely different problem from what you asked before with
>> "%foo" versus "bar%foo".
>
> Correct.
>
>
>> Why are you still using \m after the previous discussion?
>
> I gave my \m example (that partially worked) in order you understand the
> history of why I'm looking for something else.
>
>
>> You had
>> an explicit set of characters before. Now you say a word can contain
>> *ANY* character, so you have lost the dichotomy between word-characters
>> and non-word characters, and searching whole words doesn't make sense.
>
> Searching for whole words makes sense if you consider that a user might
> enter let's say %f oo$#ba r in its "Search what" entry box, and ask for
> whole word matches only. I want to find this exact string in the text
> contained in my text widget, and I want only "whole word" matches, in
> other words, this exact text must be bounded by what I called "word
> boundaries" (rule 3).
>
> Another example. Consider the 5 lines of text below, as if it was input in
> a text widget:
> This is the first line of text
> Download antivirus software, firewalls, spyware
> removal tools, and %f oo$#ba r more to improve the security
> of your %f oo$#ba ranything and to help keep it running smoothly
> This is a test of a ?something%f oo$#ba r text widget
>
> I want to have a regexp that, given that the users input is %f oo$#ba r
> (13 characters, first is %, last is r, contains both set A chars, and
> non-set A chars), will match in line 3 but not in lines 4 or 5. At first
> sight this should have been \m%f oo\$#ba r\M but as we now all know this
> does not work. My boundaries are different from Tcl word boundaries, and
> that's why \m cannot be used. Therefore I'm looking for a regexp that will
> replace \m and \M taking into account my definition of what a boundary is.
>
> I don't see any contradiction between the rules I gave in my previous post
> and my examples.
> The user can enter anything he wants, be it "word" characters (i.e. in
> what I called set A), or not. But the matching process should only match
> at boundaries, i.e. rule 3 must apply.
>
>
>> If you mainly want to anchor a user's string to follow particular
>> characters,
>> then see my response from yesterday.
>
> I don't think that this is exactly what I'm looking for.
>
>
>> You should be sanitizing the user's input before feeding it to regexp!
>
> If by sanitizing you mean escape the Tcl special characters from the user
> that have a meaning in a regexp, I agree with you. I'll have to double the
> antislashes typed by the user, to change $ into \$ and so on. But changing
> the users's input more than this I must not do.
>
> Hope it is a bit clearer, now.
> Anyway, thanks for the headache I give you with no doubt ;-)
> Francois
>
>


.



Relevant Pages

  • Re: Regexp, ***= and subexpressions
    ... > Hi there, person reading this. ... > says here, "the rest of the RE is taken to be a literal string, with all ... > characters considered ordinary characters", that is so what I need, so ... > And hacking regexp engine is, naturally, out of question:) ...
    (comp.lang.tcl)
  • Re: Regexp: m and [^[:alnum:]_] are not equivalent
    ... beginning or end of a string. ... > and non-word characters, and searching whole words doesn't make sense. ... this exact text must be bounded by what I called "word boundaries" ... Therefore I'm looking for a regexp that will ...
    (comp.lang.tcl)
  • Re: Doing an AND in regexp char class
    ... I can check with a character class if one of the characters in the ... I use unpack to avoid creating a bunch of String objects, ... wondering is if there is a way to do this with a simple regexp. ... cfp:~> ruby a.rb ...
    (comp.lang.ruby)
  • Re: any way to escape the expression when constructing a RegExp object?
    ... characters so the match isn't happening as we would like. ... If the reason you want to create a regexp to match a literal string is ... And finally, if you can't change anything, you can escape all ...
    (comp.lang.javascript)
  • Re: Efficient String Lookup?
    ... regexp language allowed embedded Perl code, ... The pattern is ... So the regexp engine tries the next option, ... I could put it inside a * to match all characters, ...
    (comp.lang.python)