Re: Regexp: \m and [^[:alnum:]_] are not equivalent



>> 3. a "word" boundary is a location where on one side you have any
and on the other side
>> you have a character included in set A
>
> I don't believe you. According to your example, words may not begin
> or end at the boundary of the string (no "character not in set A" at
> the boundary), so your example contains *zero* words, by rule 3.

I don't understand why you don't believe me. "Words" may begin at the
beginning or end of a string.
I said:
>> character not in set A (or string start/stop),

Therefore my example sounds correct to me: it contains three of my "words".

But this isn't of big importance I think. As I said, the text in which I
want to search is placed in a text widget. Beginning or end of the part of
the text in which I search is given by the start and stop index of the .t
search command. If my problem is solved for all the cases but the start and
stop, it will be a great step forward. I would look at such cases
afterwards.


> Which indicates you have not thought clearly about your problem,
> rather than merely not describing it clearly.

This is however still possible. I tried to explain my mind as clearly as I
could but might have failed.
Another way of describing my problem to you is to say that I want to mimic
certain wordprocessors "Match whole word" search option (e.g. MS Word offers
such an option).


>> 4. a "word" can be composed of any number of characters
>> 5. the characters in a "word" can appear in any order, at any position
>> inside the "word", and any number of times
>
> Which is an entirely different problem from what you asked before with
> "%foo" versus "bar%foo".

Correct.


> Why are you still using \m after the previous discussion?

I gave my \m example (that partially worked) in order you understand the
history of why I'm looking for something else.


> You had
> an explicit set of characters before. Now you say a word can contain
> *ANY* character, so you have lost the dichotomy between word-characters
> and non-word characters, and searching whole words doesn't make sense.

Searching for whole words makes sense if you consider that a user might
enter let's say %f oo$#ba r in its "Search what" entry box, and ask for
whole word matches only. I want to find this exact string in the text
contained in my text widget, and I want only "whole word" matches, in other
words, this exact text must be bounded by what I called "word boundaries"
(rule 3).

Another example. Consider the 5 lines of text below, as if it was input in a
text widget:
This is the first line of text
Download antivirus software, firewalls, spyware
removal tools, and %f oo$#ba r more to improve the security
of your %f oo$#ba ranything and to help keep it running smoothly
This is a test of a ?something%f oo$#ba r text widget

I want to have a regexp that, given that the users input is %f oo$#ba r
(13 characters, first is %, last is r, contains both set A chars, and
non-set A chars), will match in line 3 but not in lines 4 or 5. At first
sight this should have been \m%f oo\$#ba r\M but as we now all know this
does not work. My boundaries are different from Tcl word boundaries, and
that's why \m cannot be used. Therefore I'm looking for a regexp that will
replace \m and \M taking into account my definition of what a boundary is.

I don't see any contradiction between the rules I gave in my previous post
and my examples.
The user can enter anything he wants, be it "word" characters (i.e. in what
I called set A), or not. But the matching process should only match at
boundaries, i.e. rule 3 must apply.


> If you mainly want to anchor a user's string to follow particular
> characters,
> then see my response from yesterday.

I don't think that this is exactly what I'm looking for.


> You should be sanitizing the user's input before feeding it to regexp!

If by sanitizing you mean escape the Tcl special characters from the user
that have a meaning in a regexp, I agree with you. I'll have to double the
antislashes typed by the user, to change $ into \$ and so on. But changing
the users's input more than this I must not do.

Hope it is a bit clearer, now.
Anyway, thanks for the headache I give you with no doubt ;-)
Francois


.



Relevant Pages

  • Re: Regexp: m and [^[:alnum:]_] are not equivalent
    ... Sanitizing an arbitrary user input for regexp is not simple at all. ... - Find a candidate string with the standard search capabilities of the ... >> an explicit set of characters before. ... > boundaries". ...
    (comp.lang.tcl)
  • Re: Regexp, ***= and subexpressions
    ... > Hi there, person reading this. ... > says here, "the rest of the RE is taken to be a literal string, with all ... > characters considered ordinary characters", that is so what I need, so ... > And hacking regexp engine is, naturally, out of question:) ...
    (comp.lang.tcl)
  • Re: Doing an AND in regexp char class
    ... I can check with a character class if one of the characters in the ... I use unpack to avoid creating a bunch of String objects, ... wondering is if there is a way to do this with a simple regexp. ... cfp:~> ruby a.rb ...
    (comp.lang.ruby)
  • Re: any way to escape the expression when constructing a RegExp object?
    ... characters so the match isn't happening as we would like. ... If the reason you want to create a regexp to match a literal string is ... And finally, if you can't change anything, you can escape all ...
    (comp.lang.javascript)
  • Re: Q acts differently in s/// and m// operations
    ... > since after all there's no unquotemeta() function? ... And that string is processed per qq. ... a different meaning -- all non-word characters following \Q are ... In the m// regexp, it is the ...
    (comp.lang.perl.misc)