I'm attempting something which I feel should be fairly obvious to me but it's not. I'm trying to match a string which does NOT contain a specific sequence of characters. I've tried using [^ab]
, [^(ab)]
, etc. to match strings containing no 'a's or 'b's, or only 'a's or only 'b's or 'ba' but not match on 'ab'. The examples I gave won't match 'ab' it's true but they also won't match 'a' alone and I need them to. Is there some simple way to do this?
Answer
Use negative lookahead:
^(?!.*ab).*$
UPDATE: In the comments below, I stated that this approach is slower than the one given in Peter's answer. I've run some tests since then, and found that it's really slightly faster. However, the reason to prefer this technique over the other is not speed, but simplicity.
The other technique, described here as a tempered greedy token, is suitable for more complex problems, like matching delimited text where the delimiters consist of multiple characters (like HTML, as Luke commented below). For the problem described in the question, it's overkill.
For anyone who's interested, I tested with a large chunk of Lorem Ipsum text, counting the number of lines that don't contain the word "quo". These are the regexes I used:
(?m)^(?!.*\bquo\b).+$
(?m)^(?:(?!\bquo\b).)+$
Whether I search for matches in the whole text, or break it up into lines and match them individually, the anchored lookahead consistently outperforms the floating one.
No comments:
Post a Comment