Friday 2 December 2016

javascript - Regex excluding matches wrapped in specific bbcode tags



I'm trying to replace double quotes with curly quotes, except when the text is wrapped in certain tags, like [quote] and [code].




Sample input



[quote="Name"][b]Alice[/b] said, "Hello world!"[/quote]

"Why no goodbye?" replied [b]Bob[/b]. "It's always Hello!"




Expected output



[quote="Name"][b]Alice[/b] said, "Hello world!"[/quote]

“Why no goodbye?” replied [b]Bob[/b]. “It's always Hello!”





I figured how to elegantly achieve what I want in PHP by using (*SKIP)(*F), however my code will be run in javascript, and the javascript solution is less than ideal.



Right now I'm splitting the string at those tags, running the replace, then putting the string together:



var o = 3;
a = a
.split(/(\[(?(?:icode|quote|code))[^\]]*?\](?:[\s]*?.)*?[\s]*?\[\/(?:\k)\])/i)
.map(function(x,i) {

if (i == o-1 && x) {
x = '';
}
else if (i == o && x)
{
x = x.replace(/(?![^<]*>|[^\[]*\])"([^"]*?)"/gi, '“$1”')
o = o+3;
}
return x;
}).join('');



Javascript Regex Breakdown




  1. Inside split():


    • (\[(?icode|quote|code)[^\]]*?\](?:.)*?\[\/(\k)\]) - captures the pattern inside parentheses:



      • \[(?quote|code|icode)[^\]]*?\] - a [quote], [code], or [icode] opening tag, with or without parameters like =html, eg [code=html]

      • (?:[\s]*?.)*? - any 0+ (as few as possible) occurrences of any char (.), preceded or not by whitespace, so it doesn't break if the opening tag is followed by a line break

      • [\s]*? - 0+ whitespaces

      • \[\/(\k)\] - [\quote], [\code], or [\icode] closing tags. Matches the text captured in the (?) group. Eg: if it's a quote opening tag, it'll be a quote closing tag



  2. Inside replace():



    • (?![^<]*>|[^\[]*\])"([^"]*?)" - captures text inside double quotes:


      • (?![^<]*>|[^\[]*\]) - negative lookahead, looks for characters (that aren't < or [) followed by either > or ] and discards them, so it won't match anything inside bbcode and html tags. Eg: [spoiler="Name"] or . Note that matches wrapped in tags are left untouched.

      • " - literal opening double quotes character.

      • ([^"]*?) - any 0+ character, except double quotes.

      • " - literal closing double quotes character.






SPLIT() REGEX DEMO: https://regex101.com/r/Ugy3GG/1



That's awful, because the replace is executed multiple times.






Meanwhile, the same result can be achieved with a single PHP regex. The regex I wrote was based on Match regex pattern that isn't within a bbcode tag.



(\[(?quote|code|icode)[^\]]*?\](?:[\s]*?.)*?[\s]*?\[\/(\k)\])(*SKIP)(*F)|(?![^<]*>|[^\[]*\])"([^"]*?)"



PHP Regex Breakdown




  • (\[(?quote|code|icode)[^\]]*?\](?:[\s]*?.)*?[\s]*?\[\/(\k)\])(*SKIP)(*F) - matches the pattern inside capturing parentheses just like javascript split() above, then (*SKIP)(*F) make the regex engine omit the matched text.

  • | - or

  • (?![^<]*>|[^\[]*\])"([^"]*?)" - captures text inside double quotes in the same way javascript replace() does




PHP DEMO: https://regex101.com/r/fB0lyI/1



The beauty of this regex is that it only needs to be run once. No splitting and joining of strings. Is there a way to implement it in javascript?


Answer



Because JS lacks backtracking verbs you will need to consume those bracketed chunks but later replace them as is. By obtaining the second side of the alternation from your own regex the final regex would be:



\[(quote|i?code)[^\]]*\][\s\S]*?\[\/\1\]|(?![^<]*>|[^\[]*\])"([^"]*)"


But the tricky part is using a callback function with replace() method:




str.replace(regex, function($0, $1, $2) {
return $1 ? $0 : '“' + $2 + '”';
})


Above ternary operator returns $0 (whole match) if first capturing group exists otherwise it encloses second capturing group value in curly quotes and returns it.



Note: this may fail in different cases.




See live demo here


No comments:

Post a Comment

c++ - Does curly brackets matter for empty constructor?

Those brackets declare an empty, inline constructor. In that case, with them, the constructor does exist, it merely does nothing more than t...