Friday 28 April 2017

javascript - Regex excluding matches wrapped in specific bbcode tags

I'm trying to replace double quotes with curly quotes, except when the text is wrapped in certain tags, like [quote] and [code].



Sample input




[quote="Name"][b]Alice[/b] said, "Hello world!"[/quote]

"Why no goodbye?" replied [b]Bob[/b]. "It's always Hello!"




Expected output



[quote="Name"][b]Alice[/b] said, "Hello world!"[/quote]

“Why no goodbye?” replied [b]Bob[/b]. “It's always Hello!”





I figured how to elegantly achieve what I want in PHP by using (*SKIP)(*F), however my code will be run in javascript, and the javascript solution is less than ideal.



Right now I'm splitting the string at those tags, running the replace, then putting the string together:



var o = 3;
a = a
.split(/(\[(?(?:icode|quote|code))[^\]]*?\](?:[\s]*?.)*?[\s]*?\[\/(?:\k)\])/i)
.map(function(x,i) {
if (i == o-1 && x) {
x = '';

}
else if (i == o && x)
{
x = x.replace(/(?![^<]*>|[^\[]*\])"([^"]*?)"/gi, '“$1”')
o = o+3;
}
return x;
}).join('');



Javascript Regex Breakdown




  1. Inside split():


    • (\[(?icode|quote|code)[^\]]*?\](?:.)*?\[\/(\k)\]) - captures the pattern inside parentheses:


      • \[(?quote|code|icode)[^\]]*?\] - a [quote], [code], or [icode] opening tag, with or without parameters like =html, eg [code=html]


      • (?:[\s]*?.)*? - any 0+ (as few as possible) occurrences of any char (.), preceded or not by whitespace, so it doesn't break if the opening tag is followed by a line break

      • [\s]*? - 0+ whitespaces

      • \[\/(\k)\] - [\quote], [\code], or [\icode] closing tags. Matches the text captured in the (?) group. Eg: if it's a quote opening tag, it'll be a quote closing tag



  2. Inside replace():


    • (?![^<]*>|[^\[]*\])"([^"]*?)" - captures text inside double quotes:



      • (?![^<]*>|[^\[]*\]) - negative lookahead, looks for characters (that aren't < or [) followed by either > or ] and discards them, so it won't match anything inside bbcode and html tags. Eg: [spoiler="Name"] or . Note that matches wrapped in tags are left untouched.

      • " - literal opening double quotes character.

      • ([^"]*?) - any 0+ character, except double quotes.

      • " - literal closing double quotes character.





SPLIT() REGEX DEMO: https://regex101.com/r/Ugy3GG/1




That's awful, because the replace is executed multiple times.






Meanwhile, the same result can be achieved with a single PHP regex. The regex I wrote was based on Match regex pattern that isn't within a bbcode tag.



(\[(?quote|code|icode)[^\]]*?\](?:[\s]*?.)*?[\s]*?\[\/(\k)\])(*SKIP)(*F)|(?![^<]*>|[^\[]*\])"([^"]*?)"



PHP Regex Breakdown




  • (\[(?quote|code|icode)[^\]]*?\](?:[\s]*?.)*?[\s]*?\[\/(\k)\])(*SKIP)(*F) - matches the pattern inside capturing parentheses just like javascript split() above, then (*SKIP)(*F) make the regex engine omit the matched text.

  • | - or

  • (?![^<]*>|[^\[]*\])"([^"]*?)" - captures text inside double quotes in the same way javascript replace() does



PHP DEMO: https://regex101.com/r/fB0lyI/1




The beauty of this regex is that it only needs to be run once. No splitting and joining of strings. Is there a way to implement it in javascript?

No comments:

Post a Comment

c++ - Does curly brackets matter for empty constructor?

Those brackets declare an empty, inline constructor. In that case, with them, the constructor does exist, it merely does nothing more than t...