Wednesday, 9 November 2016

perl - sed regex stop at first match



I want to replace part of the following html text (excerpt of a huge file), to update old forum formatting (resulting from a very bad forum porting job done 2 years ago) to regular phpBB formatting:



    <blockquote id="quote"><font size="1" face="Verdana, Arial, Helvetica" id="quote">quote:<hr height="1" noshade id="quote"><i>written by User</i>



this should be filtered into:



    [quote=User]


I used the following regex in sed



    s/<blockquote.*written by \(.*\)<\/i>/[quote=\1]/g



this works on the given example, but in the actual file, several quotes like this can be in one line. In that case sed is too greedy, and places everything between the first and the last match in the [quote=...] tag. I cannot seem to make it replace every occurance of this pattern in the line... (I don't think there's any nested quotes, but that would make it even more difficult)


Answer



You need a version of sed(1) that uses Perl-compatible regular expressions, so that you can do things like make a minimal match, or one with a negative lookahead.



The easiest way to do this is simply to use Perl in the first place.



If you have an existing sed script, you can translate it into Perl using the s2p(1) utility. Note that in Perl you really want to use $1 on the right side of the s/// operator. In most cases the \1 is grandfathered, but in general you want $1 there:



s/<blockquote.*?written by (.*?)<\/i>/[quote=$1]/g;



Notice I have removed the backslash from the front of the parens. Another advantage of using Perl is that it uses the sane egrep-style regexes (like awk), not the ugly grep-style ones (like sed) that require all those confusing (and inconsistent) backslashes all over the place.



Another advantage to using Perl is you can use paired, nestable delimiters to avoid ugly backslashes. For example:



s{<blockquote.*?written by (.*?)</i>}
{[quote=$1]}g;


Other advantage include that Perl gets along excellently well with UTF-8 (now the Web’s majority encoding form), and that you can do multiline matches without the extreme pain that sed requires for that. For example:




$ perl -CSD -00 -pe 's{<blockquote.*?written by (.*?)</i>}{[quote=$1]}gs' file1.utf8 file2.utf8 ...


The -CSD makes it treat stdin, stdout, and files as UTF-8. The -00 makes it read the entire file in one fell slurp, and the /s makes the dot cross newline boundaries as need be.


No comments:

Post a Comment

c++ - Does curly brackets matter for empty constructor?

Those brackets declare an empty, inline constructor. In that case, with them, the constructor does exist, it merely does nothing more than t...