pcrecpp::QuoteMeta and null bytes

Note: This entry has been restored from old archives.

Update 2008-02-22 13:45: Sometimes I hate it, just a little, when random things I write end up as 2nd/3rd Google hits for the thing I wrote about! This isn’t a review, it’s an entry about one pretty trivial bug found in a utility method provided by the pcrecpp API. The project maintainers were quick to respond to the bug with a good fix. In general I have this to say about pcrecpp: It’s the best option I’ve found for working with PCREs in C++, it beats the pants off working with the C API or writing your own wrapper.
Update 2008-02-15 09:45: Bug filed yesterday and they’ve gone with special-casing the escape for NULL. I provided a patch that added a new QuoteMetaHex function, but I much prefer the route they’ve chosen. (I was concerned that they might really want it to be exactly like Perl’s quotemeta.)

Be warned! If using pcrecpp::QuoteMeta on strings with embedded null bytes the results might not be as you expect!

#include <pcrecpp.h>
#include <string>
#include <iostream>

int main(void)
{
    std::string unquoted("foo");
    unquoted.push_back('');
    unquoted.append("bar");
    std::string autoquoted = pcrecpp::RE::QuoteMeta(unquoted);
    std::string manualquoted("foo\x00bar");
    std::cout &lt;&lt; "Auto quoted version is: " &lt;&lt; autoquoted &lt;&lt; std::endl;
    std::cout &lt;&lt; "Auto match result is: " &lt;&lt; pcrecpp::RE(autoquoted).FullMatch(unquoted) &lt;&lt; std::endl;
    std::cout &lt;&lt; "Manual quoted version is: " &lt;&lt; manualquoted &lt;&lt; std::endl;
    std::cout &lt;&lt; "Manual match result is: " &lt;&lt; pcrecpp::RE(manualquoted).FullMatch(unquoted) &lt;&lt; std::endl;
    return 0;
}
:; g++ quotemeta.cc -o quotemeta -lpcrecpp
:; ./quotemeta 
Auto quoted version is: foobar
Auto match result is: 0
Manual quoted version is: foox00bar
Manual match result is: 1
:;

Dammit!

But is it a bug? The documentation in pcrecpp.h says:

  // Escapes all potentially meaningful regexp characters in
  // 'unquoted'.  The returned string, used as a regular expression,
  // will exactly match the original string.  For example,
  //           1.5-2.0?
  // may become:
  //           1.5-2.0?
  static string QuoteMeta(const StringPiece& unquoted);

And that’s what man pcrecpp tells me too. So the definition is essentially “does what perl’s quotemeta does.” Hrm:

:; perl -e 'print quotemeta("foox00bar") . "n"'
foobar
:; perl -e 'print quotemeta("foox00bar") . "n"' | xxd
0000000: 666f 6f5c 0062 6172 0a                   foo.bar.

That second command is just to determine that the null byte is actually still there. The same trick with ./quotemeta shows that the null is also still there when pcrecpp::QuoteMeta is used.

So, the behaviour of pcrecpp::QuoteMeta is correct by definition.

What about the matching then? Should “” followed by a literal null be part of the regular expression? I’m not sure about libpcre semantics for this but let’s test with Perl. Note that pcrecpp::FullMatch means the whole string must match, so the Perl expression must have “^” and “$” at either
end.

:; perl -e '$s="foobar"; $p=quotemeta($s); $s =~ s/^$p$//; print "<$s>n"' | xxd
0000000: 3c3e 0a                                  <>.
:; perl -e '$s="foobar"; $p=quotemeta($s); $s =~ s/^foo//; print "<$s>n"' | xxd
0000000: 3c00 6261 723e 0a                        <.bar>.
:; perl -e '$s="foobar"; $p=quotemeta($s); $s =~ s/^foo//; print "<$s>n"' | xxd
0000000: 3c62 6172 3e0a                           <bar>.

OK, looks like pcrecpp isn’t matching like Perl. Digging into the pcrecpp.cc source equivalent to the Ubuntu package I’m using shows:

...
RE(const string& pat) { Init(pat, NULL); }
...
void RE::Init(const string& pat, const RE_Options* options) {
 pattern_ = pat;
...
  re_partial_ = Compile(UNANCHORED);
  if (re_partial_ != NULL) {
    re_full_ = Compile(ANCHOR_BOTH);
  }
...
pcre* RE::Compile(Anchor anchor) {
...
  if (anchor != ANCHOR_BOTH) {
    re = pcre_compile(pattern_.c_str(), pcre_options,
                      &compile_error, &eoffset, NULL);
  } else {
    // Tack a 'z' at the end of RE.  Parenthesize it first so that
    // the 'z' applies to all top-level alternatives in the regexp.
    string wrapped = "(?:";  // A non-counting grouping operator
    wrapped += pattern_;
    wrapped += ")\z";
    re = pcre_compile(wrapped.c_str(), pcre_options,
                      &compile_error, &eoffset, NULL);
  }
...

Hm! The problem is that the QuoteMeta leaves the literal null byte in then later on the compilation uses the string’s c.str(). Naturally this will be null-terminated, so that marks the end of our pattern.

It seems to be that pcre_compile doesn’t offer a version with a specified string length so there’s no way around this without printable-quoting the null. This is find in libpcre since the behaviour is obvious. Maybe not so find in pcrecpp since it is common to use std::string as a non-printable data container (it is very common, but maybe it is a “bad thing”™?) I think it is a bug in pcrecpp, but it could be a “document the caveat” bug rather than “do magic to make it work bug.”

Looks like pcrecpp.cc is actually part of upstream libpcre. Should file a bug I guess.