Note: This entry has been restored from old archives.
pcrecpp
API. The project maintainers were quick to respond to the bug with a good fix. In general I have this to say about pcrecpp
: It’s the best option I’ve found for working with PCREs in C++, it beats the pants off working with the C API or writing your own wrapper. quotemeta
.) Be warned! If using pcrecpp::QuoteMeta on strings with embedded null bytes the results might not be as you expect!
#include <pcrecpp.h> #include <string> #include <iostream> int main(void) { std::string unquoted("foo"); unquoted.push_back(' '); unquoted.append("bar"); std::string autoquoted = pcrecpp::RE::QuoteMeta(unquoted); std::string manualquoted("foo\x00bar"); std::cout << "Auto quoted version is: " << autoquoted << std::endl; std::cout << "Auto match result is: " << pcrecpp::RE(autoquoted).FullMatch(unquoted) << std::endl; std::cout << "Manual quoted version is: " << manualquoted << std::endl; std::cout << "Manual match result is: " << pcrecpp::RE(manualquoted).FullMatch(unquoted) << std::endl; return 0; }
:; g++ quotemeta.cc -o quotemeta -lpcrecpp :; ./quotemeta Auto quoted version is: foobar Auto match result is: 0 Manual quoted version is: foox00bar Manual match result is: 1 :;
Dammit!
But is it a bug? The documentation in pcrecpp.h
says:
// Escapes all potentially meaningful regexp characters in // 'unquoted'. The returned string, used as a regular expression, // will exactly match the original string. For example, // 1.5-2.0? // may become: // 1.5-2.0? static string QuoteMeta(const StringPiece& unquoted);
And that’s what man pcrecpp
tells me too. So the definition is essentially “does what perl’s quotemeta
does.” Hrm:
:; perl -e 'print quotemeta("foox00bar") . "n"' foobar :; perl -e 'print quotemeta("foox00bar") . "n"' | xxd 0000000: 666f 6f5c 0062 6172 0a foo.bar.
That second command is just to determine that the null byte is actually still there. The same trick with ./quotemeta
shows that the null is also still there when pcrecpp::QuoteMeta
is used.
So, the behaviour of pcrecpp::QuoteMeta
is correct by definition.
What about the matching then? Should “” followed by a literal null be part of the regular expression? I’m not sure about
libpcre
semantics for this but let’s test with Perl. Note that pcrecpp::FullMatch
means the whole string must match, so the Perl expression must have “^
” and “$
” at either
end.
:; perl -e '$s="foo bar"; $p=quotemeta($s); $s =~ s/^$p$//; print "<$s>n"' | xxd 0000000: 3c3e 0a <>. :; perl -e '$s="foo bar"; $p=quotemeta($s); $s =~ s/^foo//; print "<$s>n"' | xxd 0000000: 3c00 6261 723e 0a <.bar>. :; perl -e '$s="foo bar"; $p=quotemeta($s); $s =~ s/^foo //; print "<$s>n"' | xxd 0000000: 3c62 6172 3e0a <bar>.
OK, looks like pcrecpp
isn’t matching like Perl. Digging into the pcrecpp.cc
source equivalent to the Ubuntu package I’m using shows:
... RE(const string& pat) { Init(pat, NULL); } ... void RE::Init(const string& pat, const RE_Options* options) { pattern_ = pat; ... re_partial_ = Compile(UNANCHORED); if (re_partial_ != NULL) { re_full_ = Compile(ANCHOR_BOTH); } ... pcre* RE::Compile(Anchor anchor) { ... if (anchor != ANCHOR_BOTH) { re = pcre_compile(pattern_.c_str(), pcre_options, &compile_error, &eoffset, NULL); } else { // Tack a 'z' at the end of RE. Parenthesize it first so that // the 'z' applies to all top-level alternatives in the regexp. string wrapped = "(?:"; // A non-counting grouping operator wrapped += pattern_; wrapped += ")\z"; re = pcre_compile(wrapped.c_str(), pcre_options, &compile_error, &eoffset, NULL); } ...
Hm! The problem is that the QuoteMeta
leaves the literal null byte in then later on the compilation uses the string’s c.str()
. Naturally this will be null-terminated, so that marks the end of our pattern.
It seems to be that pcre_compile
doesn’t offer a version with a specified string length so there’s no way around this without printable-quoting the null. This is find in libpcre
since the behaviour is obvious. Maybe not so find in pcrecpp
since it is common to use std::string
as a non-printable data container (it is very common, but maybe it is a “bad thing”™?) I think it is a bug in pcrecpp
, but it could be a “document the caveat” bug rather than “do magic to make it work bug.”
Looks like pcrecpp.cc
is actually part of upstream libpcre
. Should file a bug I guess.