ActiveBlog

Rx 2.0 for Komodo 6.0
by Eric Promislow

Eric Promislow, November 2, 2010

Komodo's Rx Toolkit is a debugger for pattern-matching. It lets you tweak patterns (or regular expressions, also known as "regexes", hence the "Rx") against subject text, showing you the results interactively. You can choose to see a single match, all matches, replacements, or even the result of "splitting" the string on the regex. While very powerful, patterns can quickly become complex and unwieldy, so it's useful to have a tool to try them out in isolation.

Rx Toolkit Screencast

Here's a short demonstration of the Rx Toolkit, specifically highlighting the new language-specific regex parsing.

Figure 1 shows a typical case of using the Rx Toolkit to work out a pattern for picking values out of HTML. Although we always recommend using a proper HTML parsing library for this task in general, I can easily see a case where you've made an AJAX call and would like to use JavaScript to pull out some values from an incoming HTML payload. Figure 1 shows how we're verifying that JavaScript can process the rather complex expression on the sample input.

Figure 1: Rx Toolkit Window

As of Komodo 6, the Rx Toolkit now lets you specify which language to use to test the patterns. In previous versions, much like the Model T, you could use any language as long as it was Python. This made sense since much of Komodo is implemented in Python, and the Rx language supported by Python is a large subset of the generally agreed industry standard, with good performance characteristics.

However, there were two problems. First of all, there isn't one standard regex language, even if Appendix B of O'Reilly's "C# Essentials" looks a lot like the 95% most commonly used parts of the regex part of Perl. Each language has slight differences, and it would be far more beneficial for the tool to recognize that.

"Let's take this offline"

Second, not only was Komodo using its loaded Python interpreter to evaluate the submitted patterns, it was doing this in-process. This meant that a pattern like...

(?:(?:(?:(?:.+).+).+).+)zzz

...could tie Komodo up for a long time if the subject was long enough, and didn't contain a zzz. We wanted the IDE to be more forgiving of user error, which required moving the evaluation step to a separate process. In fact, we shipped that feature in version 5.2 as a bug fix, but this paved the way to allow Komodo to let the user choose which language to choose.

Rx Implementation differences

So the Rx Toolkit for 6.0 has a new language drop-down menu next to the Shortcuts menu that lets you choose which language to use for evaluating the regex. Currently we support JavaScript, Perl, PHP, Python, and Ruby, with plans to add Tcl in the future. The Shortcuts drop-down menu and the list of Modifiers now slightly change depending on which language you've chosen. For example, all the languages support an "Ignore case" option (although the syntax varies across languages), while only Python supports a locale-based modifier that, for example, changes the set of characters that match "\w".

The Shortcuts menu also now changes slightly with language choice. The problem with this menu is that even the language with the smallest Rx footprint, JavaScript, still supports a large micro-language. There are too many parts of the regex micro-language to easily show in a single-level dropdown menu, and we've long resisted going with a multi-level popup that would show every part of the language. Instead, the Help menu contains convenient links to the documentation for each supported language, and Google will point out a few cheat sheets people have shared (I personally used to use the aforementioned Appendix B as my physical cheat sheet, until I ended up internalizing the list; at four pages, it's one of the most concise but complete lists I've found).

Sometimes You Still Need a Debugger

While testing the new Rx Toolkit, I found that some languages exhibited more predictable behavior than others in the Toolkit. First of all, each language uses a different paradigm for its pattern-matching syntax. It's fully object-oriented in Python, function-driven in PHP, and mostly object-oriented in JavaScript.

In Perl, pattern-matching is operator-driven, while the results are stored in various pseudo-global variables. As in several other areas, Ruby is a hybrid of Perl's and Python's approaches.

The Rx Toolkit acts more like a binary filter: either a particular pattern and subject match, or it doesn't. If you need to investigate further, it might be time to switch to the debugger. If you're starting out, and don't yet have much code to run, the interactive shell will work out (currently available for Perl, Python, and Ruby; the Komodo Developer Extension contains a great JS shell that can run inside either Komodo or Firefox, taking on the appropriate context).

Some notes on Rx in Perl

It's worth making a note on Perl, particularly given how so many people who choose Perl are using it for its pattern-matching capabilities. While the Rx Toolkit tries to hide many details from you, it doesn't hide all of them. For example, you would expect that the simple pattern...

}

...would match text containing a close-brace, and it does, for every language except Perl. When Perl is chosen, the toolkit relays this error message from the Perl interpreter:

<![CDATA[
There is an error in your regular expression: Use of uninitialized
value in regexp compilation at .../support/rxx/rxx_perl.pl
line 122, $lt;STDIN> line 1.
]]>

That admittedly isn't too helpful an error message. The problem here is that the Perl evaluator uses "{" and "}" to delimit the patterns, and gets confused by the unmatched brace inside the pattern. A future version of Rx Toolkit will allow selecting different delimiters, which is closer in spirit to how Perl works. Additionally, you have to watch out when trying to match characters like '$' and '@' in Perl, if they're followed by a letter. Perl will try to interpolate a variable, while other languages will treat '@' as is.

I ran into a more subtle issue with Perl. A customer reported that it looked like \Q and \E were getting ignored by the Perl evaluator, and I quickly reproduced the bug. These sequences act like "character quoters": most of the non-alphanumeric characters between these two characters are escaped, meaning a "*" matches itself, instead of giving an occurrence count for its preceding sub-pattern.

It turns out that these aren't actually regex operators in Perl, but are in fact processed by the Perl compiler when it reads in any string that does interpolation (meaning a string delimited by double-quotes, "qq...", or "qr...", to give the most widely used delimiters). In other words, Perl's regex engine doesn't even know what \Q means, and it was issuing an error message. It's an easy fix for a future version, but I suspect that I'm not done with Perl for now. For one thing, other people have asked for more control over which delimiters Perl uses to wrap the regex.

... and PHP

On the subject of delimiters, I should also mention PHP. For reasons I haven't figured out, functions like PHP's preg_match want the pattern parameter to be delimited by "/"s. I was surprised by that — given that you invoke pattern-matching on PHP by calling preg_match($pattern, $subject ...), I didn't see why delimiters are needed for the $pattern variable.

But they are. So this means that if you're going to match a slash in PHP, you need to escape it. However, all the other languages in the Toolkit don't need to escape the slash. So instead, the PHP evaluator escapes the "/"s for you. This means that a pattern like "a\/b" will trigger this error message from PHP:

    There is an error in your regular expression: PHP Warning:
    preg_match(): Unknown modifier 'b' in
    .../support/rxx/rxx_php.php on line 30

So you'd have to write the pattern as "a/b" in Rx, and then escape the "/" when setting the code. This isn't great; we prefer people to be able to move patterns between Komodo and Rx seamlessly, and without worrying about the details under the hood. But I can see how eventually we'll be adding more options in the UI that give people more control over how the evaluators work.

Unicode Requirements Could Influence Language Choice

I found that every language gave me a different experience with handling Unicode characters. For this exercise, I used a small sample of Swedish text, and tried to find all the "long" words in it, where I defined "long" to mean more than 10 letters long.

The text:

Naturligtvis kan de tillfälliga lägenheterna användas
i andra sammanhang, t ex som personalbostäder vid
tillfälliga arbetsplatser. Evakueringslägenheterna
levereras tomma

The pattern:

\b(\w{10,})\b

The results:

LanguageVersionResults
JavaScript 1.8 Failed, no Unicode option
Perl 5.10 Succeeded
PHP 5.3 Succeeded with Unicode option on
Python 2.6 Succeeded with Unicode option on; locale might work
Ruby 1.9.2 Failed, even with Unicode option on

Notice how JavaScript treats the position between a "ä" and an ASCII letter as a boundary.

Figure 2: Rx JavaScript Unicode

This shows how I can see that both JavaScript and Ruby's regex engines are ASCII-focused, while later versions of Perl handle Unicode transparently. With PHP, I need to turn on the Unicode flag (which puts a "u" after the trailing "/" delimiter). Without it, the Toolkit sometimes shows artefacts of the UTF-8 encoding, where a Unicode character is split into its separate UTF-8 characters. Python offers more flexibility: if I'm running in a Latin-1 locale, adding the Locale field would match characters like "ä". Python's Unicode option is more of a blowtorch, causing \w to match a word character from any Unicode language, not just the current locale's.

The shot below shows how Perl matches the non-ASCII characters. Note how here the Unicode option is disabled; Perl doesn't support turning Unicode off.

Figure 3: Rx Perl Unicode

Subscribe to ActiveState Blogs by Email

Share this post:

Category: komodo
About the Author: RSS

Eric Promislow is a senior developer who's worked on Komodo since the very beginning. He has a M.Sc. in Computing Science from Queen's University and a B.Sc. in Biophysics from the University of Ontario. Before joining ActiveState, he helped create the OmniMark text-processing language.

Comments

6 comments for Rx 2.0 for Komodo 6.0
Permalink

For reasons I haven't figured out, functions like PHP's preg_match want the pattern parameter to be delimited by "/"s.

Not really. You can use any character as delimiter. Perhaps you're thinking of JavaScript :-?

I was surprised by that — given that you invoke pattern-matching on PHP by calling preg_match($pattern, $subject ...), I didn't see why delimiters are needed for the $pattern variable.

So you can provide modifiers:

/[a-z]{3}/Ui

But they are. So this means that if you're going to match a slash in PHP, you need to escape it.

Or pick another delimiter:

@^/foo/bar@Ui

Permalink

Yes, you can use other delimiters besides '/' for any of the PHP preg_* functions. A "feature" that I classify as a very subtle bug cost me a couple of hours yesterday: the delimiters do not have to be the same character, they just have to be valid delimiter characters. So, for example, if I mistakenly pass '<!-- TERM: foo -->' as my RE, libpcre happily uses the '<' and '>' as delimiters, meaning that they will not be replaced. FAIL.

Permalink

All the more reason Rx Toolkit really needs that "supply your own delimiters" feature.

- Eric

Permalink

Thanks, Álvaro,

I figured out why preg_replace requires delimiters, but
didn't realize that you can choose different ones (like
in Perl). In JavaScript, you create a regex object
either by using slashes:

var re = /^.../;

or by calling a constructor, which requires
escaping backslashes:

var re = new RegExp(pattern, options);

I've added bug http://bugs.activestate.com/show_bug.cgi?id=88637 to implement this.

- Eric

Permalink

Eric mentions this in the post: "Currently we support JavaScript, Perl, PHP, Python, and Ruby, with plans to add Tcl in the future."