ActiveBlog

Solving the Wedged Web App Problem with Komodo's Rx Toolkit
by Eric Promislow

Eric Promislow, July 18, 2011

As part of the beta launch for ActiveState's Stackato cloud platform, we've written a Rails app that tracks the tweet activity on some of the players in the upcoming American presidential election (only less than 16 months away as I write this!). You can read more about CandidateBuzz here.

This post is on how I quickly solved a technical problem with one of my favorite development tools. The crux of the back-end work on this app was parsing a Twitter flow, and culling the incoming tweets to separate each unique tweet from a barrage of copies and retweets. For example, I wanted to discover that the second tweet in this list was derived from the first:

Using Komodo and Stackato to build a cool web app
http://bit.ly/jcRTvD #candidateBuzz

RT @cnn RT @TheKomodoKid: Using Komodo and Stackato to build a coolweb app
http://bit.ly/jcRTvD #candidateBuzz #thisiscool

Occasionally, the backend would hang, or as one of the testers diplomatically put it, "it would get wedged", and I would have to restart the application. That sent me on a hunt for the cause.

First, I realized that I hadn't done a thorough RTFM. Cloud Foundry, the basis for Stackato, runs Rails apps by default with Webrick, a web server written in pure Ruby. I've found that while Webrick isn't the fastest way to run a Rails app, I've never questioned its stability. I added "gem 'thin'" to the app's Gemfile, restarted the app, worked on some other things, and waited for a hang. Out of curiousity, I ran a 'tail -f' in a window that was monitoring the flow of parsed tweets, and left it open. And at some point, I saw a tweet that contained this nugget:

Cher Has Had It Up To Her With Michele Bachmann [Tweet Beat]:
\t\t\t\t\t\t\t\t\t\t
\t\t\t\t\t
\t\t\t\t\t\t
\t\t\t\t\t\t\t\t\t
\t\t\t\tToday in Tw... http://bit.ly/pD0qup

Each '\t' denotes a tab character, and the link is real (NSFWDOWYW). And the tail program was paused. There was no other activity, where I was expecting a fast flow of parsed tweets and diagnostic messages. I had found the wedge point! Now I hadn't used the Komodo Rx Toolkit to develop the patterns I used to write the tweet parser. I've been doing pattern-matching for so long, sometimes the regexes feel like they write themselves. But I was suspicious, so I fired it up and pasted in the tweet above. I then pasted this tweet-parsing regex into the toolkit (making sure the verbose and single-line options were on):

\A(\s*(?:(?:RT\b[\s:]*)?(?:@[a-zA-Z][\w\-.]*[,:\s]*))*)
(.*?)
((?:http:\/\/.*?\/\S+|[\#\@][a-zA-Z][\w\-.]*|\s+)*)\Z

In English, I'm looking for all the "RT" indicators at the start of a tweet, followed by the body of the tweet, followed by any links, hashtags, and mentions of other users. I found I couldn't count links as part of a message, because some people will copy a tweet, regenerate their own private shortened URL of a link, and then republish the tweet as their own. I could have gone to the trouble of using timestamps to determine which tweet was the true original, but I didn't care; I just wanted to make sure I wasn't recording two uniques where one was a copy.

Back from that digression, I used the Rx Toolkit language selector to have it evaluate the expression with Ruby, and watched the Results box say "Pattern-match in progress...", just like you see in the screenshot. When I switched the language to Python, I got the same behavior. PHP and JavaScript both told me "Your regular expression does not match the search text". And Perl, the granddaddy of text-parsing languages, came up with the correct result immediately.

A possible culprit was the "near-infinite backtracking" problem, where the algorithm constantly backtracks to the same sub-pattern. This can happen when you're looking for something like

((x*)+)y$
where the text has a large number of "x"s, but the algorithm doesn't check to see if it ends with a "y". The pattern did contain something like that, so I tried this revision:

\A(\s*(?:(?:RT\b[\s:]*)?(?:@[a-zA-Z][\w\-.]*[,:\s]*))*)
(.*?)
((?:http:\/\/.*?\/\S+|[\#\@][a-zA-Z][\w\-.]*|\s)*)\Z

...and Ruby and Python both found the answer immediately, and even PHP found it. (If you don't see the difference, it's near the end of the regex, where I changed "...\s+)*" to "...\s)*".) I pushed the changes to my local web server, ran it for a while, saw that it had successfully processed the Cher tweet, pushed the change to the main server, and Candidate Buzz has been running without needing a restart since that point.

So by using the Rx Toolkit, I was able to quickly find out that my regex didn't behave the same way in five different languages, and was able to quickly fix the regex so that it would work correctly not only in Ruby, but any other language, should I need to redeploy CandidateBuzz on a different framework.

Komodo IDE: Rx Toolkit

Trackback URL for this post:

http://www.activestate.com/trackback/3110
Category: komodo, stackato
About the Author: RSS

Eric Promislow is a senior developer who's worked on Komodo since the very beginning. He has a M.Sc. in Computing Science from Queen's University and a B.Sc. in Biophysics from the University of Ontario. Before joining ActiveState, he helped create the OmniMark text-processing language.

SHARE THIS:

Comments

2 comments for Solving the Wedged Web App Problem with Komodo's Rx Toolkit
Permalink

I too have saved tons of time, but simply using the RX toolkit to play with regex's till I get them right. it is one of the best parts of Komodo IDE!

Permalink

RxToolkit is awesome! I use it to make regexs and it hastened the whole process remarkably. My computer, RxToolkit, and some jojoba oil on my keyboard = super fast coding ;) Thanks!

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Each email address will be obfuscated in a human readable fashion or (if JavaScript is enabled) replaced with a spamproof clickable link.

More information about formatting options

By submitting this form, you accept the Mollom privacy policy.