Text Manipulation with Perl

Perl has often been called the swiss army knife of scripting languages, but one of the most common use cases is string manipulation. After all, it’s right there in the ‘backronym”: “Practical Extraction and Reporting Language.” 

Perl’s very rich and powerful regular expression (regex) engine makes it easy to do basic string crunching, but by adding a few community-created modules, you can manipulate all kinds of text-based resources. In fact, we’ve put together a Perl 5.28 runtime environment to support just that. You can get it here for Linux and Windows: Perl Text Processing Runtime by signing up for a free ActiveState Platform account.

 

Using Regex in Perl

Regular Expressions (regex) are one of the most common ways to parse text and extract or identify components. Use cases are almost infinite, but some of the most common ones include verifying that an email address is valid, or that a password conforms to your requirements. 

When you create a regex, you’re really just defining a search pattern using an expression. Some of the most common classes used to create regex expressions include:

  • . match any character except newline
  • \w match “word” characters
  • \d match digit
  • \s match whitespace
  • \W match non-“word” characters
  • \D match non-digits
  • \S match non-whitespace
  • [abc] match any of a, b, or c
  • [^abc] match anything that is not a, b, or c
  • [a-g] match any character from a – g (inclusive)

While regex capabilities are built into the Perl language, creating and maintaining regexes is a pain. This is why resources like the Top 15 Commonly Used Regex exist. 

 

HTML Parsing in Perl

HTML parsing provides you with the ability to recognize hypertext markup and either strip it out or separate it from content (ie., text) on Web pages. The main use case for HTML parsing is “web scraping,” whereby a Web site’s content is extracted typically for purposes of analysis, indexing, searching or re-use.

Because parsing HTML with regexes is the path to madness Perl modules like HTML::Parser were written to help recognize both standard and non-standard HTML implementations, simplifying the extraction of data from HTML. 

The following example (from the HTML::Parser doc) shows how you can use HTML::Parser to print out any text found within an HTML element (in this case, title) of an HTML document:

use HTML::Parser ();
 
sub start_handler
{
  return if shift ne "title";
  my $self = shift;
  $self->handler(text => sub { print shift }, "dtext");
  $self->handler(end  => sub { shift->eof if shift eq "title"; },
                         "tagname,self");
}
 
my $p = HTML::Parser->new(api_version => 3);
$p->handler( start => \&start_handler, "tagname,self");
$p->parse_file(shift || die) || die $!;
print "\n";

While this is a fairly straightforward example that’s unlikely to be tripped up by too many Web pages, the idea with using HTML::Parser is to try and avoid as much as possible having to look at the underlying HTML code. 

 

JSON Manipulation in Perl

If XML is readable in the same way that sand is edible, then JSON is like taffy: you have to chew on it for a while before you can digest it. To simplify working with JSON, you may want to extract just the relevant info you need, or perhaps transform it into something like a Perl data structure so you can more easily query it.

The JSON-MaybeXS module provides you with a simple way to encode and decode JSON to/from Perl data structures: 

my $json_text = encode_json($data_structure);
my $data_structure = decode_json($json_text);
 

Next Steps

There’s a wealth of community modules that can help simplify just about any task associated with manipulating, extracting and transforming strings besides the ones highlighted in this blog. In the Perl Text Processing Runtime we’ve included some of the most common ones, such as:

  • HTML-Tree – lets you build and scan parse-trees of HTML. 
  • PPI – lets you parse, analyze and manipulate the Perl language itself
  • PPIx-Regexp – lets you parse regular expressions in a manner similar to the way the PPI package parses Perl
  • Test2-Suite – lets you write tests for your data manipulation routines 
  • Text-Balanced – lets you extract delimited text sequences from strings
  • Text-CSV_XS – lets you parse and create Comma-Separated Value (CSV) strings and files. Importantly, it handles fields with commas in the field value itself properly.
  • Text-ParseWords – lets you parse text in the same way that a shell does, and create an array of tokens or array of arrays
  • XML-Parser – lets you parse XML documents
  • XML-XSLT – lets you process XSLT (eXtensible Stylesheet Language Transformations) so you can transform XML docs into HTML pages, text files, etc
  • Text-Autoformat – lets you format text in a myriad of ways

If you want to test out any of these libraries, the easiest way is to just download our Perl 5.28 runtime for Linux and Windows by signing up for a free ActiveState Platform account. 
If you want to more ways to practice working with text in Perl, or if you just need a few more examples, a good resource is the Perl CLI Text Processing repository on Github.

Dana Crane

Dana Crane

Experienced Product Marketer and Product Manager with a demonstrated history of success in the computer software industry. Strong skills in Product Lifecycle Management, Pragmatic Marketing methods, Enterprise Software, Software as a Service (SaaS), Agile Methodologies, Customer Relationship Management (CRM), and Go-to-market Strategy.