Adding New Languages to Komodo with UDL

Kid: Adding a New Language to Komodo with UDL

-1. This Tutorial doesn't work with Komodo 4.2

The API changed slightly when 4.2 was released, but wasn't quite correct.  We're working on getting it operational again.  Apologies for any inconvenience.

0. Accompanying Files

 

This tutorial works best with two zip files.  You'll find the first one in your Komodo install area, at <komodo-install>/lib/support/luddite/luddite.zip (Windows), <komodo-install>/lib/support/luddite/luddite.tar.gz (Linux), or .../Contents/SharedSupport/luddite/luddite.tar.gz on OSX. Unpack the zip or tar file into a work area outside the Komodo install area, since we'll be reusing many of the files we ship, or making slight modifications to them.

Then download kid.zip, which contains a final version of the files this tutorial walks you through. The first zip/tgz file creates a directory called "luddite-1.0.0".  Copy the kid.zip file into that directory, and unzip it by running unzip -a kid.zip.

For those who prefer shell commands:

$ mkdir -p ~/play  # scratch directory
$ cd ~/play
$ tar xfz komodo-installdir/lib/support/luddite/luddite.tar.gz
$ cd luddite-1.0.0
$ wget https://downloads.activestate.com/pub/staff/ericp/kid.zip
$ unzip -a kid.zip

1. UDL: The Back Story

It's well-understood that a template-based system is a better way of delivering web content than writing pure code.  Tasks are broken down into separate files, so the programmers can work on code, the writers in HTML or XML, and the designers with CSS. Usually you need a few files to glue together the various pieces.  The concept is still new enough that there is no universal agreement on a name for the nature of the contents of these glue files, but "template languages" seems to be the most common.

When we started work on Komodo, around 2000, PHP was the only commonly-used open template language ("open" allows me to leave proprietary systems like JSP, ASP, and ColdFusion out of the discussion). The editor Komodo is based on, Scintilla, actually bundles HTML and PHP processing in the same colorizing module.  But over the years we saw new multi-component languages get released, and enthusiastic users discovering that this was exactly what they needed to deliver dynamic web content quickly. During the last couple of years we've seen near-instant uptake of emerging high-level frameworks like Rails, Django, and CakePHP.

These template languages are relatively easy to build.  Put together an XML or HTML parser, maybe an XPath processor, and a language that has an eval() statement, and you have an instant template language (if one that is inherently insecure, with that eval statement.  If the language can be freely embeddable in a web server like Apache, so much the better.  Replace the eval() statement with pattern-matching and a managed symbol table, and you have something other people can use.

So we envision that increasingly more template languages will be created. Communities of avid fans will grow around them, and some of them will be looking for a GUI-type editor to manage them in.

Now we knew that the Scintilla HTML/PHP lexer wouldn't be satisfactory for all these other template languages.  These lexers are important parts of Komodo.  They're invoked on almost every keystroke, and they try to update the style of each visible character as quickly as possible.

Previously, we had two options to support new languages:

  1. make a modification to the HTML/PHP lexer (it currently also has some support for ASP, VBScript, and Embedded Python), and associate new languages with it
  2. write a new lexer

Neither solution was very attractive.  Template languages can be complex, with many edge cases that we wanted to handle well. Each template language can have a few idiosyncracies, and it's important to handle them correctly to create a smooth editing experience.

Meanwhile, most template languages also have many common features. They all have a markup base, which we could safely assume to be XML, XHTML, or HTML.  If they're (X)HTML-based, they should support embedded CSS and JavaScript.  They usually embed a "server-side language" ("SSL" for short) -- this is the code that is run on the server, emitting the HTML at run-time.  And they all have transition points that indicate when to move from HTML to the SSL, and when to return. Some of them even have their own little language on top of the SSL -- Perl's Template-Toolkit and Smarty/PHP are two common examples.

The other problem with either of those two approaches was that they both entailed a rebuild of Komodo, which meant that a new feature wouldn't be available until the next release.  We wanted to make this a Komodo extension point, so someone using a language like Liquid 1 would be able to add an extension to Komodo, and take advantage of as many language-aware features in Komodo as possible.  This obviously covered syntax coloring, but we wanted to include auto-indenting, code-folding, selection-commenting, and as much as possible, code-completion.  We did achieve all of the above, and are working out how to provide hooks to Komodo's syntax checking and debugging sub-systems for a future release.

And we wanted people to be able to do this without having to know anything about Scintilla, how lexers work, or even access to a C++ compiler.

I had done enough work with two complicated lexers, for Perl and Ruby, and fixed bugs in various others, that I saw recurring patterns in lexers that every language needed to implement.  Some of these patterns were simple (comments start with "#" or "//", and end at the end of line; numeric constants start with a digit, and have an optional fractional and exponent part); some were more complex (strings in Perl can start with "q" followed by a non-alphanumeric character, and end with that same character, but don't forget to handle backslash escapes; "old-style" HTML attribute values are not always quoted); some were hard enough that we haven't implemented them yet (here-documents). But what I did see was that all these lexers could be implemented by specifying a set of patterns that are applicable at certain states.

We set out to design a language that would let people focus on the high-level aspect of building these lexers: first break up each template language into a set of sub-languages, called families. Then break up each family into a set of states (are we in a comment, string, or numeric constant)?  And finally assign to each state a set of patterns we're looking for, with actions to perform when one of them matches.  These actions would assign styles to a range of text, and possibly move to a different state.

Then we needed a way of getting this information into Komodo. Conveniently enough, we introduced a Firefox-like extension mechanism with version 4.   The idea was to take this high-level description of a new language, and run it through a process that would create a .xpi file (or cross-platform installer, pronounced "zippy").  Add it to Komodo, and you'd have integrated your new language.

This technology is called UDL, for User-Defined Languages.  UDL consists of three parts:

  1. a language called Luddite (for "Language for User-DefineD Integrated TEmplating systems", of course), that is used to specify lexers
  2. a general-purpose lexer engine added to Scintilla, which converts compiled Luddite programs into screens full of syntax-highlighted code
  3. a XPI-generator tool

Komodo also ships with the source for all its UDL-based languages. You'll find it in either <komodo-install>/lib/support/luddite/luddite.zip (Windows), <komodo-install>/lib/support/luddite/luddite.tar.gz (Linux), or .../Contents/SharedSupport/luddite/luddite.tar.gz on OSX. Unpack the zip or tar file into a work area, since we'll be reusing many of the files we ship, or making slight modifications to them.  If you find you're typing a lot as you work through this tutorial, you probably didn't read this paragraph.

2. Creating a New Lexer for Kid

This article is intentionally different from the documentation you'll find in the Komodo help pages.  UDL and Luddite have served us well here at ActiveState -- writing a lexer for Liquid would take a few minutes.

But we intended UDL to be usable by people who didn't also design it.  Short of hitting the road, we've written this document as if you're working side by side with us.   We'll make typos early on, run into error messages and situations that don't work the way we want them to, diagnose the problems, and fix them.

One of our beta testers for Komodo 4 asked us when we were going to deliver syntax coloring for Kid.  At first we thought this would be a good test for UDL -- was the technology mature enough, and the documentation thorough enough -- that a third party could succeed with it? It seemed like a good test case; none of us had heard of Kid, but no one else outside the company was delivering a lexer.  Since we needed both a tutorial and a lexer for Kid, it seemed like an ideal case study.

The code samples in this document are different from the ones in the accompanying zip file. The zipped files are final, correct versions of the UDL code needed to implement Kid. The code snippets in this document contain several errors (some of which are intentional), so I could cover the error detection and correction process as well. Luddite's a young language, and this particularly shows in its error detection capabilities. Also, when you're targeting a GUI, sometimes the only indication of an error situation is silence. We'll see a few of those, so we'll show you where to look for clues, and how to put them together to solve the puzzle.

2.1 A Spec for a Kid Lexer

First, keep in mind that most of the work of building a Kid lexer has already been done.  It's based on XHTML, and Komodo ships with standalone lexers for HTML, JavaScript, and CSS, as well as modules that cover the transitions among these three languages (simplified because CSS and JavaScript don't directly interact).

So the next step was to find the description of the Kid language (at http://www.kid-templating.org/language.html 2, and determine the remaining tasks we're faced with. From this page we see that Kid files consist of familiar HTML files with four places where Python code can appear:

  1. Between "python" processing-instructions (PIs): <?python ... ?>.   I downloaded the Kid source code, and saw that Kid works by reading an XHTML file with the ElementTree XML parser, doing a generic XML parse on it, and then finding the Python-specific parts after.  So this means that because the XML parser doesn't know about Python strings and comments, the first occurrence of "?>" will end a Python PI.  We'll want to implement our lexer to model this behavior.
  2. Attributes in the "http://purl.org/kid/ns#" namespace are deemed to contain Python code.  Luddite currently isn't namespace-aware, so we'll assume that the "py" prefix always maps to this namespace.  Komodo bug 53517 tracks this item.
  3. In any non-py attribute, or in text content, sequences of "${...}" contain Python expressions. "$$" is the escape sequence for a single '$'.
  4. Any expression starting with a '$', followed by '.'-separated identifiers, should be treated as a Python expression. I assumed the '$' had to be preceded by a non-identifier, although the code didn't make that check, as of version 0.9.4.  Making the check complicated the code, and I initially got it wrong, which led to a couple of useful lessons while testing the extension.  So I left it in.

$-expansion is turned off inside XML comments. The exception is for comments that start with "[" or "<![", or comments that end with "//". Luddite rules are constrained to one line at a time, so we won't be able to handle the final situation.  But the JavaScript and CSS modes are already relatively intelligent about these compatibility hacks.  The Luddite code accepts CDATA and XML-comment delimiters in these states, colors them as comments, and then stays in each family's default state.  You can try it in Komodo right now. Bring up an HTML file, add a script tag anywhere, and wrap the contents with either CDATA marked sections or XML comments.  Neither will affect the JavaScript highlighting.

2.2 Test Planning

Here's an outline of everything we expect to see working once we finish the lexer.  We of course test to make sure we don't get Python lexing outside the Python areas, and make sure that other features, such as auto-indentation, are working.

  • <?python ... ?> - kwd ident op strings
  • <?pytho ... ?> - verify failure
  • <?python2 ... ?> - verify failure
  • <?python* ... ?> - rejected by ElementTree (Kid's parser).
  • python content attributes:  
       
    • double-quoted: verify types of tokens, single-quoted single and triple strings,   and escape chars.  
    • single-quoted: similar  
  • ${...}:     
         
    • on in normal attribute strings    
    • off in Python content attributes - test by putting a string inside a string        e.g.: print "this is a stri${"inner string"}ng"        "inner" and "string" should be colored as identifiers    
    • on in text content    
    • off in comments    
    • off in PIs    
    • off in CDATA sections    
    • on in JavaScript code    
    • on in CSS styles     
  • $[...]: off
  • $$ works as an escape wherever ${...} is on
  • $foo.bar:    
         
    • on in normal attribute strings    
    • off in Python content attribute strings    
    • on in text content    
    • off in comments    
    • off in PIs    
    • off in CDATA sections    
    • on in JavaScript code    
    • on in CSS styles    
  • auto-indentation:
       
    • works for HTML  
    • works for CSS  
    • works for JavaScript  
    • works for Python
  • Code|Comment Region (and uncomment):
       
    • works for HTML  
    • works for CSS  
    • works for JavaScript  
    • works for Python
  • code-completion:
       
    • works for HTML, CSS sections  
    • multi-language Python code-completion wasn't ready for Komodo 4.0 (and     due to the way we implemented multi-language completion, JavaScript     isn't ready in Python-based UDLs).

2.3 Reuse

Since I said we wouldn't be typing much, let's look at the *-mainlex.udl files in the work directory, and pick a promising starting point.  The django-mainlex.udl file is a red herring; Django is implemented in Python, and works well with it, but Django template files don't contain any Python code.  Let's open up the rhtml-mainlex.udl file instead 3 4.

The rhtml-mainlex.udl file defines the name of the language, which will show up in the Komodo UI.  This field is case-sensitive, and all language names are normally either all-caps for acronyms, or capitalized otherwise.

The rest of the file contains includes for the components that contain an RHTML file.  The good news is that we can use all the ones that don't contain "ruby" or "rhtml" in their names.  Our "mainlex.udl" file will contain include statements on all these files:

include "html2js.udl"
include "html2css.udl"
include "css2html.udl"
include "js2html.udl"
include "html.udl"
include "csslex.udl"
include "jslex.udl"

With that list, we have a working spec for an HTML lexer. By no coincidence, this is the full list of includes used for the html-mainlex.udl file.

So now we need to add Luddite files to cover the transitions from HTML, JavaScript, and CSS into Python, and back out.  And we'll need a file to describe the core Python lexer as well.  Our final kid-mainlex.udl file will then look like this:

language Kid

include "html2js.udl"

include "html2css.udl"

include "kid/html2python.udl"  #*

include "kid/css2python.udl" #*
include "css2html.udl"

include "kid/js2python.udl"  #*
include "js2html.udl"

include "kid/python2html.udl" #*

include "html.udl"
include "csslex.udl"
include "jslex.udl"
include "pythonlex.udl" #*

The files that need to be written have a "#*" comment.  The transition files are going in the "kid" subdirectory to avoid future collisions with transition rules for other Python-based template languages we might want to support.

So let's work down the list.  First we'll look at the html2ruby.udl file to get an idea of what we need to do to handle Python:

# html2ruby.udl
family markup

# Precondition: we already painted everything before the
# '<' that brought us here.

state IN_M_STAG_EXP_TNAME:
'%#' : => IN_TPL_BLOCK_COMMENT_1
/%=?/ : paint(include, TPL_OPERATOR), spush_check(IN_M_DEFAULT) => IN_SSL_DEFAULT

state IN_M_STAG_ATTR_DSTRING:
/<%=?/ : paint(upto, M_STRING), paint(include, TPL_OPERATOR), spush_check(IN_M_STAG_ATTR_DSTRING) => IN_SSL_DEFAULT

state IN_M_STAG_ATTR_SSTRING:
/<%=?/ : paint(upto, M_STRING), paint(include, TPL_OPERATOR), spush_check(IN_M_STAG_ATTR_SSTRING) => IN_SSL_DEFAULT

The html2ruby.udl file also contains code to handle RHTML comments.  Kid doesn't have those, so I removed that part.

Recall that at its essense, Luddite code describes a state machine.  You give each state an arbitrary name, specify which strings and patterns to match when we're in that state, and a list of commands specifying what to do next.  The two most common commands are "paint" and "=>", which specifies the state to change to.  If no actions are given, UDL stays in the same state, starting at the character following the matched sequence.  If no conditions in a given state fires, UDL consumes the current character, and moves on to the next.

The paint action needs a bit of explanation as well.  paint(upto... means assign all the unassigned text to the point where the current match started the specified style (like the "M_STRING" above).  paint(include...) gives all the text up to and including the current match the style.

State names are created when first mentioned.  The line "state FOO:" introduces a state declaration for state FOO, and is followed by one or more state conditions.  State names are global, and states may be defined in one or more places.  States are "used" when an action specifies a state to change to.  Luddite will complain about undefined states, and give a warning message about defined states that are never used.

The state conditions are attempted in the order specified in the main Luddite program, flattening out included files.  Once one condition is fulfilled, no others are attempted.  This is why matches on longer strings and patterns should be attempted before trying to match shorter strings that could be a prefix of the longer one.  For the same reason, transition files should be included before the main files (e.g. the files html2js.udl and js2html.udl are included before html.udl and jslex.udl above). 

One way of writing faster parsers is to provide an explicit character set to skip over, such as in this state:

state BREAKFAST:
'juice' : ...
'coffee' : ...
'milk' : ...
/[^cjm]+/ : #stay here

The other all-caps names in Luddite are conventionally used for Scintilla style names.  These are defined here in Appendix 1. The "SCE_UDL_" prefix on these names is optional.  Luddite allows states to have the same name as style names (with or without the prefix), but the practice isn't recommended.

So we can easily support the <?python... form, by replacing

state IN_M_STAG_EXP_TNAME:
'%#' : => IN_TPL_BLOCK_COMMENT_1
/%=?/ : paint(include, TPL_OPERATOR), spush_check(IN_M_DEFAULT) => IN_SSL_DEFAULT

with this code:

state IN_M_STAG_EXP_TNAME:
/\?python\b/ : paint(include, TPL_OPERATOR), spush_check(IN_M_DEFAULT) => IN_SSL_DEFAULT

"SSL_DEFAULT" is the conventional name for the state where each sequence of server-side code starts.

The spush_check action is for the point where we find a "?>" that ends the <?python block.  At that point we emit an spop_check action, and it will go back to the IN_M_DEFAULT state (which is conventionally the starting state for the markup sub-language).

We need to handle $-expansion here as well.  This can happen in attribute values and text content, and I'm going to assume the attribute values must be quoted, as Kid is XHTML-based. So we can borrow some of the html2ruby code, and replace accordingly. We have to handle single-quoted and double-quoted strings separately, which is a cry for further encapsulation in this language, but we'll manage:

state IN_M_STAG_ATTR_DSTRING:
'$$' : #stay here, this is an escape sequence
'${' : paint(upto, M_STRING), paint(include, SSL_OPERATOR), \
      spush_check(IN_M_STAG_ATTR_DSTRING) => IN_SSL_DEFAULT
# **todo** dotted identifiers

state IN_M_STAG_ATTR_SSTRING:
'$$' : #stay here, this is an escape sequence
'${' : paint(upto, M_STRING), paint(include, SSL_OPERATOR), \
      spush_check(IN_M_STAG_ATTR_SSTRING) => IN_SSL_DEFAULT
# **todo** dotted identifiers

We also said we'd support $-interpolated Python in text content as well.  Text content isn't actually handled explicitly in the markup-base.udl file.  Instead, whenever we find some markup when we're in the IN_M_DEFAULT state, we color everything up to that character with the M_DEFAULT style, and then usually switch to a different state. So text content will be in that state, leading to this rule:

state IN_M_DEFAULT:
'$$' : #stay here, this is an escape sequence
'${' : paint(upto, M_DEFAULT), paint(include, SSL_OPERATOR), \
      spush_check(IN_M_DEFAULT) => IN_SSL_DEFAULT
# **todo** dotted identifiers

Now we're going to use the UDL stack to handle "}".  Standard Python has braces, so we use the stack to determine whether to bounce back into markup mode, or stay here.

There is no Python lexer yet (one of the reasons for choosing Kid in this tutorial was to show how to write a complete lexer for a server-side language), but it will need these two rules in the default state:

state IN_SSL_DEFAULT:
'{' : paint(upto, SSL_DEFAULT), paint(include, SSL_OPERATOR), \
     spush_check(IN_SSL_DEFAULT), => IN_SSL_DEFAULT
'}' : paint(upto, SSL_DEFAULT), paint(include, SSL_OPERATOR), \
     spop_check, => IN_SSL_DEFAULT

So when we find a "{" in Python, we push the SSL_DEFAULT state on the state-stack. And when we find a "}" at the default state, we check the state-stack. If it's non-empty, we switch to its top state, and pop it off the stack. Otherwise we go to the target state.

Since the standard Python lexer doesn't need to do anything special with braces, we'll override handling them, and put them in a separate section of kid/html2python.udl.

We'll also process the unbracketed form in the html2python.udl file, although strictly speaking it belongs to python.  First we need to recognize when we enter the state:

pattern PYHTMLDOLLARSTART = '(?<^|[^\w_\$])\$(?=[\w_])'

state IN_M_STAG_ATTR_DSTRING:
/$PYHTMLDOLLARSTART/ : paint(upto, M_STRING), \
   paint(include, SSL_OPERATOR), spush_check(IN_M_STAG_ATTR_DSTRING),
=> IN_M_PYTHON_UNBRACKETED_EXPN

state IN_M_STAG_ATTR_SSTRING:
/$PYHTMLDOLLARSTART/ : paint(upto, M_STRING), \
   paint(include, SSL_OPERATOR), spush_check(IN_M_STAG_ATTR_SSTRING),
=> IN_M_PYTHON_UNBRACKETED_EXPN

state IN_M_DEFAULT:
/$PYHTMLDOLLARSTART/ : paint(upto, M_DEFAULT), \
   paint(include, SSL_OPERATOR), => IN_M_PYTHON_UNBRACKETED_EXPN

While  we could let bracketed expressions be handled by our regular SSL states, we need to handle unbracketed expressions explicitly:

state IN_M_PYTHON_UNBRACKETED_EXPN: 
'.' : paint(upto, SSL_IDENTIFIER), paint(include, SSL_OPERATOR)
/[^\W_]/: paint(upto, SSL_IDENTIFIER), spop_check, => IN_M_DEFAULT
/\z/ : paint(include, SSL_IDENTIFIER)

We've been introducing new concepts as we go along.  The pattern declaration allows us to give commonly used patterns a name. Each family has its own set of pattern names, so you can specify that "-" is a markup NAMECHAR, but not a CSL NAMECHAR.

Incidentally, the pattern matching is implemented with the PCRE perl-compatible, regular-expression library.  This tutorial assumes you're comfortable with this syntax.  If not, and you're sure you want to master Luddite, start at the Mastering Regular Expressions site, choose a (natural) language and a vendor, and get that book.

The /\z/ pattern matches the end of the buffer. Usually Luddite can figure out what color to apply to all pending characters at the end of the buffer. Where more than one color is used in a state, as in this case, it's always safe to specify an explicit EOF color. If a lexer doesn't color the remaining characters in the buffer, the scintilla component will repeatedly invoke the lexer trying to discover the styles for these characters, reducing performance.

Finally we need to handle the py: attributes. Because Luddite doesn't have named variables, supporting both types of quoted attribute strings is going to be a bit hard. But this problem has already been solved for the arbitrary string and regex delimiters used in languages like Perl and Ruby. When we find the quote character, we set the delimiter, and we'll add a default level rule for the Python side to recognize that delimiter.

The new rules will be:

state IN_M_STAG_POST_TAGNAME: # you need to read markup-base.udl to find this state
/py:[$CS]+/ : paint(upto, M_TAGSPACE), paint(include, M_ATTRNAME), => IN_M_KID_PYATTR_1

I'm going to assume the attribute name, '=', and initial quote are on the same line.  You can add more conditions to allow multi-line attributes.

state IN_M_KID_PYATTR_1:
'=' : paint(upto, M_TAGSPACE), paint(include, M_OPERATOR) => IN_M_KID_PYATTR_2
/[$WS]/ : #stay
/[^'"]/ : paint(upto, M_TAGSPACE), IN_M_DEFAULT

state IN_M_KID_PYATTR_2:
/(['"])/ : paint(upto, M_TAGSPACE), paint(include, M_STRING), set_delimiter(1), \
=> IN_SSL_DEFAULT

and we add one rule to python2html.udl:

state IN_SSL_DEFAULT:
delimiter: paint(include, SSL_DEFAULT), => IN_M_STAG_POST_TAGNAME

The delimiter mechanism was designed with Ruby and Perl in mind, certainly not for Python, or a Python-derived language.  It turns out that delimiters can be used to handle single-quoted and double-quoted strings with one set of rules, when the two kinds of strings are identical. For example, single- and double-quoted strings are identical in JavaScript, CSS, Python, and *ML attribute values; in Perl, Ruby, and PHP double-quoted strings support interpolate expressions, while single-quoted strings don't.  I haven't moved to replace the current string handling with delimiters because they don't nest (yet another work item).

At this point we've finished html2python.udl.  Let's move on to the js2python part.  Now there is no js2ruby.udl file (there probably should be though). So we'll look at js2php.udl. It has two rules, one for handling transitions to PHP inside JavaScript single-quoted strings, and one for transitions in double-quoted strings.

So the contents of kid/js2python.udl should look familiar:

state IN_CSL_DSTRING:
'$$' : #escape, stay
'${' : paint(upto, CSL_STRING), paint(include, SSL_OPERATOR), \
spush_check(IN_CSL_DSTRING) => IN_SSL_DEFAULT

state IN_CSL_SSTRING:
'$$' : #escape, stay
'${' : paint(upto, CSL_STRING), paint(include, SSL_OPERATOR), \
spush_check(IN_CSL_SSTRING) => IN_SSL_DEFAULT

state IN_CSL_DEFAULT:
'$$' : #escape, stay
'${' : paint(upto, CSL_DEFAULT), paint(include, SSL_OPERATOR), \
spush_check(IN_CSL_DEFAULT) => IN_SSL_DEFAULT

We aren't going to support the unbracketed form, because the "$" is a valid identifier character in JavaScript, and with the wide adoption of libraries like prototype.js, the use of "$" signs in JavaScript code has become more common.

Supporting ${...} in JavaScript's two kinds of comments is left as the proverbial exercise for the reader.  Since the comments will only be seen by the JavaScript interpreter, and then ignored, I don't see the point of interpolating variables there anyway.

The kid/css2python.udl file is almost identical to the js2python one. Again, we allow for interpolation outside strings, except in comments, and we again wish we had a better encapsulation mechanism:

state IN_CSS_DSTRING:
'$$' : #escape, stay
'${' : paint(upto, CSS_STRING), paint(include, SSL_OPERATOR), \
spush_check(IN_CSS_DSTRING) => IN_SSL_DEFAULT

state IN_CSS_SSTRING:
'$$' : #escape, stay
'${' : paint(upto, CSS_STRING), paint(include, SSL_OPERATOR), \
spush_check(IN_CSS_SSTRING) => IN_SSL_DEFAULT

state IN_CSS_DEFAULT:
'$$' : #escape, stay
'${' : paint(upto, CSS_DEFAULT), paint(include, SSL_OPERATOR), \
spush_check(IN_CSS_DEFAULT) => IN_SSL_DEFAULT

Again, we don't need to handle the case of reaching the end of the Python expression, because when we reach the "}", we'll pop the correct destination state off the state stack.

One of the CSS experts out there will say we aren't done. CSS allows unquoted URLs inside the url() operator.  Fine. By looking for 'url' in csslex.udl, we see that unquoted URLs are handled in the IN_URL_2 state.  The rest should be easy:

state IN_URL_2:
'$$' : #escape, stay
'${' : paint(upto, CSS_DEFAULT), paint(include, SSL_OPERATOR), \
spush_check(IN_URL_2) => IN_SSL_DEFAULT

The python2html.udl file has only one task, to get back to the markup family on "?>".  Because the HTML parser doesn't know that the contents of the PI are actually Python code, we have to model the parser, and use coloring to let the programmer know how the document will be interpreted.

family ssl

state IN_SSL_DEFAULT:
'?>' : paint(upto, SSL_DEFAULT), paint(include, SSL_OPERATOR), => IN_M_DEFAULT

state IN_SSL_SSTRING:
'?>' : paint(upto, SSL_STRING), paint(include, SSL_OPERATOR), => IN_M_DEFAULT

state IN_SSL_DSTRING:
'?>' : paint(upto, SSL_STRING), paint(include, SSL_OPERATOR), => IN_M_DEFAULT

state IN_SSL_TRIPLE_SSTRING:
'?>' : paint(upto, SSL_STRING), paint(include, SSL_OPERATOR), => IN_M_DEFAULT

state IN_SSL_TRIPLE_DSTRING:
'?>' : paint(upto, SSL_STRING), paint(include, SSL_OPERATOR), => IN_M_DEFAULT

state IN_SSL_COMMENT:
'?>' : paint(upto, SSL_COMMENT), paint(include, SSL_OPERATOR), => IN_M_DEFAULT

We've now described all the inter-language transitions, but still have to write a lexer for Python.  Fortunately, Python is lexically simpler than the other server-side language UDL files.  The full text is in pythonlex.udl, but there are some new features worth describing.

Handling Python's triple-quoted strings is a snap.  The only subtlety is remembering that the sequence "\'''" does not end a single-quoted triple-string.

state IN_SSL_DEFAULT:

'"""' : paint(upto, SSL_DEFAULT), => IN_SSL_TRIPLE_DSTRING
'"' : paint(upto, SSL_DEFAULT), => IN_SSL_DSTRING
"'''" : paint(upto, SSL_DEFAULT), => IN_SSL_TRIPLE_SSTRING
"'" : paint(upto, SSL_DEFAULT), => IN_SSL_SSTRING

state IN_SSL_TRIPLE_DSTRING:
'\\"' : # ""
'"""' : paint(include, SSL_STRING), => IN_SSL_DEFAULT

state IN_SSL_TRIPLE_SSTRING:
"\\'" : # ""
"'''" : paint(include, SSL_STRING), => IN_SSL_DEFAULT

The hard part with single-line strings is processing line-continuations. First, we need to recognize unterminated lines.  We need to determine when we reach the end of line, and return to the default state.  But if we find a backslash at the end of the line, we need to move to a temporary state that continues the state, and returns to the main string-recognition state as soon as possible.

state IN_SSL_DSTRING
/\\[\r\n]/ : => IN_DSL_DSTRING_LINECONT
/\\./ : # ""
/$/ : paint(upto, SSL_STRING), => IN_SSL_DEFAULT  # No EOLString in UDL
"'" : paint(upto, SSL_STRING), => IN_SSL_DEFAULT

You might wonder why we can't simply match /\\$/ and stay in the state. The reason is that UDL will consume the backslash, but end-of-line conditions, as with other zero-width conditions, are not consumable.  The UDL engine in fact doesn't recognize when it's matched a zero-width pattern and hasn't changed state, a sure recipe for an infinite loop (it should recognize these, and complain -- another bug).  This means that when you write a pattern that can succeed without matching one or more character,s you need to move to a different state.

state IN_SSL_DSTRING_LINECONT:
/\\[\r\n]/ : #stay
/\\./ : => IN_SSL_DSTRING
'"' : paint(include, SSL_STRING), => IN_SSL_DEFAULT
/./ : => IN_SSL_DSTRING
/^$/ : paint(upto, SSL_STRING), => IN_SSL_DEFAULT # End empty lines here

There are five different cases to handle after a backslash-escaped newline:

  1. Another escaped newline
  2. A quote ending the string
  3. An empty line: this triggers a syntax error
  4. An escaped character (might be a quote), continuing the string
  5. Any other character continuing the string

Another subtle area is distinguishing periods that start numerals from periods that separate identifiers.  This default rule and this state handle that:

/\.(?=[$NMSTART])/ : paint(upto, SSL_DEFAULT), paint(include, SSL_OPERATOR), \
=> IN_SSL_IDENTIFIER_1

state IN_SSL_IDENTIFIER_1:
'.' : paint(upto, SSL_IDENTIFIER), paint(include, SSL_OPERATOR)
/[^$NMCHAR]/ : paint(upto, SSL_OPERATOR), redo, => IN_SSL_DEFAULT

The redo action tells UDL to not consume the matched string or pattern, and move to the specified string.  The Luddite parser will currently complain about a single condition that contains a redo action without a specified state, but it doesn't detect infinite loops, such as this one:

state STATE_1:
'a' : redo, => STATE_2

state STATE_2:
/\w/ : redo, => STATE_3

state STATE_3:
/[^\d]/ : redo, => STATE_1

If an 'a' is encountered at state STATE_1, UDL will normally enter an infinite loop. However the engine has a check: if it notices that it hasn't moved past a given character after some number of tries, it issues an error message, and moves to the next character. This is not guaranteed to give reasonable results.

Also introduced are the fold statements at the bottom of the module. Each fold statement has three parts - the string to match, the style it has to be colored at, and then a "+" or "-" specifying whether to increase the line's fold level or decrease it.  So for example, if line 10 has a "[" styled as an operator (as opposed to part of a string or comment), and line 12 has a "]", Komodo will draw a fold box around lines 10 through 12 inclusive.

Komodo actually has a separate way of indicating fold blocks for Python code, based on their indentation level.  This makes sense, as indentation in Python reflects program structure. However Luddite doesn't yet support this.

2.4 A Digression: Why Not Lex?

Lex is a decades-old technology that does the same thing Luddite does -- it reads in a stream of characters, and separates them into lexical types. Here's a string, here's an identifier, here's a comment, etc.  It's long been associated with a syntactic analyzer (remember our discussion at the beginning?), which tries to make sense of those separate items.  And it mostly uses a list of regular expressions to do its work.  A typical Lex program has code like this:

"+"         return(PLUS);
"-"         return(MINUS);

Many other extensible editors take the same approach: to add syntactic coloring for a new language, give a list of patterns to match, with the color for each one.

We originally wanted to do that for UDL, but regular expressions alone are too weak.  You can't use patterns alone to tell a computer that when you're inside a Ruby double-quoted string, anything between a "#{" and a "}" should be colored as standard Ruby code.  And be sure to handle nested occurrences of these strings.  And now we're talking multi-language documents.  In RHTML, when we see an instance of "<%=" in an attribute string, we want to step into Ruby mode.  When we find the "%>", we bounce back into HTML mode.

This is impossible to do with regular expressions alone.  Then how does Lex do that, you might be asking.  The secret with Lex is that right-hand side, after the string or pattern, is actually C code.  By wrapping it in braces, you can write arbitrary C, including managing nested states.  But we wanted to let people accomplish building lexers for languages like these without having to mess around with C or C++.

3. Building the Extension

To extend Komodo, we need to follow these steps:

  1. Compile the Luddite program
  2. Create any other files, and make sure they're in a correct directory structure
  3. Package all the files into a XPI
  4. Add the XPI to Komodo's extension manager

3.1 Compiling Luddite Files

Let's assume you've unpacked the Komodo Luddite files into a directory called luddite-1.0.0, created the directory luddite-1.0.0/udl/kid, copied the new pythonlex.udl and kid-mainlex.udl files into the udl directory, and the other new files into the udl/kid directory. Change to the luddite-1.0.0 directory. It's time to build the extension.

We need to compile the specification using this command:

   python luddite.py compile -f --ext=.kid.html
     --guid=9c773798-fbf3-4793-a9f3-43023f53033d --skel udl/kid-mainlex.udl

You can substitute "luddite" for "python luddite.py" on Windows systems, and use "luddite.py" elsewhere.  I'm using the full form here, as it's cross-platform.  You can generate your own GUID if you prefer, as long as it matches the class ID used in koKid_UDL_Language.py, which UDL generates.

Komodo uses a convention of giving web-based template languages an extension like ".kid.html".  Since I work with a variety of HTML-based languages, I prefer the two-part extension.  If you know that all your HTML files are going to be Kid files, feel free to assign the .html extension to Kid.

    Then build the XPI using a command like this:

  python luddite.py package -c "creator" --version "major.minor.sub" Kid -f

 

and then install it.  When I ran the output the first time I got this output:

yacc: Generating SLR parsing table...
yacc: 56 shift/reduce conflicts
statements HT_STATE state_name opt_colon_eol transitions pattern_const opt_token
_check opt_colon cmds COMMA . LexToken(LB_NL,'\n',16)
Syntax error at or near line 16, token '
'
...

followed by about 30 lines of yacc grammar.

This part of UDL is still rough around the edges. The error mechanism still doesn't report file names, and we need to reduce the output. The best we can do right now is try compiling each of the files individually, and look for failures. Here's how it's done in Windows:

for %i in ( udl\kid\*.udl ) do python luddite.py compile -f %i
C...>python luddite.py compile -f udl\kid\css2python.udl
yacc: Generating SLR parsing table...
yacc: 56 shift/reduce conflicts
At least one transition moves to undefined state IN_SSL_DEFAULT
This state needs to be defined somewhere.
...

We know there are no errors in this file, as Luddite complains that the state IN_SSL_DEFAULT wasn't defined.  This makes sense, because css2python.udl doesn't include any other files.

...
C...>python luddite.py compile -f udl\kid\html2python.udl
yacc: Generating SLR parsing table...
yacc: 56 shift/reduce conflicts
statements HT_STATE state_name opt_colon_eol transitions pattern_const opt_token
_check opt_colon cmds COMMA . LexToken(LB_NL,'\n',16)
Syntax error at or near line 16, token '

We see that in html2python.udl line 16 doesn't end with a backslash, but the state command doesn't end there (remember that the typos dicussed in the document have been fixed in the actual shipped source). We add it, check over our files for other similar cases (naturally the SSTRING state has the same problem as the DSTRING state, since I copied and pasted it), rerun the command, and get this output:

At least one transition moves to undefined state IN_SSL_SQUOTE
This state needs to be defined somewhere.

It turns out that my original single-quote handlers did this:

'"' : paint(upto, SSL_DEFAULT), => IN_SSL_DQUOTE

The target state name should have been the conventional IN_SSL_DSTRING.

The Luddite compiler complains only about the first undefined state. Again, that can be improved.

We rerun the command, and get another undefined state report for IN_DSL_DSTRING_LINECONT.  Easy fix for this typo.

Another rerun shows that all referenced state names are now defined, but we have another problem:

luddite: create lexres `build\Kid\lexers\Kid.lexres'
Undefined pattern PY in str $PY_HTML_DOLLAR_START, family ssl

I had forgotten that pattern variable names may contain only upper-case names.  After a quick change of "PY_HTML_DOLLAR_START" to "PYHTMLDOLLARSTART", a rebuild gives this result:

yacc: Generating SLR parsing table...
yacc: 56 shift/reduce conflicts
luddite: create lexres `build\Kid\lexers\Kid.lexres'
luddite: create lang service `build\Kid\components\koKid_UDL_Language.py'
luddite: create template `build\Kid\templates\All Languages\Kid.kid.html'
luddite: create template `build\Kid\templates\Common\Kid.kid.html'

Success!

3.2 Packaging

We're ready to build the extension.

We run this command:

python luddite.py package -c "Eric Promislow" --version "1.0.0" Kid -f

and get this output:

luddite: create `build\Kid\install.rdf'
luddite: create `build\Kid\chrome.manifest'
luddite: `kid_language-1.0.0-ko.xpi' successfully created

You can get more help by running "python luddite.py help" for general help, and "python luddite.py help <command>" for help on a specific command.

3.3 Installing and Testing

Find the newly created kid_language-1.0.0-ko.xpi file, and open it with Komodo. You should see two dialog boxes.  The first prompts you to install the new extension.  If you accept, the second box prompts you to restart Komodo. At this point restart Komodo.  This would be a good time to download a sample Kid file to test, if you don't have any on your local system.  I'm using the Mandelbrot set example available at http://www.kid-templating.org/trac/wiki/KidExamples, and saved it in a file called mandelbrot.kid.html.  Before installing the XPI, Komodo treats it as an HTML file.  One of the ways we can tell that Komodo will color it as a Kid file is when the "def" in the initial Python block is colored like a keyword.

When I restarted Komodo, I found that "def" was still colored like an identifier. In fact, the whole <?python ... ?> block was colored like any HTML PI. And menus like View|View As Language|Other and the File Associations lists in Preferences didn't have an entry for "Kid", but the "Kid Language" XPI was listed in the extension manager. The problem was that I forgot to specify the "--guid=GUID" and "--skel" options on the command-line. When I recompiled, repackaged, and reinstalled the XPI, this time there were no bindings. I had to add template files for Komodo to see how to bind the files. In effect, I wanted the XPI components to reside in the subdirectory given below. Files with a "**" to the right are empty files we need to create. All other subdirectories and files are created by the luddite commands.

+---build
   +---Kid
   |   |   chrome.manifest
   |   |   install.rdf
   |   |
   |   +---components
   |   |       koKid_UDL_Language.py
   |   |
   |   +---lexers
   |   |       Kid.lexres
   |   |
   |   +---templates
   |       +---All Languages
   |       |       Kid.kid.html **
   |       |
   |       +---Common
   |               Kid.kid.html **
   |

Now when I recompiled, repackaged, and reinstalled, this time "View|View as Language" showed my mandelbrot file was being treated as a Kid file by Komodo, but the coloring doesn't look as good as we were expecting. The <?python...?> block at the top of the file is colored as if it were a regular processing instruction -- switching the buffer's language from Kid to HTML shows this, although it looks like the contents of the attribute strings are being handled correctly.

So to recap, at this point we've figured out how to write a Luddite spec for a lexer that is accepted by the compiler.  We've also successfully packages the compiled files into a XPI that Komodo will accept, and our changes seem to have been reflected in most of the API.  The only thing that isn't working now is that some of our patterns and states don't seem to be working.

Currently Luddite doesn't have a trace mode.  Until we build a suitable trace mode (and it will need a detailed configuration controller to prevent the user from getting inundated by hundreds of lines of output on every keypress), it pays to look at the Luddite code that doesn't seem to be firing.

We know that conditions are attempted in the order they're encountered. This means that in state IN_M_STAG_EXP_TNAME, /\?python\b/ should be attempted before any of the standard conditions listed in markup-base.udl.  However, let's look at that file to see why the standard condition seems to be firing.

When we backtrack looking for IN_M_STAG_EXP_TNAME, we see that it's entered only when all the other strings that can follow a "<" haven't been recognized.  We need to match the "<" in the default state.  So instead of looking for "?" after a "<" has been recognized, we need to look for the full sequence "<?".

state IN_M_DEFAULT:
/<\?python\b/ : paint(include, TPL_OPERATOR), => IN_SSL_DEFAULT

recompile, repackage, reinstall, and reload....

Now strings and numeric constants are colored differently in the python block, but the "def" string is colored the same as operators. I brought up Preferences|Fonts and Colors|Lang-Specific|Kid, and gave identifiers, keywords, and operators all different, garish colors. After saving, I see that none of them are recognized in all three types of Python code blocks: the PIs, ${...} strings, and "py:" attribute values. The problem is most likely in the pythonlex.udl file, so I'll have a look at that.

The first mistake was here:

state IN_SSL_IDENTIFIER_1:
'.' : paint(upto, SSL_IDENTIFIER), paint(include, SSL_OPERATOR)
/[^$NMCHAR]/ : paint(upto, SSL_OPERATOR), redo, => IN_SSL_DEFAULT

Since periods are colored with the paint/include action, I got the second action wrong.  It should be

/[^$NMCHAR]/ : paint(upto, SSL_IDENTIFIER), redo, => IN_SSL_DEFAULT

If this is the fix, it should handle keywords as well.  The keywords declaration listed all the words we want to be treated as keywords, and the keyword_style declaration specified that any time we finish a token of style SSL_IDENTIFIER that is in the keyword list, it should be recolored with style SSL_WORD.  Let's lather, rinse, and check once again...

And it worked.  Now we can test the various boundaries of our spec.

The Mandelbrot sample has a few py:content attributes, using double-quoted strings.  We can insert a single-quoted string inside a double-quoted py:content attribute string.  Escaping single-quotes work, and triple-single-quoted strings work too.  Note that we can't escape double-quotes inside these strings, because they're first parsed by the XML parser, which doesn't know about backslashes.  If you really need a double-quote inside one of these strings, you need to specify it as "&quot;", and Python will see a '"'.

However, single-quoted strings aren't working as well.  I get lexing in them, but as soon as I enter a double-quote, I only seem to get sequences of string and default styles, until the closing single-quote.  Back to the code...

state IN_SSL_DSTRING
/\\[\r\n]/ : => IN_SSL_DSTRING_LINECONT
/\\./ : # ""
/$/ : paint(upto, SSL_STRING), => IN_SSL_DEFAULT  # No EOLString in UDL
"'" : paint(upto, SSL_STRING), => IN_SSL_DEFAULT

Another typo.  The "D" in "DSTRING" indicates a double-quote, but I'm ending the string with a single-quote.  The last condition should be:

'"' : paint(upto, SSL_STRING), => IN_SSL_DEFAULT

Remember the four "REs"...  You did write a shell script or batch file for compiling and packaging, right?  Unfortunately you can't pass a XPI to Komodo on the command-line.  On Windows I keep a file manager handy to drag-and-drop the XPI's icon on Komodo.  You can also add an "Open-File Shortcut" to your toolbox pointing to the directory containing the XPI.  And one more time-saver, if you prefer running Komodo from the command-line rather than from an icon, is that closing Komodo automatically closes the Installer box as well.

So we now have working Python blocks, Python-content attributes, and embedded ${...} strings.  Working through our test suite from Section 2.2 Test Planning, we see that ${...} isn't working in text content.

The problem is obviously in html2python.udl, because we're failing to transition from HTML to Python.  And sure enough, I forgot to copy the rules for that from this document to the code file.

Another cycle, everything that was working before is still working (the advantage of testing lexers is you can look at the screen, but this is also one of the big disadvantages.  Batch tests with scripts would be a great asset here).

Continuing down the list from where we left off, we try the following tests, verifying that ${...} sequences are either recognized as Python transition strings, or ignored, depending on the context:

Text Result Passes test?
${...} in comments off yes
${...} in PIs off yes
${...} in CDATA sections off yes
${...}in CDATA sections off yes
${...}in JS strings on yes
${...}in JS text on yes
${...}in JS comments off yes
${...}in CSS strings on yes
${...}in CSS text on yes
${...}in CSS bare URLs on yes
${...}in CSS comments off yes
$[...] off yes
$${...} in attribute strings off yes
$${...} in text content off yes
$foo.bar in non-Python attribute strings off no!

We have no feedback on why that last test-cas failed, and need to dig a bit deeper.

    This is where there's an advantage in running Komodo from the command-line with the -v option.  This is the feature that was implemented with that more complex pattern variable near the top of html2python.udl.  It turns out that when we launch Komodo, it emits three error messages to the console with -v on:

udl: failed to compile ptn <(?&^|[^\w_\$])\$(?=[\w_])>: failed at offset 3 (^|[^\w_\$])\$(?=[\w_])): unrecognized character after (?<
udl: failed to compile ptn <(?&^|[^\w_\$])\$(?=[\w_])>: failed at offset 3 (^|[^\w_\$])\$(?=[\w_])): unrecognized character after (?<
udl: failed to compile ptn <(?&^|[^\w_\$])\$(?=[\w_])>: failed at offset 3 (^|[^\w_\$])\$(?=[\w_])): unrecognized character after (?<

Seeing this message three times makes sense, since we use that pattern three times.  Let's fire up the RxToolkit (assuming you're using Komodo IDE or the old 3.5 version), and see what it says...

That didn't take long.  The "<" after "(?" is highlighted, and there's a hard-to-miss error message complaining about the regex.  I wanted to match dollar-signs that are either at the start of the line, or are not preceded by an identifier or other dollar sign. I forgot to put an "=" after the "<", so let's try that.

That gives a new error message: "look-behind requires a fixed-width pattern". OK, I don't need to put the ^ anchor in a look-behind, so let's rewrite the pattern like so:

(?:^|(?<=[^\w_\$]))\$(?=[\w_])

The error message is gone.  Now put these sample strings in the Text box, and see which ones match (be sure the "Match All" button is selected in the top row):

$foo
$$foo
abc$foo
abc $foo
abc $12.34
abc*$foo
for i in $*
Text Result Passes test?
$foo matches yes
$foo matches yes
$$foo fails yes
abc$foo fails yes
abc $foo matches yes
abc $12.34 matches no!
abc*$foo matches yes
for i in $* fails yes

Test #6 failed -- obviously we don't want to interpret a price as a reference to a Python object.  I fixed this by changing the occurrence of "\w" after the "$" to a stricter "a-zA-Z". If you're following with the Rx Toolkit, you'll see that the line containing "abc $12.34" is no longer highlighted, and no other lines of the sample text were affected.

So we set the pattern variable PY_HTML_DOLLAR_START to '(?:^|(?<=[^\w_\$]))\$(?=[a-zA-Z_])', and cycle yet again. And see that while the "$" characters are now colored as identifiers, the following characters aren't. This is the case in all three contexts: text content, single-quoted plain attribute strings, and double-quoted strings.

And it was yet another typo:

state IN_M_PYTHON_UNBRACKETED_EXPN:
'.' : paint(upto, SSL_IDENTIFIER), paint(include, SSL_OPERATOR)
/[^\W]/: paint(upto, SSL_IDENTIFIER), redo, spop_check => IN_M_DEFAULT

the second pattern is a double-negative.  It should be /[\W]/

That fixed that one.  and we find the other tests for recognizing identifier shortcuts work.

Context Result Passes test?
Python content attribute strings off yes
text content on yes
comments off yes
PIs off yes
CDATA sections off yes

    So we've done static checking of all the Kid transition strings.  We also fiddle with the Komodo editor contents to make sure that changes are updated quickly and correctly.  If you implement a language and notice an anomaly, it's most likely a bug in UDL that we've made, not something you need to fix.

We also want to make sure code-completion and auto-indent are working reasonably well.  But all along we've noticed that there's no code-completion going on with HTML tags.  It turns out that there's no completion in the other sections either.  The next section will explain why, and show how to add them.

3. Auto-Indenting and Code-Completion

"Smart" auto-indenting and code-folding appear to be working correctly in HTML.  Pressing return after a "{" in the JS and CSS modes correctly increases the indentation, and typing a "}" at the start of the line shifts the close-brace to the previous indentation level.  All good. The folding looks right as well.

The Python section is a bit different.  There is no folding based on indentation levels, like there is in the .py files inside Komodo. This is because code-folding is done by the scintilla components, and Luddite does not yet have a way of expressing fold levels based on indentation levels.  However the auto-indentation should be working, and it isn't.  When we press return after "def color(x,y):", Komodo should automatically increase the indent level by one, and it fails to.

Like any debugging scenario, there are two places to look more closely: the symptom, and the cause.  The symptom, in this case, is failure to do Python-style auto-indenting.  Let's have a closer look at the buffer. It seems fine, maybe too fine.  Remember when I created a garish color scheme for my Kid documents?  I had decided to assign operators a hard-to-miss, hard-to-look-at crimson.   The operators made the HTML, JavaScript, and CSS sections of the code hard to look at, but the part of my brain that was looking for missing styles was fooled by the part that liked the calmer Python section.  A closer look showed that operators weren't getting recognized in the Python section.  Auto-indenting in Python mode depends on distinguishing characters like ":" in an operator context from ":"s appearing elsewhere.

So let's look back at the source code.  We'll fire up the Rx Toolkit, specify the pattern we're using to recognize operator characterrs in the pattern field, and put the "def color" line in the subject field.

We'll have to manually expand the pattern "[$OP]" to give [~!@%^&*()-=+[]{}\\|;:,.<>/?], and we're told that there are no matches.  At least it's consistent with the behavior we see in the editor, so we'll have to look more closely at the regular expression to find the cause of the problem.

While I was writing this regex, I kept in mind the relaxed requirements for escaping special characters inside square brackets.  In fact, the only characters that need escaping are the backslash itself, and the right-square bracket character ("]").  Once I looked at it, I could see it wasn't escaped.  The pattern I had written was matching one character in ~!@%^&*()-=+[, followed by the sequence {}\\|;:,.<>>/?]. By putting a backslash before the inner ']', the Rx Toolkit tells me they match.  So let's change the pattern definition from

pattern OP = '~!@%^&*()-=+[]{}\\|;:,.<>/?'

to

pattern OP = '~!@%^&*()-=+[\]{}\\|;:,.<>/?'

and cycle.  Auto-indentation is now *partly* working.  Pressing return after a ":" doesn't advance indentation, but pressing return on a line that starts with a "dedenting" keyword like "return" or "break" decreases it.  It looks like an internal bug in Komodo....  Yes.  You can read more at https://bugs.activestate.com/show_bug.cgi?id=66133.  Until this bug is fixed, ":" handling won't work in Python segments of Kid files.

As for code-completion, at this point UDL doesn't help us generate the required files, and Komodo isn't yet doing code-completion for Python in multi-language documents.

If you're feeling adventurous, you could copy a file like lang_mason.py out of komodo-install/lib/mozilla/python/komodo/codeintel2, call it "lang_kid.py", change all occurrences of "Mason" to "Kid" and occurrences of "Perl" to "Python" (leaving the shebang line alone), make sure there are no syntax errors in lang_kid.py, copy it back to the Komodo codeintel2 directory, and restart Komodo.  If Komodo fails to start, or starts up, but without any loaded buffers, you could either consult the pystderr.log and pystdout.log files in your application data area, or run Komodo from the command-line with the -v option, to better see error messages as they are triggered.

At this point you should find code-completion working for HTML and CSS in style tags.  We haven't implemented the class for multi-language Python code-completion yet, and when we do, JavaScript completion will fall out of it.   When we ship that version of Komodo, the version of the Kid lexer you build today should inherit the new functionality in the new version, not even requiring a recompile.

Appendix A. Style Names

The first section consists of markup names. Some are a bit cryptic. The "S" in "STAG" stands for "start"; the "E" in "ETAG" stands for "end"; the "C" in "TAGC" stands for "close", and the "O" in "TAGO" stands for "open". "EMP" stands for "EMPTY". The other markup terms should be self-evident in this angle-bracket-inundated world.

SCE_UDL_M_DEFAULT
SCE_UDL_M_STAGO
SCE_UDL_M_TAGNAME
SCE_UDL_M_TAGSPACE
SCE_UDL_M_ATTRNAME
SCE_UDL_M_OPERATOR
SCE_UDL_M_STAGC
SCE_UDL_M_EMP_TAGC
SCE_UDL_M_STRING
SCE_UDL_M_ETAGO
SCE_UDL_M_ETAGC
SCE_UDL_M_ENTITY
SCE_UDL_M_PI
SCE_UDL_M_CDATA
SCE_UDL_M_COMMENT

SCE_UDL_CSS_DEFAULT
SCE_UDL_CSS_COMMENT
SCE_UDL_CSS_NUMBER
SCE_UDL_CSS_STRING
SCE_UDL_CSS_WORD
SCE_UDL_CSS_IDENTIFIER
SCE_UDL_CSS_OPERATOR

SCE_UDL_CSL_DEFAULT
SCE_UDL_CSL_COMMENT
SCE_UDL_CSL_COMMENTBLOCK
SCE_UDL_CSL_NUMBER
SCE_UDL_CSL_STRING
SCE_UDL_CSL_WORD
SCE_UDL_CSL_IDENTIFIER
SCE_UDL_CSL_OPERATOR
SCE_UDL_CSL_REGEX

SCE_UDL_SSL_DEFAULT
SCE_UDL_SSL_COMMENT
SCE_UDL_SSL_COMMENTBLOCK
SCE_UDL_SSL_NUMBER
SCE_UDL_SSL_STRING
SCE_UDL_SSL_WORD
SCE_UDL_SSL_IDENTIFIER
SCE_UDL_SSL_OPERATOR
SCE_UDL_SSL_REGEX
SCE_UDL_SSL_VARIABLE

SCE_UDL_TPL_DEFAULT
SCE_UDL_TPL_COMMENT
SCE_UDL_TPL_COMMENTBLOCK
SCE_UDL_TPL_NUMBER
SCE_UDL_TPL_STRING
SCE_UDL_TPL_WORD
SCE_UDL_TPL_IDENTIFIER
SCE_UDL_TPL_OPERATOR
SCE_UDL_TPL_VARIABLE

References

1. While I was writing this I saw a request in a Rails-related feed asking if anyone was using Liquid templates.  A quick google search for "liquid" came up with about 12 billion hits, which shows you just how quickly this template language thing is being picked up.

2. "kid.org" was taken by Kennewick Irrigation District; now you don't have to look it up

3. If you open a .udl file in Komodo, you'll see syntax-coloring for the different parts. See how it's done in luddite.udl.

4. While you're opening files in Komodo, you might want to open the Luddite documentation in the help section. The overview is in the "Komodo Reference" under "User-Defined Languages", and the language reference is at "Luddite Reference".