Regular expressions

Regular expressions language overview

Introduction

Regular expressions language – the language that greatly simplifies text manipulation – is one of the most used domain specific languages today. Almost every developer has used it at least once. Some languages, like Perl and Python, have built in support for it; some, like Java, use it through libraries. Java, the language which we use to implement MPS, does not have language level support for regular expressions, so it was natural for us to implement DSL for them, so we would be able to use a DSL instead of a regular expression library. This language is a good example of an MPS language. Having read this introduction, you will be able to understand how to create and use languages in MPS.

Examples

We assume you have MPS already installed.

This document uses many examples. You can find them in a regular expression language project (%MPS_HOME%/platform/regexp) under the jetbrains.mps.regexp.examples solution:

Language overview

Let's take a look at a simple regular expression application. Suppose we want to get a user name and domain name from an email address. Here is code that prints out a user name and a domain name by analyzing an email address with a regular expression (you can find this example in the EmailExample class):

The regular expression that is used in this match regexp statement does the following. First, it reads one or more word characters (\w+) and saves them in a "user" variable. After that, it reads the "@" character. Then we read a list of words which are separated by a period ("." character) and save it in a domain variable (\w+(.\w+)). If a match is found, the program prints out user and domain to System.out.

Here is a syntax tree for this example's regular expression:

Language structure

When we create a language in MPS, we usually start by defining its abstract syntax. Abstract syntax in MPS is called language structure. To do this, we use a structure language. Structure language is an XML Schema counterpart from XML language, or a DDL counterpart from SQL. Let's take a look at the regular expressions language structure.

Overview

The MPS regular expressions language contains several parts:

Regular expressions: concepts used to specify regular expressions. They include concepts for string literals, symbol classes, and "or" and "sequence" regular expressions.
BaseLanguage (BaseLanguage is a Java-like language, used internally by MPS as a target language for generators) integration: this part includes concepts used to embed regular-expressions-related code into BaseLanguage. For example, it includes MatchStatement, ReplaceStatement, and SplitStatement.
Regular expressions library support. When we work with regular expressions, we want to reuse them, and so we created special concepts for this task.

Regular expressions

All regular expressions concepts in our language are placed into "Regexp" folder in its structure model:

Let's consider them in detail. We have a single base concept for all of them: Regexp:

It is derived from BaseConcept concept. All MPS concepts are derived from it. This concept also has the abstract concept property, which means that it is created to form a concepts hierarchy, not to be used in language to define regular expressions. It is similar to the 'abstract' modifier in Java classes.

Let's consider the concepts that are derived from it. You can see them in a hierarchy view. You can see this view by pressing Ctrl + H on the concept declaration. For the Regexp concept, we will see the following:

StringRegexp represents an arbitrary string which can be matched against text (you can find all examples of regular expression that we consider in this section in the Regexps root node):

Let's take a look at its concept declaration (you can quickly navigate to a concept declaration by pressing Ctrl + Shift + S when an instance of a concept is selected in an editor):

In its declaration, we see a property text with a string type, which is used to store text that will be shown in the editor. Also, this concept declares a concept property "alias." Concept properties differ from simple properties. Simple properties correspond to Java instance fields, and concept properties correspond to Java static fields. The value of a concept property alias will be shown in completion menu, when we press Ctrl + Space:

Binary regular expressions are created to represent regular expressions that combine two different regular expressions into one. BinaryRegexp concept is declared as abstract and has two concrete sub concepts: OrRegexp and SeqRegexp. Here are examples of their instances:

Here is its concept declaration:

It defines two links: one to store the left part and another to store the right part. The word 'aggregation' means that the regular expression under this link will be a part of a declared concept instance. i.e. if we look at the syntax tree, we will see a child regular expression under the parent BinaryRegexp:

Dot regexp represents a regexp which matches any character. LineEndRegexp matches only at the end of a line. LineStartRegexp matches only at the start of a line. ParensRegexp are used to group other regular expressions in order to make an enclosing regular expression more readable.

There are a lot of sets of symbols which are often used, but they are quite verbose to enter. So we have character classes that make it possible to enter [A-Z] instead of (A|B|CZ). We have two kinds of them: negative and positive. Both of them extend abstract SymbolClassRegexp:

Many of these character classes are used in several places, so they can be referenced in a simpler way with PredefinedSymbolClassRegexp. Instead of [A-Z] we can write "\w":

This concept is declared in the following way:

Here we have symbolClass link declaration, which has a reference stereotype (aggregation, which we mentioned above, is also a link stereotype). Reference stereotype means, that an instance of this concept won't contain the referenced node as a child. Instead the referenced node can be stored in any place in the model. Also we have a lot of different UnaryRegexps which are derived from an abstract concept UnaryRegexp. They include +, * and other regexp operations:

When we work with a text it is often useful to remember some match, and reference it later. To facilitate this task we have MatchParensRegexp that remembers a string which it matches, and MatchVariableReferenceRegexp that references a string matched before. The following code matches a pair of the same xml tags with a text inside it:

BaseLanguage integration

Regular expressions have a little use if they can't be integrated in the BaseLanguage code. So in regular expressions language we have special concepts which make it possible to write regular-expression-related constructs in a program which is written in BaseLanguage.

If you want to add new constructs to BaseLanguage you usually extend either Expression or Statement concept from BaseLanguage. Expression concept represents expressions like "1+2", "a == b". Statement concept represents control structures like "if() { }", "while() { }". In the regular expressions language we create both new expressions and statements.

Let's first take a look at the statements and than at the expressions:

MatchRegexpStatement is used when you want to check whether a specified string matches a regular expression (you can find the examples for this section in BaseLanguageIntegration class in jetbrains.mps.regexp.examples model):

We have an interesting feature here: you can reference named matches in the MatchRegexpStatement block. These match variables work in other statements which are defined in the regular expressions language.

FindMatchStatement checks whether a specified string contains a match for a specified regular expression. It is similar to MatchRegexpStatement.

ForEachMatchStatement allows you to iterate over all matches of a specified regular expression in a specified string:

When we work with a string, we often want to replace all matches of a regular expression with a specified text. In regular expressions language you can do this with the help of ReplaceWithRegexpExpression:

It is also often practical to split a string with some regular expression. For example, to extract parts of a string which are separated by one or more whitespace symbols we can write this SplitExpression:

When we reference a match in a block, the MatchVariableReference concept is used. It is also derived from the Expression concept.

Library support

When we work with regular expressions, we want to use some of them in many places. To define these reusable regular expressions, we have a special concept – Regexps. It contains zero or more named regular expressions:

Accessory models

In many languages we have the following problem: we have a lot of very similar entities, which can be used in any model that is written with this language (like predefined symbol class regular expression). We could create a concept for every such entity. But MPS has a better solution: you can create a special model, called an accessory model, and declare all these entities in it with your language.

We have the PredefinedSymbolClass concept which is used to declare a symbol class. Also, we have the PredefinedSymbolClasses container concept, which contains these symbol classes. If you look into the accessory model of the regular expressions language, you will see this:

Editor

After defining the concept structure, we usually create an editor for it. To accomplish this task, we use the editor language. It is quite straightforward to use, so let's consider its most common constructs.

All editor-related code is placed in an editor model. You can find it under a language node in a project tree:

Here is an editor of StringLiteralRegexp:

It contains a horizontal collection, the container which you might use to group other constructs inside it, and {text}, which is used to include an editor for an instance property.

Here is an editor for MatchVariableReferenceRegexp:

It also consists of a horizontal collection, but this time we have a richer set of constructs inside it. "(ref" and ")" are constants, which always contain the same text. "%match%->{name}" is used to reference the property "name" of match link's target.

Here is an editor for Regexps:

It contains a vertical collection with nested horizontal collections. Also, it contains a "(> %Getting_started.xmlregexp% <)" construct. It is used to include editors for all the nodes in the role "regexp".

Scopes

After declaring references in structure, we have default substitute menus for them. These default menus include all the nodes of a reference type in the current model and all of its imported models. Sometimes it works, but sometimes we have to narrow down the scope of these menus (For example, if you have a lot of match variables named "name" in different parts of a model, it's a good idea to follow the Java scoping rules for these variables.) To handle this task, we have constraints language's scopes.

Scopes are placed in a constraints model under a language node:

Let's consider a scope for MatchVariableReference:

Scope consists of a referent set handler, a scope condition (labeled "can create"), and a scope constructor. Usually, only a scope constructor is specified. Scope constructor has to return an object that implements the ISearchScope interface. Usually, an instance of the class SimpleSearchScope is returned; it has a constructor which takes a list of nodes, i.e. we return a list of nodes which are visible in a specified place.

Actions

Default editors in MPS aren't very easy to use. To improve this default behavior, different constructs from the actions language and the editor language can be used.

When we enter code in a text-based language, we usually do it from left to right. We might start from "2", then enter "2+", and finally we might have "2+2". It is also possible to enter code in MPS in this way with the aid of a mechanism called 'right transform.'

To define a right transform action, you have to create a right transform actions root in the actions model and add some right transform actions to it. Let's consider a right transform action from the regular expressions language which transforms one regular expression to the unary regular expression, that is, it transforms "a" into "a+", "a*", and so on (like constraints, editor and structure, you can find the actions model under the language node in your project tree):

Each right transform has an applicable concept – the type of concept this action can be applied to. Also, it has a condition and the most important part: a right transform menu. There are different types of right transform menus. The menu on the picture above adds one menu item for each non-abstract UnaryRegexp sub concept. The handler of this menu part transforms an expression into a unary expression.

Type System

Many languages have a type system. It allows you to check a model against it, and can be used to improve editing experience and simplify the generator. For example, if we know the type of a particular expression, we are able to calculate which methods can be applied to it. MPS has a special language for type systems, called HELGINS. In languages with a very simple structure, it's possible to live without it, but when we have a complex language or want to integrate with BaseLanguage, we have to create a type system, at least for BaseLanguage integration concepts.

In HELGINS, types are represented as MPS nodes. So, if you have a sublanguage for types, like BaseLanguage does, you can use it for type checking.

Let's consider a couple of rules from the regular expressions language.

In this code we define a type called String (String here is an instance of ClassifierType from BaseLanguage, which is used in method parameter types, local variables and other places). To do so, we use the GIVETYPE statement.

Let's take a look at a more complex rule:

In this rule, we require that an expression that we match against a regular expression in FindMatchStatement be a subtype of String type. We do this by specifying a type equation. The sign ":<=:" denotes a subtype; expression TYPEOF denotes a type of expression in parenthesis.

To calculate types, HELGINS uses a sophisticated algorithm which saves you a lot of time. You don't have to worry about the order in which types are calculated; all you have to do is to specify type equations in typing rules, and HELGINS will solve them for you.

Of course, the rules in our language are very simple, and if you want to know more about HELGINS, you have to take a look at rules in languages like BaseLanguage or the model language.

Generator

Almost any language created with MPS has a generator. Generators in MPS convert the high-level language code into code in a lower level language. The key component of a generator is its mapping configuration. It tells us what to do with a language.

Let's consider a mapping configuration of the regular expressions language:

It contains one mapping rule and several reduction rules. Each rule has an applicable concept; for each instance of this concept, the rule will be applied. Mapping rules create a new root node on each application. A reduction rule replaces a node to which it is applied with a new node. Each rule has an associated template used to create an output node.

Let's take a look at an instance of such a template:

Templates contain MPS code with macros and template fragments.

The code outside of a template fragment is not used during generation, and is used only to create a context for code inside a template fragment. For example, if we know that our code will be placed inside a method with a parameter named node, we might create a method with such a parameter around the template fragment. During generation, MPS will recognize your intention, and this variable will be automatically resolved.

Macros are used to specify variable parts of code. For example, variable matcher on a picture above has a property macro on it. This property macro generates a unique name for this variable, so we will be able to use nested match blocks. MPS has different kinds of macros: different kinds of node macros, property macros, and reference macros. All of these concepts are declared in the jetbrains.mps.TLBase language.

MPS 2022.3 Help

Regular expressions