Regular expressions
Regular expressions language – the language that greatly simplifies text manipulation – is one of the most used domain specific languages today. Almost every developer has used it at least once. Some languages, like Perl and Python, have built in support for it; some, like Java, use it through libraries. Java, the language which we use to implement MPS, does not have language level support for regular expressions, so it was natural for us to implement DSL for them, so we would be able to use a DSL instead of a regular expression library. This language is a good example of an MPS language. Having read this introduction, you will be able to understand how to create and use languages in MPS.
We assume you have MPS already installed.
This document uses many examples. You can find them in a regular expression language project (%MPS_HOME%/platform/regexp) under the jetbrains.mps.regexp.examples solution:
data:image/s3,"s3://crabby-images/94373/943732a9eaed55e789adb66f491406d5ef36d9a5" alt="worddav54fd8999f157b75ecce5d51ca9bff681.png worddav54fd8999f157b75ecce5d51ca9bff681.png"
Let's take a look at a simple regular expression application. Suppose we want to get a user name and domain name from an email address. Here is code that prints out a user name and a domain name by analyzing an email address with a regular expression (you can find this example in the EmailExample class):
data:image/s3,"s3://crabby-images/fe0a3/fe0a3459e1ccd9f772a0c9f7633ed6d1c97449c6" alt="worddav6357b7cf6d2db641dc49ad95d2b20ce6.png worddav6357b7cf6d2db641dc49ad95d2b20ce6.png"
The regular expression that is used in this match regexp statement does the following. First, it reads one or more word characters (\w+) and saves them in a "user" variable. After that, it reads the "@" character. Then we read a list of words which are separated by a period ("." character) and save it in a domain variable (\w+(.\w+)). If a match is found, the program prints out user and domain to System.out.
Here is a syntax tree for this example's regular expression:
data:image/s3,"s3://crabby-images/b93cb/b93cb51c9b9d1b3da192935f645e4835737015c2" alt="worddave83ccb1d63a90d30c3ac013f61241197.png worddave83ccb1d63a90d30c3ac013f61241197.png"
When we create a language in MPS, we usually start by defining its abstract syntax. Abstract syntax in MPS is called language structure. To do this, we use a structure language. Structure language is an XML Schema counterpart from XML language, or a DDL counterpart from SQL. Let's take a look at the regular expressions language structure.
The MPS regular expressions language contains several parts:
Regular expressions: concepts used to specify regular expressions. They include concepts for string literals, symbol classes, and "or" and "sequence" regular expressions.
BaseLanguage (BaseLanguage is a Java-like language, used internally by MPS as a target language for generators) integration: this part includes concepts used to embed regular-expressions-related code into BaseLanguage. For example, it includes MatchStatement, ReplaceStatement, and SplitStatement.
Regular expressions library support. When we work with regular expressions, we want to reuse them, and so we created special concepts for this task.
All regular expressions concepts in our language are placed into "Regexp" folder in its structure model:
data:image/s3,"s3://crabby-images/86886/86886862f8c0b9834722e129226532d06a1f691a" alt="worddavc3eea13ac43a8310f50d2743e3c07435.png worddavc3eea13ac43a8310f50d2743e3c07435.png"
Let's consider them in detail. We have a single base concept for all of them: Regexp:
data:image/s3,"s3://crabby-images/7a959/7a959290d8adf948cb3d05c71d980fba720ae773" alt="worddavb1c8d5e9a555626d4a7c6f326b4391ba.png worddavb1c8d5e9a555626d4a7c6f326b4391ba.png"
It is derived from BaseConcept concept. All MPS concepts are derived from it. This concept also has the abstract concept property, which means that it is created to form a concepts hierarchy, not to be used in language to define regular expressions. It is similar to the 'abstract' modifier in Java classes.
Let's consider the concepts that are derived from it. You can see them in a hierarchy view. You can see this view by pressing Ctrl + H on the concept declaration. For the Regexp concept, we will see the following:
data:image/s3,"s3://crabby-images/55efb/55efb950b29e2def4dd33e90389671aef83b2ebd" alt="worddav641bbd1e055f8a1257d7e09e8545a4b5.png worddav641bbd1e055f8a1257d7e09e8545a4b5.png"
StringRegexp represents an arbitrary string which can be matched against text (you can find all examples of regular expression that we consider in this section in the Regexps root node):
data:image/s3,"s3://crabby-images/f6d50/f6d50fef63414f8c44df1d6da2335c5d39a9fa06" alt="worddav60b32b2dae9e635eaec0472fce900e33.png worddav60b32b2dae9e635eaec0472fce900e33.png"
Let's take a look at its concept declaration (you can quickly navigate to a concept declaration by pressing Ctrl + Shift + S when an instance of a concept is selected in an editor):
data:image/s3,"s3://crabby-images/62416/62416faca55b175afb40aa47e8a72ee619574b1d" alt="worddavc394a4e587c8986c177d7a7e08981894.png worddavc394a4e587c8986c177d7a7e08981894.png"
In its declaration, we see a property text with a string type, which is used to store text that will be shown in the editor. Also, this concept declares a concept property "alias." Concept properties differ from simple properties. Simple properties correspond to Java instance fields, and concept properties correspond to Java static fields. The value of a concept property alias will be shown in completion menu, when we press Ctrl + Space:
data:image/s3,"s3://crabby-images/aa449/aa44955e67e9fd53ea6246760e240e3256818d84" alt="worddavf9496c8f18b0eb7b37640d05e82a8bf6.png worddavf9496c8f18b0eb7b37640d05e82a8bf6.png"
Binary regular expressions are created to represent regular expressions that combine two different regular expressions into one. BinaryRegexp concept is declared as abstract and has two concrete sub concepts: OrRegexp and SeqRegexp. Here are examples of their instances:
data:image/s3,"s3://crabby-images/1959d/1959d16323f9ad47d34d9761265c6576ba72df5c" alt="worddav293aaafcd17556c2b1f82ba50ef99148.png worddav293aaafcd17556c2b1f82ba50ef99148.png"
data:image/s3,"s3://crabby-images/cc7f7/cc7f798993fbb5230fa56e3e4343d7e55527017f" alt="worddava8acebee034eb28ab80b57e485b5aebd.png worddava8acebee034eb28ab80b57e485b5aebd.png"
Here is its concept declaration:
data:image/s3,"s3://crabby-images/e4691/e4691fb556b91a5e6cb4f927eceed2023500cacb" alt="worddav37de03cf815e63b6dab3fbf5daf43f86.png worddav37de03cf815e63b6dab3fbf5daf43f86.png"
It defines two links: one to store the left part and another to store the right part. The word 'aggregation' means that the regular expression under this link will be a part of a declared concept instance. i.e. if we look at the syntax tree, we will see a child regular expression under the parent BinaryRegexp:
data:image/s3,"s3://crabby-images/82dbe/82dbe68ada775be5700414c860261fcf29b92811" alt="worddav4ad0a0fbd213cbdb7725a1851aaf03bf.png worddav4ad0a0fbd213cbdb7725a1851aaf03bf.png"
Dot regexp represents a regexp which matches any character. LineEndRegexp matches only at the end of a line. LineStartRegexp matches only at the start of a line. ParensRegexp are used to group other regular expressions in order to make an enclosing regular expression more readable.
data:image/s3,"s3://crabby-images/abd5f/abd5f1f2b8b54a22a15b0c3dec9840947e817695" alt="worddav24e21f44c149abcb3659dfd9d06e62aa.png worddav24e21f44c149abcb3659dfd9d06e62aa.png"
data:image/s3,"s3://crabby-images/650c5/650c58fe508f8e8c39677b5c48b8a59ff692700f" alt="worddav642ced80a263561d8c469346025b4836.png worddav642ced80a263561d8c469346025b4836.png"
data:image/s3,"s3://crabby-images/ae338/ae338a50350d31016db854c7f6632ae00e95f587" alt="worddav94705c1fc94ecb5145b425d14f42afcf.png worddav94705c1fc94ecb5145b425d14f42afcf.png"
There are a lot of sets of symbols which are often used, but they are quite verbose to enter. So we have character classes that make it possible to enter [A-Z] instead of (A|B|CZ). We have two kinds of them: negative and positive. Both of them extend abstract SymbolClassRegexp:
data:image/s3,"s3://crabby-images/9b10d/9b10d75b47480f004b3690a1596c77cd299a7630" alt="worddav802b499d6a7435f7a89accadba5698b3.png worddav802b499d6a7435f7a89accadba5698b3.png"
data:image/s3,"s3://crabby-images/9a0a6/9a0a64d99bcf70bef7abe78986549e7314d7e183" alt="worddava11126417379b10b2d63b7b458305019.png worddava11126417379b10b2d63b7b458305019.png"
Many of these character classes are used in several places, so they can be referenced in a simpler way with PredefinedSymbolClassRegexp. Instead of [A-Z] we can write "\w":
data:image/s3,"s3://crabby-images/21c91/21c91f31eff6a2d6f5975a8302659caf9b7f7136" alt="worddav133f90af8072c5fb9dfe57c7158c29a9.png worddav133f90af8072c5fb9dfe57c7158c29a9.png"
This concept is declared in the following way:
data:image/s3,"s3://crabby-images/16dd9/16dd995c837002077b88526de11c83a136e580e2" alt="worddav622ddc5598027e3c18f2fbd3f4bf8d24.png worddav622ddc5598027e3c18f2fbd3f4bf8d24.png"
Here we have symbolClass link declaration, which has a reference stereotype (aggregation, which we mentioned above, is also a link stereotype). Reference stereotype means, that an instance of this concept won't contain the referenced node as a child. Instead the referenced node can be stored in any place in the model. Also we have a lot of different UnaryRegexps which are derived from an abstract concept UnaryRegexp. They include +, * and other regexp operations:
data:image/s3,"s3://crabby-images/95d79/95d7945d2f456011800251212c7d8e1b020fb435" alt="worddav302fe26fe89dcd25ad86915228449271.png worddav302fe26fe89dcd25ad86915228449271.png"
data:image/s3,"s3://crabby-images/86cee/86cee0e9fdfab16665ec8dafbad1a90b78f25192" alt="worddav196774bebcf4832840530e8a598b777a.png worddav196774bebcf4832840530e8a598b777a.png"
When we work with a text it is often useful to remember some match, and reference it later. To facilitate this task we have MatchParensRegexp that remembers a string which it matches, and MatchVariableReferenceRegexp that references a string matched before. The following code matches a pair of the same xml tags with a text inside it:
data:image/s3,"s3://crabby-images/f9906/f9906081cc61bd5f893795ef8da0ada53e0d3ea6" alt="worddav480b88108c470d3412e1c3ca338bf95d.png worddav480b88108c470d3412e1c3ca338bf95d.png"
Regular expressions have a little use if they can't be integrated in the BaseLanguage code. So in regular expressions language we have special concepts which make it possible to write regular-expression-related constructs in a program which is written in BaseLanguage.
If you want to add new constructs to BaseLanguage you usually extend either Expression or Statement concept from BaseLanguage. Expression concept represents expressions like "1+2", "a == b". Statement concept represents control structures like "if() { }", "while() { }". In the regular expressions language we create both new expressions and statements.
Let's first take a look at the statements and than at the expressions:
MatchRegexpStatement is used when you want to check whether a specified string matches a regular expression (you can find the examples for this section in BaseLanguageIntegration class in jetbrains.mps.regexp.examples model):
data:image/s3,"s3://crabby-images/ecfc5/ecfc50835933a9df5b240e58bad6991428bb349a" alt="worddav7ca0d21a461360dc9e35c6d2732b1467.png worddav7ca0d21a461360dc9e35c6d2732b1467.png"
We have an interesting feature here: you can reference named matches in the MatchRegexpStatement block. These match variables work in other statements which are defined in the regular expressions language.
FindMatchStatement checks whether a specified string contains a match for a specified regular expression. It is similar to MatchRegexpStatement.
data:image/s3,"s3://crabby-images/6a887/6a887421437b781044698539bd98c1c288921147" alt="worddav49147cbc3846c222907f8ab55c1ed548.png worddav49147cbc3846c222907f8ab55c1ed548.png"
ForEachMatchStatement allows you to iterate over all matches of a specified regular expression in a specified string:
data:image/s3,"s3://crabby-images/1f4e2/1f4e288cc3196e27f60c5be83aac2238979714f3" alt="worddavb00b56a3a457bcfa3bd5c194d9f07a2a.png worddavb00b56a3a457bcfa3bd5c194d9f07a2a.png"
When we work with a string, we often want to replace all matches of a regular expression with a specified text. In regular expressions language you can do this with the help of ReplaceWithRegexpExpression:
data:image/s3,"s3://crabby-images/ef247/ef247518acfa586dafbfc2732ae4fd314fbc830f" alt="worddav0324d85e995d77bce18cb1796cbb79a9.png worddav0324d85e995d77bce18cb1796cbb79a9.png"
It is also often practical to split a string with some regular expression. For example, to extract parts of a string which are separated by one or more whitespace symbols we can write this SplitExpression:
data:image/s3,"s3://crabby-images/2a952/2a95290d8ddfbca280c8686359464a7b9ef4da19" alt="worddav8693e20963fbd910cdc6f17e65500d23.png worddav8693e20963fbd910cdc6f17e65500d23.png"
When we reference a match in a block, the MatchVariableReference concept is used. It is also derived from the Expression concept.
When we work with regular expressions, we want to use some of them in many places. To define these reusable regular expressions, we have a special concept – Regexps. It contains zero or more named regular expressions:
data:image/s3,"s3://crabby-images/151a9/151a9512eecf61c498965e1bb336b472d8db9cca" alt="worddavb1831b4d7d4be636b5f539589a7abd8c.png worddavb1831b4d7d4be636b5f539589a7abd8c.png"
In many languages we have the following problem: we have a lot of very similar entities, which can be used in any model that is written with this language (like predefined symbol class regular expression). We could create a concept for every such entity. But MPS has a better solution: you can create a special model, called an accessory model, and declare all these entities in it with your language.
We have the PredefinedSymbolClass concept which is used to declare a symbol class. Also, we have the PredefinedSymbolClasses container concept, which contains these symbol classes. If you look into the accessory model of the regular expressions language, you will see this:
data:image/s3,"s3://crabby-images/20970/209708a5532e121d7cc78daf8b38bfaef78f4f5d" alt="worddav26add762091c9a627c43395111298b8a.png worddav26add762091c9a627c43395111298b8a.png"
After defining the concept structure, we usually create an editor for it. To accomplish this task, we use the editor language. It is quite straightforward to use, so let's consider its most common constructs.
All editor-related code is placed in an editor model. You can find it under a language node in a project tree:
data:image/s3,"s3://crabby-images/afa21/afa21538edfc8ad9c7e3c3b14a764de114f113c9" alt="worddavc1136b6cec7f424fb0462579c99c0799.png worddavc1136b6cec7f424fb0462579c99c0799.png"
Here is an editor of StringLiteralRegexp:
data:image/s3,"s3://crabby-images/6283c/6283c1ee41bc206dd8a7b849969e2cdca754c690" alt="worddav9af071cf04fc04a21f5fda7ba65302a4.png worddav9af071cf04fc04a21f5fda7ba65302a4.png"
It contains a horizontal collection, the container which you might use to group other constructs inside it, and {text}, which is used to include an editor for an instance property.
Here is an editor for MatchVariableReferenceRegexp:
data:image/s3,"s3://crabby-images/0d74c/0d74c286ce3d7969395e3b54ba52f4f3c214b8af" alt="worddavde451156ae66510919af6563b7264e39.png worddavde451156ae66510919af6563b7264e39.png"
It also consists of a horizontal collection, but this time we have a richer set of constructs inside it. "(ref" and ")" are constants, which always contain the same text. "%match%->{name}" is used to reference the property "name" of match link's target.
Here is an editor for Regexps:
data:image/s3,"s3://crabby-images/0a924/0a924982dc5ce28030f337e583b26b0ac85264a0" alt="worddav79e6df4b7ef387a833fa176607b56cd2.png worddav79e6df4b7ef387a833fa176607b56cd2.png"
It contains a vertical collection with nested horizontal collections. Also, it contains a "(> %Getting_started.xmlregexp% <)" construct. It is used to include editors for all the nodes in the role "regexp".
After declaring references in structure, we have default substitute menus for them. These default menus include all the nodes of a reference type in the current model and all of its imported models. Sometimes it works, but sometimes we have to narrow down the scope of these menus (For example, if you have a lot of match variables named "name" in different parts of a model, it's a good idea to follow the Java scoping rules for these variables.) To handle this task, we have constraints language's scopes.
Scopes are placed in a constraints model under a language node:
data:image/s3,"s3://crabby-images/7df4a/7df4aac31bd30fee2bd476f8da18c7399e2286ec" alt="worddavaeb22e88a30788c4c39df5434c19093d.png worddavaeb22e88a30788c4c39df5434c19093d.png"
Let's consider a scope for MatchVariableReference:
data:image/s3,"s3://crabby-images/9d510/9d510c8757a1e2f4d1d017977b1b98c41a85be2e" alt="worddavc378d6ee860a2581a4e9dbdc23fae9d8.png worddavc378d6ee860a2581a4e9dbdc23fae9d8.png"
Scope consists of a referent set handler, a scope condition (labeled "can create"), and a scope constructor. Usually, only a scope constructor is specified. Scope constructor has to return an object that implements the ISearchScope interface. Usually, an instance of the class SimpleSearchScope is returned; it has a constructor which takes a list of nodes, i.e. we return a list of nodes which are visible in a specified place.
Default editors in MPS aren't very easy to use. To improve this default behavior, different constructs from the actions language and the editor language can be used.
When we enter code in a text-based language, we usually do it from left to right. We might start from "2", then enter "2+", and finally we might have "2+2". It is also possible to enter code in MPS in this way with the aid of a mechanism called 'right transform.'
To define a right transform action, you have to create a right transform actions root in the actions model and add some right transform actions to it. Let's consider a right transform action from the regular expressions language which transforms one regular expression to the unary regular expression, that is, it transforms "a" into "a+", "a*", and so on (like constraints, editor and structure, you can find the actions model under the language node in your project tree):
data:image/s3,"s3://crabby-images/be883/be8836d7b069d3971df358369ccaeb79bdef929b" alt="worddavced409801e815e076620cb81669d28df.png worddavced409801e815e076620cb81669d28df.png"
Each right transform has an applicable concept – the type of concept this action can be applied to. Also, it has a condition and the most important part: a right transform menu. There are different types of right transform menus. The menu on the picture above adds one menu item for each non-abstract UnaryRegexp sub concept. The handler of this menu part transforms an expression into a unary expression.
Many languages have a type system. It allows you to check a model against it, and can be used to improve editing experience and simplify the generator. For example, if we know the type of a particular expression, we are able to calculate which methods can be applied to it. MPS has a special language for type systems, called HELGINS. In languages with a very simple structure, it's possible to live without it, but when we have a complex language or want to integrate with BaseLanguage, we have to create a type system, at least for BaseLanguage integration concepts.
In HELGINS, types are represented as MPS nodes. So, if you have a sublanguage for types, like BaseLanguage does, you can use it for type checking.
Let's consider a couple of rules from the regular expressions language.
data:image/s3,"s3://crabby-images/af300/af300a13effdb8bcb5f223e0783883bcae1fdb6b" alt="worddavfbd103e572784a29f920f075b0b64b55.png worddavfbd103e572784a29f920f075b0b64b55.png"
In this code we define a type called String (String here is an instance of ClassifierType from BaseLanguage, which is used in method parameter types, local variables and other places). To do so, we use the GIVETYPE statement.
Let's take a look at a more complex rule:
data:image/s3,"s3://crabby-images/278a2/278a2e2e923ebedb37659d39f16e7d4767983d19" alt="worddav1a274d8e26272ef9d52b6c7f7458bfff.png worddav1a274d8e26272ef9d52b6c7f7458bfff.png"
In this rule, we require that an expression that we match against a regular expression in FindMatchStatement be a subtype of String type. We do this by specifying a type equation. The sign ":<=:" denotes a subtype; expression TYPEOF denotes a type of expression in parenthesis.
To calculate types, HELGINS uses a sophisticated algorithm which saves you a lot of time. You don't have to worry about the order in which types are calculated; all you have to do is to specify type equations in typing rules, and HELGINS will solve them for you.
Of course, the rules in our language are very simple, and if you want to know more about HELGINS, you have to take a look at rules in languages like BaseLanguage or the model language.
Almost any language created with MPS has a generator. Generators in MPS convert the high-level language code into code in a lower level language. The key component of a generator is its mapping configuration. It tells us what to do with a language.
Let's consider a mapping configuration of the regular expressions language:
data:image/s3,"s3://crabby-images/2c9d6/2c9d645fbff9aa1a6d9b2bf36104cb44c500861b" alt="worddavf77db0d024ca925203b56a14931eb487.png worddavf77db0d024ca925203b56a14931eb487.png"
It contains one mapping rule and several reduction rules. Each rule has an applicable concept; for each instance of this concept, the rule will be applied. Mapping rules create a new root node on each application. A reduction rule replaces a node to which it is applied with a new node. Each rule has an associated template used to create an output node.
Let's take a look at an instance of such a template:
data:image/s3,"s3://crabby-images/d672a/d672a53e19a7a9a74210c21021582832b8386e72" alt="worddavc1fd76f76c7cffb01f1f0e716bd309a7.png worddavc1fd76f76c7cffb01f1f0e716bd309a7.png"
Templates contain MPS code with macros and template fragments.
The code outside of a template fragment is not used during generation, and is used only to create a context for code inside a template fragment. For example, if we know that our code will be placed inside a method with a parameter named node, we might create a method with such a parameter around the template fragment. During generation, MPS will recognize your intention, and this variable will be automatically resolved.
Macros are used to specify variable parts of code. For example, variable matcher on a picture above has a property macro on it. This property macro generates a unique name for this variable, so we will be able to use nested match blocks. MPS has different kinds of macros: different kinds of node macros, property macros, and reference macros. All of these concepts are declared in the jetbrains.mps.TLBase language.
We have taken a look at the regular expressions language. It uses many MPS language development features, but of course not all of them. The best way to learn how to use MPS is to look at another language, like the base language and bootstrap languages. There are several tools in MPS which can be used to understand how MPS works.
One of them is find usages. You can invoke it by choosing Find Usages from the editor popup menu, or by pressing Alt + F7 on a node in an editor.
data:image/s3,"s3://crabby-images/b73a9/b73a9a68a98b77b7bf94063141599266715f631d" alt="worddav49f451933a7d937ef0e3fc0fdcb6f70f.png worddav49f451933a7d937ef0e3fc0fdcb6f70f.png"
The second one is find concept instances. When you come across a concept, and you don't know how to use it, the best way to learn it is to find its instances and try to understand what those instances do.
data:image/s3,"s3://crabby-images/cc58e/cc58e469261f573043fa3cc1c9968da7b787e02e" alt="worddav8b182bc2907aa0f4261f0c1176d60061.png worddav8b182bc2907aa0f4261f0c1176d60061.png"
MPS distribution also contains a documentation system in the %MPS_HOME%/help folder. Some of it is out of date, some quite incomplete, but it can be used to learn MPS.
Thanks for your feedback!