Creating Node Types

A custom language requires many NodeType classes and instances to be created. After all, every distinct type of node in a PSI tree needs a distinct NodeType instance. While it is possible to create these classes and instances by hand, it is more manageable to use tooling to generate the code.

The SDK provides the TokenGenerator tool to convert an XML file listing tokens and keywords into TokenNodeType classes and static singleton instances. The PsiGen tool is used to create a parser from a .psi file, but it also creates ITreeNode classes and CompositeNodeType classes and instances.

The output of the lexer is a stream of singleton instances of TokenNodeType derived classes. The parser doesn't need to know the actual class of the token node type, it only needs to compare it against a known singleton value, and call a known method - TokenNodeType.Create. The same is true for interior tree nodes - the class of the CompositeNodeType is irrelevant, instead, the known singleton value is used to call CompositeNodeType.Create to create the interior tree node.

As such, the usual structure when creating node types is to create them as private nested classes inside a "token type" or "element type" class, and create public static fields that expose the singleton instance.

Creating token node types

For example, the C# language defines the CSharpTokenType class. This is not to be confused with the CSharpTokenNodeType, which is the class derived from TokenNodeType, and acts as the base class for all C# token node types. Instead, the CSharpTokenType class contains a number of private class definitions - CSharpTokenNodeType, WhitespaceNodeType, NewLineNodeType and so on. It also contains public static fields of type TokenNodeType, such as WHITE_SPACE, NEW_LINE, END_OF_LINE_COMMENT and so on (the capitals betray the Java heritage, and ReSharper's lineage from IntelliJ).

public static partial class CSharpTokenType
{
  private abstract class CSharpTokenNodeType : TokenNodeType
  {
    // ...
  }

  private sealed class GenericNodeType : CSharpTokenNodeType
  {
    // ...
  }

  private sealed class WhitespaceNodeType : CSharpTokenNodeType
  {
    // ...
  }

  private sealed class NewLineNodeType : CSharpTokenNodeType
  {
    // ...
  }

  public static readonly TokenNodeType WHITE_SPACE = new WhitespaceNodeType(LAST_GENERATED_TOKEN_TYPE_INDEX + 1);
  public static readonly TokenNodeType NEW_LINE = new NewLineNodeType(LAST_GENERATED_TOKEN_TYPE_INDEX + 2);
  public static readonly TokenNodeType END_OF_LINE_COMMENT = new EndOfLineCommentNodeType(LAST_GENERATED_TOKEN_TYPE_INDEX + 3);

  public static readonly TokenNodeType INTEGER_LITERAL = new GenericTokenNodeType("INTEGER_LITERAL", LAST_GENERATED_TOKEN_TYPE_INDEX + 6, "000");
  public static readonly TokenNodeType FLOAT_LITERAL = new GenericTokenNodeType("FLOAT_LITERAL", LAST_GENERATED_TOKEN_TYPE_INDEX + 7, "0.0");
  public static readonly TokenNodeType CHARACTER_LITERAL = new GenericTokenNodeType("CHARACTER_LITERAL", LAST_GENERATED_TOKEN_TYPE_INDEX + 8, "'C'");
}

The unique index for the node type is passed into the constructor, and is based on the LAST_GENERATED_TOKEN_TYPE_INDEX value, which in turn is generated by the TokenGenerator SDK tool.

The whitespace, new line and comment token node types are specific classes, and only need the index, while the integer, float and character literal token node types are instances of GenericTokenNodeType, and require a name, index and representation.

Not shown in the sample above is that the node type classes all implement the Create method, and return an instance of an ITreeNode, or more specifically, a class that derives from LeafElementBase. This is covered in more detail in the section on creating tree nodes.

Also note that the CSharpTokenType class is a partial class. This allows other token node types and instances to be created in other files. Typically, a custom language will define the base token node types by hand - CSharpTokenNodeType, WhitespaceTokenNodeType, IdentifierTokenNodeType and also FixedTokenNodeType, KeywordTokenNodeType and GenericTokenNodeType. However, fixed(-length) tokens and keywords are usually generated by TokenGenerator.

Generating token node types

The TokenGenerator SDK tool takes an input XML file and creates a C# file that contains the "token type" class, declared as partial, and defines classes and instances for each of the fixed tokens and keywords in the file. It will also create the LAST_GENERATED_TOKEN_TYPE_INDEX value seen above). It also generates the ITreeNode classes for each token.

The format of the XML file is very simple. For example, consider tokens for a language called "Foo". The XML file is called tokens.xml or FooTokenType.Tokens.xml or something similar:

<Tokens TokenTypeNamespace="MyCompany.MyProduct.Psi.Foo.Parsing"
        TokenTypeClass="FooTokenType"
        BaseTokenNodeTypeIndex="8000"
        KeywordNodeType="KeywordTokenNodeType"
        KeywordTokenElement="FixedTokenElement"
        TokenNodeType="FixedTokenNodeType"
        TokenTokenElement="FixedTokenElement"
        Dynamic="false">

  <Keyword name="RETURN_KEYWORD" representation="return" />
  <Keyword name="NAMESPACE_KEYWORD" representation="namespace" />
  <!-- ... -->

  <Token name="LPARENTH" representation="(" />
  <Token name="RPARENTH" representation=")" />
  <!-- ... -->
</Token>

The attributes to the root Tokens element are as follows:

TokenTypeNamespace - the namespace of the "token type" class that will be generated. Should match the manually written "token type" class.
TokenTypeClass - the name of the "token type" class that will hold private token node type class definitions.
BaseTokenNodeTypeIndex - the initial value used for the index of each token node type. This value needs to be unique across languages, but multiple languages can reuse the same value. However, if there is any chance of token node types being reused across languages (e.g. with languages that extend other languages, such as TypeScript and JavaScript), then care should be taken that these numbers do not clash. If this value isn't specified, the default value is 1000.
KeywordNodeType - the base class to use when generating keyword token node types.
KeywordTokenElement - the base class used when generating the ITreeNode for this node type, which is returned from the token node type's Create method. See the section on creating tree nodes for more details. Typically, this is a manually created class called FixedTokenElement (there is no need for a KeywordTokenElement).
TokenNodeType - the base class to use when generating a fixed token, such as an operator or other punctuation. Typically a manually created class called FixedTokenNodeType.
TokenTokenElement - the base class used when generating the ITreeNode for this node type, which is returned from the token node type's Create method. See the section on creating tree nodes for more details. Typically, this is a manually created class called FixedTokenElement.
Dynamic - defaults to false. If true, the ITreeNode that is created is passed the text of the token, such as an identifier's name. This isn't usually needed, as the tokens (operators, punctuation and keywords) are usually fixed.

The child elements of Tokens are either Token or Keyword. A Token element will generate a class that derives from FixedTokenNodeType, while Keyword will generate a class that derives from KeywordTokenNodeType (or whatever names were specified in the XML file). Both elements take the same attributes:

name - the name of the token or keyword node type. This is typically specified in all-caps, such as ABSTRACT_KEYWORD, and normalised, by converting to camel case, such as AbstractKeyword. The all-caps version is used as the name of the singleton instance.
title - if specified, is used as the identifier passed to the base class, and used to construct the name of the node type and token element classes. If not specified, then the normalised name is used instead.
representation - the value passed to TokenNodeType.TokenRepresentation.
filtered - defaults to false. If true, adds an implementation of TokenNodeType.IsFiltered, which returns true. This is usually only used by whitespace and comments, which tend to have their own hand written token node type implementation. However, it can be used by any insignificant syntax tokens.

Using the TokenGenerator

In order to invoke the TokenGenerator tool on the XML file, the XML file's Build Action needs to be specified in Visual Studio's Properties pane. Select the XML file, open the Properties pane, and set the Build Action to TokenGenerator.

The ReSharper SDK sets up the build process to run the TokenGenerator during a compile. However, it requires the output file to be added to the MSBuild file. Open the .csproj and find the line for the tokens.xml file, and change it to something like:

<ItemGroup>
  <TokenGenerator Include="src\...\FooTokenType.xml">
    <OutputFile>src\...\FooTokenType.generated.cs</OutputFile>
  </TokenGenerator>
</ItemGroup>

Where the path in OutputFile is the same as the path in the Include statement.

If the OutputFile isn't specified, the tokens aren't generated, and the build will fail. If the token node types are generated correctly, they are automatically added into the list of files to be compiled, and the build will be successful. However, it doesn't get added to the project, which can lead to unresolved errors as ReSharper won't be able to find the class definitions. It is recommended to add the generated file to the project. It can be safely excluded from source control, should you wish - the file will be rebuilt at the next compile.

Creating composite node types

The same basic pattern is followed for composite node types - create an owning type, usually called ElementType, and create private nested classes that derive from CompositeNodeType. Finally, create a public static field of type CompositeNodeType that exposes the singleton instance.

There are a couple of minor differences to how token node types are created. Typically, the composite node types are created by the psiGen parser generator SDK tool, and this creates a class called ElementType rather than including the language name, such as CSharpTokenType.

Also, the names of the composite node type classes are created from the rules in the .psi grammar, by converting "CamelCase" to "SHOUTING_SNAKE_CASE", and adding _INTERNAL. So a rule such as colorProfileBlock creates a private nested composite node class called COLOR_PROFILE_BLOCK_INTERNAL, which is then exposed as a public static field instance called COLOR_PROFILE_BLOCK.

This psiGen tool is described in more detail in the section on parsing.

Last modified: 04 July 2023