Implementing a lexer
ReSharper requires a custom language to create a lexer that implements (at least) the ILexer
interface:
The IBuffer
is given to the lexer in the constructor, via ILexerFactory
.
Clients of the lexer will follow these steps:
Call
Start
to get the lexer to recognise the first token.Retrieve the current token type from the
TokenType
property, which will be a singleton instance of a language-specific class that derives fromTokenNodeType
(see the guide on Token Node Types for more details).Use the
TokenStart
andTokenEnd
properties to retrieve the offset of the token start and end in the text buffer. This is required because the token type is a singleton instance, and therefore cannot contain details about the location and length of the token itself. The start offset is inclusive, and the end offset is exclusive, just likeTextRange
(e.g. given "Hello world", the text range (0, 5) returns "Hello").Call the
Advance
method repeatedly, to move to the next token, which will update theTokenType
,TokenStart
andTokenEnd
properties with the information about the current token and location.
The CurrentPosition
property is a lexer specific object that encapsulates the information required by the lexer to save and restore the current location. The LexerStateCookie
class can be used by parsers to make it easy to rollback to a specific state in the lexer. It implements the IDisposable
interface, so it can be used in using
statement:
This can be used to implement lookahead, retrieving a number of tokens ahead, then rolling back to the current position (see Lexer Utility Methods for more details).
Strongly typed lexers
The ILexer
class exposes the CurrentPosition
as an object, to allow lexers maximum flexibility for storing state about the current position - the lexer can return any object it wishes. However, if the lexer wishes to return a value type, this can add boxing allocations, so the lexer can also implement ILexer<TState>
:
This overrides (shadows) the CurrentPosition
property to be of type TState
instead of object
. This will allow a value type to be returned without boxing allocations. For example, if the lexer only requires an integer position (such as a caching lexer), it can implement ILexer<int>
, and avoid boxing the int
into object
.
Similarly, the lexer can implement its state object as a struct
, and return it as a strongly typed item:
The struct
is copied by value to the caller of CurrentPosition
, and no boxing allocations take place.
Incremental lexers
ReSharper includes infrastructure for incremental lexing, that is, only lexing the parts of a file that change, and reusing existing tokens for the rest of the file. Most of the work is handled by a caching lexer, and is covered in more detail in the section on incremental parsing.
The custom language parser can implement the ILexerEx
and IIncrementalLexer
interfaces:
These interfaces expose the lexer state as a uint
value. If the lexer is built with CsLex, this state can be the yy_lexical_state
value, which is used to decide when specific regular expression rules are applied. Alternatively, it can be used as a lookup into other (static) values, or used to encode more state information into the bits of the uint
(the C# lexer uses this strategy to encode a stack of items).
The IIncrementalLexer
interface has a Start
method, which allows the lexer to start from an arbitrary point in the text buffer, without having to parse the preceding part of the file first. It takes a start and end offset, and also the uint
state value returned from ILexerEx.LexerStateEx
. These values will have been cached from a previous scan of the text buffer. The TokenBuffer
and CachingLexer
classes implement this.
More details are in the section on incremental parsing.