Lexing HTML-Like Languages
Lexing an HTML-like language can be complicated - it must handle existing HTML syntax, but also provide support for new syntax, which might be new tags, new tag syntax (<%
and %>
), start block delimiters (Razor's @
, which has no closing delimiter) and even nested HTML inside a language construct (Razor again). And it shouldn't have to reimplement HTML.
Compound lexer
ReSharper manages this using HtmlCompoundLexer<TLexer>
and HtmlCompoundIncrementalLexer<TLexer>
. These are implementations of ILexerEx
and IIncrementalLexer
that switches between a lexer for the custom syntax and a lexer for standard HTML. The TLexer
generic parameter is the lexer for the custom syntax, and must implement the IHtmlCompoundLexer
interface.
Creating the compound lexer
The constructor looks like this:
buffer
represents the text of the file to be lexed.baseLexer
is an instance of the lexer used to parse the custom HTML syntax, and is of the generic typeTLexer
.htmlLexerFactory
is a factory class that is used to create the standard HTML lexer.
Typically, this is called from the CreateLexer
method of a HTML-like language's LanguageService
:
A new instance of the lexer for the custom syntax is passed in, as is a new instance of the standard HtmlLexerFactory
. Both classes take a parameter of myTokenNodeTypes
, which a per-language component that derives from IHtmlComponentTokenNodeTypes
:
The RAW_HTML
property is used by the compound lexer to know when to switch between the custom lexer and the standard HTML lexer. See the HTML Token Node Types section for more details on this interface and implementing token node types.
Running the compound lexer
The compound lexer is used just like a normal implementation of ILexer
- Advance()
is called multiple times until the end of file.
The base (custom) lexer drives the lexing of the file - the compound lexer will initially use the tokens of the base lexer. However, if the base lexer encounters a RAW_HTML
token, the compound lexer will switch to the HTML lexer, and use the HTML tokens.
This means:
Initially, advance the base lexer.
Update the compound lexer's own state (
ILexerEx.LexerStateEx
) by combining the states of the base leer, the HTML lexer and a flag to indicate if the HTML lexer is currently running. This state can be saved and restore for incremental lexing.Compare the base lexer's current token to the custom language's
IHtmlCompoundTokenNodeTypes.RAW_HTML
:If different, use the base lexer's current token.
If the same, reset the (incremental) standard HTML lexer over the span of the
RAW_HTML
token and set the flag to indicate the HTML lexer is running.
On subsequent calls to
Advance
, if the flag is set, advance the HTML lexer and use that token, otherwise, advance the base lexer.Once the span of the
RAW_HTML
token has been lexed (by the HTML lexer), reset the HTML running flag and start advancing the base lexer again.
Incremental lexing
The HtmlCompoundIncrementalLexer
class is essentially the same as HtmlCompoundLexer
except that it also implements IIncrementalLexer
, allowing for incremental lexing. More details on incremental lexing can be found in the topics on Implementing a Lexer, Caching Lexers and incremental parsing.
Custom syntax lexer
The base lexer is intended to create tokens for the custom language syntax. It must implement IHtmlCompoundLexer
:
The TokenTypes
property returns the per-language component that implements IHtmlCompoundTokenNodeTypes
, so that the compound lexer can access the RAW_HTML
property, which is the token node type the custom lexer uses to indicate a block of raw HTML. This is a different token node type for each HTML-like custom language.
The NStates
property is the number of states supported by the custom lexer. In other words, this is the maximum value that will be set in ILexerEx.LexerStateEx
for the custom lexer. It is used when combining the base lexer and HTML lexer states for incremental lexing in the compound lexer.
If built with CsLex, this will be the number of states such as YYINITIAL
used in the lexer definition, and can be implemented like:
Tips for implementing the base lexer
The base lexer should create tokens for the custom language as normal, and should return
RAW_HTML
for any chunks of HTML.If the custom language's syntax is valid HTML, let the standard HTML lexer/parser handle it. For example, WebForms'
<asp:Content runat="server" ...
is valid HTML, and is lexed and parsed as HTML.An embedded HTML block should start with valid HTML syntax. The HTML lexer starts with empty state, so the first
RAW_HTML
block must start with valid HTML.The HTML lexer state is preserved across
RAW_HTML
blocks, so it is possible to create "nested" blocks of custom syntax inside HTML. For example:<div class="@MyClass" />This could be implemented as a
RAW_HTML
token that spans from the start of the element up to the@
symbol -<div class="
. The HTML lexer will then return tokens for the opening bracket, the element, the attribute name, the equals, the quote symbol and then relinquish control to the base lexer, which will return tokens for the expression -@MyClass
. It can then returnRAW_HTML
for the closing quotes and the closing element bracket -" />
. Because the HTML lexer state is preserved, HTML lexing will continue as expected, and the HTML tokens are correct.