v0.6.1

Integrate heuristics into lexer selection
Comments
2025-07-01 12:27:08 -03:00 · 2024-08-24 21:45:57 -03:00 · 2024-08-24 21:39:39 -03:00 · 2024-08-24 20:53:14 -03:00 · 2024-08-24 20:09:29 -03:00 · 2024-08-24 19:59:05 -03:00
35 changed files with 17322 additions and 611 deletions
--- a/.ameba.yml
+++ b/.ameba.yml
@ -1,5 +1,5 @@
 # This configuration file was generated by `ameba --gen-config`
-# on 2024-08-04 23:09:09 UTC using Ameba version 1.6.1.
+# on 2024-08-12 22:00:49 UTC using Ameba version 1.6.1.
 # The point is for the user to remove these configuration records
 # one by one as the reported problems are removed from the code base.

@ -9,7 +9,7 @@ Documentation/DocumentationAdmonition:
  Description: Reports documentation admonitions
  Timezone: UTC
  Excluded:
-  - src/tartrazine.cr
+  - src/lexer.cr
  - src/actions.cr
  Admonitions:
  - TODO
@ -17,3 +17,105 @@ Documentation/DocumentationAdmonition:
  - BUG
  Enabled: true
  Severity: Warning
+
+# Problems found: 22
+# Run `ameba --only Lint/MissingBlockArgument` for details
+Lint/MissingBlockArgument:
+  Description: Disallows yielding method definitions without block argument
+  Excluded:
+  - pygments/tests/examplefiles/cr/test.cr
+  Enabled: true
+  Severity: Warning
+
+# Problems found: 1
+# Run `ameba --only Lint/NotNil` for details
+Lint/NotNil:
+  Description: Identifies usage of `not_nil!` calls
+  Excluded:
+  - pygments/tests/examplefiles/cr/test.cr
+  Enabled: true
+  Severity: Warning
+
+# Problems found: 34
+# Run `ameba --only Lint/ShadowingOuterLocalVar` for details
+Lint/ShadowingOuterLocalVar:
+  Description: Disallows the usage of the same name as outer local variables for block
+    or proc arguments
+  Excluded:
+  - pygments/tests/examplefiles/cr/test.cr
+  Enabled: true
+  Severity: Warning
+
+# Problems found: 1
+# Run `ameba --only Lint/UnreachableCode` for details
+Lint/UnreachableCode:
+  Description: Reports unreachable code
+  Excluded:
+  - pygments/tests/examplefiles/cr/test.cr
+  Enabled: true
+  Severity: Warning
+
+# Problems found: 6
+# Run `ameba --only Lint/UselessAssign` for details
+Lint/UselessAssign:
+  Description: Disallows useless variable assignments
+  ExcludeTypeDeclarations: false
+  Excluded:
+  - pygments/tests/examplefiles/cr/test.cr
+  Enabled: true
+  Severity: Warning
+
+# Problems found: 3
+# Run `ameba --only Naming/BlockParameterName` for details
+Naming/BlockParameterName:
+  Description: Disallows non-descriptive block parameter names
+  MinNameLength: 3
+  AllowNamesEndingInNumbers: true
+  Excluded:
+  - pygments/tests/examplefiles/cr/test.cr
+  AllowedNames:
+  - _
+  - e
+  - i
+  - j
+  - k
+  - v
+  - x
+  - y
+  - ex
+  - io
+  - ws
+  - op
+  - tx
+  - id
+  - ip
+  - k1
+  - k2
+  - v1
+  - v2
+  ForbiddenNames: []
+  Enabled: true
+  Severity: Convention
+
+# Problems found: 1
+# Run `ameba --only Naming/RescuedExceptionsVariableName` for details
+Naming/RescuedExceptionsVariableName:
+  Description: Makes sure that rescued exceptions variables are named as expected
+  Excluded:
+  - pygments/tests/examplefiles/cr/test.cr
+  AllowedNames:
+  - e
+  - ex
+  - exception
+  - error
+  Enabled: true
+  Severity: Convention
+
+# Problems found: 6
+# Run `ameba --only Naming/TypeNames` for details
+Naming/TypeNames:
+  Description: Enforces type names in camelcase manner
+  Excluded:
+  - pygments/tests/examplefiles/cr/test.cr
+  Enabled: true
+  Severity: Convention
--- a/.gitignore
+++ b/.gitignore
@ -7,3 +7,5 @@ chroma/
 pygments/
 shard.lock
 .vscode/
+.crystal/
+venv/
--- a/Dockerfile.static
+++ b/Dockerfile.static
@ -0,0 +1,15 @@
+FROM --platform=${TARGETPLATFORM:-linux/amd64} alpine:3.20 AS build
+RUN apk add --no-cache \
+    crystal \
+    shards \
+    yaml-dev \
+    yaml-static \
+    openssl-dev \
+    openssl-libs-static \
+    libxml2-dev \
+    libxml2-static \
+    zlib-dev \
+    zlib-static \
+    xz-dev \
+    xz-static \
+    make
--- a/2
+++ b/2
@ -1,5 +1,5 @@
 build: $(wildcard src/**/*.cr) $(wildcard lexers/*xml) $(wildcard styles/*xml) shard.yml
-	shards build -Dstrict_multi_assign -Dno_number_autocast
+	shards build -Dstrict_multi_assign -Dno_number_autocast -d --error-trace
 release: $(wildcard src/**/*.cr) $(wildcard lexers/*xml) $(wildcard styles/*xml) shard.yml
 	shards build --release
 static: $(wildcard src/**/*.cr) $(wildcard lexers/*xml) $(wildcard styles/*xml) shard.yml
--- a/README.md
+++ b/README.md
@ -4,17 +4,17 @@ Tartrazine is a library to syntax-highlight code. It is
 a port of [Pygments](https://pygments.org/) to
 [Crystal](https://crystal-lang.org/). Kind of.

-It's not currently usable because it's not finished, but:
-
-* The lexers work for the implemented languages
-* The provided styles work
-* There is a very very simple HTML formatter
+The CLI tool can be used to highlight many things in many styles.

 # A port of what? Why "kind of"?

-Because I did not read the Pygments code. And this is actually
-based on [Chroma](https://github.com/alecthomas/chroma) ...
-although I did not read that code either.
+Pygments is a staple of the Python ecosystem, and it's great.
+It lets you highlight code in many languages, and it has many
+themes. Chroma is "Pygments for Go", it's actually a port of
+Pygments to Go, and it's great too.
+
+I wanted that in Crystal, so I started this project. But I did
+not read much of the Pygments code. Or much of Chroma's.

 Chroma has taken most of the Pygments lexers and turned them into
 XML descriptions. What I did was take those XML files from Chroma
@ -29,18 +29,40 @@ This only covers the RegexLexers, which are the most common ones,
 but it means the supported languages are a subset of Chroma's, which
 is a subset of Pygments'.

-Currently Tartrazine supports ... 241 languages.
+Currently Tartrazine supports ... 247 languages.

-It has 332 themes (64 from Chroma, the rest are base16 themes via
+It has 331 themes (63 from Chroma, the rest are base16 themes via
 [Sixteen](https://github.com/ralsina/sixteen)

 ## Installation

-This will have a CLI tool that can be installed, but it's not
-there yet.
+From prebuilt binaries:

+Each release provides statically-linked binaries that should
+work on any Linux. Get them from the [releases page](https://github.com/ralsina/tartrazine/releases) and put them in your PATH.

-## Usage
+To build from source:
+
+1. Clone this repo
+2. Run `make` to build the `tartrazine` binary
+3. Copy the binary somewhere in your PATH.
+
+## Usage as a CLI tool
+
+Show a syntax highlighted version of a C source file in your terminal:
+
+```shell
+$ tartrazine whatever.c -l c -t catppuccin-macchiato --line-numbers -f terminal
+```
+
+Generate a standalone HTML file from a C source file with the syntax highlighted:
+
+```shell
+$ tartrazine whatever.c -l c -t catppuccin-macchiato --line-numbers \
+  --standalone -f html -o whatever.html 
+```
+
+## Usage as a Library

 This works:

@ -49,7 +71,9 @@ require "tartrazine"

 lexer = Tartrazine.lexer("crystal")
 theme = Tartrazine.theme("catppuccin-macchiato")
-puts Tartrazine::Html.new.format(File.read(ARGV[0]), lexer, theme)
+formatter = Tartrazine::Html.new
+formatter.theme = theme
+puts formatter.format(File.read(ARGV[0]), lexer)
 ```

 ## Contributing
--- a/TODO.md
+++ b/TODO.md
@ -2,6 +2,14 @@

 ## TODO

-* Implement styles
-* Implement formatters
-* Implement lexer loader that respects aliases, etc
+* ✅ Implement styles
+* ✅ Implement formatters
+* ✅ Implement CLI
+* ✅ Implement lexer loader that respects aliases
+* ✅ Implement lexer loader by file extension
+* ✅ Add --line-numbers to terminal formatter
+* Implement lexer loader by mime type
+* ✅ Implement Delegating lexers
+* ✅ Add RstLexer
+* Add Mako template lexer
+* Implement heuristic lexer detection
--- a/lexers/LICENSE-heuristics
+++ b/lexers/LICENSE-heuristics
@ -0,0 +1,22 @@
+Copyright (c) 2017 GitHub, Inc.
+
+Permission is hereby granted, free of charge, to any person
+obtaining a copy of this software and associated documentation
+files (the "Software"), to deal in the Software without
+restriction, including without limitation the rights to use,
+copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the
+Software is furnished to do so, subject to the following
+conditions:
+
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
+OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
+HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
+WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+OTHER DEALINGS IN THE SOFTWARE.
--- a/lexers/LiquidLexer.xml
+++ b/lexers/LiquidLexer.xml
@ -0,0 +1,130 @@
+
+<lexer>
+  <config>
+    <name>liquid</name>
+    <alias>liquid</alias>
+    <filename>*.liquid</filename>
+  </config>
+  <rules>
+    <state name="root">
+      <rule pattern="[^{]+"><token type="Text"/></rule>
+      <rule pattern="(\{%)(\s*)"><bygroups><token type="Punctuation"/><token type="TextWhitespace"/></bygroups><push state="tag-or-block"/></rule>
+      <rule pattern="(\{\{)(\s*)([^\s}]+)"><bygroups><token type="Punctuation"/><token type="TextWhitespace"/><usingself state="generic"/></bygroups><push state="output"/></rule>
+      <rule pattern="\{"><token type="Text"/></rule>
+    </state>
+    <state name="tag-or-block">
+      <rule pattern="(if|unless|elsif|case)(?=\s+)"><token type="KeywordReserved"/><push state="condition"/></rule>
+      <rule pattern="(when)(\s+)"><bygroups><token type="KeywordReserved"/><token type="TextWhitespace"/></bygroups><combined state="end-of-block" state="whitespace" state="generic"/></rule>
+      <rule pattern="(else)(\s*)(%\})"><bygroups><token type="KeywordReserved"/><token type="TextWhitespace"/><token type="Punctuation"/></bygroups><pop depth="1"/></rule>
+      <rule pattern="(capture)(\s+)([^\s%]+)(\s*)(%\})"><bygroups><token type="NameTag"/><token type="TextWhitespace"/><usingself state="variable"/><token type="TextWhitespace"/><token type="Punctuation"/></bygroups><pop depth="1"/></rule>
+      <rule pattern="(comment)(\s*)(%\})"><bygroups><token type="NameTag"/><token type="TextWhitespace"/><token type="Punctuation"/></bygroups><push state="comment"/></rule>
+      <rule pattern="(raw)(\s*)(%\})"><bygroups><token type="NameTag"/><token type="TextWhitespace"/><token type="Punctuation"/></bygroups><push state="raw"/></rule>
+      <rule pattern="(end(case|unless|if))(\s*)(%\})"><bygroups><token type="KeywordReserved"/>None<token type="TextWhitespace"/><token type="Punctuation"/></bygroups><pop depth="1"/></rule>
+      <rule pattern="(end([^\s%]+))(\s*)(%\})"><bygroups><token type="NameTag"/>None<token type="TextWhitespace"/><token type="Punctuation"/></bygroups><pop depth="1"/></rule>
+      <rule pattern="(cycle)(\s+)(?:([^\s:]*)(:))?(\s*)"><bygroups><token type="NameTag"/><token type="TextWhitespace"/><usingself state="generic"/><token type="Punctuation"/><token type="TextWhitespace"/></bygroups><push state="variable-tag-markup"/></rule>
+      <rule pattern="([^\s%]+)(\s*)"><bygroups><token type="NameTag"/><token type="TextWhitespace"/></bygroups><push state="tag-markup"/></rule>
+    </state>
+    <state name="output">
+      <rule><include state="whitespace"/></rule>
+      <rule pattern="\}\}"><token type="Punctuation"/><pop depth="1"/></rule>
+      <rule pattern="\|"><token type="Punctuation"/><push state="filters"/></rule>
+    </state>
+    <state name="filters">
+      <rule><include state="whitespace"/></rule>
+      <rule pattern="\}\}"><token type="Punctuation"/><push state="#pop" state="#pop"/></rule>
+      <rule pattern="([^\s|:]+)(:?)(\s*)"><bygroups><token type="NameFunction"/><token type="Punctuation"/><token type="TextWhitespace"/></bygroups><push state="filter-markup"/></rule>
+    </state>
+    <state name="filter-markup">
+      <rule pattern="\|"><token type="Punctuation"/><pop depth="1"/></rule>
+      <rule><include state="end-of-tag"/></rule>
+      <rule><include state="default-param-markup"/></rule>
+    </state>
+    <state name="condition">
+      <rule><include state="end-of-block"/></rule>
+      <rule><include state="whitespace"/></rule>
+      <rule pattern="([^\s=!&gt;&lt;]+)(\s*)([=!&gt;&lt;]=?)(\s*)(\S+)(\s*)(%\})"><bygroups><usingself state="generic"/><token type="TextWhitespace"/><token type="Operator"/><token type="TextWhitespace"/><usingself state="generic"/><token type="TextWhitespace"/><token type="Punctuation"/></bygroups></rule>
+      <rule pattern="\b!"><token type="Operator"/></rule>
+      <rule pattern="\bnot\b"><token type="OperatorWord"/></rule>
+      <rule pattern="([\w.\&#x27;&quot;]+)(\s+)(contains)(\s+)([\w.\&#x27;&quot;]+)"><bygroups><usingself state="generic"/><token type="TextWhitespace"/><token type="OperatorWord"/><token type="TextWhitespace"/><usingself state="generic"/></bygroups></rule>
+      <rule><include state="generic"/></rule>
+      <rule><include state="whitespace"/></rule>
+    </state>
+    <state name="generic-value">
+      <rule><include state="generic"/></rule>
+      <rule><include state="end-at-whitespace"/></rule>
+    </state>
+    <state name="operator">
+      <rule pattern="(\s*)((=|!|&gt;|&lt;)=?)(\s*)"><bygroups><token type="TextWhitespace"/><token type="Operator"/>None<token type="TextWhitespace"/></bygroups><pop depth="1"/></rule>
+      <rule pattern="(\s*)(\bcontains\b)(\s*)"><bygroups><token type="TextWhitespace"/><token type="OperatorWord"/><token type="TextWhitespace"/></bygroups><pop depth="1"/></rule>
+    </state>
+    <state name="end-of-tag">
+      <rule pattern="\}\}"><token type="Punctuation"/><pop depth="1"/></rule>
+    </state>
+    <state name="end-of-block">
+      <rule pattern="%\}"><token type="Punctuation"/><push state="#pop" state="#pop"/></rule>
+    </state>
+    <state name="end-at-whitespace">
+      <rule pattern="\s+"><token type="TextWhitespace"/><pop depth="1"/></rule>
+    </state>
+    <state name="param-markup">
+      <rule><include state="whitespace"/></rule>
+      <rule pattern="([^\s=:]+)(\s*)(=|:)"><bygroups><token type="NameAttribute"/><token type="TextWhitespace"/><token type="Operator"/></bygroups></rule>
+      <rule pattern="(\{\{)(\s*)([^\s}])(\s*)(\}\})"><bygroups><token type="Punctuation"/><token type="TextWhitespace"/><usingself state="variable"/><token type="TextWhitespace"/><token type="Punctuation"/></bygroups></rule>
+      <rule><include state="string"/></rule>
+      <rule><include state="number"/></rule>
+      <rule><include state="keyword"/></rule>
+      <rule pattern=","><token type="Punctuation"/></rule>
+    </state>
+    <state name="default-param-markup">
+      <rule><include state="param-markup"/></rule>
+      <rule pattern="."><token type="Text"/></rule>
+    </state>
+    <state name="variable-param-markup">
+      <rule><include state="param-markup"/></rule>
+      <rule><include state="variable"/></rule>
+      <rule pattern="."><token type="Text"/></rule>
+    </state>
+    <state name="tag-markup">
+      <rule pattern="%\}"><token type="Punctuation"/><push state="#pop" state="#pop"/></rule>
+      <rule><include state="default-param-markup"/></rule>
+    </state>
+    <state name="variable-tag-markup">
+      <rule pattern="%\}"><token type="Punctuation"/><push state="#pop" state="#pop"/></rule>
+      <rule><include state="variable-param-markup"/></rule>
+    </state>
+    <state name="keyword">
+      <rule pattern="\b(false|true)\b"><token type="KeywordConstant"/></rule>
+    </state>
+    <state name="variable">
+      <rule pattern="[a-zA-Z_]\w*"><token type="NameVariable"/></rule>
+      <rule pattern="(?&lt;=\w)\.(?=\w)"><token type="Punctuation"/></rule>
+    </state>
+    <state name="string">
+      <rule pattern="&#x27;[^&#x27;]*&#x27;"><token type="LiteralStringSingle"/></rule>
+      <rule pattern="&quot;[^&quot;]*&quot;"><token type="LiteralStringDouble"/></rule>
+    </state>
+    <state name="number">
+      <rule pattern="\d+\.\d+"><token type="LiteralNumberFloat"/></rule>
+      <rule pattern="\d+"><token type="LiteralNumberInteger"/></rule>
+    </state>
+    <state name="generic">
+      <rule><include state="keyword"/></rule>
+      <rule><include state="string"/></rule>
+      <rule><include state="number"/></rule>
+      <rule><include state="variable"/></rule>
+    </state>
+    <state name="whitespace">
+      <rule pattern="[ \t]+"><token type="TextWhitespace"/></rule>
+    </state>
+    <state name="comment">
+      <rule pattern="(\{%)(\s*)(endcomment)(\s*)(%\})"><bygroups><token type="Punctuation"/><token type="TextWhitespace"/><token type="NameTag"/><token type="TextWhitespace"/><token type="Punctuation"/></bygroups><push state="#pop" state="#pop"/></rule>
+      <rule pattern="."><token type="Comment"/></rule>
+    </state>
+    <state name="raw">
+      <rule pattern="[^{]+"><token type="Text"/></rule>
+      <rule pattern="(\{%)(\s*)(endraw)(\s*)(%\})"><bygroups><token type="Punctuation"/><token type="TextWhitespace"/><token type="NameTag"/><token type="TextWhitespace"/><token type="Punctuation"/></bygroups><pop depth="1"/></rule>
+      <rule pattern="\{"><token type="Text"/></rule>
+    </state>
+  </rules>
+</lexer>
+
--- a/lexers/VelocityLexer.xml
+++ b/lexers/VelocityLexer.xml
@ -0,0 +1,55 @@
+
+<lexer>
+  <config>
+    <name>Velocity</name>
+    <alias>velocity</alias>
+    <filename>*.vm</filename>
+    <filename>*.fhtml</filename>
+    <dot_all>true</dot_all>
+  </config>
+  <rules>
+    <state name="root">
+      <rule pattern="[^{#$]+"><token type="Other"/></rule>
+      <rule pattern="(#)(\*.*?\*)(#)"><bygroups><token type="CommentPreproc"/><token type="Comment"/><token type="CommentPreproc"/></bygroups></rule>
+      <rule pattern="(##)(.*?$)"><bygroups><token type="CommentPreproc"/><token type="Comment"/></bygroups></rule>
+      <rule pattern="(#\{?)([a-zA-Z_]\w*)(\}?)(\s?\()"><bygroups><token type="CommentPreproc"/><token type="NameFunction"/><token type="CommentPreproc"/><token type="Punctuation"/></bygroups><push state="directiveparams"/></rule>
+      <rule pattern="(#\{?)([a-zA-Z_]\w*)(\}|\b)"><bygroups><token type="CommentPreproc"/><token type="NameFunction"/><token type="CommentPreproc"/></bygroups></rule>
+      <rule pattern="\$!?\{?"><token type="Punctuation"/><push state="variable"/></rule>
+    </state>
+    <state name="variable">
+      <rule pattern="[a-zA-Z_]\w*"><token type="NameVariable"/></rule>
+      <rule pattern="\("><token type="Punctuation"/><push state="funcparams"/></rule>
+      <rule pattern="(\.)([a-zA-Z_]\w*)"><bygroups><token type="Punctuation"/><token type="NameVariable"/></bygroups><push/></rule>
+      <rule pattern="\}"><token type="Punctuation"/><pop depth="1"/></rule>
+      <rule><pop depth="1"/></rule>
+    </state>
+    <state name="directiveparams">
+      <rule pattern="(&amp;&amp;|\|\||==?|!=?|[-&lt;&gt;+*%&amp;|^/])|\b(eq|ne|gt|lt|ge|le|not|in)\b"><token type="Operator"/></rule>
+      <rule pattern="\["><token type="Operator"/><push state="rangeoperator"/></rule>
+      <rule pattern="\b[a-zA-Z_]\w*\b"><token type="NameFunction"/></rule>
+      <rule><include state="funcparams"/></rule>
+    </state>
+    <state name="rangeoperator">
+      <rule pattern="\.\."><token type="Operator"/></rule>
+      <rule><include state="funcparams"/></rule>
+      <rule pattern="\]"><token type="Operator"/><pop depth="1"/></rule>
+    </state>
+    <state name="funcparams">
+      <rule pattern="\$!?\{?"><token type="Punctuation"/><push state="variable"/></rule>
+      <rule pattern="\s+"><token type="Text"/></rule>
+      <rule pattern="[,:]"><token type="Punctuation"/></rule>
+      <rule pattern="&quot;(\\\\|\\[^\\]|[^&quot;\\])*&quot;"><token type="LiteralStringDouble"/></rule>
+      <rule pattern="&#x27;(\\\\|\\[^\\]|[^&#x27;\\])*&#x27;"><token type="LiteralStringSingle"/></rule>
+      <rule pattern="0[xX][0-9a-fA-F]+[Ll]?"><token type="LiteralNumber"/></rule>
+      <rule pattern="\b[0-9]+\b"><token type="LiteralNumber"/></rule>
+      <rule pattern="(true|false|null)\b"><token type="KeywordConstant"/></rule>
+      <rule pattern="\("><token type="Punctuation"/><push/></rule>
+      <rule pattern="\)"><token type="Punctuation"/><pop depth="1"/></rule>
+      <rule pattern="\{"><token type="Punctuation"/><push/></rule>
+      <rule pattern="\}"><token type="Punctuation"/><pop depth="1"/></rule>
+      <rule pattern="\["><token type="Punctuation"/><push/></rule>
+      <rule pattern="\]"><token type="Punctuation"/><pop depth="1"/></rule>
+    </state>
+  </rules>
+</lexer>
+
--- a/lexers/bbcode.xml
+++ b/lexers/bbcode.xml
@ -0,0 +1,22 @@
+
+<lexer>
+  <config>
+    <name>BBCode</name>
+    <alias>bbcode</alias>
+    <mime_type>text/x-bbcode</mime_type>
+  </config>
+  <rules>
+    <state name="root">
+      <rule pattern="[^[]+"><token type="Text"/></rule>
+      <rule pattern="\[/?\w+"><token type="Keyword"/><push state="tag"/></rule>
+      <rule pattern="\["><token type="Text"/></rule>
+    </state>
+    <state name="tag">
+      <rule pattern="\s+"><token type="Text"/></rule>
+      <rule pattern="(\w+)(=)(&quot;?[^\s&quot;\]]+&quot;?)"><bygroups><token type="NameAttribute"/><token type="Operator"/><token type="LiteralString"/></bygroups></rule>
+      <rule pattern="(=)(&quot;?[^\s&quot;\]]+&quot;?)"><bygroups><token type="Operator"/><token type="LiteralString"/></bygroups></rule>
+      <rule pattern="\]"><token type="Keyword"/><pop depth="1"/></rule>
+    </state>
+  </rules>
+</lexer>
+
--- a/lexers/groff.xml
+++ b/lexers/groff.xml
@ -3,6 +3,7 @@
    <name>Groff</name>
    <alias>groff</alias>
    <alias>nroff</alias>
+    <alias>roff</alias>
    <alias>man</alias>
    <filename>*.[1-9]</filename>
    <filename>*.1p</filename>
@ -87,4 +88,4 @@
      </rule>
    </state>
  </rules>
-</lexer>
+</lexer>
--- a/lexers/heuristics.yml
+++ b/lexers/heuristics.yml
@ -0,0 +1,913 @@
+# A collection of simple regexp-based rules that can be applied to content
+# to disambiguate languages with the same file extension.
+#
+# There are two top-level keys: disambiguations and named_patterns.
+#
+# disambiguations     - a list of disambiguation rules, one for each
+#                       extension or group of extensions.
+# extensions          - an array of file extensions that this block applies to.
+# rules               - list of rules that are applied in order to the content
+#                       of a file with a matching extension. Rules are evaluated
+#                       until one of them matches. If none matches, no language
+#                       is returned.
+# language            - Language to be returned if the rule matches.
+# pattern             - Ruby-compatible regular expression that makes the rule
+#                       match. If no pattern is specified, the rule always matches.
+#                       Pattern can be a string with a single regular expression
+#                       or an array of strings that will be merged in a single
+#                       regular expression (with union).
+# and                 - An and block merges multiple rules and checks that all of
+#                       them must match.
+# negative_pattern    - Same as pattern, but checks for absence of matches.
+# named_pattern       - A pattern can be reused by specifying it in the
+#                       named_patterns section and referencing it here by its
+#                       key.
+# named_patterns      - Key-value map of reusable named patterns.
+#
+# Please keep this list alphabetized.
+#
+---
+disambiguations:
+- extensions: ['.1', '.2', '.3', '.4', '.5', '.6', '.7', '.8', '.9']
+  rules:
+  - language: man
+    and:
+    - named_pattern: mdoc-date
+    - named_pattern: mdoc-title
+    - named_pattern: mdoc-heading
+  - language: man
+    and:
+    - named_pattern: man-title
+    - named_pattern: man-heading
+  - language: Roff
+    pattern: '^\.(?:[A-Za-z]{2}(?:\s|$)|\\")'
+- extensions: ['.1in', '.1m', '.1x', '.3in', '.3m', '.3p', '.3pm', '.3qt', '.3x', '.man', '.mdoc']
+  rules:
+  - language: man
+    and:
+    - named_pattern: mdoc-date
+    - named_pattern: mdoc-title
+    - named_pattern: mdoc-heading
+  - language: man
+    and:
+    - named_pattern: man-title
+    - named_pattern: man-heading
+  - language: Roff
+- extensions: ['.al']
+  rules:
+  # AL pattern source from https://github.com/microsoft/AL/blob/master/grammar/alsyntax.tmlanguage - keyword.other.applicationobject.al
+  - language: AL
+    and:
+    - pattern: '\b(?i:(CODEUNIT|PAGE|PAGEEXTENSION|PAGECUSTOMIZATION|DOTNET|ENUM|ENUMEXTENSION|VALUE|QUERY|REPORT|TABLE|TABLEEXTENSION|XMLPORT|PROFILE|CONTROLADDIN|REPORTEXTENSION|INTERFACE|PERMISSIONSET|PERMISSIONSETEXTENSION|ENTITLEMENT))\b'
+  # Open-ended fallback to Perl AutoLoader
+  - language: Perl
+- extensions: ['.app']
+  rules:
+  - language: Erlang
+    pattern: '^\{\s*(?:application|''application'')\s*,\s*(?:[a-z]+[\w@]*|''[^'']+'')\s*,\s*\[(?:.|[\r\n])*\]\s*\}\.[ \t]*$'
+- extensions: ['.as']
+  rules:
+  - language: ActionScript
+    pattern: '^\s*(?:package(?:\s+[\w.]+)?\s+(?:\{|$)|import\s+[\w.*]+\s*;|(?=.*?(?:intrinsic|extends))(intrinsic\s+)?class\s+[\w<>.]+(?:\s+extends\s+[\w<>.]+)?|(?:(?:public|protected|private|static)\s+)*(?:(?:var|const|local)\s+\w+\s*:\s*[\w<>.]+(?:\s*=.*)?\s*;|function\s+\w+\s*\((?:\s*\w+\s*:\s*[\w<>.]+\s*(,\s*\w+\s*:\s*[\w<>.]+\s*)*)?\)))'
+- extensions: ['.asc']
+  rules:
+  - language: Public Key
+    pattern: '^(----[- ]BEGIN|ssh-(rsa|dss)) '
+  - language: AsciiDoc
+    pattern: '^[=-]+\s|\{\{[A-Za-z]'
+  - language: AGS Script
+    pattern: '^(\/\/.+|((import|export)\s+)?(function|int|float|char)\s+((room|repeatedly|on|game)_)?([A-Za-z]+[A-Za-z_0-9]+)\s*[;\(])'
+- extensions: ['.asm']
+  rules:
+  - language: Motorola 68K Assembly
+    named_pattern: m68k
+- extensions: ['.asy']
+  rules:
+  - language: LTspice Symbol
+    pattern: '^SymbolType[ \t]'
+  - language: Asymptote
+- extensions: ['.bas']
+  rules:
+  - language: FreeBasic
+    pattern: '^[ \t]*#(?i)(?:define|endif|endmacro|ifn?def|include|lang|macro)(?:$|\s)'
+  - language: BASIC
+    pattern: '\A\s*\d'
+  - language: VBA
+    and:
+    - named_pattern: vb-module
+    - named_pattern: vba
+  - language: Visual Basic 6.0
+    named_pattern: vb-module
+- extensions: ['.bb']
+  rules:
+  - language: BlitzBasic
+    pattern: '(<^\s*; |End Function)'
+  - language: BitBake
+    pattern: '^(# |include|require|inherit)\b'
+  - language: Clojure
+    pattern: '\((def|defn|defmacro|let)\s'
+- extensions: ['.bf']
+  rules:
+  - language: Beef
+    pattern: '(?-m)^\s*using\s+(System|Beefy)(\.(.*))?;\s*$'
+  - language: HyPhy
+    pattern:
+    - '(?-m)^\s*#include\s+".*";\s*$'
+    - '\sfprintf\s*\('
+  - language: Brainfuck
+    pattern: '(>\+>|>\+<)'
+- extensions: ['.bi']
+  rules:
+  - language: FreeBasic
+    pattern: '^[ \t]*#(?i)(?:define|endif|endmacro|ifn?def|if|include|lang|macro)(?:$|\s)'
+- extensions: ['.bs']
+  rules:
+  - language: Bikeshed
+    pattern: '^(?i:<pre\s+class)\s*=\s*(''|\"|\b)metadata\b\1[^>\r\n]*>'
+  - language: BrighterScript
+    pattern:
+    - (?i:^\s*(?=^sub\s)(?:sub\s*\w+\(.*?\))|(?::\s*sub\(.*?\))$)
+    - (?i:^\s*(end\ssub)$)
+    - (?i:^\s*(?=^function\s)(?:function\s*\w+\(.*?\)\s*as\s*\w*)|(?::\s*function\(.*?\)\s*as\s*\w*)$)
+    - (?i:^\s*(end\sfunction)$)
+  - language: Bluespec BH
+    pattern: '^package\s+[A-Za-z_][A-Za-z0-9_'']*(?:\s*\(|\s+where)'
+- extensions: ['.builds']
+  rules:
+  - language: XML
+    pattern: '^(\s*)(?i:<Project|<Import|<Property|<?xml|xmlns)'
+- extensions: ['.ch']
+  rules:
+  - language: xBase
+    pattern: '^\s*#\s*(?i:if|ifdef|ifndef|define|command|xcommand|translate|xtranslate|include|pragma|undef)\b'
+- extensions: ['.cl']
+  rules:
+  - language: Common Lisp
+    pattern: '^\s*\((?i:defun|in-package|defpackage) '
+  - language: Cool
+    pattern: '^class'
+  - language: OpenCL
+    pattern: '\/\* |\/\/ |^\}'
+- extensions: ['.cls']
+  rules:
+  - language: Visual Basic 6.0
+    and:
+    - named_pattern: vb-class
+    - pattern: '^\s*BEGIN(?:\r?\n|\r)\s*MultiUse\s*=.*(?:\r?\n|\r)\s*Persistable\s*='
+  - language: VBA
+    named_pattern: vb-class
+  - language: TeX
+    pattern: '^\s*\\(?:NeedsTeXFormat|ProvidesClass)\{'
+  - language: ObjectScript
+    pattern: '^Class\s'
+- extensions: ['.cmp']
+  rules:
+  - language: Gerber Image
+    pattern: '^[DGMT][0-9]{2}\*(?:\r?\n|\r)'
+- extensions: ['.cs']
+  rules:
+  - language: Smalltalk
+    pattern: '![\w\s]+methodsFor: '
+  - language: 'C#'
+    pattern: '^\s*(using\s+[A-Z][\s\w.]+;|namespace\s*[\w\.]+\s*(\{|;)|\/\/)'
+- extensions: ['.csc']
+  rules:
+  - language: GSC
+    named_pattern: gsc
+- extensions: ['.csl']
+  rules:
+  - language: XML
+    pattern: '(?i:^\s*(<\?xml|xmlns))'
+  - language: Kusto
+    pattern: '(^\|\s*(where|extend|project|limit|summarize))|(^\.\w+)'
+- extensions: ['.d']
+  rules:
+  - language: D
+    # see http://dlang.org/spec/grammar
+    # ModuleDeclaration | ImportDeclaration | FuncDeclaration | unittest
+    pattern: '^module\s+[\w.]*\s*;|import\s+[\w\s,.:]*;|\w+\s+\w+\s*\(.*\)(?:\(.*\))?\s*\{[^}]*\}|unittest\s*(?:\(.*\))?\s*\{[^}]*\}'
+  - language: DTrace
+    # see http://dtrace.org/guide/chp-prog.html, http://dtrace.org/guide/chp-profile.html, http://dtrace.org/guide/chp-opt.html
+    pattern: '^(\w+:\w*:\w*:\w*|BEGIN|END|provider\s+|(tick|profile)-\w+\s+\{[^}]*\}|#pragma\s+D\s+(option|attributes|depends_on)\s|#pragma\s+ident\s)'
+  - language: Makefile
+    # path/target : dependency \
+    # target : \
+    #  : dependency
+    # path/file.ext1 : some/path/../file.ext2
+    pattern: '([\/\\].*:\s+.*\s\\$|: \\$|^[ %]:|^[\w\s\/\\.]+\w+\.\w+\s*:\s+[\w\s\/\\.]+\w+\.\w+)'
+- extensions: ['.dsp']
+  rules:
+  - language: Microsoft Developer Studio Project
+    pattern: '# Microsoft Developer Studio Generated Build File'
+  - language: Faust
+    pattern: '\bprocess\s*[(=]|\b(library|import)\s*\(\s*"|\bdeclare\s+(name|version|author|copyright|license)\s+"'
+- extensions: ['.e']
+  rules:
+  - language: E
+    pattern:
+    - '^\s*(def|var)\s+(.+):='
+    - '^\s*(def|to)\s+(\w+)(\(.+\))?\s+\{'
+    - '^\s*(when)\s+(\(.+\))\s+->\s+\{'
+  - language: Eiffel
+    pattern:
+    - '^\s*\w+\s*(?:,\s*\w+)*[:]\s*\w+\s'
+    - '^\s*\w+\s*(?:\(\s*\w+[:][^)]+\))?(?:[:]\s*\w+)?(?:--.+\s+)*\s+(?:do|local)\s'
+    - '^\s*(?:across|deferred|elseif|ensure|feature|from|inherit|inspect|invariant|note|once|require|undefine|variant|when)\s*$'
+  - language: Euphoria
+    named_pattern: euphoria
+- extensions: ['.ecl']
+  rules:
+  - language: ECLiPSe
+    pattern: '^[^#]+:-'
+  - language: ECL
+    pattern: ':='
+- extensions: ['.es']
+  rules:
+  - language: Erlang
+    pattern: '^\s*(?:%%|main\s*\(.*?\)\s*->)'
+  - language: JavaScript
+    pattern: '\/\/|("|'')use strict\1|export\s+default\s|\/\*(?:.|[\r\n])*?\*\/'
+- extensions: ['.ex']
+  rules:
+  - language: Elixir
+    pattern:
+    - '^\s*@moduledoc\s'
+    - '^\s*(?:cond|import|quote|unless)\s'
+    - '^\s*def(?:exception|impl|macro|module|protocol)[(\s]'
+  - language: Euphoria
+    named_pattern: euphoria
+- extensions: ['.f']
+  rules:
+  - language: Forth
+    pattern: '^: '
+  - language: Filebench WML
+    pattern: 'flowop'
+  - language: Fortran
+    named_pattern: fortran
+- extensions: ['.for']
+  rules:
+  - language: Forth
+    pattern: '^: '
+  - language: Fortran
+    named_pattern: fortran
+- extensions: ['.fr']
+  rules:
+  - language: Forth
+    pattern: '^(: |also |new-device|previous )'
+  - language: Frege
+    pattern: '^\s*(import|module|package|data|type) '
+  - language: Text
+- extensions: ['.frm']
+  rules:
+  - language: VBA
+    and:
+    - named_pattern: vb-form
+    - pattern: '^\s*Begin\s+\{[0-9A-Z\-]*\}\s?'
+  - language: Visual Basic 6.0
+    and:
+    - named_pattern: vb-form
+    - pattern: '^\s*Begin\s+VB\.Form\s+'
+- extensions: ['.fs']
+  rules:
+  - language: Forth
+    pattern: '^(: |new-device)'
+  - language: 'F#'
+    pattern: '^\s*(#light|import|let|module|namespace|open|type)'
+  - language: GLSL
+    pattern: '^\s*(#version|precision|uniform|varying|vec[234])'
+  - language: Filterscript
+    pattern: '#include|#pragma\s+(rs|version)|__attribute__'
+- extensions: ['.ftl']
+  rules:
+  - language: FreeMarker
+    pattern: '^(?:<|[a-zA-Z-][a-zA-Z0-9_-]+[ \t]+\w)|\$\{\w+[^\r\n]*?\}|^[ \t]*(?:<#--.*?-->|<#([a-z]+)(?=\s|>)[^>]*>.*?</#\1>|\[#--.*?--\]|\[#([a-z]+)(?=\s|\])[^\]]*\].*?\[#\2\])'
+  - language: Fluent
+    pattern: '^-?[a-zA-Z][a-zA-Z0-9_-]* *=|\{\$-?[a-zA-Z][-\w]*(?:\.[a-zA-Z][-\w]*)?\}'
+- extensions: ['.g']
+  rules:
+  - language: GAP
+    pattern: '\s*(Declare|BindGlobal|KeyDependentOperation|Install(Method|GlobalFunction)|SetPackageInfo)'
+  - language: G-code
+    pattern: '^[MG][0-9]+(?:\r?\n|\r)'
+- extensions: ['.gd']
+  rules:
+  - language: GAP
+    pattern: '\s*(Declare|BindGlobal|KeyDependentOperation)'
+  - language: GDScript
+    pattern: '\s*(extends|var|const|enum|func|class|signal|tool|yield|assert|onready)'
+- extensions: ['.gml']
+  rules:
+  - language: XML
+    pattern: '(?i:^\s*(<\?xml|xmlns))'
+  - language: Graph Modeling Language
+    pattern: '(?i:^\s*(graph|node)\s+\[$)'
+  - language: Gerber Image
+    pattern: '^[DGMT][0-9]{2}\*$'
+  - language: Game Maker Language
+- extensions: ['.gs']
+  rules:
+  - language: GLSL
+    pattern: '^#version\s+[0-9]+\b'
+  - language: Gosu
+    pattern: '^uses (java|gw)\.'
+  - language: Genie
+    pattern: '^\[indent=[0-9]+\]'
+- extensions: ['.gsc']
+  rules:
+  - language: GSC
+    named_pattern: gsc
+- extensions: ['.gsh']
+  rules:
+  - language: GSC
+    named_pattern: gsc
+- extensions: ['.gts']
+  rules:
+  - language: Gerber Image
+    pattern: '^G0.'
+  - language: Glimmer TS
+    negative_pattern: '^G0.'
+- extensions: ['.h']
+  rules:
+  - language: Objective-C
+    named_pattern: objectivec
+  - language: C++
+    named_pattern: cpp
+  - language: C
+- extensions: ['.hh']
+  rules:
+  - language: Hack
+    pattern: '<\?hh'
+- extensions: ['.html']
+  rules:
+  - language: Ecmarkup
+    pattern: '<emu-(?:alg|annex|biblio|clause|eqn|example|figure|gann|gmod|gprose|grammar|intro|not-ref|note|nt|prodref|production|rhs|table|t|xref)(?:$|\s|>)'
+  - language: HTML
+- extensions: ['.i']
+  rules:
+  - language: Motorola 68K Assembly
+    named_pattern: m68k
+  - language: SWIG
+    pattern: '^[ \t]*%[a-z_]+\b|^%[{}]$'
+- extensions: ['.ice']
+  rules:
+  - language: JSON
+    pattern: '\A\s*[{\[]'
+  - language: Slice
+- extensions: ['.inc']
+  rules:
+  - language: Motorola 68K Assembly
+    named_pattern: m68k
+  - language: PHP
+    pattern: '^<\?(?:php)?'
+  - language: SourcePawn
+    pattern:
+    - '^public\s+(?:SharedPlugin(?:\s+|:)__pl_\w+\s*=(?:\s*\{)?|(?:void\s+)?__pl_\w+_SetNTVOptional\(\)(?:\s*\{)?)'
+    - '^methodmap\s+\w+\s+<\s+\w+'
+    - '^\s*MarkNativeAsOptional\s*\('
+  - language: NASL
+    pattern:
+    - '^\s*include\s*\(\s*(?:"|'')[\\/\w\-\.:\s]+\.(?:nasl|inc)\s*(?:"|'')\s*\)\s*;'
+    - '^\s*(?:global|local)_var\s+(?:\w+(?:\s*=\s*[\w\-"'']+)?\s*)(?:,\s*\w+(?:\s*=\s*[\w\-"'']+)?\s*)*+\s*;'
+    - '^\s*namespace\s+\w+\s*\{'
+    - '^\s*object\s+\w+\s*(?:extends\s+\w+(?:::\w+)?)?\s*\{'
+    - '^\s*(?:public\s+|private\s+|\s*)function\s+\w+\s*\([\w\s,]*\)\s*\{'
+  - language: POV-Ray SDL
+    pattern: '^\s*#(declare|local|macro|while)\s'
+  - language: Pascal
+    pattern:
+    - '(?i:^\s*\{\$(?:mode|ifdef|undef|define)[ ]+[a-z0-9_]+\})'
+    - '^\s*end[.;]\s*$'
+  - language: BitBake
+    pattern: '^inherit(\s+[\w.-]+)+\s*$'
+- extensions: ['.json']
+  rules:
+  - language: OASv2-json
+    pattern: '"swagger":\s?"2.[0-9.]+"'
+  - language: OASv3-json
+    pattern: '"openapi":\s?"3.[0-9.]+"'
+  - language: JSON
+- extensions: ['.l']
+  rules:
+  - language: Common Lisp
+    pattern: '\(def(un|macro)\s'
+  - language: Lex
+    pattern: '^(%[%{}]xs|<.*>)'
+  - language: Roff
+    pattern: '^\.[A-Za-z]{2}(\s|$)'
+  - language: PicoLisp
+    pattern: '^\((de|class|rel|code|data|must)\s'
+- extensions: ['.lean']
+  rules:
+  - language: Lean
+    pattern: '^import [a-z]'
+  - language: Lean 4
+    pattern: '^import [A-Z]'
+- extensions: ['.ls']
+  rules:
+  - language: LoomScript
+    pattern: '^\s*package\s*[\w\.\/\*\s]*\s*\{'
+  - language: LiveScript
+- extensions: ['.lsp', '.lisp']
+  rules:
+  - language: Common Lisp
+    pattern: '^\s*\((?i:defun|in-package|defpackage) '
+  - language: NewLisp
+    pattern: '^\s*\(define '
+- extensions: ['.m']
+  rules:
+  - language: Objective-C
+    named_pattern: objectivec
+  - language: Mercury
+    pattern: ':- module'
+  - language: MUF
+    pattern: '^: '
+  - language: M
+    pattern: '^\s*;'
+  - language: Mathematica
+    and:
+      - pattern: '\(\*'
+      - pattern: '\*\)$'
+  - language: MATLAB
+    pattern: '^\s*%'
+  - language: Limbo
+    pattern: '^\w+\s*:\s*module\s*\{'
+- extensions: ['.m4']
+  rules:
+  - language: M4Sugar
+    pattern:
+    - 'AC_DEFUN|AC_PREREQ|AC_INIT'
+    - '^_?m4_'
+  - language: 'M4'
+- extensions: ['.mask']
+  rules:
+  - language: Unity3D Asset
+    pattern: 'tag:unity3d.com'
+- extensions: ['.mc']
+  rules:
+  - language: Win32 Message File
+    pattern: '(?i)^[ \t]*(?>\/\*\s*)?MessageId=|^\.$'
+  - language: M4
+    pattern: '^dnl|^divert\((?:-?\d+)?\)|^\w+\(`[^\r\n]*?''[),]'
+  - language: Monkey C
+    pattern: '\b(?:using|module|function|class|var)\s+\w'
+- extensions: ['.md']
+  rules:
+  - language: Markdown
+    pattern:
+    - '(^[-A-Za-z0-9=#!\*\[|>])|<\/'
+    - '\A\z'
+  - language: GCC Machine Description
+    pattern: '^(;;|\(define_)'
+  - language: Markdown
+- extensions: ['.ml']
+  rules:
+  - language: OCaml
+    pattern: '(^\s*module)|let rec |match\s+(\S+\s)+with'
+  - language: Standard ML
+    pattern: '=> |case\s+(\S+\s)+of'
+- extensions: ['.mod']
+  rules:
+  - language: XML
+    pattern: '<!ENTITY '
+  - language: NMODL
+    pattern: '\b(NEURON|INITIAL|UNITS)\b'
+  - language: Modula-2
+    pattern: '^\s*(?i:MODULE|END) [\w\.]+;'
+  - language: [Linux Kernel Module, AMPL]
+- extensions: ['.mojo']
+  rules:
+  - language: Mojo
+    pattern: '^\s*(alias|def|from|fn|import|struct|trait)\s'
+  - language: XML
+    pattern: '^\s*<\?xml'
+- extensions: ['.ms']
+  rules:
+  - language: Roff
+    pattern: '^[.''][A-Za-z]{2}(\s|$)'
+  - language: Unix Assembly
+    and:
+      - negative_pattern: '/\*'
+      - pattern: '^\s*\.(?:include\s|globa?l\s|[A-Za-z][_A-Za-z0-9]*:)'
+  - language: MAXScript
+- extensions: ['.n']
+  rules:
+  - language: Roff
+    pattern: '^[.'']'
+  - language: Nemerle
+    pattern: '^(module|namespace|using)\s'
+- extensions: ['.ncl']
+  rules:
+  - language: XML
+    pattern: '^\s*<\?xml\s+version'
+  - language: Gerber Image
+    pattern: '^[DGMT][0-9]{2}\*(?:\r?\n|\r)'
+  - language: Text
+    pattern: 'THE_TITLE'
+- extensions: ['.nl']
+  rules:
+  - language: NL
+    pattern: '^(b|g)[0-9]+ '
+  - language: NewLisp
+- extensions: ['.nu']
+  rules:
+  - language: Nushell
+    pattern: '^\s*(import|export|module|def|let|let-env) '
+  - language: Nu
+- extensions: ['.odin']
+  rules:
+  - language: Object Data Instance Notation
+    pattern: '(?:^|<)\s*[A-Za-z0-9_]+\s*=\s*<'
+  - language: Odin
+    pattern: 'package\s+\w+|\b(?:im|ex)port\s*"[\w:./]+"|\w+\s*::\s*(?:proc|struct)\s*\(|^\s*//\s'
+- extensions: ['.p']
+  rules:
+  - language: Gnuplot
+    pattern:
+    - '^s?plot\b'
+    - '^set\s+(term|terminal|out|output|[xy]tics|[xy]label|[xy]range|style)\b'
+  - language: OpenEdge ABL
+- extensions: ['.php']
+  rules:
+  - language: Hack
+    pattern: '<\?hh'
+  - language: PHP
+    pattern: '<\?[^h]'
+- extensions: ['.pkl']
+  rules:
+    - language: Pkl
+      pattern:
+      - '^\s*(module|import|amends|extends|local|const|fixed|abstract|open|class|typealias|@\w+)\b'
+      - '^\s*[a-zA-Z0-9_$]+\s*(=|{|:)|^\s*`[^`]+`\s*(=|{|:)|for\s*\(|when\s*\('
+    - language: Pickle
+- extensions: ['.pl']
+  rules:
+  - language: Prolog
+    pattern: '^[^#]*:-'
+  - language: Perl
+    and:
+      - negative_pattern: '^\s*use\s+v6\b'
+      - named_pattern: perl
+  - language: Raku
+    named_pattern: raku
+- extensions: ['.plist']
+  rules:
+  - language: XML Property List
+    pattern: '^\s*(?:<\?xml\s|<!DOCTYPE\s+plist|<plist(?:\s+version\s*=\s*(["''])\d+(?:\.\d+)?\1)?\s*>\s*$)'
+  - language: OpenStep Property List
+- extensions: ['.plt']
+  rules:
+  - language: Prolog
+    pattern: '^\s*:-'
+- extensions: ['.pm']
+  rules:
+  - language: Perl
+    and:
+      - negative_pattern: '^\s*use\s+v6\b'
+      - named_pattern: perl
+  - language: Raku
+    named_pattern: raku
+  - language: X PixMap
+    pattern: '^\s*\/\* XPM \*\/'
+- extensions: ['.pod']
+  rules:
+  - language: Pod 6
+    pattern: '^[\s&&[^\r\n]]*=(comment|begin pod|begin para|item\d+)'
+  - language: Pod
+- extensions: ['.pp']
+  rules:
+  - language: Pascal
+    pattern: '^\s*end[.;]'
+  - language: Puppet
+    pattern: '^\s+\w+\s+=>\s'
+- extensions: ['.pro']
+  rules:
+  - language: Proguard
+    pattern: '^-(include\b.*\.pro$|keep\b|keepclassmembers\b|keepattributes\b)'
+  - language: Prolog
+    pattern: '^[^\[#]+:-'
+  - language: INI
+    pattern: 'last_client='
+  - language: QMake
+    and:
+    - pattern: HEADERS
+    - pattern: SOURCES
+  - language: IDL
+    pattern: '^\s*(?i:function|pro|compile_opt) \w[ \w,:]*$'
+- extensions: ['.properties']
+  rules:
+  - language: INI
+    and:
+    - named_pattern: key_equals_value
+    - pattern: '^[;\[]'
+  - language: Java Properties
+    and:
+    - named_pattern: key_equals_value
+    - pattern: '^[#!]'
+  - language: INI
+    named_pattern: key_equals_value
+  - language: Java Properties
+    pattern: '^[^#!][^:]*:'
+- extensions: ['.q']
+  rules:
+  - language: q
+    pattern: '((?i:[A-Z.][\w.]*:\{)|^\\(cd?|d|l|p|ts?) )'
+  - language: HiveQL
+    pattern: '(?i:SELECT\s+[\w*,]+\s+FROM|(CREATE|ALTER|DROP)\s(DATABASE|SCHEMA|TABLE))'
+- extensions: ['.qs']
+  rules:
+  - language: Q#
+    pattern: '^((\/{2,3})?\s*(namespace|operation)\b)'
+  - language: Qt Script
+    pattern: '(\w+\.prototype\.\w+|===|\bvar\b)'
+- extensions: ['.r']
+  rules:
+  - language: Rebol
+    pattern: '(?i:\bRebol\b)'
+  - language: Rez
+    pattern: '(#include\s+["<](Types\.r|Carbon\/Carbon\.r)[">])|((resource|data|type)\s+''[A-Za-z0-9]{4}''\s+((\(.*\)\s+){0,1}){)'
+  - language: R
+    pattern: '<-|^\s*#'
+- extensions: ['.re']
+  rules:
+  - language: Reason
+    pattern:
+    - '^\s*module\s+type\s'
+    - '^\s*(?:include|open)\s+\w+\s*;\s*$'
+    - '^\s*let\s+(?:module\s\w+\s*=\s*\{|\w+:\s+.*=.*;\s*$)'
+  - language: C++
+    pattern:
+    - '^\s*#(?:(?:if|ifdef|define|pragma)\s+\w|\s*include\s+<[^>]+>)'
+    - '^\s*template\s*<'
+- extensions: ['.res']
+  rules:
+  - language: ReScript
+    pattern:
+    - '^\s*(let|module|type)\s+\w*\s+=\s+'
+    - '^\s*(?:include|open)\s+\w+\s*$'
+- extensions: ['.rno']
+  rules:
+  - language: RUNOFF
+    pattern: '(?i:^\.!|^\f|\f$|^\.end lit(?:eral)?\b|^\.[a-zA-Z].*?;\.[a-zA-Z](?:[; \t])|\^\*[^\s*][^*]*\\\*(?=$|\s)|^\.c;[ \t]*\w+)'
+  - language: Roff
+    pattern: '^\.\\" '
+- extensions: ['.rpy']
+  rules:
+  - language: Python
+    pattern: '^(import|from|class|def)\s'
+  - language: "Ren'Py"
+- extensions: ['.rs']
+  rules:
+  - language: Rust
+    pattern: '^(use |fn |mod |pub |macro_rules|impl|#!?\[)'
+  - language: RenderScript
+    pattern: '#include|#pragma\s+(rs|version)|__attribute__'
+  - language: XML
+    pattern: '^\s*<\?xml'
+- extensions: ['.s']
+  rules:
+  - language: Motorola 68K Assembly
+    named_pattern: m68k
+- extensions: ['.sc']
+  rules:
+  - language: SuperCollider
+    pattern: '(?i:\^(this|super)\.|^\s*~\w+\s*=\.)'
+  - language: Scala
+    pattern: '(^\s*import (scala|java)\.|^\s*class\b)'
+- extensions: ['.scd']
+  rules:
+  - language: SuperCollider
+    pattern: '(?i:\^(this|super)\.|^\s*(~\w+\s*=\.|SynthDef\b))'
+  - language: Markdown
+    # Markdown syntax for scdoc
+    pattern: '^#+\s+(NAME|SYNOPSIS|DESCRIPTION)'
+- extensions: ['.sol']
+  rules:
+  - language: Solidity
+    pattern: '\bpragma\s+solidity\b|\b(?:abstract\s+)?contract\s+(?!\d)[a-zA-Z0-9$_]+(?:\s+is\s+(?:[a-zA-Z0-9$_][^\{]*?)?)?\s*\{'
+  - language: Gerber Image
+    pattern: '^[DGMT][0-9]{2}\*(?:\r?\n|\r)'
+- extensions: ['.sql']
+  rules:
+   # Postgres
+  - language: PLpgSQL
+    pattern: '(?i:^\\i\b|AS\s+\$\$|LANGUAGE\s+''?plpgsql''?|BEGIN(\s+WORK)?\s*;)'
+  # IBM db2
+  - language: SQLPL
+    pattern: '(?i:ALTER\s+MODULE|MODE\s+DB2SQL|\bSYS(CAT|PROC)\.|ASSOCIATE\s+RESULT\s+SET|\bEND!\s*$)'
+  # Oracle
+  - language: PLSQL
+    pattern: '(?i:\$\$PLSQL_|XMLTYPE|systimestamp|\.nextval|CONNECT\s+BY|AUTHID\s+(DEFINER|CURRENT_USER)|constructor\W+function)'
+  # T-SQL
+  - language: TSQL
+    pattern: '(?i:^\s*GO\b|BEGIN(\s+TRY|\s+CATCH)|OUTPUT\s+INSERTED|DECLARE\s+@|\[dbo\])'
+  - language: SQL
+- extensions: ['.srt']
+  rules:
+  - language: SubRip Text
+    pattern: '^(\d{2}:\d{2}:\d{2},\d{3})\s*(-->)\s*(\d{2}:\d{2}:\d{2},\d{3})$'
+- extensions: ['.st']
+  rules:
+  - language: StringTemplate
+    pattern: '\$\w+[($]|(.)!\s*.+?\s*!\1|<!\s*.+?\s*!>|\[!\s*.+?\s*!\]|\{!\s*.+?\s*!\}'
+  - language: Smalltalk
+    pattern: '\A\s*[\[{(^"''\w#]|[a-zA-Z_]\w*\s*:=\s*[a-zA-Z_]\w*|class\s*>>\s*[a-zA-Z_]\w*|^[a-zA-Z_]\w*\s+[a-zA-Z_]\w*:|^Class\s*\{|if(?:True|False):\s*\['
+- extensions: ['.star']
+  rules:
+  - language: STAR
+    pattern: '^loop_\s*$'
+  - language: Starlark
+- extensions: ['.stl']
+  rules:
+  - language: STL
+    pattern: '\A\s*solid(?:$|\s)[\s\S]*^endsolid(?:$|\s)'
+- extensions: ['.sw']
+  rules:
+  - language: Sway
+    pattern: '^\s*(?:(?:abi|dep|fn|impl|mod|pub|trait)\s|#\[)'
+  - language: XML
+    pattern: '^\s*<\?xml\s+version'
+- extensions: ['.t']
+  rules:
+  - language: Perl
+    and:
+      - negative_pattern: '^\s*use\s+v6\b'
+      - named_pattern: perl
+  - language: Raku
+    pattern: '^\s*(?:use\s+v6\b|\bmodule\b|\bmy\s+class\b)'
+  - language: Turing
+    pattern: '^\s*%[ \t]+|^\s*var\s+\w+(\s*:\s*\w+)?\s*:=\s*\w+'
+- extensions: ['.tag']
+  rules:
+  - language: Java Server Pages
+    pattern: '<%[@!=\s]?\s*(taglib|tag|include|attribute|variable)\s'
+- extensions: ['.tlv']
+  rules:
+  - language: TL-Verilog
+    pattern: '^\\.{0,10}TLV_version'
+- extensions: ['.toc']
+  rules:
+  - language: World of Warcraft Addon Data
+    pattern: '^## |@no-lib-strip@'
+  - language: TeX
+    pattern: '^\\(contentsline|defcounter|beamer|boolfalse)'
+- extensions: ['.ts']
+  rules:
+  - language: XML
+    pattern: '<TS\b'
+  - language: TypeScript
+- extensions: ['.tst']
+  rules:
+  - language: GAP
+    pattern: 'gap> '
+  # Heads up - we don't usually write heuristics like this (with no regex match)
+  - language: Scilab
+- extensions: ['.tsx']
+  rules:
+  - language: TSX
+    pattern: '^\s*(import.+(from\s+|require\()[''"]react|\/\/\/\s*<reference\s)'
+  - language: XML
+    pattern: '(?i:^\s*<\?xml\s+version)'
+- extensions: ['.txt']
+  rules:
+    # The following RegExp is simply a collapsed and simplified form of the
+    # VIM_MODELINE pattern in `./lib/linguist/strategy/modeline.rb`.
+  - language: Vim Help File
+    pattern: '(?:(?:^|[ \t])(?:vi|Vi(?=m))(?:m[<=>]?[0-9]+|m)?|[ \t]ex)(?=:(?=[ \t]*set?[ \t][^\r\n:]+:)|:(?![ \t]*set?[ \t]))(?:(?:[ \t]*:[ \t]*|[ \t])\w*(?:[ \t]*=(?:[^\\\s]|\\.)*)?)*[ \t:](?:filetype|ft|syntax)[ \t]*=(help)(?=$|\s|:)'
+  - language: Adblock Filter List
+    pattern: |-
+      (?x)\A
+      \[
+      (?<version>
+        (?:
+          [Aa]d[Bb]lock
+          (?:[ \t][Pp]lus)?
+          |
+          u[Bb]lock
+          (?:[ \t][Oo]rigin)?
+          |
+          [Aa]d[Gg]uard
+        )
+        (?:[ \t] \d+(?:\.\d+)*+)?
+      )
+      (?:
+        [ \t]?;[ \t]?
+        \g<version>
+      )*+
+      \]
+    # HACK: This is a contrived use of heuristics needed to address
+    # an unusual edge-case. See https://git.io/JULye for discussion.
+  - language: Text
+- extensions: ['.typ']
+  rules:
+  - language: Typst
+    pattern: '^#(import|show|let|set)'
+  - language: XML
+- extensions: ['.url']
+  rules:
+  - language: INI
+    pattern: '^\[InternetShortcut\](?:\r?\n|\r)(?>[^\s\[][^\r\n]*(?:\r?\n|\r))*URL='
+- extensions: ['.v']
+  rules:
+  - language: Coq
+    pattern: '(?:^|\s)(?:Proof|Qed)\.(?:$|\s)|(?:^|\s)Require[ \t]+(Import|Export)\s'
+  - language: Verilog
+    pattern: '^[ \t]*module\s+[^\s()]+\s+\#?\(|^[ \t]*`(?:define|ifdef|ifndef|include|timescale)|^[ \t]*always[ \t]+@|^[ \t]*initial[ \t]+(begin|@)'
+  - language: V
+    pattern: '\$(?:if|else)[ \t]|^[ \t]*fn\s+[^\s()]+\(.*?\).*?\{|^[ \t]*for\s*\{'
+- extensions: ['.vba']
+  rules:
+  - language: Vim Script
+    pattern: '^UseVimball'
+  - language: VBA
+- extensions: ['.w']
+  rules:
+  - language: OpenEdge ABL
+    pattern: '&ANALYZE-SUSPEND _UIB-CODE-BLOCK _CUSTOM _DEFINITIONS'
+  - language: CWeb
+    pattern: '^@(<|\w+\.)'
+- extensions: ['.x']
+  rules:
+  - language: DirectX 3D File
+    pattern:  '^xof 030(2|3)(?:txt|bin|tzip|bzip)\b'
+  - language: RPC
+    pattern: '\b(program|version)\s+\w+\s*\{|\bunion\s+\w+\s+switch\s*\('
+  - language: Logos
+    pattern: '^%(end|ctor|hook|group)\b'
+  - language: Linker Script
+    pattern: 'OUTPUT_ARCH\(|OUTPUT_FORMAT\(|SECTIONS'
+- extensions: ['.yaml', '.yml']
+  rules:
+  - language: MiniYAML
+    pattern: '^\t+.*?[^\s:].*?:'
+    negative_pattern: '---'
+  - language: OASv2-yaml
+    pattern: 'swagger:\s?''?"?2.[0-9.]+''?"?'
+  - language: OASv3-yaml
+    pattern: 'openapi:\s?''?"?3.[0-9.]+''?"?'
+  - language: YAML
+- extensions: ['.yy']
+  rules:
+  - language: JSON
+    pattern: '\"modelName\"\:\s*\"GM'
+  - language: Yacc
+named_patterns:
+  cpp:
+  - '^\s*#\s*include <(cstdint|string|vector|map|list|array|bitset|queue|stack|forward_list|unordered_map|unordered_set|(i|o|io)stream)>'
+  - '^\s*template\s*<'
+  - '^[ \t]*(try|constexpr)'
+  - '^[ \t]*catch\s*\('
+  - '^[ \t]*(class|(using[ \t]+)?namespace)\s+\w+'
+  - '^[ \t]*(private|public|protected):$'
+  - '__has_cpp_attribute|__cplusplus >'
+  - 'std::\w+'
+  euphoria:
+  - '^\s*namespace\s'
+  - '^\s*(?:public\s+)?include\s'
+  - '^\s*(?:(?:public|export|global)\s+)?(?:atom|constant|enum|function|integer|object|procedure|sequence|type)\s'
+  fortran: '^(?i:[c*][^abd-z]|      (subroutine|program|end|data)\s|\s*!)'
+  gsc:
+  - '^\s*#\s*(?:using|insert|include|define|namespace)[ \t]+\w'
+  - '^\s*(?>(?:autoexec|private)\s+){0,2}function\s+(?>(?:autoexec|private)\s+){0,2}\w+\s*\('
+  - '\b(?:level|self)[ \t]+thread[ \t]+(?:\[\[[ \t]*(?>\w+\.)*\w+[ \t]*\]\]|\w+)[ \t]*\([^\r\n\)]*\)[ \t]*;'
+  - '^[ \t]*#[ \t]*(?:precache|using_animtree)[ \t]*\('
+  key_equals_value: '^[^#!;][^=]*='
+  m68k:
+  - '(?im)\bmoveq(?:\.l)?\s+#(?:\$-?[0-9a-f]{1,3}|%[0-1]{1,8}|-?[0-9]{1,3}),\s*d[0-7]\b'
+  - '(?im)^\s*move(?:\.[bwl])?\s+(?:sr|usp),\s*[^\s]+'
+  - '(?im)^\s*move\.[bwl]\s+.*\b[ad]\d'
+  - '(?im)^\s*movem\.[bwl]\b'
+  - '(?im)^\s*move[mp](?:\.[wl])?\b'
+  - '(?im)^\s*btst\b'
+  - '(?im)^\s*dbra\b'
+  man-heading:  '^[.''][ \t]*SH +(?:[^"\s]+|"[^"\s]+)'
+  man-title:    '^[.''][ \t]*TH +(?:[^"\s]+|"[^"]+") +"?(?:[1-9]|@[^\s@]+@)'
+  mdoc-date:    '^[.''][ \t]*Dd +(?:[^"\s]+|"[^"]+")'
+  mdoc-heading: '^[.''][ \t]*Sh +(?:[^"\s]|"[^"]+")'
+  mdoc-title:   '^[.''][ \t]*Dt +(?:[^"\s]+|"[^"]+") +"?(?:[1-9]|@[^\s@]+@)'
+  objectivec: '^\s*(@(interface|class|protocol|property|end|synchronised|selector|implementation)\b|#import\s+.+\.h[">])'
+  perl:
+  - '\buse\s+(?:strict\b|v?5\b)'
+  - '^\s*use\s+(?:constant|overload)\b'
+  - '^\s*(?:\*|(?:our\s*)?@)EXPORT\s*='
+  - '^\s*package\s+[^\W\d]\w*(?:::\w+)*\s*(?:[;{]|\sv?\d)'
+  - '[\s$][^\W\d]\w*(?::\w+)*->[a-zA-Z_\[({]'
+  raku: '^\s*(?:use\s+v6\b|\bmodule\b|\b(?:my\s+)?class\b)'
+  vb-class: '^[ ]*VERSION [0-9]\.[0-9] CLASS'
+  vb-form: '^[ ]*VERSION [0-9]\.[0-9]{2}'
+  vb-module: '^[ ]*Attribute VB_Name = '
+  vba:
+  - '\b(?:VBA|[vV]ba)(?:\b|[0-9A-Z_])'
+    # VBA7 new 64-bit features
+  - '^[ ]*(?:Public|Private)? Declare PtrSafe (?:Sub|Function)\b'
+  - '^[ ]*#If Win64\b'
+  - '^[ ]*(?:Dim|Const) [0-9a-zA-Z_]*[ ]*As Long(?:Ptr|Long)\b'
+  # Top module declarations unique to VBA
+  - '^[ ]*Option (?:Private Module|Compare (?:Database|Text|Binary))\b'
+  # General VBA libraries and objects
+  - '(?: |\()(?:Access|Excel|Outlook|PowerPoint|Visio|Word|VBIDE)\.\w'
+  - '\b(?:(?:Active)?VBProjects?|VBComponents?|Application\.(?:VBE|ScreenUpdating))\b'
+  # AutoCAD, Outlook, PowerPoint and Word objects
+  - '\b(?:ThisDrawing|AcadObject|Active(?:Explorer|Inspector|Window\.Presentation|Presentation|Document)|Selection\.(?:Find|Paragraphs))\b'
+  # Excel objects
+  - '\b(?:(?:This|Active)?Workbooks?|Worksheets?|Active(?:Sheet|Chart|Cell)|WorksheetFunction)\b'
+  - '\b(?:Range\(".*|Cells\([0-9a-zA-Z_]*, (?:[0-9a-zA-Z_]*|"[a-zA-Z]{1,3}"))\)'
--- a/lexers/markdown.xml
+++ b/lexers/markdown.xml
@ -0,0 +1,56 @@
+
+<lexer>
+  <config>
+    <name>Markdown</name>
+    <alias>markdown</alias>
+    <alias>md</alias>
+    <filename>*.md</filename>
+    <filename>*.markdown</filename>
+    <mime_type>text/x-markdown</mime_type>
+  </config>
+  <rules>
+    <state name="root">
+      <rule pattern="(^#[^#].+)(\n)"><bygroups><token type="GenericHeading"/><token type="Text"/></bygroups></rule>
+      <rule pattern="(^#{2,6}[^#].+)(\n)"><bygroups><token type="GenericSubheading"/><token type="Text"/></bygroups></rule>
+      <rule pattern="^(.+)(\n)(=+)(\n)"><bygroups><token type="GenericHeading"/><token type="Text"/><token type="GenericHeading"/><token type="Text"/></bygroups></rule>
+      <rule pattern="^(.+)(\n)(-+)(\n)"><bygroups><token type="GenericSubheading"/><token type="Text"/><token type="GenericSubheading"/><token type="Text"/></bygroups></rule>
+      <rule pattern="^(\s*)([*-] )(\[[ xX]\])( .+\n)"><bygroups><token type="TextWhitespace"/><token type="Keyword"/><token type="Keyword"/><usingself state="inline"/></bygroups></rule>
+      <rule pattern="^(\s*)([*-])(\s)(.+\n)"><bygroups><token type="TextWhitespace"/><token type="Keyword"/><token type="TextWhitespace"/><usingself state="inline"/></bygroups></rule>
+      <rule pattern="^(\s*)([0-9]+\.)( .+\n)"><bygroups><token type="TextWhitespace"/><token type="Keyword"/><usingself state="inline"/></bygroups></rule>
+      <rule pattern="^(\s*&gt;\s)(.+\n)"><bygroups><token type="Keyword"/><token type="GenericEmph"/></bygroups></rule>
+      <rule pattern="^(```\n)([\w\W]*?)(^```$)">
+      <bygroups>
+        <token type="LiteralStringBacktick"/>
+        <token type="Text"/>
+        <token type="LiteralStringBacktick"/>
+      </bygroups>
+      </rule>
+      <rule pattern="^(```)(\w+)(\n)([\w\W]*?)(^```$)">
+        <bygroups>
+          <token type="LiteralStringBacktick"/>
+          <token type="NameLabel"/>  
+          <token type="TextWhitespace"/>
+          <UsingByGroup lexer="2" content="4"/>  
+          <token type="LiteralStringBacktick"/>
+        </bygroups>
+      </rule>
+      <rule><include state="inline"/></rule>
+    </state>
+    <state name="inline">
+      <rule pattern="\\."><token type="Text"/></rule>
+      <rule pattern="([^`]?)(`[^`\n]+`)"><bygroups><token type="Text"/><token type="LiteralStringBacktick"/></bygroups></rule>
+      <rule pattern="([^\*]?)(\*\*[^* \n][^*\n]*\*\*)"><bygroups><token type="Text"/><token type="GenericStrong"/></bygroups></rule>
+      <rule pattern="([^_]?)(__[^_ \n][^_\n]*__)"><bygroups><token type="Text"/><token type="GenericStrong"/></bygroups></rule>
+      <rule pattern="([^\*]?)(\*[^* \n][^*\n]*\*)"><bygroups><token type="Text"/><token type="GenericEmph"/></bygroups></rule>
+      <rule pattern="([^_]?)(_[^_ \n][^_\n]*_)"><bygroups><token type="Text"/><token type="GenericEmph"/></bygroups></rule>
+      <rule pattern="([^~]?)(~~[^~ \n][^~\n]*~~)"><bygroups><token type="Text"/><token type="GenericDeleted"/></bygroups></rule>
+      <rule pattern="[@#][\w/:]+"><token type="NameEntity"/></rule>
+      <rule pattern="(!?\[)([^]]+)(\])(\()([^)]+)(\))"><bygroups><token type="Text"/><token type="NameTag"/><token type="Text"/><token type="Text"/><token type="NameAttribute"/><token type="Text"/></bygroups></rule>
+      <rule pattern="(\[)([^]]+)(\])(\[)([^]]*)(\])"><bygroups><token type="Text"/><token type="NameTag"/><token type="Text"/><token type="Text"/><token type="NameLabel"/><token type="Text"/></bygroups></rule>
+      <rule pattern="^(\s*\[)([^]]*)(\]:\s*)(.+)"><bygroups><token type="Text"/><token type="NameLabel"/><token type="Text"/><token type="NameAttribute"/></bygroups></rule>
+      <rule pattern="[^\\\s]+"><token type="Text"/></rule>
+      <rule pattern="."><token type="Text"/></rule>
+    </state>
+  </rules>
+</lexer>
+
--- a/lexers/moinwiki.xml
+++ b/lexers/moinwiki.xml
@ -0,0 +1,34 @@
+
+<lexer>
+  <config>
+    <name>MoinMoin/Trac Wiki markup</name>
+    <alias>trac-wiki</alias>
+    <alias>moin</alias>
+    <mime_type>text/x-trac-wiki</mime_type>
+    <case_insensitive>true</case_insensitive>
+  </config>
+  <rules>
+    <state name="root">
+      <rule pattern="^#.*$"><token type="Comment"/></rule>
+      <rule pattern="(!)(\S+)"><bygroups><token type="Keyword"/><token type="Text"/></bygroups></rule>
+      <rule pattern="^(=+)([^=]+)(=+)(\s*#.+)?$"><bygroups><token type="GenericHeading"/><usingself state="root"/><token type="GenericHeading"/><token type="LiteralString"/></bygroups></rule>
+      <rule pattern="(\{\{\{)(\n#!.+)?"><bygroups><token type="NameBuiltin"/><token type="NameNamespace"/></bygroups><push state="codeblock"/></rule>
+      <rule pattern="(\&#x27;\&#x27;\&#x27;?|\|\||`|__|~~|\^|,,|::)"><token type="Comment"/></rule>
+      <rule pattern="^( +)([.*-])( )"><bygroups><token type="Text"/><token type="NameBuiltin"/><token type="Text"/></bygroups></rule>
+      <rule pattern="^( +)([a-z]{1,5}\.)( )"><bygroups><token type="Text"/><token type="NameBuiltin"/><token type="Text"/></bygroups></rule>
+      <rule pattern="\[\[\w+.*?\]\]"><token type="Keyword"/></rule>
+      <rule pattern="(\[[^\s\]]+)(\s+[^\]]+?)?(\])"><bygroups><token type="Keyword"/><token type="LiteralString"/><token type="Keyword"/></bygroups></rule>
+      <rule pattern="^----+$"><token type="Keyword"/></rule>
+      <rule pattern="[^\n\&#x27;\[{!_~^,|]+"><token type="Text"/></rule>
+      <rule pattern="\n"><token type="Text"/></rule>
+      <rule pattern="."><token type="Text"/></rule>
+    </state>
+    <state name="codeblock">
+      <rule pattern="\}\}\}"><token type="NameBuiltin"/><pop depth="1"/></rule>
+      <rule pattern="\{\{\{"><token type="Text"/><push/></rule>
+      <rule pattern="[^{}]+"><token type="CommentPreproc"/></rule>
+      <rule pattern="."><token type="CommentPreproc"/></rule>
+    </state>
+  </rules>
+</lexer>
+
--- a/lexers/rst.xml
+++ b/lexers/rst.xml
@ -0,0 +1,76 @@
+
+<lexer>
+  <config>
+    <name>reStructuredText</name>
+    <alias>restructuredtext</alias>
+    <alias>rst</alias>
+    <alias>rest</alias>
+    <filename>*.rst</filename>
+    <filename>*.rest</filename>
+    <mime_type>text/x-rst</mime_type>
+    <mime_type>text/prs.fallenstein.rst</mime_type>
+  </config>
+  <rules>
+    <state name="root">
+      <rule pattern="^(=+|-+|`+|:+|\.+|\&#x27;+|&quot;+|~+|\^+|_+|\*+|\++|#+)([ \t]*\n)(.+)(\n)(\1)(\n)"><bygroups><token type="GenericHeading"/><token type="Text"/><token type="GenericHeading"/><token type="Text"/><token type="GenericHeading"/><token type="Text"/></bygroups></rule>
+      <rule pattern="^(\S.*)(\n)(={3,}|-{3,}|`{3,}|:{3,}|\.{3,}|\&#x27;{3,}|&quot;{3,}|~{3,}|\^{3,}|_{3,}|\*{3,}|\+{3,}|#{3,})(\n)"><bygroups><token type="GenericHeading"/><token type="Text"/><token type="GenericHeading"/><token type="Text"/></bygroups></rule>
+      <rule pattern="^(\s*)([-*+])( .+\n(?:\1  .+\n)*)"><bygroups><token type="Text"/><token type="LiteralNumber"/><usingself state="inline"/></bygroups></rule>
+      <rule pattern="^(\s*)([0-9#ivxlcmIVXLCM]+\.)( .+\n(?:\1  .+\n)*)"><bygroups><token type="Text"/><token type="LiteralNumber"/><usingself state="inline"/></bygroups></rule>
+      <rule pattern="^(\s*)(\(?[0-9#ivxlcmIVXLCM]+\))( .+\n(?:\1  .+\n)*)"><bygroups><token type="Text"/><token type="LiteralNumber"/><usingself state="inline"/></bygroups></rule>
+      <rule pattern="^(\s*)([A-Z]+\.)( .+\n(?:\1  .+\n)+)"><bygroups><token type="Text"/><token type="LiteralNumber"/><usingself state="inline"/></bygroups></rule>
+      <rule pattern="^(\s*)(\(?[A-Za-z]+\))( .+\n(?:\1  .+\n)+)"><bygroups><token type="Text"/><token type="LiteralNumber"/><usingself state="inline"/></bygroups></rule>
+      <rule pattern="^(\s*)(\|)( .+\n(?:\|  .+\n)*)"><bygroups><token type="Text"/><token type="Operator"/><usingself state="inline"/></bygroups></rule>
+      <rule pattern="^( *\.\.)(\s*)((?:source)?code(?:-block)?)(::)([ \t]*)([^\n]+)(\n[ \t]*\n)([ \t]+)(.*)(\n)((?:(?:\8.*)?\n)+)"> 
+        <bygroups>
+          <token type="Punctuation"/>
+          <token type="Text"/>
+          <token type="OperatorWord"/>
+          <token type="Punctuation"/>
+          <token type="Text"/>
+          <token type="Keyword"/>
+          <token type="Text"/>
+          <token type="Text"/>
+          <UsingByGroup lexer="6" content="9,10,11"/>
+        </bygroups>
+      </rule>
+      <rule pattern="^( *\.\.)(\s*)([\w:-]+?)(::)(?:([ \t]*)(.*))">
+        <bygroups>
+          <token type="Punctuation"/>
+          <token type="Text"/>
+          <token type="OperatorWord"/>
+          <token type="Punctuation"/>
+          <token type="Text"/>
+          <usingself state="inline"/>
+        </bygroups>
+      </rule>
+      <rule pattern="^( *\.\.)(\s*)(_(?:[^:\\]|\\.)+:)(.*?)$"><bygroups><token type="Punctuation"/><token type="Text"/><token type="NameTag"/><usingself state="inline"/></bygroups></rule>
+      <rule pattern="^( *\.\.)(\s*)(\[.+\])(.*?)$"><bygroups><token type="Punctuation"/><token type="Text"/><token type="NameTag"/><usingself state="inline"/></bygroups></rule>
+      <rule pattern="^( *\.\.)(\s*)(\|.+\|)(\s*)([\w:-]+?)(::)(?:([ \t]*)(.*))"><bygroups><token type="Punctuation"/><token type="Text"/><token type="NameTag"/><token type="Text"/><token type="OperatorWord"/><token type="Punctuation"/><token type="Text"/><usingself state="inline"/></bygroups></rule>
+      <rule pattern="^ *\.\..*(\n( +.*\n|\n)+)?"><token type="Comment"/></rule>
+      <rule pattern="^( *)(:(?:\\\\|\\:|[^:\n])+:(?=\s))([ \t]*)"><bygroups><token type="Text"/><token type="NameClass"/><token type="Text"/></bygroups></rule>
+      <rule pattern="^(\S.*(?&lt;!::)\n)((?:(?: +.*)\n)+)"><bygroups><usingself state="inline"/><usingself state="inline"/></bygroups></rule>
+      <rule pattern="(::)(\n[ \t]*\n)([ \t]+)(.*)(\n)((?:(?:\3.*)?\n)+)"><bygroups><token type="LiteralStringEscape"/><token type="Text"/><token type="LiteralString"/><token type="LiteralString"/><token type="Text"/><token type="LiteralString"/></bygroups></rule>
+      <rule><include state="inline"/></rule>
+    </state>
+    <state name="inline">
+      <rule pattern="\\."><token type="Text"/></rule>
+      <rule pattern="``"><token type="LiteralString"/><push state="literal"/></rule>
+      <rule pattern="(`.+?)(&lt;.+?&gt;)(`__?)"><bygroups><token type="LiteralString"/><token type="LiteralStringInterpol"/><token type="LiteralString"/></bygroups></rule>
+      <rule pattern="`.+?`__?"><token type="LiteralString"/></rule>
+      <rule pattern="(`.+?`)(:[a-zA-Z0-9:-]+?:)?"><bygroups><token type="NameVariable"/><token type="NameAttribute"/></bygroups></rule>
+      <rule pattern="(:[a-zA-Z0-9:-]+?:)(`.+?`)"><bygroups><token type="NameAttribute"/><token type="NameVariable"/></bygroups></rule>
+      <rule pattern="\*\*.+?\*\*"><token type="GenericStrong"/></rule>
+      <rule pattern="\*.+?\*"><token type="GenericEmph"/></rule>
+      <rule pattern="\[.*?\]_"><token type="LiteralString"/></rule>
+      <rule pattern="&lt;.+?&gt;"><token type="NameTag"/></rule>
+      <rule pattern="[^\\\n\[*`:]+"><token type="Text"/></rule>
+      <rule pattern="."><token type="Text"/></rule>
+    </state>
+    <state name="literal">
+      <rule pattern="[^`]+"><token type="LiteralString"/></rule>
+      <rule pattern="``((?=$)|(?=[-/:.,; \n\x00‐‑‒–— &#x27;&quot;\)\]\}&gt;’”»!\?]))"><token type="LiteralString"/><pop depth="1"/></rule>
+      <rule pattern="`"><token type="LiteralString"/></rule>
+    </state>
+  </rules>
+</lexer>
+
--- a/scripts/lexer_metadata.py
+++ b/scripts/lexer_metadata.py
@ -0,0 +1,57 @@
+# This script parses the metadata of all the lexers and generates
+# a datafile with all the information so we don't have to instantiate
+# all the lexers to get the information.
+
+import glob
+from collections import defaultdict
+
+lexer_by_name = {}
+lexer_by_mimetype = defaultdict(set)
+lexer_by_filename = defaultdict(set)
+
+
+for fname in glob.glob("lexers/*.xml"):
+    aliases = set([])
+    mimetypes = set([])
+    filenames = set([])
+    print(fname)
+    with open(fname) as f:
+        lexer_name = fname.split("/")[-1].split(".")[0]
+        for line in f:
+            if "</config" in line:
+                break
+            if "<filename>" in line:
+                filenames.add(line.split(">")[1].split("<")[0].lower())
+            if "<mime_type>" in line:
+                mimetypes.add(line.split(">")[1].split("<")[0].lower())
+            if "<alias>" in line:
+                aliases.add(line.split(">")[1].split("<")[0].lower())
+            if "<name>" in line:
+                aliases.add(line.split(">")[1].split("<")[0].lower())
+    for alias in aliases:
+        if alias in lexer_by_name and alias != lexer_by_name[alias]:
+            raise Exception(f"Alias {alias} already in use by {lexer_by_name[alias]}")
+        lexer_by_name[alias] = lexer_name
+    for mimetype in mimetypes: 
+        lexer_by_mimetype[mimetype] = lexer_name       
+    for filename in filenames:
+        lexer_by_filename[filename].add(lexer_name)
+
+with open("src/constants/lexers.cr", "w") as f:
+    f.write("module Tartrazine\n")
+    f.write("  LEXERS_BY_NAME = {\n")
+    for k in sorted(lexer_by_name.keys()):
+        v = lexer_by_name[k]
+        f.write(f'"{k}" => "{v}", \n')
+    f.write("}\n")
+    f.write("  LEXERS_BY_MIMETYPE = {\n")
+    for k in sorted(lexer_by_mimetype.keys()):
+        v = lexer_by_mimetype[k]
+        f.write(f'"{k}" => "{v}", \n')
+    f.write("}\n")
+    f.write("  LEXERS_BY_FILENAME = {\n")
+    for k in sorted(lexer_by_filename.keys()):
+        v = lexer_by_filename[k]
+        f.write(f'"{k}" => {str(sorted(list(v))).replace("'", "\"")}, \n')
+    f.write("}\n")
+    f.write("end\n")
--- a/scripts/token_abbrevs.py
+++ b/scripts/token_abbrevs.py
@ -1,24 +1,55 @@
+# Script to generate abbreviations for tokens. Parses all lexers
+# and styles files to find all token names and generate a unique
+# abbreviation for each one. The abbreviations are generated by
+# taking the uppercase letters of the token name and converting
+# them to lowercase. If the abbreviation is not unique, the script
+# will print a warning and exit.
+
 import sys
 import string
+import glob

-# Run it as grep token lexers/* | python scripts/token_abbrevs.py
-
+tokens = {"Highlight"}
+abbrevs = {"Highlight": "hl"}

 def abbr(line):
    return "".join(c for c in line if c in string.ascii_uppercase).lower()

-abbrevs = {}
-tokens = set([])
-for line in sys.stdin:
-    if "<token" not in line:
-        continue
-    line = line.strip()
-    line = line.split('<token ',1)[-1]
-    line = line.split('"')[1]
-    abbrevs[line] = abbr(line)
-    tokens.add(line)
+def check_abbrevs():
+    if len(abbrevs) != len(tokens):
+        print("Warning: Abbreviations are not unique")
+        print(len(abbrevs), len(tokens))
+        sys.exit(1)

-print("Abbreviations: {")
-for k, v in abbrevs.items():
-    print(f'    "{k}" => "{v}",')
-print("}")
+# Processes all files in lexers looking for token names
+for fname in glob.glob("lexers/*.xml"):
+    with open(fname) as f:
+        for line in f:
+            if "<token" not in line:
+                continue
+            line = line.strip()
+            line = line.split('<token ',1)[-1]
+            line = line.split('"')[1]
+            abbrevs[line] = abbr(line)
+            tokens.add(line)
+            check_abbrevs()
+
+# Processes all files in styles looking for token names too
+for fname in glob.glob("styles/*.xml"):
+    with open(fname) as f:
+        for line in f:
+            if "<entry" not in line:
+                continue
+            line = line.strip()
+            line = line.split('type=',1)[-1]
+            line = line.split('"')[1]
+            abbrevs[line] = abbr(line)
+            tokens.add(line)
+            check_abbrevs()
+
+with open ("src/constants/token_abbrevs.cr", "w") as outf:
+    outf.write("module Tartrazine\n")
+    outf.write("  Abbreviations = {\n")
+    for k in sorted(abbrevs.keys()):
+        outf.write(f'    "{k}" => "{abbrevs[k]}",\n')
+    outf.write("  }\nend\n")
--- a/shard.yml
+++ b/shard.yml
@ -1,5 +1,5 @@
 name: tartrazine
-version: 0.1.0
+version: 0.6.1

 authors:
  - Roberto Alsina <roberto.alsina@gmail.com>
@ -15,6 +15,8 @@ dependencies:
    github: crystal-china/base58.cr
  sixteen:
    github: ralsina/sixteen
+  docopt:
+    github: chenkovsky/docopt.cr

 crystal: ">= 1.13.0"

--- a/spec/tartrazine_spec.cr
+++ b/spec/tartrazine_spec.cr
@ -14,15 +14,18 @@ unicode_problems = {
  "#{__DIR__}/tests/java/test_string_literals.txt",
  "#{__DIR__}/tests/json/test_strings.txt",
  "#{__DIR__}/tests/systemd/example1.txt",
+  "#{__DIR__}/tests/c++/test_unicode_identifiers.txt",
 }

 # These testcases fail because of differences in the way chroma and tartrazine tokenize
 # but tartrazine is correct
 bad_in_chroma = {
  "#{__DIR__}/tests/bash_session/test_comment_after_prompt.txt",
+  "#{__DIR__}/tests/html/javascript_backtracking.txt",
  "#{__DIR__}/tests/java/test_default.txt",
  "#{__DIR__}/tests/java/test_multiline_string.txt",
  "#{__DIR__}/tests/java/test_numeric_literals.txt",
+  "#{__DIR__}/tests/octave/test_multilinecomment.txt",
  "#{__DIR__}/tests/php/test_string_escaping_run.txt",
  "#{__DIR__}/tests/python_2/test_cls_builtin.txt",
 }
@ -30,19 +33,14 @@ bad_in_chroma = {
 known_bad = {
  "#{__DIR__}/tests/bash_session/fake_ps2_prompt.txt",
  "#{__DIR__}/tests/bash_session/prompt_in_output.txt",
-  "#{__DIR__}/tests/bash_session/test_newline_in_echo_no_ps2.txt",
-  "#{__DIR__}/tests/bash_session/test_newline_in_ls_ps2.txt",
  "#{__DIR__}/tests/bash_session/ps2_prompt.txt",
-  "#{__DIR__}/tests/bash_session/test_newline_in_ls_no_ps2.txt",
-  "#{__DIR__}/tests/bash_session/test_virtualenv.txt",
+  "#{__DIR__}/tests/bash_session/test_newline_in_echo_no_ps2.txt",
  "#{__DIR__}/tests/bash_session/test_newline_in_echo_ps2.txt",
-  "#{__DIR__}/tests/c/test_string_resembling_decl_end.txt",
-  "#{__DIR__}/tests/html/css_backtracking.txt",
+  "#{__DIR__}/tests/bash_session/test_newline_in_ls_no_ps2.txt",
+  "#{__DIR__}/tests/bash_session/test_newline_in_ls_ps2.txt",
+  "#{__DIR__}/tests/bash_session/test_virtualenv.txt",
  "#{__DIR__}/tests/mcfunction/data.txt",
  "#{__DIR__}/tests/mcfunction/selectors.txt",
-  "#{__DIR__}/tests/php/anonymous_class.txt",
-  "#{__DIR__}/tests/html/javascript_unclosed.txt",
-
 }

 # Tests that fail because of a limitation in PCRE2
@ -74,8 +72,8 @@ end

 # Helper that creates lexer and tokenizes
 def tokenize(lexer_name, text)
-  lexer = Tartrazine.lexer(lexer_name)
-  lexer.tokenize(text)
+  tokenizer = Tartrazine.lexer(lexer_name).tokenizer(text)
+  Tartrazine::Lexer.collapse_tokens(tokenizer.to_a)
 end

 # Helper that tokenizes using chroma to validate the lexer
--- a/src/actions.cr
+++ b/src/actions.cr
@ -1,5 +1,4 @@
 require "./actions"
-require "./constants"
 require "./formatter"
 require "./rules"
 require "./styles"
@ -9,12 +8,33 @@ require "./tartrazine"
 # perform a list of actions. These actions can emit tokens
 # or change the state machine.
 module Tartrazine
-  class Action
-    property type : String
-    property xml : XML::Node
+  enum ActionType
+    Bygroups
+    Combined
+    Include
+    Pop
+    Push
+    Token
+    Using
+    Usingbygroup
+    Usingself
+  end
+
+  struct Action
    property actions : Array(Action) = [] of Action

-    def initialize(@type : String, @xml : XML::Node?)
+    @content_index : Array(Int32) = [] of Int32
+    @depth : Int32 = 0
+    @lexer_index : Int32 = 0
+    @lexer_name : String = ""
+    @states : Array(String) = [] of String
+    @states_to_push : Array(String) = [] of String
+    @token_type : String = ""
+    @type : ActionType = ActionType::Token
+
+    def initialize(t : String, xml : XML::Node?)
+      @type = ActionType.parse(t.capitalize)
+
      # Some actions may have actions in them, like this:
      # <bygroups>
      # <token type="GenericPrompt"/>
@ -24,48 +44,56 @@ module Tartrazine
      #
      # The token actions match with the first 2 groups in the regex
      # the using action matches the 3rd and shunts it to another lexer
-      @xml.children.each do |node|
+      xml.children.each do |node|
        next unless node.element?
        @actions << Action.new(node.name, node)
      end
+
+      # Prefetch the attributes we ned from the XML and keep them
+      case @type
+      when ActionType::Token
+        @token_type = xml["type"]
+      when ActionType::Push
+        @states_to_push = xml.attributes.select { |attrib|
+          attrib.name == "state"
+        }.map &.content
+      when ActionType::Pop
+        @depth = xml["depth"].to_i
+      when ActionType::Using
+        @lexer_name = xml["lexer"].downcase
+      when ActionType::Combined
+        @states = xml.attributes.select { |attrib|
+          attrib.name == "state"
+        }.map &.content
+      when ActionType::Usingbygroup
+        @lexer_index = xml["lexer"].to_i
+        @content_index = xml["content"].split(",").map(&.to_i)
+      end
    end

    # ameba:disable Metrics/CyclomaticComplexity
-    def emit(match : Regex::MatchData?, lexer : Lexer, match_group = 0) : Array(Token)
-      case type
-      when "token"
-        raise Exception.new "Can't have a token without a match" if match.nil?
-        [Token.new(type: xml["type"], value: match[match_group])]
-      when "push"
-        states_to_push = xml.attributes.select { |attrib|
-          attrib.name == "state"
-        }.map &.content
-        if states_to_push.empty?
-          # Push without a state means push the current state
-          states_to_push = [lexer.state_stack.last]
-        end
-        states_to_push.each do |state|
-          if state == "#pop"
+    def emit(match : MatchData, tokenizer : Tokenizer, match_group = 0) : Array(Token)
+      case @type
+      when ActionType::Token
+        raise Exception.new "Can't have a token without a match" if match.empty?
+        [Token.new(type: @token_type, value: String.new(match[match_group].value))]
+      when ActionType::Push
+        to_push = @states_to_push.empty? ? [tokenizer.state_stack.last] : @states_to_push
+        to_push.each do |state|
+          if state == "#pop" && tokenizer.state_stack.size > 1
            # Pop the state
-            Log.trace { "Popping state" }
-            lexer.state_stack.pop
+            tokenizer.state_stack.pop
          else
            # Really push
-            lexer.state_stack << state
-            Log.trace { "Pushed #{lexer.state_stack}" }
+            tokenizer.state_stack << state
          end
        end
        [] of Token
-      when "pop"
-        depth = xml["depth"].to_i
-        Log.trace { "Popping #{depth} states" }
-        if lexer.state_stack.size <= depth
-          Log.trace { "Can't pop #{depth} states, only have #{lexer.state_stack.size}" }
-        else
-          lexer.state_stack.pop(depth)
-        end
+      when ActionType::Pop
+        to_pop = [@depth, tokenizer.state_stack.size - 1].min
+        tokenizer.state_stack.pop(to_pop)
        [] of Token
-      when "bygroups"
+      when ActionType::Bygroups
        # FIXME: handle
        # ><bygroups>
        # <token type="Punctuation"/>
@ -80,38 +108,50 @@ module Tartrazine
        # the action is skipped.
        result = [] of Token
        @actions.each_with_index do |e, i|
-          next if match[i + 1]?.nil?
-          result += e.emit(match, lexer, i + 1)
+          begin
+            next if match[i + 1].size == 0
+          rescue IndexError
+            # FIXME: This should not actually happen
+            # No match for this group
+            next
+          end
+          result += e.emit(match, tokenizer, i + 1)
        end
        result
-      when "using"
+      when ActionType::Using
        # Shunt to another lexer entirely
-        return [] of Token if match.nil?
-        lexer_name = xml["lexer"].downcase
-        Log.trace { "to tokenize: #{match[match_group]}" }
-        Tartrazine.lexer(lexer_name).tokenize(match[match_group], usingself: true)
-      when "usingself"
+        return [] of Token if match.empty?
+        Tartrazine.lexer(@lexer_name).tokenizer(
+          String.new(match[match_group].value),
+          secondary: true).to_a
+      when ActionType::Usingself
        # Shunt to another copy of this lexer
-        return [] of Token if match.nil?
-
-        new_lexer = Lexer.from_xml(lexer.xml)
-        Log.trace { "to tokenize: #{match[match_group]}" }
-        new_lexer.tokenize(match[match_group], usingself: true)
-      when "combined"
-        # Combine two states into one anonymous state
-        states = xml.attributes.select { |attrib|
-          attrib.name == "state"
-        }.map &.content
-        new_state = states.map { |name|
-          lexer.states[name]
+        return [] of Token if match.empty?
+        tokenizer.lexer.tokenizer(
+          String.new(match[match_group].value),
+          secondary: true).to_a
+      when ActionType::Combined
+        # Combine two or more states into one anonymous state
+        new_state = @states.map { |name|
+          tokenizer.lexer.states[name]
        }.reduce { |state1, state2|
          state1 + state2
        }
-        lexer.states[new_state.name] = new_state
-        lexer.state_stack << new_state.name
+        tokenizer.lexer.states[new_state.name] = new_state
+        tokenizer.state_stack << new_state.name
        [] of Token
+      when ActionType::Usingbygroup
+        # Shunt to content-specified lexer
+        return [] of Token if match.empty?
+        content = ""
+        @content_index.each do |i|
+          content += String.new(match[i].value)
+        end
+        Tartrazine.lexer(String.new(match[@lexer_index].value)).tokenizer(
+          content,
+          secondary: true).to_a
      else
-        raise Exception.new("Unknown action type: #{type}: #{xml}")
+        raise Exception.new("Unknown action type: #{@type}")
      end
    end
  end
--- a/src/bytes_regex.cr
+++ b/src/bytes_regex.cr
@ -0,0 +1,73 @@
+module BytesRegex
+  extend self
+
+  class Regex
+    def initialize(pattern : String, multiline = false, dotall = false, ignorecase = false, anchored = false)
+      flags = LibPCRE2::UTF | LibPCRE2::UCP | LibPCRE2::NO_UTF_CHECK
+      flags |= LibPCRE2::MULTILINE if multiline
+      flags |= LibPCRE2::DOTALL if dotall
+      flags |= LibPCRE2::CASELESS if ignorecase
+      flags |= LibPCRE2::ANCHORED if anchored
+      if @re = LibPCRE2.compile(
+           pattern,
+           pattern.bytesize,
+           flags,
+           out errorcode,
+           out erroroffset,
+           nil)
+      else
+        msg = String.new(256) do |buffer|
+          bytesize = LibPCRE2.get_error_message(errorcode, buffer, 256)
+          {bytesize, 0}
+        end
+        raise Exception.new "Error #{msg} compiling regex at offset #{erroroffset}"
+      end
+      @match_data = LibPCRE2.match_data_create_from_pattern(@re, nil)
+    end
+
+    def finalize
+      LibPCRE2.match_data_free(@match_data)
+      LibPCRE2.code_free(@re)
+    end
+
+    def match(str : Bytes, pos = 0) : Array(Match)
+      rc = LibPCRE2.match(
+        @re,
+        str,
+        str.size,
+        pos,
+        LibPCRE2::NO_UTF_CHECK,
+        @match_data,
+        nil)
+      if rc > 0
+        ovector = LibPCRE2.get_ovector_pointer(@match_data)
+        (0...rc).map do |i|
+          m_start = ovector[2 * i]
+          m_end = ovector[2 * i + 1]
+          if m_start == m_end
+            m_value = Bytes.new(0)
+          else
+            m_value = str[m_start...m_end]
+          end
+          Match.new(m_value, m_start, m_end - m_start)
+        end
+      else
+        [] of Match
+      end
+    end
+  end
+
+  struct Match
+    property value : Bytes
+    property start : UInt64
+    property size : UInt64
+
+    def initialize(@value : Bytes, @start : UInt64, @size : UInt64)
+    end
+  end
+end
+
+# pattern = "foo"
+# str = "foo bar"
+# re = BytesRegex::Regex.new(pattern)
+# p! String.new(re.match(str.to_slice)[0].value)
--- a/src/constants/lexers.cr
+++ b/src/constants/lexers.cr
--- a/src/constants/token_abbrevs.cr
+++ b/src/constants/token_abbrevs.cr
@ -1,92 +1,100 @@
 module Tartrazine
  Abbreviations = {
    "Background"               => "b",
-    "Text"                     => "t",
+    "CodeLine"                 => "cl",
+    "Comment"                  => "c",
+    "CommentHashbang"          => "ch",
+    "CommentMultiline"         => "cm",
+    "CommentPreproc"           => "cp",
+    "CommentPreprocFile"       => "cpf",
    "CommentSingle"            => "cs",
    "CommentSpecial"           => "cs",
-    "NameVariable"             => "nv",
-    "Keyword"                  => "k",
-    "NameFunction"             => "nf",
-    "Punctuation"              => "p",
-    "Operator"                 => "o",
-    "LiteralNumberInteger"     => "lni",
-    "NameBuiltin"              => "nb",
-    "Name"                     => "n",
-    "OperatorWord"             => "ow",
-    "LiteralStringSingle"      => "lss",
-    "Literal"                  => "l",
-    "NameClass"                => "nc",
-    "CommentMultiline"         => "cm",
-    "LiteralStringRegex"       => "lsr",
-    "KeywordDeclaration"       => "kd",
-    "KeywordConstant"          => "kc",
-    "NameOther"                => "no",
-    "LiteralNumberFloat"       => "lnf",
-    "LiteralNumberHex"         => "lnh",
-    "LiteralStringDouble"      => "lsd",
-    "KeywordType"              => "kt",
-    "NameNamespace"            => "nn",
-    "NameAttribute"            => "na",
-    "KeywordReserved"          => "kr",
-    "CommentPreproc"           => "cp",
-    "KeywordNamespace"         => "kn",
-    "NameConstant"             => "nc",
-    "NameLabel"                => "nl",
-    "LiteralString"            => "ls",
-    "LiteralStringChar"        => "lsc",
-    "TextWhitespace"           => "tw",
-    "LiteralStringEscape"      => "lse",
-    "LiteralNumber"            => "ln",
-    "Other"                    => "o",
-    "LiteralStringBoolean"     => "lsb",
-    "NameProperty"             => "np",
-    "Comment"                  => "c",
-    "NameTag"                  => "nt",
-    "LiteralStringOther"       => "lso",
-    "NameVariableGlobal"       => "nvg",
-    "NameBuiltinPseudo"        => "nbp",
-    "LiteralNumberBin"         => "lnb",
-    "KeywordPseudo"            => "kp",
-    "CommentPreprocFile"       => "cpf",
-    "LiteralStringAffix"       => "lsa",
-    "LiteralStringDelimiter"   => "lsd",
-    "LiteralNumberOct"         => "lno",
    "Error"                    => "e",
    "Generic"                  => "g",
-    "LiteralNumberIntegerLong" => "lnil",
-    "NameDecorator"            => "nd",
-    "LiteralStringInterpol"    => "lsi",
-    "LiteralStringBacktick"    => "lsb",
-    "GenericPrompt"            => "gp",
-    "GenericOutput"            => "go",
-    "LiteralStringName"        => "lsn",
-    "LiteralStringHeredoc"     => "lsh",
-    "LiteralStringSymbol"      => "lss",
-    "NameVariableInstance"     => "nvi",
-    "LiteralOther"             => "lo",
-    "NameVariableClass"        => "nvc",
-    "NameOperator"             => "no",
-    "None"                     => "n",
-    "LiteralStringDoc"         => "lsd",
-    "NameException"            => "ne",
-    "GenericSubheading"        => "gs",
-    "GenericStrong"            => "gs",
    "GenericDeleted"           => "gd",
-    "GenericInserted"          => "gi",
-    "GenericHeading"           => "gh",
-    "NameEntity"               => "ne",
-    "NamePseudo"               => "np",
-    "CommentHashbang"          => "ch",
-    "TextPunctuation"          => "tp",
-    "NameVariableAnonymous"    => "nva",
-    "NameVariableMagic"        => "nvm",
-    "NameFunctionMagic"        => "nfm",
    "GenericEmph"              => "ge",
-    "GenericUnderline"         => "gu",
-    "LiteralStringAtom"        => "lsa",
-    "LiteralDate"              => "ld",
    "GenericError"             => "ge",
-    "TextSymbol"               => "ts",
+    "GenericHeading"           => "gh",
+    "GenericInserted"          => "gi",
+    "GenericOutput"            => "go",
+    "GenericPrompt"            => "gp",
+    "GenericStrong"            => "gs",
+    "GenericSubheading"        => "gs",
+    "GenericTraceback"         => "gt",
+    "GenericUnderline"         => "gu",
+    "Highlight"                => "hl",
+    "Keyword"                  => "k",
+    "KeywordConstant"          => "kc",
+    "KeywordDeclaration"       => "kd",
+    "KeywordNamespace"         => "kn",
+    "KeywordPseudo"            => "kp",
+    "KeywordReserved"          => "kr",
+    "KeywordType"              => "kt",
+    "LineHighlight"            => "lh",
+    "LineNumbers"              => "ln",
+    "LineNumbersTable"         => "lnt",
+    "LineTable"                => "lt",
+    "LineTableTD"              => "lttd",
+    "Literal"                  => "l",
+    "LiteralDate"              => "ld",
+    "LiteralNumber"            => "ln",
+    "LiteralNumberBin"         => "lnb",
+    "LiteralNumberFloat"       => "lnf",
+    "LiteralNumberHex"         => "lnh",
+    "LiteralNumberInteger"     => "lni",
+    "LiteralNumberIntegerLong" => "lnil",
+    "LiteralNumberOct"         => "lno",
+    "LiteralOther"             => "lo",
+    "LiteralString"            => "ls",
+    "LiteralStringAffix"       => "lsa",
+    "LiteralStringAtom"        => "lsa",
+    "LiteralStringBacktick"    => "lsb",
+    "LiteralStringBoolean"     => "lsb",
+    "LiteralStringChar"        => "lsc",
+    "LiteralStringDelimiter"   => "lsd",
+    "LiteralStringDoc"         => "lsd",
+    "LiteralStringDouble"      => "lsd",
+    "LiteralStringEscape"      => "lse",
+    "LiteralStringHeredoc"     => "lsh",
+    "LiteralStringInterpol"    => "lsi",
+    "LiteralStringName"        => "lsn",
+    "LiteralStringOther"       => "lso",
+    "LiteralStringRegex"       => "lsr",
+    "LiteralStringSingle"      => "lss",
+    "LiteralStringSymbol"      => "lss",
+    "Name"                     => "n",
+    "NameAttribute"            => "na",
+    "NameBuiltin"              => "nb",
+    "NameBuiltinPseudo"        => "nbp",
+    "NameClass"                => "nc",
+    "NameConstant"             => "nc",
+    "NameDecorator"            => "nd",
+    "NameEntity"               => "ne",
+    "NameException"            => "ne",
+    "NameFunction"             => "nf",
+    "NameFunctionMagic"        => "nfm",
    "NameKeyword"              => "nk",
+    "NameLabel"                => "nl",
+    "NameNamespace"            => "nn",
+    "NameOperator"             => "no",
+    "NameOther"                => "no",
+    "NameProperty"             => "np",
+    "NamePseudo"               => "np",
+    "NameTag"                  => "nt",
+    "NameVariable"             => "nv",
+    "NameVariableAnonymous"    => "nva",
+    "NameVariableClass"        => "nvc",
+    "NameVariableGlobal"       => "nvg",
+    "NameVariableInstance"     => "nvi",
+    "NameVariableMagic"        => "nvm",
+    "None"                     => "n",
+    "Operator"                 => "o",
+    "OperatorWord"             => "ow",
+    "Other"                    => "o",
+    "Punctuation"              => "p",
+    "Text"                     => "t",
+    "TextPunctuation"          => "tp",
+    "TextSymbol"               => "ts",
+    "TextWhitespace"           => "tw",
  }
 end
--- a/src/formatter.cr
+++ b/src/formatter.cr
@ -1,5 +1,4 @@
 require "./actions"
-require "./constants"
 require "./formatter"
 require "./rules"
 require "./styles"
@ -10,101 +9,20 @@ module Tartrazine
  # This is the base class for all formatters.
  abstract class Formatter
    property name : String = ""
+    property theme : Theme = Tartrazine.theme("default-dark")

-    def format(text : String, lexer : Lexer, theme : Theme) : String
+    # Format the text using the given lexer.
+    def format(text : String, lexer : Lexer, io : IO = nil) : Nil
      raise Exception.new("Not implemented")
    end

-    def get_style_defs(theme : Theme) : String
+    def format(text : String, lexer : Lexer) : String
      raise Exception.new("Not implemented")
    end
-  end

-  class Ansi < Formatter
-    def format(text : String, lexer : Lexer, theme : Theme) : String
-      output = String.build do |outp|
-        lexer.tokenize(text).each do |token|
-          outp << self.colorize(token[:value], token[:type], theme)
-        end
-      end
-      output
-    end
-
-    def colorize(text : String, token : String, theme : Theme) : String
-      style = theme.styles.fetch(token, nil)
-      return text if style.nil?
-      if theme.styles.has_key?(token)
-        s = theme.styles[token]
-      else
-        # Themes don't contain information for each specific
-        # token type. However, they may contain information
-        # for a parent style. Worst case, we go to the root
-        # (Background) style.
-        s = theme.styles[theme.style_parents(token).reverse.find { |parent|
-          theme.styles.has_key?(parent)
-        }]
-      end
-      colorized = text.colorize(s.color.try &.colorize)
-      # Intentionally not setting background color
-      colorized.mode(:bold) if s.bold
-      colorized.mode(:italic) if s.italic
-      colorized.mode(:underline) if s.underline
-      colorized.to_s
-    end
-  end
-
-  class Html < Formatter
-    def format(text : String, lexer : Lexer, theme : Theme) : String
-      output = String.build do |outp|
-        outp << "<html><head><style>"
-        outp << get_style_defs(theme)
-        outp << "</style></head><body>"
-        outp << "<pre class=\"#{get_css_class("Background", theme)}\"><code class=\"#{get_css_class("Background", theme)}\">"
-        lexer.tokenize(text).each do |token|
-          fragment = "<span class=\"#{get_css_class(token[:type], theme)}\">#{token[:value]}</span>"
-          outp << fragment
-        end
-        outp << "</code></pre></body></html>"
-      end
-      output
-    end
-
-    # ameba:disable Metrics/CyclomaticComplexity
-    def get_style_defs(theme : Theme) : String
-      output = String.build do |outp|
-        theme.styles.each do |token, style|
-          outp << ".#{get_css_class(token, theme)} {"
-          # These are set or nil
-          outp << "color: #{style.color.try &.hex};" if style.color
-          outp << "background-color: #{style.background.try &.hex};" if style.background
-          outp << "border: 1px solid #{style.border.try &.hex};" if style.border
-
-          # These are true/false/nil
-          outp << "border: none;" if style.border == false
-          outp << "font-weight: bold;" if style.bold
-          outp << "font-weight: 400;" if style.bold == false
-          outp << "font-style: italic;" if style.italic
-          outp << "font-style: normal;" if style.italic == false
-          outp << "text-decoration: underline;" if style.underline
-          outp << "text-decoration: none;" if style.underline == false
-
-          outp << "}"
-        end
-      end
-      output
-    end
-
-    # Given a token type, return the CSS class to use.
-    def get_css_class(token, theme)
-      return Abbreviations[token] if theme.styles.has_key?(token)
-
-      # Themes don't contain information for each specific
-      # token type. However, they may contain information
-      # for a parent style. Worst case, we go to the root
-      # (Background) style.
-      Abbreviations[theme.style_parents(token).reverse.find { |parent|
-        theme.styles.has_key?(parent)
-      }]
+    # Return the styles, if the formatter supports it.
+    def style_defs : String
+      raise Exception.new("Not implemented")
    end
  end
 end
--- a/src/formatters/ansi.cr
+++ b/src/formatters/ansi.cr
@ -0,0 +1,56 @@
+require "../formatter"
+
+module Tartrazine
+  class Ansi < Formatter
+    property? line_numbers : Bool = false
+
+    def initialize(@theme : Theme = Tartrazine.theme("default-dark"), @line_numbers : Bool = false)
+    end
+
+    private def line_label(i : Int32) : String
+      "#{i + 1}".rjust(4).ljust(5)
+    end
+
+    def format(text : String, lexer : Lexer) : String
+      outp = String::Builder.new("")
+      format(text, lexer, outp)
+      outp.to_s
+    end
+
+    def format(text : String, lexer : BaseLexer, outp : IO) : Nil
+      tokenizer = lexer.tokenizer(text)
+      i = 0
+      outp << line_label(i) if line_numbers?
+      tokenizer.each do |token|
+        outp << colorize(token[:value], token[:type])
+        if token[:value].includes?("\n")
+          i += 1
+          outp << line_label(i) if line_numbers?
+        end
+      end
+    end
+
+    def colorize(text : String, token : String) : String
+      style = theme.styles.fetch(token, nil)
+      return text if style.nil?
+      if theme.styles.has_key?(token)
+        s = theme.styles[token]
+      else
+        # Themes don't contain information for each specific
+        # token type. However, they may contain information
+        # for a parent style. Worst case, we go to the root
+        # (Background) style.
+        s = theme.styles[theme.style_parents(token).reverse.find { |parent|
+          theme.styles.has_key?(parent)
+        }]
+      end
+      colorized = text.colorize
+      s.color.try { |col| colorized = colorized.fore(col.colorize) }
+      # Intentionally not setting background color
+      colorized.mode(:bold) if s.bold
+      colorized.mode(:italic) if s.italic
+      colorized.mode(:underline) if s.underline
+      colorized.to_s
+    end
+  end
+end
--- a/src/formatters/html.cr
+++ b/src/formatters/html.cr
@ -0,0 +1,132 @@
+require "../constants/token_abbrevs.cr"
+require "../formatter"
+require "html"
+
+module Tartrazine
+  class Html < Formatter
+    # property line_number_in_table : Bool = false
+    # property with_classes : Bool = true
+    property class_prefix : String = ""
+    property highlight_lines : Array(Range(Int32, Int32)) = [] of Range(Int32, Int32)
+    property line_number_id_prefix : String = "line-"
+    property line_number_start : Int32 = 1
+    property tab_width = 8
+    property? line_numbers : Bool = false
+    property? linkable_line_numbers : Bool = true
+    property? standalone : Bool = false
+    property? surrounding_pre : Bool = true
+    property? wrap_long_lines : Bool = false
+    property weight_of_bold : Int32 = 600
+
+    property theme : Theme
+
+    def initialize(@theme : Theme = Tartrazine.theme("default-dark"), *,
+                   @highlight_lines = [] of Range(Int32, Int32),
+                   @class_prefix : String = "",
+                   @line_number_id_prefix = "line-",
+                   @line_number_start = 1,
+                   @tab_width = 8,
+                   @line_numbers : Bool = false,
+                   @linkable_line_numbers : Bool = true,
+                   @standalone : Bool = false,
+                   @surrounding_pre : Bool = true,
+                   @wrap_long_lines : Bool = false,
+                   @weight_of_bold : Int32 = 600)
+    end
+
+    def format(text : String, lexer : Lexer) : String
+      outp = String::Builder.new("")
+      format(text, lexer, outp)
+      outp.to_s
+    end
+
+    def format(text : String, lexer : BaseLexer, io : IO) : Nil
+      pre, post = wrap_standalone
+      io << pre if standalone?
+      format_text(text, lexer, io)
+      io << post if standalone?
+    end
+
+    # Wrap text into a full HTML document, including the CSS for the theme
+    def wrap_standalone
+      output = String.build do |outp|
+        outp << "<!DOCTYPE html><html><head><style>"
+        outp << style_defs
+        outp << "</style></head><body>"
+      end
+      {output.to_s, "</body></html>"}
+    end
+
+    private def line_label(i : Int32) : String
+      line_label = "#{i + 1}".rjust(4).ljust(5)
+      line_class = highlighted?(i + 1) ? "class=\"#{get_css_class("LineHighlight")}\"" : ""
+      line_id = linkable_line_numbers? ? "id=\"#{line_number_id_prefix}#{i + 1}\"" : ""
+      "<span #{line_id} #{line_class} style=\"user-select: none;\">#{line_label} </span>"
+    end
+
+    def format_text(text : String, lexer : BaseLexer, outp : IO)
+      tokenizer = lexer.tokenizer(text)
+      i = 0
+      if surrounding_pre?
+        pre_style = wrap_long_lines? ? "style=\"white-space: pre-wrap; word-break: break-word;\"" : ""
+        outp << "<pre class=\"#{get_css_class("Background")}\" #{pre_style}>"
+      end
+      outp << "<code class=\"#{get_css_class("Background")}\">"
+      outp << line_label(i) if line_numbers?
+      tokenizer.each do |token|
+        outp << "<span class=\"#{get_css_class(token[:type])}\">#{HTML.escape(token[:value])}</span>"
+        if token[:value].ends_with? "\n"
+          i += 1
+          outp << line_label(i) if line_numbers?
+        end
+      end
+      outp << "</code></pre>"
+    end
+
+    # ameba:disable Metrics/CyclomaticComplexity
+    def style_defs : String
+      output = String.build do |outp|
+        theme.styles.each do |token, style|
+          outp << ".#{get_css_class(token)} {"
+          # These are set or nil
+          outp << "color: ##{style.color.try &.hex};" if style.color
+          outp << "background-color: ##{style.background.try &.hex};" if style.background
+          outp << "border: 1px solid ##{style.border.try &.hex};" if style.border
+
+          # These are true/false/nil
+          outp << "border: none;" if style.border == false
+          outp << "font-weight: bold;" if style.bold
+          outp << "font-weight: #{@weight_of_bold};" if style.bold == false
+          outp << "font-style: italic;" if style.italic
+          outp << "font-style: normal;" if style.italic == false
+          outp << "text-decoration: underline;" if style.underline
+          outp << "text-decoration: none;" if style.underline == false
+          outp << "tab-size: #{tab_width};" if token == "Background"
+
+          outp << "}"
+        end
+      end
+      output
+    end
+
+    # Given a token type, return the CSS class to use.
+    def get_css_class(token : String) : String
+      if !theme.styles.has_key? token
+        # Themes don't contain information for each specific
+        # token type. However, they may contain information
+        # for a parent style. Worst case, we go to the root
+        # (Background) style.
+        parent = theme.style_parents(token).reverse.find { |dad|
+          theme.styles.has_key?(dad)
+        }
+        theme.styles[token] = theme.styles[parent]
+      end
+      class_prefix + Abbreviations[token]
+    end
+
+    # Is this line in the highlighted ranges?
+    def highlighted?(line : Int) : Bool
+      highlight_lines.any?(&.includes?(line))
+    end
+  end
+end
--- a/src/formatters/json.cr
+++ b/src/formatters/json.cr
@ -0,0 +1,18 @@
+require "../formatter"
+
+module Tartrazine
+  class Json < Formatter
+    property name = "json"
+
+    def format(text : String, lexer : BaseLexer) : String
+      outp = String::Builder.new("")
+      format(text, lexer, outp)
+      outp.to_s
+    end
+
+    def format(text : String, lexer : BaseLexer, io : IO) : Nil
+      tokenizer = lexer.tokenizer(text)
+      io << Tartrazine::Lexer.collapse_tokens(tokenizer.to_a).to_json
+    end
+  end
+end
--- a/src/heuristics.cr
+++ b/src/heuristics.cr
@ -0,0 +1,81 @@
+require "yaml"
+
+# Use linguist's heuristics to disambiguate between languages
+# This is *shamelessly* stolen from https://github.com/github-linguist/linguist
+# and ported to Crystal. Deepest thanks to the authors of Linguist
+# for licensing it liberally.
+#
+# Consider this code (c) 2017 GitHub, Inc. even if I wrote it.
+module Linguist
+  class Heuristic
+    include YAML::Serializable
+
+    property disambiguations : Array(Disambiguation)
+    property named_patterns : Hash(String, String | Array(String))
+
+    # Run the heuristics on the given filename and content
+    def run(filename, content)
+      ext = File.extname filename
+      disambiguation = disambiguations.find do |item|
+        item.extensions.includes? ext
+      end
+      disambiguation.try &.run(content, named_patterns)
+    end
+  end
+
+  class Disambiguation
+    include YAML::Serializable
+    property extensions : Array(String)
+    property rules : Array(LangRule)
+
+    def run(content, named_patterns)
+      rules.each do |rule|
+        if rule.match(content, named_patterns)
+          return rule.language
+        end
+      end
+      nil
+    end
+  end
+
+  class LangRule
+    include YAML::Serializable
+    property pattern : (String | Array(String))?
+    property negative_pattern : (String | Array(String))?
+    property named_pattern : String?
+    property and : Array(LangRule)?
+    property language : String | Array(String)?
+
+    # ameba:disable Metrics/CyclomaticComplexity
+    def match(content, named_patterns)
+      # This rule matches without conditions
+      return true if !pattern && !negative_pattern && !named_pattern && !and
+
+      if pattern
+        p_arr = [] of String
+        p_arr << pattern.as(String) if pattern.is_a? String
+        p_arr = pattern.as(Array(String)) if pattern.is_a? Array(String)
+        return true if p_arr.any? { |pat| ::Regex.new(pat).matches?(content) }
+      end
+      if negative_pattern
+        p_arr = [] of String
+        p_arr << negative_pattern.as(String) if negative_pattern.is_a? String
+        p_arr = negative_pattern.as(Array(String)) if negative_pattern.is_a? Array(String)
+        return true if p_arr.none? { |pat| ::Regex.new(pat).matches?(content) }
+      end
+      if named_pattern
+        p_arr = [] of String
+        if named_patterns[named_pattern].is_a? String
+          p_arr << named_patterns[named_pattern].as(String)
+        else
+          p_arr = named_patterns[named_pattern].as(Array(String))
+        end
+        result = p_arr.any? { |pat| ::Regex.new(pat).matches?(content) }
+      end
+      if and
+        result = and.as(Array(LangRule)).all?(&.match(content, named_patterns))
+      end
+      result
+    end
+  end
+end
--- a/src/lexer.cr
+++ b/src/lexer.cr
@ -0,0 +1,319 @@
+require "baked_file_system"
+require "./constants/lexers"
+
+module Tartrazine
+  class LexerFiles
+    extend BakedFileSystem
+    bake_folder "../lexers", __DIR__
+  end
+
+  # Get the lexer object for a language name
+  # FIXME: support mimetypes
+  def self.lexer(name : String? = nil, filename : String? = nil) : BaseLexer
+    return lexer_by_name(name) if name && name != "autodetect"
+    return lexer_by_filename(filename) if filename
+
+    Lexer.from_xml(LexerFiles.get("/#{LEXERS_BY_NAME["plaintext"]}.xml").gets_to_end)
+  end
+
+  private def self.lexer_by_name(name : String) : BaseLexer
+    lexer_file_name = LEXERS_BY_NAME.fetch(name.downcase, nil)
+    return create_delegating_lexer(name) if lexer_file_name.nil? && name.includes? "+"
+    raise Exception.new("Unknown lexer: #{name}") if lexer_file_name.nil?
+
+    Lexer.from_xml(LexerFiles.get("/#{lexer_file_name}.xml").gets_to_end)
+  end
+
+  private def self.lexer_by_filename(filename : String) : BaseLexer
+    candidates = Set(String).new
+    LEXERS_BY_FILENAME.each do |k, v|
+      candidates += v.to_set if File.match?(k, File.basename(filename))
+    end
+
+    case candidates.size
+    when 0
+      lexer_file_name = LEXERS_BY_NAME["plaintext"]
+    when 1
+      lexer_file_name = candidates.first
+    else
+      lexer_file_name = self.lexer_by_content(filename)
+      begin
+        return self.lexer(lexer_file_name)
+      rescue ex : Exception
+        raise Exception.new("Multiple lexers match the filename: #{candidates.to_a.join(", ")}, heuristics suggest #{lexer_file_name} but there is no matching lexer.")
+      end
+    end
+
+    Lexer.from_xml(LexerFiles.get("/#{lexer_file_name}.xml").gets_to_end)
+  end
+
+  private def self.lexer_by_content(fname : String) : String?
+    h = Linguist::Heuristic.from_yaml(LexerFiles.get("/heuristics.yml").gets_to_end)
+    result = h.run(fname, File.read(fname))
+    case result
+    when Nil
+      raise Exception.new "No lexer found for #{fname}"
+    when String
+      result.as(String)
+    when Array(String)
+      result.first
+    end
+  end
+
+  private def self.create_delegating_lexer(name : String) : BaseLexer
+    language, root = name.split("+", 2)
+    language_lexer = lexer(language)
+    root_lexer = lexer(root)
+    DelegatingLexer.new(language_lexer, root_lexer)
+  end
+
+  # Return a list of all lexers
+  def self.lexers : Array(String)
+    LEXERS_BY_NAME.keys.sort!
+  end
+
+  # A token, the output of the tokenizer
+  alias Token = NamedTuple(type: String, value: String)
+
+  abstract class BaseTokenizer
+  end
+
+  class Tokenizer < BaseTokenizer
+    include Iterator(Token)
+    property lexer : BaseLexer
+    property text : Bytes
+    property pos : Int32 = 0
+    @dq = Deque(Token).new
+    property state_stack = ["root"]
+
+    def initialize(@lexer : BaseLexer, text : String, secondary = false)
+      # Respect the `ensure_nl` config option
+      if text.size > 0 && text[-1] != '\n' && @lexer.config[:ensure_nl] && !secondary
+        text += "\n"
+      end
+      @text = text.to_slice
+    end
+
+    def next : Iterator::Stop | Token
+      if @dq.size > 0
+        return @dq.shift
+      end
+      if pos == @text.size
+        return stop
+      end
+
+      matched = false
+      while @pos < @text.size
+        @lexer.states[@state_stack.last].rules.each do |rule|
+          matched, new_pos, new_tokens = rule.match(@text, @pos, self)
+          if matched
+            @pos = new_pos
+            split_tokens(new_tokens).each { |token| @dq << token }
+            break
+          end
+        end
+        if !matched
+          if @text[@pos] == 10u8
+            @dq << {type: "Text", value: "\n"}
+            @state_stack = ["root"]
+          else
+            @dq << {type: "Error", value: String.new(@text[@pos..@pos])}
+          end
+          @pos += 1
+          break
+        end
+      end
+      self.next
+    end
+
+    # If a token contains a newline, split it into two tokens
+    def split_tokens(tokens : Array(Token)) : Array(Token)
+      split_tokens = [] of Token
+      tokens.each do |token|
+        if token[:value].includes?("\n")
+          values = token[:value].split("\n")
+          values.each_with_index do |value, index|
+            value += "\n" if index < values.size - 1
+            split_tokens << {type: token[:type], value: value}
+          end
+        else
+          split_tokens << token
+        end
+      end
+      split_tokens
+    end
+  end
+
+  abstract class BaseLexer
+    property config = {
+      name:             "",
+      priority:         0.0,
+      case_insensitive: false,
+      dot_all:          false,
+      not_multiline:    false,
+      ensure_nl:        false,
+    }
+    property states = {} of String => State
+
+    def tokenizer(text : String, secondary = false) : BaseTokenizer
+      Tokenizer.new(self, text, secondary)
+    end
+  end
+
+  # This implements a lexer for Pygments RegexLexers as expressed
+  # in Chroma's XML serialization.
+  #
+  # For explanations on what actions and states do
+  # the Pygments documentation is a good place to start.
+  # https://pygments.org/docs/lexerdevelopment/
+  class Lexer < BaseLexer
+    # Collapse consecutive tokens of the same type for easier comparison
+    # and smaller output
+    def self.collapse_tokens(tokens : Array(Tartrazine::Token)) : Array(Tartrazine::Token)
+      result = [] of Tartrazine::Token
+      tokens = tokens.reject { |token| token[:value] == "" }
+      tokens.each do |token|
+        if result.empty?
+          result << token
+          next
+        end
+        last = result.last
+        if last[:type] == token[:type]
+          new_token = {type: last[:type], value: last[:value] + token[:value]}
+          result.pop
+          result << new_token
+        else
+          result << token
+        end
+      end
+      result
+    end
+
+    def self.from_xml(xml : String) : Lexer
+      l = Lexer.new
+      lexer = XML.parse(xml).first_element_child
+      if lexer
+        config = lexer.children.find { |node|
+          node.name == "config"
+        }
+        if config
+          l.config = {
+            name:             xml_to_s(config, name) || "",
+            priority:         xml_to_f(config, priority) || 0.0,
+            not_multiline:    xml_to_s(config, not_multiline) == "true",
+            dot_all:          xml_to_s(config, dot_all) == "true",
+            case_insensitive: xml_to_s(config, case_insensitive) == "true",
+            ensure_nl:        xml_to_s(config, ensure_nl) == "true",
+          }
+        end
+
+        rules = lexer.children.find { |node|
+          node.name == "rules"
+        }
+        if rules
+          # Rules contains states 🤷
+          rules.children.select { |node|
+            node.name == "state"
+          }.each do |state_node|
+            state = State.new
+            state.name = state_node["name"]
+            if l.states.has_key?(state.name)
+              raise Exception.new("Duplicate state: #{state.name}")
+            else
+              l.states[state.name] = state
+            end
+            # And states contain rules 🤷
+            state_node.children.select { |node|
+              node.name == "rule"
+            }.each do |rule_node|
+              case rule_node["pattern"]?
+              when nil
+                if rule_node.first_element_child.try &.name == "include"
+                  rule = IncludeStateRule.new(rule_node)
+                else
+                  rule = UnconditionalRule.new(rule_node)
+                end
+              else
+                rule = Rule.new(rule_node,
+                  multiline: !l.config[:not_multiline],
+                  dotall: l.config[:dot_all],
+                  ignorecase: l.config[:case_insensitive])
+              end
+              state.rules << rule
+            end
+          end
+        end
+      end
+      l
+    end
+  end
+
+  # A lexer that takes two lexers as arguments. A root lexer
+  # and a language lexer. Everything is scalled using the
+  # language lexer, afterwards all `Other` tokens are lexed
+  # using the root lexer.
+  #
+  # This is useful for things like template languages, where
+  # you have Jinja + HTML or Jinja + CSS and so on.
+  class DelegatingLexer < BaseLexer
+    property language_lexer : BaseLexer
+    property root_lexer : BaseLexer
+
+    def initialize(@language_lexer : BaseLexer, @root_lexer : BaseLexer)
+    end
+
+    def tokenizer(text : String, secondary = false) : DelegatingTokenizer
+      DelegatingTokenizer.new(self, text, secondary)
+    end
+  end
+
+  # This Tokenizer works with a DelegatingLexer. It first tokenizes
+  # using the language lexer, and "Other" tokens are tokenized using
+  # the root lexer.
+  class DelegatingTokenizer < BaseTokenizer
+    include Iterator(Token)
+    @dq = Deque(Token).new
+    @language_tokenizer : BaseTokenizer
+
+    def initialize(@lexer : DelegatingLexer, text : String, secondary = false)
+      # Respect the `ensure_nl` config option
+      if text.size > 0 && text[-1] != '\n' && @lexer.config[:ensure_nl] && !secondary
+        text += "\n"
+      end
+      @language_tokenizer = @lexer.language_lexer.tokenizer(text, true)
+    end
+
+    def next : Iterator::Stop | Token
+      if @dq.size > 0
+        return @dq.shift
+      end
+      token = @language_tokenizer.next
+      if token.is_a? Iterator::Stop
+        return stop
+      elsif token.as(Token).[:type] == "Other"
+        root_tokenizer = @lexer.root_lexer.tokenizer(token.as(Token).[:value], true)
+        root_tokenizer.each do |root_token|
+          @dq << root_token
+        end
+      else
+        @dq << token.as(Token)
+      end
+      self.next
+    end
+  end
+
+  # A Lexer state. A state has a name and a list of rules.
+  # The state machine has a state stack containing references
+  # to states to decide which rules to apply.
+  struct State
+    property name : String = ""
+    property rules = [] of BaseRule
+
+    def +(other : State)
+      new_state = State.new
+      new_state.name = Random.base58(8)
+      new_state.rules = rules + other.rules
+      new_state
+    end
+  end
+end
--- a/src/main.cr
+++ b/src/main.cr
@ -1,5 +1,97 @@
+require "docopt"
 require "./**"

-lexer = Tartrazine.lexer("crystal")
-theme = Tartrazine.theme(ARGV[1])
-puts Tartrazine::Html.new.format(File.read(ARGV[0]), lexer, theme)
+HELP = <<-HELP
+tartrazine: a syntax highlighting tool
+
+Usage:
+  tartrazine (-h, --help)
+  tartrazine FILE -f html [-t theme][--standalone][--line-numbers]
+                          [-l lexer][-o output]
+  tartrazine -f html -t theme --css
+  tartrazine FILE -f terminal [-t theme][-l lexer][--line-numbers]
+                              [-o output]
+  tartrazine FILE -f json [-o output]
+  tartrazine --list-themes
+  tartrazine --list-lexers
+  tartrazine --list-formatters
+  tartrazine --version
+
+Options:
+  -f <formatter>      Format to use (html, terminal, json)
+  -t <theme>          Theme to use, see --list-themes [default: default-dark]
+  -l <lexer>          Lexer (language) to use, see --list-lexers. Use more than
+                      one lexer with "+" (e.g. jinja+yaml) [default: autodetect]
+  -o <output>         Output file. Default is stdout.
+  --standalone        Generate a standalone HTML file, which includes
+                      all style information. If not given, it will generate just
+                      a HTML fragment ready to include in your own page.
+  --css               Generate a CSS file for the theme called <theme>.css
+  --line-numbers      Include line numbers in the output
+  -h, --help          Show this screen
+  -v, --version       Show version number
+HELP
+
+options = Docopt.docopt(HELP, ARGV)
+
+# Handle version manually
+if options["--version"]
+  puts "tartrazine #{Tartrazine::VERSION}"
+  exit 0
+end
+
+if options["--list-themes"]
+  puts Tartrazine.themes.join("\n")
+  exit 0
+end
+
+if options["--list-lexers"]
+  puts Tartrazine.lexers.join("\n")
+  exit 0
+end
+
+if options["--list-formatters"]
+  puts "html\njson\nterminal"
+  exit 0
+end
+
+theme = Tartrazine.theme(options["-t"].as(String))
+
+if options["-f"]
+  formatter = options["-f"].as(String)
+  case formatter
+  when "html"
+    formatter = Tartrazine::Html.new
+    formatter.standalone = options["--standalone"] != nil
+    formatter.line_numbers = options["--line-numbers"] != nil
+    formatter.theme = theme
+  when "terminal"
+    formatter = Tartrazine::Ansi.new
+    formatter.line_numbers = options["--line-numbers"] != nil
+    formatter.theme = theme
+  when "json"
+    formatter = Tartrazine::Json.new
+  else
+    puts "Invalid formatter: #{formatter}"
+    exit 1
+  end
+
+  if formatter.is_a?(Tartrazine::Html) && options["--css"]
+    File.open("#{options["-t"].as(String)}.css", "w") do |outf|
+      outf << formatter.style_defs
+    end
+    exit 0
+  end
+
+  lexer = Tartrazine.lexer(name: options["-l"].as(String), filename: options["FILE"].as(String))
+
+  input = File.open(options["FILE"].as(String)).gets_to_end
+
+  if options["-o"].nil?
+    outf = STDOUT
+  else
+    outf = File.open(options["-o"].as(String), "w")
+  end
+  formatter.format(input, lexer, outf)
+  outf.close
+end
--- a/src/rules.cr
+++ b/src/rules.cr
@ -1,9 +1,9 @@
 require "./actions"
-require "./constants"
+require "./bytes_regex"
 require "./formatter"
+require "./lexer"
 require "./rules"
 require "./styles"
-require "./tartrazine"

 # These are lexer rules. They match with the text being parsed
 # and perform actions, either emitting tokens or changing the
@ -11,37 +11,15 @@ require "./tartrazine"
 module Tartrazine
  # This rule matches via a regex pattern

-  class Rule
-    property pattern : Regex = Re2.new ""
-    property actions : Array(Action) = [] of Action
-    property xml : String = "foo"
+  alias Regex = BytesRegex::Regex
+  alias Match = BytesRegex::Match
+  alias MatchData = Array(Match)

-    def match(text, pos, lexer) : Tuple(Bool, Int32, Array(Token))
-      match = pattern.match(text, pos)
-      # We don't match if the match doesn't move the cursor
-      # because that causes infinite loops
-      return false, pos, [] of Token if match.nil? || match.end == 0
-      # Log.trace { "#{match}, #{pattern.inspect}, #{text}, #{pos}" }
-      tokens = [] of Token
-      # Emit the tokens
-      actions.each do |action|
-        # Emit the token
-        tokens += action.emit(match, lexer)
-      end
-      Log.trace { "#{xml}, #{match.end}, #{tokens}" }
-      return true, match.end, tokens
-    end
+  abstract struct BaseRule
+    abstract def match(text : Bytes, pos : Int32, tokenizer : Tokenizer) : Tuple(Bool, Int32, Array(Token))
+    abstract def initialize(node : XML::Node)

-    def initialize(node : XML::Node, multiline, dotall, ignorecase)
-      @xml = node.to_s
-      @pattern = Re2.new(
-        node["pattern"],
-        multiline,
-        dotall,
-        ignorecase,
-        anchored: true)
-      add_actions(node)
-    end
+    @actions : Array(Action) = [] of Action

    def add_actions(node : XML::Node)
      node.children.each do |child|
@ -51,23 +29,42 @@ module Tartrazine
    end
  end

+  struct Rule < BaseRule
+    property pattern : Regex = Regex.new ""
+
+    def match(text : Bytes, pos, tokenizer) : Tuple(Bool, Int32, Array(Token))
+      match = pattern.match(text, pos)
+
+      # No match
+      return false, pos, [] of Token if match.size == 0
+      return true, pos + match[0].size, @actions.flat_map(&.emit(match, tokenizer))
+    end
+
+    def initialize(node : XML::Node)
+    end
+
+    def initialize(node : XML::Node, multiline, dotall, ignorecase)
+      pattern = node["pattern"]
+      pattern = "(?m)" + pattern if multiline
+      @pattern = Regex.new(pattern, multiline, dotall, ignorecase, true)
+      add_actions(node)
+    end
+  end
+
  # This rule includes another state. If any of the rules of the
  # included state matches, this rule matches.
-  class IncludeStateRule < Rule
-    property state : String = ""
+  struct IncludeStateRule < BaseRule
+    @state : String = ""

-    def match(text, pos, lexer) : Tuple(Bool, Int32, Array(Token))
-      Log.trace { "Including state #{state} from #{lexer.state_stack.last}" }
-      lexer.states[state].rules.each do |rule|
-        matched, new_pos, new_tokens = rule.match(text, pos, lexer)
-        Log.trace { "#{xml}, #{new_pos}, #{new_tokens}" } if matched
+    def match(text : Bytes, pos : Int32, tokenizer : Tokenizer) : Tuple(Bool, Int32, Array(Token))
+      tokenizer.@lexer.states[@state].rules.each do |rule|
+        matched, new_pos, new_tokens = rule.match(text, pos, tokenizer)
        return true, new_pos, new_tokens if matched
      end
      return false, pos, [] of Token
    end

    def initialize(node : XML::Node)
-      @xml = node.to_s
      include_node = node.children.find { |child|
        child.name == "include"
      }
@ -77,39 +74,15 @@ module Tartrazine
  end

  # This rule always matches, unconditionally
-  class UnconditionalRule < Rule
-    def match(text, pos, lexer) : Tuple(Bool, Int32, Array(Token))
-      tokens = [] of Token
-      actions.each do |action|
-        tokens += action.emit(nil, lexer)
-      end
-      return true, pos, tokens
+  struct UnconditionalRule < BaseRule
+    NO_MATCH = [] of Match
+
+    def match(text, pos, tokenizer) : Tuple(Bool, Int32, Array(Token))
+      return true, pos, @actions.flat_map(&.emit(NO_MATCH, tokenizer))
    end

    def initialize(node : XML::Node)
-      @xml = node.to_s
      add_actions(node)
    end
  end
-
-  # This is a hack to workaround that Crystal seems to disallow
-  # having regexes multiline but not dot_all
-  class Re2 < Regex
-    @source = "fa"
-    @options = Regex::Options::None
-    @jit = true
-
-    def initialize(pattern : String, multiline = false, dotall = false, ignorecase = false, anchored = false)
-      flags = LibPCRE2::UTF | LibPCRE2::DUPNAMES |
-              LibPCRE2::UCP
-      flags |= LibPCRE2::MULTILINE if multiline
-      flags |= LibPCRE2::DOTALL if dotall
-      flags |= LibPCRE2::CASELESS if ignorecase
-      flags |= LibPCRE2::ANCHORED if anchored
-      flags |= LibPCRE2::NO_UTF_CHECK
-      @re = Regex::PCRE2.compile(pattern, flags) do |error_message|
-        raise Exception.new(error_message)
-      end
-    end
-  end
 end
--- a/src/styles.cr
+++ b/src/styles.cr
@ -1,5 +1,4 @@
 require "./actions"
-require "./constants"
 require "./formatter"
 require "./rules"
 require "./styles"
@ -10,17 +9,37 @@ require "xml"
 module Tartrazine
  alias Color = Sixteen::Color

-  def self.theme(name : String) : Theme
-    return Theme.from_base16(name[7..]) if name.starts_with? "base16_"
-    Theme.from_xml(ThemeFiles.get("/#{name}.xml").gets_to_end)
-  end
-
-  class ThemeFiles
+  struct ThemeFiles
    extend BakedFileSystem
    bake_folder "../styles", __DIR__
  end

-  class Style
+  def self.theme(name : String) : Theme
+    begin
+      return Theme.from_base16(name)
+    rescue ex : Exception
+      raise ex unless ex.message.try &.includes? "Theme not found"
+    end
+    begin
+      Theme.from_xml(ThemeFiles.get("/#{name}.xml").gets_to_end)
+    rescue
+      raise Exception.new("Theme #{name} not found")
+    end
+  end
+
+  # Return a list of all themes
+  def self.themes
+    themes = Set(String).new
+    ThemeFiles.files.each do |file|
+      themes << file.path.split("/").last.split(".").first
+    end
+    Sixteen::DataFiles.files.each do |file|
+      themes << file.path.split("/").last.split(".").first
+    end
+    themes.to_a.sort!
+  end
+
+  struct Style
    # These properties are tri-state.
    # true means it's set
    # false means it's not set
@ -60,7 +79,7 @@ module Tartrazine
    end
  end

-  class Theme
+  struct Theme
    property name : String = ""

    property styles = {} of String => Style
@ -103,7 +122,8 @@ module Tartrazine
      # The color assignments are adapted from
      # https://github.com/mohd-akram/base16-pygments/

-      theme.styles["Background"] = Style.new(color: t["base05"], background: t["base00"])
+      theme.styles["Background"] = Style.new(color: t["base05"], background: t["base00"], bold: true)
+      theme.styles["LineHighlight"] = Style.new(color: t["base0D"], background: t["base01"])
      theme.styles["Text"] = Style.new(color: t["base05"])
      theme.styles["Error"] = Style.new(color: t["base08"])
      theme.styles["Comment"] = Style.new(color: t["base03"])
@ -162,7 +182,26 @@ module Tartrazine

        theme.styles[node["type"]] = s
      end
+      # We really want a LineHighlight class
+      if !theme.styles.has_key?("LineHighlight")
+        theme.styles["LineHighlight"] = Style.new
+        theme.styles["LineHighlight"].background = make_highlight_color(theme.styles["Background"].background)
+        theme.styles["LineHighlight"].bold = true
+      end
      theme
    end
+
+    # If the color is dark, make it brighter and viceversa
+    def self.make_highlight_color(base_color)
+      if base_color.nil?
+        # WHo knows
+        return Color.new(127, 127, 127)
+      end
+      if base_color.dark?
+        base_color.lighter(0.2)
+      else
+        base_color.darker(0.2)
+      end
+    end
  end
 end
--- a/src/tartrazine.cr
+++ b/src/tartrazine.cr
@ -1,5 +1,4 @@
 require "./actions"
-require "./constants"
 require "./formatter"
 require "./rules"
 require "./styles"
@ -12,189 +11,9 @@ require "xml"

 module Tartrazine
  extend self
-  VERSION = "0.1.0"
+  VERSION = {{ `shards version #{__DIR__}`.chomp.stringify }}

  Log = ::Log.for("tartrazine")
-
-  # This implements a lexer for Pygments RegexLexers as expressed
-  # in Chroma's XML serialization.
-  #
-  # For explanations on what actions and states do
-  # the Pygments documentation is a good place to start.
-  # https://pygments.org/docs/lexerdevelopment/
-
-  # A Lexer state. A state has a name and a list of rules.
-  # The state machine has a state stack containing references
-  # to states to decide which rules to apply.
-  class State
-    property name : String = ""
-    property rules = [] of Rule
-
-    def +(other : State)
-      new_state = State.new
-      new_state.name = Random.base58(8)
-      new_state.rules = rules + other.rules
-      new_state
-    end
-  end
-
-  class LexerFiles
-    extend BakedFileSystem
-
-    bake_folder "../lexers", __DIR__
-  end
-
-  # A token, the output of the tokenizer
-  alias Token = NamedTuple(type: String, value: String)
-
-  class Lexer
-    property config = {
-      name:             "",
-      aliases:          [] of String,
-      filenames:        [] of String,
-      mime_types:       [] of String,
-      priority:         0.0,
-      case_insensitive: false,
-      dot_all:          false,
-      not_multiline:    false,
-      ensure_nl:        false,
-    }
-    property xml : String = ""
-
-    property states = {} of String => State
-
-    property state_stack = ["root"]
-
-    # Turn the text into a list of tokens. The `usingself` parameter
-    # is true when the lexer is being used to tokenize a string
-    # from a larger text that is already being tokenized.
-    # So, when it's true, we don't modify the text.
-    def tokenize(text, usingself = false) : Array(Token)
-      @state_stack = ["root"]
-      tokens = [] of Token
-      pos = 0
-      matched = false
-
-      # Respect the `ensure_nl` config option
-      if text.size > 0 && text[-1] != '\n' && config[:ensure_nl] && !usingself
-        text += "\n"
-      end
-
-      # Loop through the text, applying rules
-      while pos < text.size
-        state = states[@state_stack.last]
-        # Log.trace { "Stack is #{@state_stack} State is #{state.name}, pos is #{pos}, text is #{text[pos..pos + 10]}" }
-        state.rules.each do |rule|
-          matched, new_pos, new_tokens = rule.match(text, pos, self)
-          if matched
-            # Move position forward, save the tokens,
-            # tokenize from the new position
-            # Log.trace { "MATCHED: #{rule.xml}" }
-            pos = new_pos
-            tokens += new_tokens
-            break
-          end
-          # Log.trace { "NOT MATCHED: #{rule.xml}" }
-        end
-        # If no rule matches, emit an error token
-        unless matched
-          # Log.trace { "Error at #{pos}" }
-          tokens << {type: "Error", value: "#{text[pos]}"}
-          pos += 1
-        end
-      end
-      Lexer.collapse_tokens(tokens)
-    end
-
-    # Collapse consecutive tokens of the same type for easier comparison
-    # and smaller output
-    def self.collapse_tokens(tokens : Array(Tartrazine::Token)) : Array(Tartrazine::Token)
-      result = [] of Tartrazine::Token
-      tokens = tokens.reject { |token| token[:value] == "" }
-      tokens.each do |token|
-        if result.empty?
-          result << token
-          next
-        end
-        last = result.last
-        if last[:type] == token[:type]
-          new_token = {type: last[:type], value: last[:value] + token[:value]}
-          result.pop
-          result << new_token
-        else
-          result << token
-        end
-      end
-      result
-    end
-
-    # ameba:disable Metrics/CyclomaticComplexity
-    def self.from_xml(xml : String) : Lexer
-      l = Lexer.new
-      l.xml = xml
-      lexer = XML.parse(xml).first_element_child
-      if lexer
-        config = lexer.children.find { |node|
-          node.name == "config"
-        }
-        if config
-          l.config = {
-            name:             xml_to_s(config, name) || "",
-            aliases:          xml_to_a(config, _alias) || [] of String,
-            filenames:        xml_to_a(config, filename) || [] of String,
-            mime_types:       xml_to_a(config, mime_type) || [] of String,
-            priority:         xml_to_f(config, priority) || 0.0,
-            not_multiline:    xml_to_s(config, not_multiline) == "true",
-            dot_all:          xml_to_s(config, dot_all) == "true",
-            case_insensitive: xml_to_s(config, case_insensitive) == "true",
-            ensure_nl:        xml_to_s(config, ensure_nl) == "true",
-          }
-        end
-
-        rules = lexer.children.find { |node|
-          node.name == "rules"
-        }
-        if rules
-          # Rules contains states 🤷
-          rules.children.select { |node|
-            node.name == "state"
-          }.each do |state_node|
-            state = State.new
-            state.name = state_node["name"]
-            if l.states.has_key?(state.name)
-              raise Exception.new("Duplicate state: #{state.name}")
-            else
-              l.states[state.name] = state
-            end
-            # And states contain rules 🤷
-            state_node.children.select { |node|
-              node.name == "rule"
-            }.each do |rule_node|
-              case rule_node["pattern"]?
-              when nil
-                if rule_node.first_element_child.try &.name == "include"
-                  rule = IncludeStateRule.new(rule_node)
-                else
-                  rule = UnconditionalRule.new(rule_node)
-                end
-              else
-                rule = Rule.new(rule_node,
-                  multiline: !l.config[:not_multiline],
-                  dotall: l.config[:dot_all],
-                  ignorecase: l.config[:case_insensitive])
-              end
-              state.rules << rule
-            end
-          end
-        end
-      end
-      l
-    end
-  end
-
-  def self.lexer(name : String) : Lexer
-    Lexer.from_xml(LexerFiles.get("/#{name}.xml").gets_to_end)
-  end
 end

 # Convenience macros to parse XML
--- a/styles/base16-snazzy.xml
+++ b/styles/base16-snazzy.xml
@ -1,74 +0,0 @@
-<style name="base16-snazzy">
-  <entry type="Other" style="#e2e4e5"/>
-  <entry type="Error" style="#ff5c57"/>
-  <entry type="Background" style="bg:#282a36"/>
-  <entry type="Keyword" style="#ff6ac1"/>
-  <entry type="KeywordConstant" style="#ff6ac1"/>
-  <entry type="KeywordDeclaration" style="#ff5c57"/>
-  <entry type="KeywordNamespace" style="#ff6ac1"/>
-  <entry type="KeywordPseudo" style="#ff6ac1"/>
-  <entry type="KeywordReserved" style="#ff6ac1"/>
-  <entry type="KeywordType" style="#9aedfe"/>
-  <entry type="Name" style="#e2e4e5"/>
-  <entry type="NameAttribute" style="#57c7ff"/>
-  <entry type="NameBuiltin" style="#ff5c57"/>
-  <entry type="NameBuiltinPseudo" style="#e2e4e5"/>
-  <entry type="NameClass" style="#f3f99d"/>
-  <entry type="NameConstant" style="#ff9f43"/>
-  <entry type="NameDecorator" style="#ff9f43"/>
-  <entry type="NameEntity" style="#e2e4e5"/>
-  <entry type="NameException" style="#e2e4e5"/>
-  <entry type="NameFunction" style="#57c7ff"/>
-  <entry type="NameLabel" style="#ff5c57"/>
-  <entry type="NameNamespace" style="#e2e4e5"/>
-  <entry type="NameOther" style="#e2e4e5"/>
-  <entry type="NameTag" style="#ff6ac1"/>
-  <entry type="NameVariable" style="#ff5c57"/>
-  <entry type="NameVariableClass" style="#ff5c57"/>
-  <entry type="NameVariableGlobal" style="#ff5c57"/>
-  <entry type="NameVariableInstance" style="#ff5c57"/>
-  <entry type="Literal" style="#e2e4e5"/>
-  <entry type="LiteralDate" style="#e2e4e5"/>
-  <entry type="LiteralString" style="#5af78e"/>
-  <entry type="LiteralStringBacktick" style="#5af78e"/>
-  <entry type="LiteralStringChar" style="#5af78e"/>
-  <entry type="LiteralStringDoc" style="#5af78e"/>
-  <entry type="LiteralStringDouble" style="#5af78e"/>
-  <entry type="LiteralStringEscape" style="#5af78e"/>
-  <entry type="LiteralStringHeredoc" style="#5af78e"/>
-  <entry type="LiteralStringInterpol" style="#5af78e"/>
-  <entry type="LiteralStringOther" style="#5af78e"/>
-  <entry type="LiteralStringRegex" style="#5af78e"/>
-  <entry type="LiteralStringSingle" style="#5af78e"/>
-  <entry type="LiteralStringSymbol" style="#5af78e"/>
-  <entry type="LiteralNumber" style="#ff9f43"/>
-  <entry type="LiteralNumberBin" style="#ff9f43"/>
-  <entry type="LiteralNumberFloat" style="#ff9f43"/>
-  <entry type="LiteralNumberHex" style="#ff9f43"/>
-  <entry type="LiteralNumberInteger" style="#ff9f43"/>
-  <entry type="LiteralNumberIntegerLong" style="#ff9f43"/>
-  <entry type="LiteralNumberOct" style="#ff9f43"/>
-  <entry type="Operator" style="#ff6ac1"/>
-  <entry type="OperatorWord" style="#ff6ac1"/>
-  <entry type="Punctuation" style="#e2e4e5"/>
-  <entry type="Comment" style="#78787e"/>
-  <entry type="CommentHashbang" style="#78787e"/>
-  <entry type="CommentMultiline" style="#78787e"/>
-  <entry type="CommentSingle" style="#78787e"/>
-  <entry type="CommentSpecial" style="#78787e"/>
-  <entry type="CommentPreproc" style="#78787e"/>
-  <entry type="Generic" style="#e2e4e5"/>
-  <entry type="GenericDeleted" style="#ff5c57"/>
-  <entry type="GenericEmph" style="underline #e2e4e5"/>
-  <entry type="GenericError" style="#ff5c57"/>
-  <entry type="GenericHeading" style="bold #e2e4e5"/>
-  <entry type="GenericInserted" style="bold #e2e4e5"/>
-  <entry type="GenericOutput" style="#43454f"/>
-  <entry type="GenericPrompt" style="#e2e4e5"/>
-  <entry type="GenericStrong" style="italic #e2e4e5"/>
-  <entry type="GenericSubheading" style="bold #e2e4e5"/>
-  <entry type="GenericTraceback" style="#e2e4e5"/>
-  <entry type="GenericUnderline" style="underline"/>
-  <entry type="Text" style="#e2e4e5"/>
-  <entry type="TextWhitespace" style="#e2e4e5"/>
-</style>
--- a/x2.html
+++ b/x2.html
Author	SHA1	Message	Date
Roberto Alsina	df88047ca8	v0.6.1	2024-08-24 21:45:57 -03:00
Roberto Alsina	5a3b50d7a3	Integrate heuristics into lexer selection	2024-08-24 21:39:39 -03:00
Roberto Alsina	a5926af518	Comments	2024-08-24 20:53:14 -03:00
Roberto Alsina	fc9f834bc8	Make it work again	2024-08-24 20:09:29 -03:00
Roberto Alsina	58fd42d936	Rebase to main	2024-08-24 19:59:05 -03:00
Roberto Alsina	5a88a51f3e	Implement heuristics from linguist	2024-08-24 19:55:56 -03:00
Roberto Alsina	fd7c6fa4b3	Sort of working?	2024-08-24 19:55:56 -03:00
Roberto Alsina	6264bfc754	Beginning deserialization of data	2024-08-24 19:55:56 -03:00
Roberto Alsina	38196d6e96	Rst lexer	2024-08-24 19:49:02 -03:00
Roberto Alsina	c6cd74e339	248 languages	2024-08-23 14:49:01 -03:00
Roberto Alsina	17c66a6572	typo	2024-08-23 14:46:26 -03:00
Roberto Alsina	cd7e150aae	Merge pull request #1 from ralgozino/docs/improve-v0.6.0-instructions docs: improve readme and help message	2024-08-23 14:45:56 -03:00
Ramiro Algozino	176b8e9bc9	docs: improve readme and help message - Add example for printing output to the terminal - Fix example for usage as CLI tool (missing -f flag) - Add instructions in the help message for combining lexers	2024-08-23 18:30:14 +02:00
Roberto Alsina	d8ddf5d8b6	v0.6.0	2024-08-23 10:39:08 -03:00
Roberto Alsina	06556877ef	Merge branch 'more_lexers'	2024-08-23 10:34:17 -03:00
Roberto Alsina	3d5d073471	Implemented usingbygroup action, so code-in-markdown works	2024-08-23 10:20:03 -03:00
Roberto Alsina	a2884c4c78	Refactor	2024-08-22 21:58:21 -03:00
Roberto Alsina	bd3df10d2c	Use classes instead of structs to allow properties of the same type	2024-08-22 21:52:59 -03:00
Roberto Alsina	0f3b7fc3c5	Initial implementation of delegatinglexer	2024-08-22 20:55:08 -03:00
Roberto Alsina	7f4296e9d7	Some template lexers	2024-08-22 16:11:30 -03:00
Roberto Alsina	f883065092	Fix weird bug	2024-08-22 15:00:17 -03:00
Roberto Alsina	746abe53ea	Fix weird bug	2024-08-22 14:58:05 -03:00
Roberto Alsina	90971e8f1b	Generate constants sorted so git diffs are smaller	2024-08-22 10:24:09 -03:00
Roberto Alsina	057879c6ee	oops	2024-08-22 10:11:36 -03:00
Roberto Alsina	215d53e173	3 more lexers (markdown moinwiki bbcode)	2024-08-21 22:21:38 -03:00
Roberto Alsina	f435d7df21	0.5.1	2024-08-21 21:22:36 -03:00
Roberto Alsina	5b0a1789dc	v0.5.0	2024-08-21 21:22:36 -03:00
Roberto Alsina	76ef1fea41	Fix example code in README	2024-08-21 21:22:36 -03:00
Roberto Alsina	3ebedec6c1	Make formatter a bit more convenient	2024-08-19 11:26:34 -03:00
Roberto Alsina	57e63f2308	Make formatter a bit more convenient	2024-08-19 11:20:08 -03:00
Roberto Alsina	4a598a575b	Make formatter a bit more convenient	2024-08-19 11:18:54 -03:00
Roberto Alsina	9042138053	Make formatter a bit more convenient	2024-08-19 11:17:44 -03:00
Roberto Alsina	fa647e898a	Make formatter a bit more convenient	2024-08-19 10:15:02 -03:00
Roberto Alsina	ad92929a10	Make formatter a bit more convenient	2024-08-19 09:59:01 -03:00
Roberto Alsina	bb952a44b8	Use IO for output	2024-08-16 17:25:33 -03:00
Roberto Alsina	ae03e4612e	todo management	2024-08-16 14:05:34 -03:00
Roberto Alsina	471b2f5050	updated	2024-08-16 14:03:05 -03:00
Roberto Alsina	5a3b08e716	lint	2024-08-16 14:01:16 -03:00
Roberto Alsina	9ebb9f2765	Fix off-by-1	2024-08-16 13:36:11 -03:00
Roberto Alsina	7538fc76aa	Tokenize via an iterator, makes everything much faster	2024-08-16 13:27:02 -03:00
Roberto Alsina	788577b226	Fix comment	2024-08-15 23:56:52 -03:00
Roberto Alsina	1f01146b1f	Minor cleanup	2024-08-15 23:21:21 -03:00
Roberto Alsina	9041b763ea	Remove unused bits of lexer config	2024-08-15 23:17:49 -03:00
Roberto Alsina	ada30915c3	Idiomatic changes	2024-08-15 23:16:29 -03:00
Roberto Alsina	78eff45ea0	Idiomatic changes	2024-08-15 23:11:49 -03:00
Roberto Alsina	e817aedd60	Idiomatic changes	2024-08-15 22:41:24 -03:00
Roberto Alsina	20d6b65346	More idiomatic	2024-08-15 22:01:50 -03:00
Roberto Alsina	cb09dff9f1	Minor cleanup	2024-08-15 21:35:06 -03:00
Roberto Alsina	b589726352	Make action a struct, guard against popping too much	2024-08-15 21:16:17 -03:00
Roberto Alsina	a3a7b5bd9a	Many cleanups	2024-08-15 21:10:25 -03:00
Roberto Alsina	58e8dac038	Make usingself MUCH cheaper, since it was called many times when parsing C	2024-08-15 19:20:12 -03:00
Roberto Alsina	f72a40f095	Oops, escape things in HTML formatter!	2024-08-15 17:12:29 -03:00
Roberto Alsina	bf257a5b82	cleanup	2024-08-15 17:05:03 -03:00
Roberto Alsina	029495590c	cleanup	2024-08-15 17:04:48 -03:00
Roberto Alsina	115debdec6	Allocate match_data once	2024-08-15 17:04:16 -03:00
Roberto Alsina	4612db58fe	Prefetch XML data	2024-08-15 17:03:58 -03:00
Roberto Alsina	f45a86c83a	ignore	2024-08-15 16:35:58 -03:00
Roberto Alsina	27008640a6	v0.4.0	2024-08-14 13:25:39 -03:00
Roberto Alsina	7db8fdc9e4	Updated README	2024-08-14 13:25:20 -03:00
Roberto Alsina	ad664d9f93	Added error handling	2024-08-14 11:24:25 -03:00
Roberto Alsina	0626c8619f	Working bytes-regexes, faster, MORE tests pass	2024-08-14 11:06:53 -03:00
Roberto Alsina	3725201f8a	Merge branch 'main' of github.com:ralsina/tartrazine	2024-08-14 09:25:08 -03:00
Roberto Alsina	6f64b76c44	lint	2024-08-13 22:07:23 -03:00
Roberto Alsina	5218af6855	lint	2024-08-13 22:06:19 -03:00
Roberto Alsina	c898f395a1	reset stack on EOL instead of error, makes no difference, but it's in pygments version	2024-08-13 22:06:07 -03:00
Roberto Alsina	56e49328fb	Tiny bug	2024-08-13 21:00:00 -03:00
Roberto Alsina	8d7faf2098	0.3.0	2024-08-13 11:06:06 -03:00
Roberto Alsina	2e87762f1b	API changes to make it nicer These are incompatible, tho. * Theme is now a property of the formatter instead of passing it arounf * get_style_defs is now style_defs	2024-08-13 10:57:02 -03:00
Roberto Alsina	88f5674917	Tiny bug	2024-08-12 21:02:17 -03:00
Roberto Alsina	ce6f3d29b5	Remove Re2 hack	2024-08-12 19:01:13 -03:00
Roberto Alsina	46d6d3f467	Make how-heavy-is-bold configurable	2024-08-12 10:55:58 -03:00
Roberto Alsina	78ddc69937	Merge branch 'main' of github.com:ralsina/tartrazine	2024-08-12 10:11:03 -03:00
Roberto Alsina	b1ad7b64c0	oops	2024-08-12 10:10:51 -03:00
Roberto Alsina	cbedf8a8db	Bump to 0.2.0	2024-08-11 13:24:30 -03:00
Roberto Alsina	ec8c53c823	Added --line-numbers for the terminal formatter	2024-08-11 13:21:47 -03:00
Roberto Alsina	e3a1ce37b4	Support guessing lexer by filename	2024-08-11 13:04:35 -03:00
Roberto Alsina	b4f38e00e1	Script to generate lexer metadata constants	2024-08-11 12:41:22 -03:00
Roberto Alsina	08daabe1c3	Cleanup token abbreviation generation script	2024-08-11 12:06:02 -03:00
Roberto Alsina	e8d405fc99	Implemented decent version of the CLI	2024-08-11 11:54:00 -03:00
Roberto Alsina	e295256573	Implemented decent version of the CLI	2024-08-11 11:49:42 -03:00
Roberto Alsina	e40c8b586c	Removed duplicate snazzy theme	2024-08-11 11:27:37 -03:00
Roberto Alsina	bc34f93cc5	Use regular sixteen now	2024-08-10 17:16:26 -03:00
Roberto Alsina	f64c91801e	lint	2024-08-10 16:58:36 -03:00
Roberto Alsina	8e29500fcf	Make line-numbers not-selectable. This makes the listing copy-friendly AND doesn't require wrapping things in tables	2024-08-10 16:54:46 -03:00
Roberto Alsina	f2e638ce3b	Require main branch sixteen for now, line-highlight style improvements	2024-08-10 16:50:55 -03:00
Roberto Alsina	84ee7e6934	JSON formatter	2024-08-09 16:58:15 -03:00
Roberto Alsina	89d212b71c	Start actual CLI	2024-08-09 16:53:24 -03:00
Roberto Alsina	a92d2501f7	HTML formatter option: wrap_long_lines	2024-08-09 16:20:30 -03:00
Roberto Alsina	6b44bcb5ad	HTML formatter option: surrounding_pre	2024-08-09 15:59:49 -03:00
Roberto Alsina	86a5894429	Hack luminance tweaking for creating highlight color (needs a proper implementation)	2024-08-09 14:54:00 -03:00
Roberto Alsina	be12e0f4f1	Sort constants	2024-08-09 14:44:23 -03:00
Roberto Alsina	96dcb7e15e	Fix line highlight for non-base16 themes	2024-08-09 14:42:33 -03:00
Roberto Alsina	d1762f477a	Fix constants for non-base16 themes	2024-08-09 14:17:24 -03:00
Roberto Alsina	f98f44365f	HTML formatter option: line_numbers / highlight_lines	2024-08-09 14:00:42 -03:00
Roberto Alsina	d0c2b1764a	HTML formatter option: line_number_start / line_number_id_prefix	2024-08-09 13:28:05 -03:00
Roberto Alsina	e6a292ade0	HTML formatter option: tab_width	2024-08-09 12:29:56 -03:00
Roberto Alsina	4ced996f90	HTML formatter option: class_prefix	2024-08-09 12:21:02 -03:00
Roberto Alsina	fd5af6ba3b	Starting to add options to HTML formatter: standalone	2024-08-09 11:57:23 -03:00
Roberto Alsina	47237eecc3	Refactor things into separate files for easier reading	2024-08-09 11:31:18 -03:00
Roberto Alsina	a0ff4e0118	0.1.1	2024-08-09 11:11:17 -03:00
Roberto Alsina	ece3d4163a	Bug	2024-08-09 11:03:32 -03:00
Roberto Alsina	3180168261	Added helper files	2024-08-09 10:32:15 -03:00