Compare commits

..

56 Commits

Author SHA1 Message Date
72afec773e Integrate heuristics into lexer selection 2024-08-24 21:35:06 -03:00
a5926af518 Comments 2024-08-24 20:53:14 -03:00
fc9f834bc8 Make it work again 2024-08-24 20:09:29 -03:00
58fd42d936 Rebase to main 2024-08-24 19:59:05 -03:00
5a88a51f3e Implement heuristics from linguist 2024-08-24 19:55:56 -03:00
fd7c6fa4b3 Sort of working? 2024-08-24 19:55:56 -03:00
6264bfc754 Beginning deserialization of data 2024-08-24 19:55:56 -03:00
38196d6e96 Rst lexer 2024-08-24 19:49:02 -03:00
c6cd74e339 248 languages 2024-08-23 14:49:01 -03:00
17c66a6572 typo 2024-08-23 14:46:26 -03:00
cd7e150aae
Merge pull request #1 from ralgozino/docs/improve-v0.6.0-instructions
docs: improve readme and help message
2024-08-23 14:45:56 -03:00
Ramiro Algozino
176b8e9bc9
docs: improve readme and help message
- Add example for printing output to the terminal
- Fix example for usage as CLI tool (missing -f flag)
- Add instructions in the help message for combining lexers
2024-08-23 18:30:14 +02:00
d8ddf5d8b6 v0.6.0 2024-08-23 10:39:08 -03:00
06556877ef Merge branch 'more_lexers' 2024-08-23 10:34:17 -03:00
3d5d073471 Implemented usingbygroup action, so code-in-markdown works 2024-08-23 10:20:03 -03:00
a2884c4c78 Refactor 2024-08-22 21:58:21 -03:00
bd3df10d2c Use classes instead of structs to allow properties of the same type 2024-08-22 21:52:59 -03:00
0f3b7fc3c5 Initial implementation of delegatinglexer 2024-08-22 20:55:08 -03:00
7f4296e9d7 Some template lexers 2024-08-22 16:11:30 -03:00
f883065092 Fix weird bug 2024-08-22 15:00:17 -03:00
746abe53ea Fix weird bug 2024-08-22 14:58:05 -03:00
90971e8f1b Generate constants sorted so git diffs are smaller 2024-08-22 10:24:09 -03:00
057879c6ee oops 2024-08-22 10:11:36 -03:00
215d53e173 3 more lexers (markdown moinwiki bbcode) 2024-08-21 22:21:38 -03:00
f435d7df21 0.5.1 2024-08-21 21:22:36 -03:00
5b0a1789dc v0.5.0 2024-08-21 21:22:36 -03:00
76ef1fea41 Fix example code in README 2024-08-21 21:22:36 -03:00
3ebedec6c1 Make formatter a bit more convenient 2024-08-19 11:26:34 -03:00
57e63f2308 Make formatter a bit more convenient 2024-08-19 11:20:08 -03:00
4a598a575b Make formatter a bit more convenient 2024-08-19 11:18:54 -03:00
9042138053 Make formatter a bit more convenient 2024-08-19 11:17:44 -03:00
fa647e898a Make formatter a bit more convenient 2024-08-19 10:15:02 -03:00
ad92929a10 Make formatter a bit more convenient 2024-08-19 09:59:01 -03:00
bb952a44b8 Use IO for output 2024-08-16 17:25:33 -03:00
ae03e4612e todo management 2024-08-16 14:05:34 -03:00
471b2f5050 updated 2024-08-16 14:03:05 -03:00
5a3b08e716 lint 2024-08-16 14:01:16 -03:00
9ebb9f2765 Fix off-by-1 2024-08-16 13:36:11 -03:00
7538fc76aa Tokenize via an iterator, makes everything much faster 2024-08-16 13:27:02 -03:00
788577b226 Fix comment 2024-08-15 23:56:52 -03:00
1f01146b1f Minor cleanup 2024-08-15 23:21:21 -03:00
9041b763ea Remove unused bits of lexer config 2024-08-15 23:17:49 -03:00
ada30915c3 Idiomatic changes 2024-08-15 23:16:29 -03:00
78eff45ea0 Idiomatic changes 2024-08-15 23:11:49 -03:00
e817aedd60 Idiomatic changes 2024-08-15 22:41:24 -03:00
20d6b65346 More idiomatic 2024-08-15 22:01:50 -03:00
cb09dff9f1 Minor cleanup 2024-08-15 21:35:06 -03:00
b589726352 Make action a struct, guard against popping too much 2024-08-15 21:16:17 -03:00
a3a7b5bd9a Many cleanups 2024-08-15 21:10:25 -03:00
58e8dac038 Make usingself MUCH cheaper, since it was called many times when parsing C 2024-08-15 19:20:12 -03:00
f72a40f095 Oops, escape things in HTML formatter! 2024-08-15 17:12:29 -03:00
bf257a5b82 cleanup 2024-08-15 17:05:03 -03:00
029495590c cleanup 2024-08-15 17:04:48 -03:00
115debdec6 Allocate match_data once 2024-08-15 17:04:16 -03:00
4612db58fe Prefetch XML data 2024-08-15 17:03:58 -03:00
f45a86c83a ignore 2024-08-15 16:35:58 -03:00
28 changed files with 15443 additions and 1415 deletions

1
.gitignore vendored
View File

@ -8,3 +8,4 @@ pygments/
shard.lock
.vscode/
.crystal/
venv/

View File

@ -29,7 +29,7 @@ This only covers the RegexLexers, which are the most common ones,
but it means the supported languages are a subset of Chroma's, which
is a subset of Pygments'.
Currently Tartrazine supports ... 241 languages.
Currently Tartrazine supports ... 248 languages.
It has 331 themes (63 from Chroma, the rest are base16 themes via
[Sixteen](https://github.com/ralsina/sixteen)
@ -47,7 +47,22 @@ To build from source:
2. Run `make` to build the `tartrazine` binary
3. Copy the binary somewhere in your PATH.
## Usage
## Usage as a CLI tool
Show a syntax highlighted version of a C source file in your terminal:
```shell
$ tartrazine whatever.c -l c -t catppuccin-macchiato --line-numbers -f terminal
```
Generate a standalone HTML file from a C source file with the syntax highlighted:
```shell
$ tartrazine whatever.c -l c -t catppuccin-macchiato --line-numbers \
--standalone -f html -o whatever.html
```
## Usage as a Library
This works:
@ -56,7 +71,9 @@ require "tartrazine"
lexer = Tartrazine.lexer("crystal")
theme = Tartrazine.theme("catppuccin-macchiato")
puts Tartrazine::Html.new.format(File.read(ARGV[0]), lexer, theme)
formatter = Tartrazine::Html.new
formatter.theme = theme
puts formatter.format(File.read(ARGV[0]), lexer)
```
## Contributing

View File

@ -9,4 +9,7 @@
* ✅ Implement lexer loader by file extension
* ✅ Add --line-numbers to terminal formatter
* Implement lexer loader by mime type
* Implement Pygment's "DelegateLexer"
* ✅ Implement Delegating lexers
* ✅ Add RstLexer
* Add Mako template lexer
* Implement heuristic lexer detection

130
lexers/LiquidLexer.xml Normal file
View File

@ -0,0 +1,130 @@
<lexer>
<config>
<name>liquid</name>
<alias>liquid</alias>
<filename>*.liquid</filename>
</config>
<rules>
<state name="root">
<rule pattern="[^{]+"><token type="Text"/></rule>
<rule pattern="(\{%)(\s*)"><bygroups><token type="Punctuation"/><token type="TextWhitespace"/></bygroups><push state="tag-or-block"/></rule>
<rule pattern="(\{\{)(\s*)([^\s}]+)"><bygroups><token type="Punctuation"/><token type="TextWhitespace"/><usingself state="generic"/></bygroups><push state="output"/></rule>
<rule pattern="\{"><token type="Text"/></rule>
</state>
<state name="tag-or-block">
<rule pattern="(if|unless|elsif|case)(?=\s+)"><token type="KeywordReserved"/><push state="condition"/></rule>
<rule pattern="(when)(\s+)"><bygroups><token type="KeywordReserved"/><token type="TextWhitespace"/></bygroups><combined state="end-of-block" state="whitespace" state="generic"/></rule>
<rule pattern="(else)(\s*)(%\})"><bygroups><token type="KeywordReserved"/><token type="TextWhitespace"/><token type="Punctuation"/></bygroups><pop depth="1"/></rule>
<rule pattern="(capture)(\s+)([^\s%]+)(\s*)(%\})"><bygroups><token type="NameTag"/><token type="TextWhitespace"/><usingself state="variable"/><token type="TextWhitespace"/><token type="Punctuation"/></bygroups><pop depth="1"/></rule>
<rule pattern="(comment)(\s*)(%\})"><bygroups><token type="NameTag"/><token type="TextWhitespace"/><token type="Punctuation"/></bygroups><push state="comment"/></rule>
<rule pattern="(raw)(\s*)(%\})"><bygroups><token type="NameTag"/><token type="TextWhitespace"/><token type="Punctuation"/></bygroups><push state="raw"/></rule>
<rule pattern="(end(case|unless|if))(\s*)(%\})"><bygroups><token type="KeywordReserved"/>None<token type="TextWhitespace"/><token type="Punctuation"/></bygroups><pop depth="1"/></rule>
<rule pattern="(end([^\s%]+))(\s*)(%\})"><bygroups><token type="NameTag"/>None<token type="TextWhitespace"/><token type="Punctuation"/></bygroups><pop depth="1"/></rule>
<rule pattern="(cycle)(\s+)(?:([^\s:]*)(:))?(\s*)"><bygroups><token type="NameTag"/><token type="TextWhitespace"/><usingself state="generic"/><token type="Punctuation"/><token type="TextWhitespace"/></bygroups><push state="variable-tag-markup"/></rule>
<rule pattern="([^\s%]+)(\s*)"><bygroups><token type="NameTag"/><token type="TextWhitespace"/></bygroups><push state="tag-markup"/></rule>
</state>
<state name="output">
<rule><include state="whitespace"/></rule>
<rule pattern="\}\}"><token type="Punctuation"/><pop depth="1"/></rule>
<rule pattern="\|"><token type="Punctuation"/><push state="filters"/></rule>
</state>
<state name="filters">
<rule><include state="whitespace"/></rule>
<rule pattern="\}\}"><token type="Punctuation"/><push state="#pop" state="#pop"/></rule>
<rule pattern="([^\s|:]+)(:?)(\s*)"><bygroups><token type="NameFunction"/><token type="Punctuation"/><token type="TextWhitespace"/></bygroups><push state="filter-markup"/></rule>
</state>
<state name="filter-markup">
<rule pattern="\|"><token type="Punctuation"/><pop depth="1"/></rule>
<rule><include state="end-of-tag"/></rule>
<rule><include state="default-param-markup"/></rule>
</state>
<state name="condition">
<rule><include state="end-of-block"/></rule>
<rule><include state="whitespace"/></rule>
<rule pattern="([^\s=!&gt;&lt;]+)(\s*)([=!&gt;&lt;]=?)(\s*)(\S+)(\s*)(%\})"><bygroups><usingself state="generic"/><token type="TextWhitespace"/><token type="Operator"/><token type="TextWhitespace"/><usingself state="generic"/><token type="TextWhitespace"/><token type="Punctuation"/></bygroups></rule>
<rule pattern="\b!"><token type="Operator"/></rule>
<rule pattern="\bnot\b"><token type="OperatorWord"/></rule>
<rule pattern="([\w.\&#x27;&quot;]+)(\s+)(contains)(\s+)([\w.\&#x27;&quot;]+)"><bygroups><usingself state="generic"/><token type="TextWhitespace"/><token type="OperatorWord"/><token type="TextWhitespace"/><usingself state="generic"/></bygroups></rule>
<rule><include state="generic"/></rule>
<rule><include state="whitespace"/></rule>
</state>
<state name="generic-value">
<rule><include state="generic"/></rule>
<rule><include state="end-at-whitespace"/></rule>
</state>
<state name="operator">
<rule pattern="(\s*)((=|!|&gt;|&lt;)=?)(\s*)"><bygroups><token type="TextWhitespace"/><token type="Operator"/>None<token type="TextWhitespace"/></bygroups><pop depth="1"/></rule>
<rule pattern="(\s*)(\bcontains\b)(\s*)"><bygroups><token type="TextWhitespace"/><token type="OperatorWord"/><token type="TextWhitespace"/></bygroups><pop depth="1"/></rule>
</state>
<state name="end-of-tag">
<rule pattern="\}\}"><token type="Punctuation"/><pop depth="1"/></rule>
</state>
<state name="end-of-block">
<rule pattern="%\}"><token type="Punctuation"/><push state="#pop" state="#pop"/></rule>
</state>
<state name="end-at-whitespace">
<rule pattern="\s+"><token type="TextWhitespace"/><pop depth="1"/></rule>
</state>
<state name="param-markup">
<rule><include state="whitespace"/></rule>
<rule pattern="([^\s=:]+)(\s*)(=|:)"><bygroups><token type="NameAttribute"/><token type="TextWhitespace"/><token type="Operator"/></bygroups></rule>
<rule pattern="(\{\{)(\s*)([^\s}])(\s*)(\}\})"><bygroups><token type="Punctuation"/><token type="TextWhitespace"/><usingself state="variable"/><token type="TextWhitespace"/><token type="Punctuation"/></bygroups></rule>
<rule><include state="string"/></rule>
<rule><include state="number"/></rule>
<rule><include state="keyword"/></rule>
<rule pattern=","><token type="Punctuation"/></rule>
</state>
<state name="default-param-markup">
<rule><include state="param-markup"/></rule>
<rule pattern="."><token type="Text"/></rule>
</state>
<state name="variable-param-markup">
<rule><include state="param-markup"/></rule>
<rule><include state="variable"/></rule>
<rule pattern="."><token type="Text"/></rule>
</state>
<state name="tag-markup">
<rule pattern="%\}"><token type="Punctuation"/><push state="#pop" state="#pop"/></rule>
<rule><include state="default-param-markup"/></rule>
</state>
<state name="variable-tag-markup">
<rule pattern="%\}"><token type="Punctuation"/><push state="#pop" state="#pop"/></rule>
<rule><include state="variable-param-markup"/></rule>
</state>
<state name="keyword">
<rule pattern="\b(false|true)\b"><token type="KeywordConstant"/></rule>
</state>
<state name="variable">
<rule pattern="[a-zA-Z_]\w*"><token type="NameVariable"/></rule>
<rule pattern="(?&lt;=\w)\.(?=\w)"><token type="Punctuation"/></rule>
</state>
<state name="string">
<rule pattern="&#x27;[^&#x27;]*&#x27;"><token type="LiteralStringSingle"/></rule>
<rule pattern="&quot;[^&quot;]*&quot;"><token type="LiteralStringDouble"/></rule>
</state>
<state name="number">
<rule pattern="\d+\.\d+"><token type="LiteralNumberFloat"/></rule>
<rule pattern="\d+"><token type="LiteralNumberInteger"/></rule>
</state>
<state name="generic">
<rule><include state="keyword"/></rule>
<rule><include state="string"/></rule>
<rule><include state="number"/></rule>
<rule><include state="variable"/></rule>
</state>
<state name="whitespace">
<rule pattern="[ \t]+"><token type="TextWhitespace"/></rule>
</state>
<state name="comment">
<rule pattern="(\{%)(\s*)(endcomment)(\s*)(%\})"><bygroups><token type="Punctuation"/><token type="TextWhitespace"/><token type="NameTag"/><token type="TextWhitespace"/><token type="Punctuation"/></bygroups><push state="#pop" state="#pop"/></rule>
<rule pattern="."><token type="Comment"/></rule>
</state>
<state name="raw">
<rule pattern="[^{]+"><token type="Text"/></rule>
<rule pattern="(\{%)(\s*)(endraw)(\s*)(%\})"><bygroups><token type="Punctuation"/><token type="TextWhitespace"/><token type="NameTag"/><token type="TextWhitespace"/><token type="Punctuation"/></bygroups><pop depth="1"/></rule>
<rule pattern="\{"><token type="Text"/></rule>
</state>
</rules>
</lexer>

55
lexers/VelocityLexer.xml Normal file
View File

@ -0,0 +1,55 @@
<lexer>
<config>
<name>Velocity</name>
<alias>velocity</alias>
<filename>*.vm</filename>
<filename>*.fhtml</filename>
<dot_all>true</dot_all>
</config>
<rules>
<state name="root">
<rule pattern="[^{#$]+"><token type="Other"/></rule>
<rule pattern="(#)(\*.*?\*)(#)"><bygroups><token type="CommentPreproc"/><token type="Comment"/><token type="CommentPreproc"/></bygroups></rule>
<rule pattern="(##)(.*?$)"><bygroups><token type="CommentPreproc"/><token type="Comment"/></bygroups></rule>
<rule pattern="(#\{?)([a-zA-Z_]\w*)(\}?)(\s?\()"><bygroups><token type="CommentPreproc"/><token type="NameFunction"/><token type="CommentPreproc"/><token type="Punctuation"/></bygroups><push state="directiveparams"/></rule>
<rule pattern="(#\{?)([a-zA-Z_]\w*)(\}|\b)"><bygroups><token type="CommentPreproc"/><token type="NameFunction"/><token type="CommentPreproc"/></bygroups></rule>
<rule pattern="\$!?\{?"><token type="Punctuation"/><push state="variable"/></rule>
</state>
<state name="variable">
<rule pattern="[a-zA-Z_]\w*"><token type="NameVariable"/></rule>
<rule pattern="\("><token type="Punctuation"/><push state="funcparams"/></rule>
<rule pattern="(\.)([a-zA-Z_]\w*)"><bygroups><token type="Punctuation"/><token type="NameVariable"/></bygroups><push/></rule>
<rule pattern="\}"><token type="Punctuation"/><pop depth="1"/></rule>
<rule><pop depth="1"/></rule>
</state>
<state name="directiveparams">
<rule pattern="(&amp;&amp;|\|\||==?|!=?|[-&lt;&gt;+*%&amp;|^/])|\b(eq|ne|gt|lt|ge|le|not|in)\b"><token type="Operator"/></rule>
<rule pattern="\["><token type="Operator"/><push state="rangeoperator"/></rule>
<rule pattern="\b[a-zA-Z_]\w*\b"><token type="NameFunction"/></rule>
<rule><include state="funcparams"/></rule>
</state>
<state name="rangeoperator">
<rule pattern="\.\."><token type="Operator"/></rule>
<rule><include state="funcparams"/></rule>
<rule pattern="\]"><token type="Operator"/><pop depth="1"/></rule>
</state>
<state name="funcparams">
<rule pattern="\$!?\{?"><token type="Punctuation"/><push state="variable"/></rule>
<rule pattern="\s+"><token type="Text"/></rule>
<rule pattern="[,:]"><token type="Punctuation"/></rule>
<rule pattern="&quot;(\\\\|\\[^\\]|[^&quot;\\])*&quot;"><token type="LiteralStringDouble"/></rule>
<rule pattern="&#x27;(\\\\|\\[^\\]|[^&#x27;\\])*&#x27;"><token type="LiteralStringSingle"/></rule>
<rule pattern="0[xX][0-9a-fA-F]+[Ll]?"><token type="LiteralNumber"/></rule>
<rule pattern="\b[0-9]+\b"><token type="LiteralNumber"/></rule>
<rule pattern="(true|false|null)\b"><token type="KeywordConstant"/></rule>
<rule pattern="\("><token type="Punctuation"/><push/></rule>
<rule pattern="\)"><token type="Punctuation"/><pop depth="1"/></rule>
<rule pattern="\{"><token type="Punctuation"/><push/></rule>
<rule pattern="\}"><token type="Punctuation"/><pop depth="1"/></rule>
<rule pattern="\["><token type="Punctuation"/><push/></rule>
<rule pattern="\]"><token type="Punctuation"/><pop depth="1"/></rule>
</state>
</rules>
</lexer>

22
lexers/bbcode.xml Normal file
View File

@ -0,0 +1,22 @@
<lexer>
<config>
<name>BBCode</name>
<alias>bbcode</alias>
<mime_type>text/x-bbcode</mime_type>
</config>
<rules>
<state name="root">
<rule pattern="[^[]+"><token type="Text"/></rule>
<rule pattern="\[/?\w+"><token type="Keyword"/><push state="tag"/></rule>
<rule pattern="\["><token type="Text"/></rule>
</state>
<state name="tag">
<rule pattern="\s+"><token type="Text"/></rule>
<rule pattern="(\w+)(=)(&quot;?[^\s&quot;\]]+&quot;?)"><bygroups><token type="NameAttribute"/><token type="Operator"/><token type="LiteralString"/></bygroups></rule>
<rule pattern="(=)(&quot;?[^\s&quot;\]]+&quot;?)"><bygroups><token type="Operator"/><token type="LiteralString"/></bygroups></rule>
<rule pattern="\]"><token type="Keyword"/><pop depth="1"/></rule>
</state>
</rules>
</lexer>

View File

@ -3,6 +3,7 @@
<name>Groff</name>
<alias>groff</alias>
<alias>nroff</alias>
<alias>roff</alias>
<alias>man</alias>
<filename>*.[1-9]</filename>
<filename>*.1p</filename>

View File

@ -30,12 +30,12 @@
disambiguations:
- extensions: ['.1', '.2', '.3', '.4', '.5', '.6', '.7', '.8', '.9']
rules:
- language: Roff Manpage
- language: man
and:
- named_pattern: mdoc-date
- named_pattern: mdoc-title
- named_pattern: mdoc-heading
- language: Roff Manpage
- language: man
and:
- named_pattern: man-title
- named_pattern: man-heading
@ -43,12 +43,12 @@ disambiguations:
pattern: '^\.(?:[A-Za-z]{2}(?:\s|$)|\\")'
- extensions: ['.1in', '.1m', '.1x', '.3in', '.3m', '.3p', '.3pm', '.3qt', '.3x', '.man', '.mdoc']
rules:
- language: Roff Manpage
- language: man
and:
- named_pattern: mdoc-date
- named_pattern: mdoc-title
- named_pattern: mdoc-heading
- language: Roff Manpage
- language: man
and:
- named_pattern: man-title
- named_pattern: man-heading

56
lexers/markdown.xml Normal file
View File

@ -0,0 +1,56 @@
<lexer>
<config>
<name>Markdown</name>
<alias>markdown</alias>
<alias>md</alias>
<filename>*.md</filename>
<filename>*.markdown</filename>
<mime_type>text/x-markdown</mime_type>
</config>
<rules>
<state name="root">
<rule pattern="(^#[^#].+)(\n)"><bygroups><token type="GenericHeading"/><token type="Text"/></bygroups></rule>
<rule pattern="(^#{2,6}[^#].+)(\n)"><bygroups><token type="GenericSubheading"/><token type="Text"/></bygroups></rule>
<rule pattern="^(.+)(\n)(=+)(\n)"><bygroups><token type="GenericHeading"/><token type="Text"/><token type="GenericHeading"/><token type="Text"/></bygroups></rule>
<rule pattern="^(.+)(\n)(-+)(\n)"><bygroups><token type="GenericSubheading"/><token type="Text"/><token type="GenericSubheading"/><token type="Text"/></bygroups></rule>
<rule pattern="^(\s*)([*-] )(\[[ xX]\])( .+\n)"><bygroups><token type="TextWhitespace"/><token type="Keyword"/><token type="Keyword"/><usingself state="inline"/></bygroups></rule>
<rule pattern="^(\s*)([*-])(\s)(.+\n)"><bygroups><token type="TextWhitespace"/><token type="Keyword"/><token type="TextWhitespace"/><usingself state="inline"/></bygroups></rule>
<rule pattern="^(\s*)([0-9]+\.)( .+\n)"><bygroups><token type="TextWhitespace"/><token type="Keyword"/><usingself state="inline"/></bygroups></rule>
<rule pattern="^(\s*&gt;\s)(.+\n)"><bygroups><token type="Keyword"/><token type="GenericEmph"/></bygroups></rule>
<rule pattern="^(```\n)([\w\W]*?)(^```$)">
<bygroups>
<token type="LiteralStringBacktick"/>
<token type="Text"/>
<token type="LiteralStringBacktick"/>
</bygroups>
</rule>
<rule pattern="^(```)(\w+)(\n)([\w\W]*?)(^```$)">
<bygroups>
<token type="LiteralStringBacktick"/>
<token type="NameLabel"/>
<token type="TextWhitespace"/>
<UsingByGroup lexer="2" content="4"/>
<token type="LiteralStringBacktick"/>
</bygroups>
</rule>
<rule><include state="inline"/></rule>
</state>
<state name="inline">
<rule pattern="\\."><token type="Text"/></rule>
<rule pattern="([^`]?)(`[^`\n]+`)"><bygroups><token type="Text"/><token type="LiteralStringBacktick"/></bygroups></rule>
<rule pattern="([^\*]?)(\*\*[^* \n][^*\n]*\*\*)"><bygroups><token type="Text"/><token type="GenericStrong"/></bygroups></rule>
<rule pattern="([^_]?)(__[^_ \n][^_\n]*__)"><bygroups><token type="Text"/><token type="GenericStrong"/></bygroups></rule>
<rule pattern="([^\*]?)(\*[^* \n][^*\n]*\*)"><bygroups><token type="Text"/><token type="GenericEmph"/></bygroups></rule>
<rule pattern="([^_]?)(_[^_ \n][^_\n]*_)"><bygroups><token type="Text"/><token type="GenericEmph"/></bygroups></rule>
<rule pattern="([^~]?)(~~[^~ \n][^~\n]*~~)"><bygroups><token type="Text"/><token type="GenericDeleted"/></bygroups></rule>
<rule pattern="[@#][\w/:]+"><token type="NameEntity"/></rule>
<rule pattern="(!?\[)([^]]+)(\])(\()([^)]+)(\))"><bygroups><token type="Text"/><token type="NameTag"/><token type="Text"/><token type="Text"/><token type="NameAttribute"/><token type="Text"/></bygroups></rule>
<rule pattern="(\[)([^]]+)(\])(\[)([^]]*)(\])"><bygroups><token type="Text"/><token type="NameTag"/><token type="Text"/><token type="Text"/><token type="NameLabel"/><token type="Text"/></bygroups></rule>
<rule pattern="^(\s*\[)([^]]*)(\]:\s*)(.+)"><bygroups><token type="Text"/><token type="NameLabel"/><token type="Text"/><token type="NameAttribute"/></bygroups></rule>
<rule pattern="[^\\\s]+"><token type="Text"/></rule>
<rule pattern="."><token type="Text"/></rule>
</state>
</rules>
</lexer>

34
lexers/moinwiki.xml Normal file
View File

@ -0,0 +1,34 @@
<lexer>
<config>
<name>MoinMoin/Trac Wiki markup</name>
<alias>trac-wiki</alias>
<alias>moin</alias>
<mime_type>text/x-trac-wiki</mime_type>
<case_insensitive>true</case_insensitive>
</config>
<rules>
<state name="root">
<rule pattern="^#.*$"><token type="Comment"/></rule>
<rule pattern="(!)(\S+)"><bygroups><token type="Keyword"/><token type="Text"/></bygroups></rule>
<rule pattern="^(=+)([^=]+)(=+)(\s*#.+)?$"><bygroups><token type="GenericHeading"/><usingself state="root"/><token type="GenericHeading"/><token type="LiteralString"/></bygroups></rule>
<rule pattern="(\{\{\{)(\n#!.+)?"><bygroups><token type="NameBuiltin"/><token type="NameNamespace"/></bygroups><push state="codeblock"/></rule>
<rule pattern="(\&#x27;\&#x27;\&#x27;?|\|\||`|__|~~|\^|,,|::)"><token type="Comment"/></rule>
<rule pattern="^( +)([.*-])( )"><bygroups><token type="Text"/><token type="NameBuiltin"/><token type="Text"/></bygroups></rule>
<rule pattern="^( +)([a-z]{1,5}\.)( )"><bygroups><token type="Text"/><token type="NameBuiltin"/><token type="Text"/></bygroups></rule>
<rule pattern="\[\[\w+.*?\]\]"><token type="Keyword"/></rule>
<rule pattern="(\[[^\s\]]+)(\s+[^\]]+?)?(\])"><bygroups><token type="Keyword"/><token type="LiteralString"/><token type="Keyword"/></bygroups></rule>
<rule pattern="^----+$"><token type="Keyword"/></rule>
<rule pattern="[^\n\&#x27;\[{!_~^,|]+"><token type="Text"/></rule>
<rule pattern="\n"><token type="Text"/></rule>
<rule pattern="."><token type="Text"/></rule>
</state>
<state name="codeblock">
<rule pattern="\}\}\}"><token type="NameBuiltin"/><pop depth="1"/></rule>
<rule pattern="\{\{\{"><token type="Text"/><push/></rule>
<rule pattern="[^{}]+"><token type="CommentPreproc"/></rule>
<rule pattern="."><token type="CommentPreproc"/></rule>
</state>
</rules>
</lexer>

76
lexers/rst.xml Normal file
View File

@ -0,0 +1,76 @@
<lexer>
<config>
<name>reStructuredText</name>
<alias>restructuredtext</alias>
<alias>rst</alias>
<alias>rest</alias>
<filename>*.rst</filename>
<filename>*.rest</filename>
<mime_type>text/x-rst</mime_type>
<mime_type>text/prs.fallenstein.rst</mime_type>
</config>
<rules>
<state name="root">
<rule pattern="^(=+|-+|`+|:+|\.+|\&#x27;+|&quot;+|~+|\^+|_+|\*+|\++|#+)([ \t]*\n)(.+)(\n)(\1)(\n)"><bygroups><token type="GenericHeading"/><token type="Text"/><token type="GenericHeading"/><token type="Text"/><token type="GenericHeading"/><token type="Text"/></bygroups></rule>
<rule pattern="^(\S.*)(\n)(={3,}|-{3,}|`{3,}|:{3,}|\.{3,}|\&#x27;{3,}|&quot;{3,}|~{3,}|\^{3,}|_{3,}|\*{3,}|\+{3,}|#{3,})(\n)"><bygroups><token type="GenericHeading"/><token type="Text"/><token type="GenericHeading"/><token type="Text"/></bygroups></rule>
<rule pattern="^(\s*)([-*+])( .+\n(?:\1 .+\n)*)"><bygroups><token type="Text"/><token type="LiteralNumber"/><usingself state="inline"/></bygroups></rule>
<rule pattern="^(\s*)([0-9#ivxlcmIVXLCM]+\.)( .+\n(?:\1 .+\n)*)"><bygroups><token type="Text"/><token type="LiteralNumber"/><usingself state="inline"/></bygroups></rule>
<rule pattern="^(\s*)(\(?[0-9#ivxlcmIVXLCM]+\))( .+\n(?:\1 .+\n)*)"><bygroups><token type="Text"/><token type="LiteralNumber"/><usingself state="inline"/></bygroups></rule>
<rule pattern="^(\s*)([A-Z]+\.)( .+\n(?:\1 .+\n)+)"><bygroups><token type="Text"/><token type="LiteralNumber"/><usingself state="inline"/></bygroups></rule>
<rule pattern="^(\s*)(\(?[A-Za-z]+\))( .+\n(?:\1 .+\n)+)"><bygroups><token type="Text"/><token type="LiteralNumber"/><usingself state="inline"/></bygroups></rule>
<rule pattern="^(\s*)(\|)( .+\n(?:\| .+\n)*)"><bygroups><token type="Text"/><token type="Operator"/><usingself state="inline"/></bygroups></rule>
<rule pattern="^( *\.\.)(\s*)((?:source)?code(?:-block)?)(::)([ \t]*)([^\n]+)(\n[ \t]*\n)([ \t]+)(.*)(\n)((?:(?:\8.*)?\n)+)">
<bygroups>
<token type="Punctuation"/>
<token type="Text"/>
<token type="OperatorWord"/>
<token type="Punctuation"/>
<token type="Text"/>
<token type="Keyword"/>
<token type="Text"/>
<token type="Text"/>
<UsingByGroup lexer="6" content="9,10,11"/>
</bygroups>
</rule>
<rule pattern="^( *\.\.)(\s*)([\w:-]+?)(::)(?:([ \t]*)(.*))">
<bygroups>
<token type="Punctuation"/>
<token type="Text"/>
<token type="OperatorWord"/>
<token type="Punctuation"/>
<token type="Text"/>
<usingself state="inline"/>
</bygroups>
</rule>
<rule pattern="^( *\.\.)(\s*)(_(?:[^:\\]|\\.)+:)(.*?)$"><bygroups><token type="Punctuation"/><token type="Text"/><token type="NameTag"/><usingself state="inline"/></bygroups></rule>
<rule pattern="^( *\.\.)(\s*)(\[.+\])(.*?)$"><bygroups><token type="Punctuation"/><token type="Text"/><token type="NameTag"/><usingself state="inline"/></bygroups></rule>
<rule pattern="^( *\.\.)(\s*)(\|.+\|)(\s*)([\w:-]+?)(::)(?:([ \t]*)(.*))"><bygroups><token type="Punctuation"/><token type="Text"/><token type="NameTag"/><token type="Text"/><token type="OperatorWord"/><token type="Punctuation"/><token type="Text"/><usingself state="inline"/></bygroups></rule>
<rule pattern="^ *\.\..*(\n( +.*\n|\n)+)?"><token type="Comment"/></rule>
<rule pattern="^( *)(:(?:\\\\|\\:|[^:\n])+:(?=\s))([ \t]*)"><bygroups><token type="Text"/><token type="NameClass"/><token type="Text"/></bygroups></rule>
<rule pattern="^(\S.*(?&lt;!::)\n)((?:(?: +.*)\n)+)"><bygroups><usingself state="inline"/><usingself state="inline"/></bygroups></rule>
<rule pattern="(::)(\n[ \t]*\n)([ \t]+)(.*)(\n)((?:(?:\3.*)?\n)+)"><bygroups><token type="LiteralStringEscape"/><token type="Text"/><token type="LiteralString"/><token type="LiteralString"/><token type="Text"/><token type="LiteralString"/></bygroups></rule>
<rule><include state="inline"/></rule>
</state>
<state name="inline">
<rule pattern="\\."><token type="Text"/></rule>
<rule pattern="``"><token type="LiteralString"/><push state="literal"/></rule>
<rule pattern="(`.+?)(&lt;.+?&gt;)(`__?)"><bygroups><token type="LiteralString"/><token type="LiteralStringInterpol"/><token type="LiteralString"/></bygroups></rule>
<rule pattern="`.+?`__?"><token type="LiteralString"/></rule>
<rule pattern="(`.+?`)(:[a-zA-Z0-9:-]+?:)?"><bygroups><token type="NameVariable"/><token type="NameAttribute"/></bygroups></rule>
<rule pattern="(:[a-zA-Z0-9:-]+?:)(`.+?`)"><bygroups><token type="NameAttribute"/><token type="NameVariable"/></bygroups></rule>
<rule pattern="\*\*.+?\*\*"><token type="GenericStrong"/></rule>
<rule pattern="\*.+?\*"><token type="GenericEmph"/></rule>
<rule pattern="\[.*?\]_"><token type="LiteralString"/></rule>
<rule pattern="&lt;.+?&gt;"><token type="NameTag"/></rule>
<rule pattern="[^\\\n\[*`:]+"><token type="Text"/></rule>
<rule pattern="."><token type="Text"/></rule>
</state>
<state name="literal">
<rule pattern="[^`]+"><token type="LiteralString"/></rule>
<rule pattern="``((?=$)|(?=[-/:.,; \n\x00 &#x27;&quot;\)\]\}&gt;’”»!\?]))"><token type="LiteralString"/><pop depth="1"/></rule>
<rule pattern="`"><token type="LiteralString"/></rule>
</state>
</rules>
</lexer>

View File

@ -40,15 +40,18 @@ for fname in glob.glob("lexers/*.xml"):
with open("src/constants/lexers.cr", "w") as f:
f.write("module Tartrazine\n")
f.write(" LEXERS_BY_NAME = {\n")
for k, v in lexer_by_name.items():
for k in sorted(lexer_by_name.keys()):
v = lexer_by_name[k]
f.write(f'"{k}" => "{v}", \n')
f.write("}\n")
f.write(" LEXERS_BY_MIMETYPE = {\n")
for k, v in lexer_by_mimetype.items():
for k in sorted(lexer_by_mimetype.keys()):
v = lexer_by_mimetype[k]
f.write(f'"{k}" => "{v}", \n')
f.write("}\n")
f.write(" LEXERS_BY_FILENAME = {\n")
for k, v in lexer_by_filename.items():
for k in sorted(lexer_by_filename.keys()):
v = lexer_by_filename[k]
f.write(f'"{k}" => {str(list(v)).replace("'", "\"")}, \n')
f.write("}\n")
f.write("end\n")

View File

@ -1,5 +1,5 @@
name: tartrazine
version: 0.4.0
version: 0.6.0
authors:
- Roberto Alsina <roberto.alsina@gmail.com>

View File

@ -72,8 +72,8 @@ end
# Helper that creates lexer and tokenizes
def tokenize(lexer_name, text)
lexer = Tartrazine.lexer(lexer_name)
lexer.tokenize(text)
tokenizer = Tartrazine.lexer(lexer_name).tokenizer(text)
Tartrazine::Lexer.collapse_tokens(tokenizer.to_a)
end
# Helper that tokenizes using chroma to validate the lexer

View File

@ -8,19 +8,32 @@ require "./tartrazine"
# perform a list of actions. These actions can emit tokens
# or change the state machine.
module Tartrazine
class Action
property type : String
property xml : XML::Node
enum ActionType
Bygroups
Combined
Include
Pop
Push
Token
Using
Usingbygroup
Usingself
end
struct Action
property actions : Array(Action) = [] of Action
property token_type : String = ""
property states_to_push : Array(String) = [] of String
property depth = 0
property lexer_name : String = ""
property states_to_combine : Array(String) = [] of String
@content_index : Array(Int32) = [] of Int32
@depth : Int32 = 0
@lexer_index : Int32 = 0
@lexer_name : String = ""
@states : Array(String) = [] of String
@states_to_push : Array(String) = [] of String
@token_type : String = ""
@type : ActionType = ActionType::Token
def initialize(@type : String, @xml : XML::Node?)
# Extract information from the XML node we will use later
def initialize(t : String, xml : XML::Node?)
@type = ActionType.parse(t.capitalize)
# Some actions may have actions in them, like this:
# <bygroups>
@ -31,61 +44,56 @@ module Tartrazine
#
# The token actions match with the first 2 groups in the regex
# the using action matches the 3rd and shunts it to another lexer
known_types = %w(token push pop bygroups using usingself include combined)
raise Exception.new(
"Unknown action type: #{@type}") unless known_types.includes? @type
@xml.children.each do |node|
xml.children.each do |node|
next unless node.element?
@actions << Action.new(node.name, node)
end
# Prefetch the attributes we ned from the XML and keep them
case @type
when "token"
@token_type = xml["type"]? || ""
when "push"
when ActionType::Token
@token_type = xml["type"]
when ActionType::Push
@states_to_push = xml.attributes.select { |attrib|
attrib.name == "state"
}.map &.content || [] of String
when "pop"
@depth = xml["depth"]?.try &.to_i || 0
when "using"
@lexer_name = xml["lexer"]?.try &.downcase || ""
when "combined"
@states_to_combine = xml.attributes.select { |attrib|
}.map &.content
when ActionType::Pop
@depth = xml["depth"].to_i
when ActionType::Using
@lexer_name = xml["lexer"].downcase
when ActionType::Combined
@states = xml.attributes.select { |attrib|
attrib.name == "state"
}.map &.content
when ActionType::Usingbygroup
@lexer_index = xml["lexer"].to_i
@content_index = xml["content"].split(",").map(&.to_i)
end
end
# ameba:disable Metrics/CyclomaticComplexity
def emit(match : MatchData, lexer : Lexer, match_group = 0) : Array(Token)
case type
when "token"
def emit(match : MatchData, tokenizer : Tokenizer, match_group = 0) : Array(Token)
case @type
when ActionType::Token
raise Exception.new "Can't have a token without a match" if match.empty?
[Token.new(type: @token_type, value: String.new(match[match_group].value))]
when "push"
if @states_to_push.empty?
# Push without a state means push the current state
@states_to_push = [lexer.state_stack.last]
end
@states_to_push.each do |state|
if state == "#pop"
when ActionType::Push
to_push = @states_to_push.empty? ? [tokenizer.state_stack.last] : @states_to_push
to_push.each do |state|
if state == "#pop" && tokenizer.state_stack.size > 1
# Pop the state
lexer.state_stack.pop
tokenizer.state_stack.pop
else
# Really push
lexer.state_stack << state
tokenizer.state_stack << state
end
end
[] of Token
when "pop"
if lexer.state_stack.size > @depth
lexer.state_stack.pop(@depth)
end
when ActionType::Pop
to_pop = [@depth, tokenizer.state_stack.size - 1].min
tokenizer.state_stack.pop(to_pop)
[] of Token
when "bygroups"
when ActionType::Bygroups
# FIXME: handle
# ><bygroups>
# <token type="Punctuation"/>
@ -94,7 +102,7 @@ module Tartrazine
#
# where that None means skipping a group
#
raise Exception.new "Can't have a token without a match" if match.empty?
raise Exception.new "Can't have a token without a match" if match.nil?
# Each group matches an action. If the group match is empty,
# the action is skipped.
@ -103,33 +111,47 @@ module Tartrazine
begin
next if match[i + 1].size == 0
rescue IndexError
# No match for the last group
# FIXME: This should not actually happen
# No match for this group
next
end
result += e.emit(match, lexer, i + 1)
result += e.emit(match, tokenizer, i + 1)
end
result
when "using"
when ActionType::Using
# Shunt to another lexer entirely
return [] of Token if match.empty?
Tartrazine.lexer(@lexer_name).tokenize(String.new(match[match_group].value), usingself: true)
when "usingself"
Tartrazine.lexer(@lexer_name).tokenizer(
String.new(match[match_group].value),
secondary: true).to_a
when ActionType::Usingself
# Shunt to another copy of this lexer
return [] of Token if match.empty?
new_lexer = Lexer.from_xml(lexer.xml)
new_lexer.tokenize(String.new(match[match_group].value), usingself: true)
when "combined"
# Combine two states into one anonymous state
new_state = @states_to_combine.map { |name|
lexer.states[name]
tokenizer.lexer.tokenizer(
String.new(match[match_group].value),
secondary: true).to_a
when ActionType::Combined
# Combine two or more states into one anonymous state
new_state = @states.map { |name|
tokenizer.lexer.states[name]
}.reduce { |state1, state2|
state1 + state2
}
lexer.states[new_state.name] = new_state
lexer.state_stack << new_state.name
tokenizer.lexer.states[new_state.name] = new_state
tokenizer.state_stack << new_state.name
[] of Token
when ActionType::Usingbygroup
# Shunt to content-specified lexer
return [] of Token if match.empty?
content = ""
@content_index.each do |i|
content += String.new(match[i].value)
end
Tartrazine.lexer(String.new(match[@lexer_index].value)).tokenizer(
content,
secondary: true).to_a
else
raise Exception.new("Unhandled action type: #{type}")
raise Exception.new("Unknown action type: #{@type}")
end
end
end

View File

@ -3,7 +3,7 @@ module BytesRegex
class Regex
def initialize(pattern : String, multiline = false, dotall = false, ignorecase = false, anchored = false)
flags = LibPCRE2::UTF | LibPCRE2::DUPNAMES | LibPCRE2::UCP | LibPCRE2::NO_UTF_CHECK
flags = LibPCRE2::UTF | LibPCRE2::UCP | LibPCRE2::NO_UTF_CHECK
flags |= LibPCRE2::MULTILINE if multiline
flags |= LibPCRE2::DOTALL if dotall
flags |= LibPCRE2::CASELESS if ignorecase
@ -31,7 +31,6 @@ module BytesRegex
end
def match(str : Bytes, pos = 0) : Array(Match)
match = [] of Match
rc = LibPCRE2.match(
@re,
str,
@ -40,24 +39,25 @@ module BytesRegex
LibPCRE2::NO_UTF_CHECK,
@match_data,
nil)
if rc >= 0
if rc > 0
ovector = LibPCRE2.get_ovector_pointer(@match_data)
(0...rc).each do |i|
(0...rc).map do |i|
m_start = ovector[2 * i]
m_size = ovector[2 * i + 1] - m_start
if m_size == 0
m_end = ovector[2 * i + 1]
if m_start == m_end
m_value = Bytes.new(0)
else
m_value = str[m_start...m_start + m_size]
m_value = str[m_start...m_end]
end
match << Match.new(m_value, m_start, m_size)
Match.new(m_value, m_start, m_end - m_start)
end
else
[] of Match
end
match
end
end
class Match
struct Match
property value : Bytes
property start : UInt64
property size : UInt64

File diff suppressed because it is too large Load Diff

View File

@ -12,6 +12,10 @@ module Tartrazine
property theme : Theme = Tartrazine.theme("default-dark")
# Format the text using the given lexer.
def format(text : String, lexer : Lexer, io : IO = nil) : Nil
raise Exception.new("Not implemented")
end
def format(text : String, lexer : Lexer) : String
raise Exception.new("Not implemented")
end

View File

@ -7,17 +7,27 @@ module Tartrazine
def initialize(@theme : Theme = Tartrazine.theme("default-dark"), @line_numbers : Bool = false)
end
private def line_label(i : Int32) : String
"#{i + 1}".rjust(4).ljust(5)
end
def format(text : String, lexer : Lexer) : String
output = String.build do |outp|
lexer.group_tokens_in_lines(lexer.tokenize(text)).each_with_index do |line, i|
label = line_numbers? ? "#{i + 1}".rjust(4).ljust(5) : ""
outp << label
line.each do |token|
outp << colorize(token[:value], token[:type])
end
outp = String::Builder.new("")
format(text, lexer, outp)
outp.to_s
end
def format(text : String, lexer : BaseLexer, outp : IO) : Nil
tokenizer = lexer.tokenizer(text)
i = 0
outp << line_label(i) if line_numbers?
tokenizer.each do |token|
outp << colorize(token[:value], token[:type])
if token[:value].includes?("\n")
i += 1
outp << line_label(i) if line_numbers?
end
end
output
end
def colorize(text : String, token : String) : String

View File

@ -1,5 +1,6 @@
require "../constants/token_abbrevs.cr"
require "../formatter"
require "html"
module Tartrazine
class Html < Formatter
@ -34,46 +35,52 @@ module Tartrazine
end
def format(text : String, lexer : Lexer) : String
text = format_text(text, lexer)
if standalone?
text = wrap_standalone(text)
end
text
outp = String::Builder.new("")
format(text, lexer, outp)
outp.to_s
end
def format(text : String, lexer : BaseLexer, io : IO) : Nil
pre, post = wrap_standalone
io << pre if standalone?
format_text(text, lexer, io)
io << post if standalone?
end
# Wrap text into a full HTML document, including the CSS for the theme
def wrap_standalone(text) : String
def wrap_standalone
output = String.build do |outp|
outp << "<!DOCTYPE html><html><head><style>"
outp << style_defs
outp << "</style></head><body>"
outp << text
outp << "</body></html>"
end
output
{output.to_s, "</body></html>"}
end
def format_text(text : String, lexer : Lexer) : String
lines = lexer.group_tokens_in_lines(lexer.tokenize(text))
output = String.build do |outp|
if surrounding_pre?
pre_style = wrap_long_lines? ? "style=\"white-space: pre-wrap; word-break: break-word;\"" : ""
outp << "<pre class=\"#{get_css_class("Background")}\" #{pre_style}>"
end
outp << "<code class=\"#{get_css_class("Background")}\">"
lines.each_with_index(offset: line_number_start - 1) do |line, i|
line_label = line_numbers? ? "#{i + 1}".rjust(4).ljust(5) : ""
line_class = highlighted?(i + 1) ? "class=\"#{get_css_class("LineHighlight")}\"" : ""
line_id = linkable_line_numbers? ? "id=\"#{line_number_id_prefix}#{i + 1}\"" : ""
outp << "<span #{line_id} #{line_class} style=\"user-select: none;\">#{line_label} </span>"
line.each do |token|
fragment = "<span class=\"#{get_css_class(token[:type])}\">#{token[:value]}</span>"
outp << fragment
end
end
outp << "</code></pre>"
private def line_label(i : Int32) : String
line_label = "#{i + 1}".rjust(4).ljust(5)
line_class = highlighted?(i + 1) ? "class=\"#{get_css_class("LineHighlight")}\"" : ""
line_id = linkable_line_numbers? ? "id=\"#{line_number_id_prefix}#{i + 1}\"" : ""
"<span #{line_id} #{line_class} style=\"user-select: none;\">#{line_label} </span>"
end
def format_text(text : String, lexer : BaseLexer, outp : IO)
tokenizer = lexer.tokenizer(text)
i = 0
if surrounding_pre?
pre_style = wrap_long_lines? ? "style=\"white-space: pre-wrap; word-break: break-word;\"" : ""
outp << "<pre class=\"#{get_css_class("Background")}\" #{pre_style}>"
end
output
outp << "<code class=\"#{get_css_class("Background")}\">"
outp << line_label(i) if line_numbers?
tokenizer.each do |token|
outp << "<span class=\"#{get_css_class(token[:type])}\">#{HTML.escape(token[:value])}</span>"
if token[:value].ends_with? "\n"
i += 1
outp << line_label(i) if line_numbers?
end
end
outp << "</code></pre>"
end
# ameba:disable Metrics/CyclomaticComplexity
@ -104,15 +111,17 @@ module Tartrazine
# Given a token type, return the CSS class to use.
def get_css_class(token : String) : String
return class_prefix + Abbreviations[token] if theme.styles.has_key?(token)
# Themes don't contain information for each specific
# token type. However, they may contain information
# for a parent style. Worst case, we go to the root
# (Background) style.
class_prefix + Abbreviations[theme.style_parents(token).reverse.find { |parent|
theme.styles.has_key?(parent)
}]
if !theme.styles.has_key? token
# Themes don't contain information for each specific
# token type. However, they may contain information
# for a parent style. Worst case, we go to the root
# (Background) style.
parent = theme.style_parents(token).reverse.find { |dad|
theme.styles.has_key?(dad)
}
theme.styles[token] = theme.styles[parent]
end
class_prefix + Abbreviations[token]
end
# Is this line in the highlighted ranges?

View File

@ -4,8 +4,15 @@ module Tartrazine
class Json < Formatter
property name = "json"
def format(text : String, lexer : Lexer, _theme : Theme) : String
lexer.tokenize(text).to_json
def format(text : String, lexer : BaseLexer) : String
outp = String::Builder.new("")
format(text, lexer, outp)
outp.to_s
end
def format(text : String, lexer : BaseLexer, io : IO) : Nil
tokenizer = lexer.tokenizer(text)
io << Tartrazine::Lexer.collapse_tokens(tokenizer.to_a).to_json
end
end
end

View File

@ -1,8 +1,12 @@
require "yaml"
module Tartrazine
# Use linguist's heuristics to disambiguate between languages
# Use linguist's heuristics to disambiguate between languages
# This is *shamelessly* stolen from https://github.com/github-linguist/linguist
# and ported to Crystal. Deepest thanks to the authors of Linguist
# for licensing it liberally.
#
# Consider this code (c) 2017 GitHub, Inc. even if I wrote it.
module Linguist
class Heuristic
include YAML::Serializable
@ -34,12 +38,13 @@ module Tartrazine
end
end
class Rule
class LangRule
include YAML::Serializable
property pattern : (String | Array(String))?
property negative_pattern : (String | Array(String))?
property named_pattern : String?
property and : Array(Rule)?
property and : Array(LangRule)?
property language : String | Array(String)?
# ameba:disable Metrics/CyclomaticComplexity
def match(content, named_patterns)
@ -68,17 +73,9 @@ module Tartrazine
result = p_arr.any? { |pat| ::Regex.new(pat).matches?(content) }
end
if and
result = and.as(Array(Rule)).all?(&.match(content, named_patterns))
result = and.as(Array(LangRule)).all?(&.match(content, named_patterns))
end
result
end
end
class LangRule < Rule
include YAML::Serializable
property language : String | Array(String)
end
end
# h = Tartrazine::Heuristic.from_yaml(File.read("heuristics/heuristics.yml"))
# p! h.run(ARGV[0], File.read(ARGV[0]))

View File

@ -4,111 +4,169 @@ require "./constants/lexers"
module Tartrazine
class LexerFiles
extend BakedFileSystem
bake_folder "../lexers", __DIR__
end
# Get the lexer object for a language name
# FIXME: support mimetypes
def self.lexer(name : String? = nil, filename : String? = nil) : Lexer
if name.nil? && filename.nil?
def self.lexer(name : String? = nil, filename : String? = nil) : BaseLexer
return lexer_by_name(name) if name && name != "autodetect"
return lexer_by_filename(filename) if filename
Lexer.from_xml(LexerFiles.get("/#{LEXERS_BY_NAME["plaintext"]}.xml").gets_to_end)
end
private def self.lexer_by_name(name : String) : BaseLexer
lexer_file_name = LEXERS_BY_NAME.fetch(name.downcase, nil)
return create_delegating_lexer(name) if lexer_file_name.nil? && name.includes? "+"
raise Exception.new("Unknown lexer: #{name}") if lexer_file_name.nil?
Lexer.from_xml(LexerFiles.get("/#{lexer_file_name}.xml").gets_to_end)
end
private def self.lexer_by_filename(filename : String) : BaseLexer
candidates = Set(String).new
LEXERS_BY_FILENAME.each do |k, v|
candidates += v.to_set if File.match?(k, File.basename(filename))
end
case candidates.size
when 0
lexer_file_name = LEXERS_BY_NAME["plaintext"]
elsif name && name != "autodetect"
lexer_file_name = LEXERS_BY_NAME[name.downcase]
when 1
lexer_file_name = candidates.first
else
# Guess by filename
candidates = Set(String).new
LEXERS_BY_FILENAME.each do |k, v|
candidates += v.to_set if File.match?(k, File.basename(filename.to_s))
end
case candidates.size
when 0
lexer_file_name = LEXERS_BY_NAME["plaintext"]
when 1
lexer_file_name = candidates.first
else
raise Exception.new("Multiple lexers match the filename: #{candidates.to_a.join(", ")}")
lexer_file_name = self.lexer_by_content(filename)
begin
return self.lexer(lexer_file_name)
rescue ex : Exception
raise Exception.new("Multiple lexers match the filename: #{candidates.to_a.join(", ")}, heuristics suggest #{lexer_file_name} but there is no matching lexer.")
end
end
Lexer.from_xml(LexerFiles.get("/#{lexer_file_name}.xml").gets_to_end)
end
private def self.lexer_by_content(fname : String) : String?
h = Linguist::Heuristic.from_yaml(LexerFiles.get("/heuristics.yml").gets_to_end)
result = h.run(fname, File.read(fname))
case result
when Nil
raise Exception.new "No lexer found for #{fname}"
when String
result.as(String)
when Array(String)
result.first
end
end
private def self.create_delegating_lexer(name : String) : BaseLexer
language, root = name.split("+", 2)
language_lexer = lexer(language)
root_lexer = lexer(root)
DelegatingLexer.new(language_lexer, root_lexer)
end
# Return a list of all lexers
def self.lexers : Array(String)
LEXERS_BY_NAME.keys.sort!
end
# A token, the output of the tokenizer
alias Token = NamedTuple(type: String, value: String)
abstract class BaseTokenizer
end
class Tokenizer < BaseTokenizer
include Iterator(Token)
property lexer : BaseLexer
property text : Bytes
property pos : Int32 = 0
@dq = Deque(Token).new
property state_stack = ["root"]
def initialize(@lexer : BaseLexer, text : String, secondary = false)
# Respect the `ensure_nl` config option
if text.size > 0 && text[-1] != '\n' && @lexer.config[:ensure_nl] && !secondary
text += "\n"
end
@text = text.to_slice
end
def next : Iterator::Stop | Token
if @dq.size > 0
return @dq.shift
end
if pos == @text.size
return stop
end
matched = false
while @pos < @text.size
@lexer.states[@state_stack.last].rules.each do |rule|
matched, new_pos, new_tokens = rule.match(@text, @pos, self)
if matched
@pos = new_pos
split_tokens(new_tokens).each { |token| @dq << token }
break
end
end
if !matched
if @text[@pos] == 10u8
@dq << {type: "Text", value: "\n"}
@state_stack = ["root"]
else
@dq << {type: "Error", value: String.new(@text[@pos..@pos])}
end
@pos += 1
break
end
end
self.next
end
# If a token contains a newline, split it into two tokens
def split_tokens(tokens : Array(Token)) : Array(Token)
split_tokens = [] of Token
tokens.each do |token|
if token[:value].includes?("\n")
values = token[:value].split("\n")
values.each_with_index do |value, index|
value += "\n" if index < values.size - 1
split_tokens << {type: token[:type], value: value}
end
else
split_tokens << token
end
end
split_tokens
end
end
abstract class BaseLexer
property config = {
name: "",
priority: 0.0,
case_insensitive: false,
dot_all: false,
not_multiline: false,
ensure_nl: false,
}
property states = {} of String => State
def tokenizer(text : String, secondary = false) : BaseTokenizer
Tokenizer.new(self, text, secondary)
end
end
# This implements a lexer for Pygments RegexLexers as expressed
# in Chroma's XML serialization.
#
# For explanations on what actions and states do
# the Pygments documentation is a good place to start.
# https://pygments.org/docs/lexerdevelopment/
class Lexer
property config = {
name: "",
aliases: [] of String,
filenames: [] of String,
mime_types: [] of String,
priority: 0.0,
case_insensitive: false,
dot_all: false,
not_multiline: false,
ensure_nl: false,
}
property xml : String = ""
property states = {} of String => State
property state_stack = ["root"]
# Turn the text into a list of tokens. The `usingself` parameter
# is true when the lexer is being used to tokenize a string
# from a larger text that is already being tokenized.
# So, when it's true, we don't modify the text.
def tokenize(text : String, usingself = false) : Array(Token)
@state_stack = ["root"]
tokens = [] of Token
pos = 0
matched = false
# Respect the `ensure_nl` config option
if text.size > 0 && text[-1] != '\n' && config[:ensure_nl] && !usingself
text += "\n"
end
text_bytes = text.to_slice
# Loop through the text, applying rules
while pos < text_bytes.size
state = states[@state_stack.last]
# Log.trace { "Stack is #{@state_stack} State is #{state.name}, pos is #{pos}, text is #{text[pos..pos + 10]}" }
state.rules.each do |rule|
matched, new_pos, new_tokens = rule.match(text_bytes, pos, self)
if matched
# Move position forward, save the tokens,
# tokenize from the new position
# Log.trace { "MATCHED: #{rule.xml}" }
pos = new_pos
tokens += new_tokens
break
end
# Log.trace { "NOT MATCHED: #{rule.xml}" }
end
# If no rule matches, emit an error token
unless matched
if text_bytes[pos] == 10u8
# at EOL, reset state to "root"
tokens << {type: "Text", value: "\n"}
@state_stack = ["root"]
else
tokens << {type: "Error", value: String.new(text_bytes[pos..pos])}
end
pos += 1
end
end
Lexer.collapse_tokens(tokens)
end
class Lexer < BaseLexer
# Collapse consecutive tokens of the same type for easier comparison
# and smaller output
def self.collapse_tokens(tokens : Array(Tartrazine::Token)) : Array(Tartrazine::Token)
@ -131,34 +189,8 @@ module Tartrazine
result
end
# Group tokens into lines, splitting them when a newline is found
def group_tokens_in_lines(tokens : Array(Token)) : Array(Array(Token))
split_tokens = [] of Token
tokens.each do |token|
if token[:value].includes?("\n")
values = token[:value].split("\n")
values.each_with_index do |value, index|
value += "\n" if index < values.size - 1
split_tokens << {type: token[:type], value: value}
end
else
split_tokens << token
end
end
lines = [Array(Token).new]
split_tokens.each do |token|
lines.last << token
if token[:value].includes?("\n")
lines << Array(Token).new
end
end
lines
end
# ameba:disable Metrics/CyclomaticComplexity
def self.from_xml(xml : String) : Lexer
l = Lexer.new
l.xml = xml
lexer = XML.parse(xml).first_element_child
if lexer
config = lexer.children.find { |node|
@ -167,9 +199,6 @@ module Tartrazine
if config
l.config = {
name: xml_to_s(config, name) || "",
aliases: xml_to_a(config, _alias) || [] of String,
filenames: xml_to_a(config, filename) || [] of String,
mime_types: xml_to_a(config, mime_type) || [] of String,
priority: xml_to_f(config, priority) || 0.0,
not_multiline: xml_to_s(config, not_multiline) == "true",
dot_all: xml_to_s(config, dot_all) == "true",
@ -219,12 +248,66 @@ module Tartrazine
end
end
# A lexer that takes two lexers as arguments. A root lexer
# and a language lexer. Everything is scalled using the
# language lexer, afterwards all `Other` tokens are lexed
# using the root lexer.
#
# This is useful for things like template languages, where
# you have Jinja + HTML or Jinja + CSS and so on.
class DelegatingLexer < BaseLexer
property language_lexer : BaseLexer
property root_lexer : BaseLexer
def initialize(@language_lexer : BaseLexer, @root_lexer : BaseLexer)
end
def tokenizer(text : String, secondary = false) : DelegatingTokenizer
DelegatingTokenizer.new(self, text, secondary)
end
end
# This Tokenizer works with a DelegatingLexer. It first tokenizes
# using the language lexer, and "Other" tokens are tokenized using
# the root lexer.
class DelegatingTokenizer < BaseTokenizer
include Iterator(Token)
@dq = Deque(Token).new
@language_tokenizer : BaseTokenizer
def initialize(@lexer : DelegatingLexer, text : String, secondary = false)
# Respect the `ensure_nl` config option
if text.size > 0 && text[-1] != '\n' && @lexer.config[:ensure_nl] && !secondary
text += "\n"
end
@language_tokenizer = @lexer.language_lexer.tokenizer(text, true)
end
def next : Iterator::Stop | Token
if @dq.size > 0
return @dq.shift
end
token = @language_tokenizer.next
if token.is_a? Iterator::Stop
return stop
elsif token.as(Token).[:type] == "Other"
root_tokenizer = @lexer.root_lexer.tokenizer(token.as(Token).[:value], true)
root_tokenizer.each do |root_token|
@dq << root_token
end
else
@dq << token.as(Token)
end
self.next
end
end
# A Lexer state. A state has a name and a list of rules.
# The state machine has a state stack containing references
# to states to decide which rules to apply.
class State
struct State
property name : String = ""
property rules = [] of Rule
property rules = [] of BaseRule
def +(other : State)
new_state = State.new
@ -233,7 +316,4 @@ module Tartrazine
new_state
end
end
# A token, the output of the tokenizer
alias Token = NamedTuple(type: String, value: String)
end

View File

@ -1,18 +1,6 @@
require "docopt"
require "./**"
# Performance data (in milliseconds):
#
# Docopt parsing: 0.5
# Instantiating a theme: 0.1
# Instantiating a formatter: 1.0
# Instantiating a lexer: 2.0
# Tokenizing crycco.cr: 16.0
# Formatting: 0.5
# I/O: 1.5
# ---------------------------------
# Total: 21.6
HELP = <<-HELP
tartrazine: a syntax highlighting tool
@ -32,7 +20,8 @@ Usage:
Options:
-f <formatter> Format to use (html, terminal, json)
-t <theme> Theme to use, see --list-themes [default: default-dark]
-l <lexer> Lexer (language) to use, see --list-lexers [default: autodetect]
-l <lexer> Lexer (language) to use, see --list-lexers. Use more than
one lexer with "+" (e.g. jinja+yaml) [default: autodetect]
-o <output> Output file. Default is stdout.
--standalone Generate a standalone HTML file, which includes
all style information. If not given, it will generate just
@ -89,20 +78,20 @@ if options["-f"]
if formatter.is_a?(Tartrazine::Html) && options["--css"]
File.open("#{options["-t"].as(String)}.css", "w") do |outf|
outf.puts formatter.style_defs
outf << formatter.style_defs
end
exit 0
end
lexer = Tartrazine.lexer(name: options["-l"].as(String), filename: options["FILE"].as(String))
input = File.open(options["FILE"].as(String)).gets_to_end
output = formatter.format(input, lexer)
if options["-o"].nil?
puts output
outf = STDOUT
else
File.open(options["-o"].as(String), "w") do |outf|
outf.puts output
end
outf = File.open(options["-o"].as(String), "w")
end
formatter.format(input, lexer, outf)
outf.close
end

View File

@ -15,28 +15,11 @@ module Tartrazine
alias Match = BytesRegex::Match
alias MatchData = Array(Match)
class Rule
property pattern : Regex = Regex.new ""
property actions : Array(Action) = [] of Action
abstract struct BaseRule
abstract def match(text : Bytes, pos : Int32, tokenizer : Tokenizer) : Tuple(Bool, Int32, Array(Token))
abstract def initialize(node : XML::Node)
def match(text : Bytes, pos, lexer) : Tuple(Bool, Int32, Array(Token))
match = pattern.match(text, pos)
# We don't match if the match doesn't move the cursor
# because that causes infinite loops
return false, pos, [] of Token if match.empty? || match[0].size == 0
tokens = [] of Token
actions.each do |action|
tokens += action.emit(match, lexer)
end
return true, pos + match[0].size, tokens
end
def initialize(node : XML::Node, multiline, dotall, ignorecase)
pattern = node["pattern"]
pattern = "(?m)" + pattern if multiline
@pattern = Regex.new(pattern, multiline, dotall, ignorecase, true)
add_actions(node)
end
@actions : Array(Action) = [] of Action
def add_actions(node : XML::Node)
node.children.each do |child|
@ -46,14 +29,36 @@ module Tartrazine
end
end
struct Rule < BaseRule
property pattern : Regex = Regex.new ""
def match(text : Bytes, pos, tokenizer) : Tuple(Bool, Int32, Array(Token))
match = pattern.match(text, pos)
# No match
return false, pos, [] of Token if match.size == 0
return true, pos + match[0].size, @actions.flat_map(&.emit(match, tokenizer))
end
def initialize(node : XML::Node)
end
def initialize(node : XML::Node, multiline, dotall, ignorecase)
pattern = node["pattern"]
pattern = "(?m)" + pattern if multiline
@pattern = Regex.new(pattern, multiline, dotall, ignorecase, true)
add_actions(node)
end
end
# This rule includes another state. If any of the rules of the
# included state matches, this rule matches.
class IncludeStateRule < Rule
property state : String = ""
struct IncludeStateRule < BaseRule
@state : String = ""
def match(text, pos, lexer) : Tuple(Bool, Int32, Array(Token))
lexer.states[state].rules.each do |rule|
matched, new_pos, new_tokens = rule.match(text, pos, lexer)
def match(text : Bytes, pos : Int32, tokenizer : Tokenizer) : Tuple(Bool, Int32, Array(Token))
tokenizer.@lexer.states[@state].rules.each do |rule|
matched, new_pos, new_tokens = rule.match(text, pos, tokenizer)
return true, new_pos, new_tokens if matched
end
return false, pos, [] of Token
@ -69,13 +74,11 @@ module Tartrazine
end
# This rule always matches, unconditionally
class UnconditionalRule < Rule
def match(text, pos, lexer) : Tuple(Bool, Int32, Array(Token))
tokens = [] of Token
actions.each do |action|
tokens += action.emit([] of Match, lexer)
end
return true, pos, tokens
struct UnconditionalRule < BaseRule
NO_MATCH = [] of Match
def match(text, pos, tokenizer) : Tuple(Bool, Int32, Array(Token))
return true, pos, @actions.flat_map(&.emit(NO_MATCH, tokenizer))
end
def initialize(node : XML::Node)

View File

@ -9,7 +9,7 @@ require "xml"
module Tartrazine
alias Color = Sixteen::Color
class ThemeFiles
struct ThemeFiles
extend BakedFileSystem
bake_folder "../styles", __DIR__
end
@ -39,7 +39,7 @@ module Tartrazine
themes.to_a.sort!
end
class Style
struct Style
# These properties are tri-state.
# true means it's set
# false means it's not set
@ -79,7 +79,7 @@ module Tartrazine
end
end
class Theme
struct Theme
property name : String = ""
property styles = {} of String => Style

13485
x2.html Normal file

File diff suppressed because it is too large Load Diff