32 Commits

Author SHA1 Message Date
10842f7074 v0.5.0 2024-08-16 19:38:40 -03:00
ae03e4612e todo management 2024-08-16 14:05:34 -03:00
471b2f5050 updated 2024-08-16 14:03:05 -03:00
5a3b08e716 lint 2024-08-16 14:01:16 -03:00
9ebb9f2765 Fix off-by-1 2024-08-16 13:36:11 -03:00
7538fc76aa Tokenize via an iterator, makes everything much faster 2024-08-16 13:27:02 -03:00
788577b226 Fix comment 2024-08-15 23:56:52 -03:00
1f01146b1f Minor cleanup 2024-08-15 23:21:21 -03:00
9041b763ea Remove unused bits of lexer config 2024-08-15 23:17:49 -03:00
ada30915c3 Idiomatic changes 2024-08-15 23:16:29 -03:00
78eff45ea0 Idiomatic changes 2024-08-15 23:11:49 -03:00
e817aedd60 Idiomatic changes 2024-08-15 22:41:24 -03:00
20d6b65346 More idiomatic 2024-08-15 22:01:50 -03:00
cb09dff9f1 Minor cleanup 2024-08-15 21:35:06 -03:00
b589726352 Make action a struct, guard against popping too much 2024-08-15 21:16:17 -03:00
a3a7b5bd9a Many cleanups 2024-08-15 21:10:25 -03:00
58e8dac038 Make usingself MUCH cheaper, since it was called many times when parsing C 2024-08-15 19:20:12 -03:00
f72a40f095 Oops, escape things in HTML formatter! 2024-08-15 17:12:29 -03:00
bf257a5b82 cleanup 2024-08-15 17:05:03 -03:00
029495590c cleanup 2024-08-15 17:04:48 -03:00
115debdec6 Allocate match_data once 2024-08-15 17:04:16 -03:00
4612db58fe Prefetch XML data 2024-08-15 17:03:58 -03:00
f45a86c83a ignore 2024-08-15 16:35:58 -03:00
27008640a6 v0.4.0 2024-08-14 13:25:39 -03:00
7db8fdc9e4 Updated README 2024-08-14 13:25:20 -03:00
ad664d9f93 Added error handling 2024-08-14 11:24:25 -03:00
0626c8619f Working bytes-regexes, faster, MORE tests pass 2024-08-14 11:06:53 -03:00
3725201f8a Merge branch 'main' of github.com:ralsina/tartrazine 2024-08-14 09:25:08 -03:00
6f64b76c44 lint 2024-08-13 22:07:23 -03:00
5218af6855 lint 2024-08-13 22:06:19 -03:00
c898f395a1 reset stack on EOL instead of error, makes no difference, but it's in pygments version 2024-08-13 22:06:07 -03:00
56e49328fb Tiny bug 2024-08-13 21:00:00 -03:00
13 changed files with 358 additions and 239 deletions

1
.gitignore vendored
View File

@ -7,3 +7,4 @@ chroma/
pygments/ pygments/
shard.lock shard.lock
.vscode/ .vscode/
.crystal/

View File

@ -4,17 +4,17 @@ Tartrazine is a library to syntax-highlight code. It is
a port of [Pygments](https://pygments.org/) to a port of [Pygments](https://pygments.org/) to
[Crystal](https://crystal-lang.org/). Kind of. [Crystal](https://crystal-lang.org/). Kind of.
It's not currently usable because it's not finished, but: The CLI tool can be used to highlight many things in many styles.
* The lexers work for the implemented languages
* The provided styles work
* There is a very very simple HTML formatter
# A port of what? Why "kind of"? # A port of what? Why "kind of"?
Because I did not read the Pygments code. And this is actually Pygments is a staple of the Python ecosystem, and it's great.
based on [Chroma](https://github.com/alecthomas/chroma) ... It lets you highlight code in many languages, and it has many
although I did not read that code either. themes. Chroma is "Pygments for Go", it's actually a port of
Pygments to Go, and it's great too.
I wanted that in Crystal, so I started this project. But I did
not read much of the Pygments code. Or much of Chroma's.
Chroma has taken most of the Pygments lexers and turned them into Chroma has taken most of the Pygments lexers and turned them into
XML descriptions. What I did was take those XML files from Chroma XML descriptions. What I did was take those XML files from Chroma
@ -47,7 +47,14 @@ To build from source:
2. Run `make` to build the `tartrazine` binary 2. Run `make` to build the `tartrazine` binary
3. Copy the binary somewhere in your PATH. 3. Copy the binary somewhere in your PATH.
## Usage ## Usage as a CLI tool
```shell
$ tartrazine whatever.c -l c -t catppuccin-macchiato --line-numbers \
--standalone -o whatever.html
```
## Usage as a Library
This works: This works:

View File

@ -8,4 +8,5 @@
* ✅ Implement lexer loader that respects aliases * ✅ Implement lexer loader that respects aliases
* ✅ Implement lexer loader by file extension * ✅ Implement lexer loader by file extension
* ✅ Add --line-numbers to terminal formatter * ✅ Add --line-numbers to terminal formatter
* Implement lexer loader by mime type * Implement lexer loader by mime type
* Implement Delegating lexers

View File

@ -1,5 +1,5 @@
name: tartrazine name: tartrazine
version: 0.3.0 version: 0.5.0
authors: authors:
- Roberto Alsina <roberto.alsina@gmail.com> - Roberto Alsina <roberto.alsina@gmail.com>

View File

@ -14,15 +14,18 @@ unicode_problems = {
"#{__DIR__}/tests/java/test_string_literals.txt", "#{__DIR__}/tests/java/test_string_literals.txt",
"#{__DIR__}/tests/json/test_strings.txt", "#{__DIR__}/tests/json/test_strings.txt",
"#{__DIR__}/tests/systemd/example1.txt", "#{__DIR__}/tests/systemd/example1.txt",
"#{__DIR__}/tests/c++/test_unicode_identifiers.txt",
} }
# These testcases fail because of differences in the way chroma and tartrazine tokenize # These testcases fail because of differences in the way chroma and tartrazine tokenize
# but tartrazine is correct # but tartrazine is correct
bad_in_chroma = { bad_in_chroma = {
"#{__DIR__}/tests/bash_session/test_comment_after_prompt.txt", "#{__DIR__}/tests/bash_session/test_comment_after_prompt.txt",
"#{__DIR__}/tests/html/javascript_backtracking.txt",
"#{__DIR__}/tests/java/test_default.txt", "#{__DIR__}/tests/java/test_default.txt",
"#{__DIR__}/tests/java/test_multiline_string.txt", "#{__DIR__}/tests/java/test_multiline_string.txt",
"#{__DIR__}/tests/java/test_numeric_literals.txt", "#{__DIR__}/tests/java/test_numeric_literals.txt",
"#{__DIR__}/tests/octave/test_multilinecomment.txt",
"#{__DIR__}/tests/php/test_string_escaping_run.txt", "#{__DIR__}/tests/php/test_string_escaping_run.txt",
"#{__DIR__}/tests/python_2/test_cls_builtin.txt", "#{__DIR__}/tests/python_2/test_cls_builtin.txt",
} }
@ -30,19 +33,14 @@ bad_in_chroma = {
known_bad = { known_bad = {
"#{__DIR__}/tests/bash_session/fake_ps2_prompt.txt", "#{__DIR__}/tests/bash_session/fake_ps2_prompt.txt",
"#{__DIR__}/tests/bash_session/prompt_in_output.txt", "#{__DIR__}/tests/bash_session/prompt_in_output.txt",
"#{__DIR__}/tests/bash_session/test_newline_in_echo_no_ps2.txt",
"#{__DIR__}/tests/bash_session/test_newline_in_ls_ps2.txt",
"#{__DIR__}/tests/bash_session/ps2_prompt.txt", "#{__DIR__}/tests/bash_session/ps2_prompt.txt",
"#{__DIR__}/tests/bash_session/test_newline_in_ls_no_ps2.txt", "#{__DIR__}/tests/bash_session/test_newline_in_echo_no_ps2.txt",
"#{__DIR__}/tests/bash_session/test_virtualenv.txt",
"#{__DIR__}/tests/bash_session/test_newline_in_echo_ps2.txt", "#{__DIR__}/tests/bash_session/test_newline_in_echo_ps2.txt",
"#{__DIR__}/tests/c/test_string_resembling_decl_end.txt", "#{__DIR__}/tests/bash_session/test_newline_in_ls_no_ps2.txt",
"#{__DIR__}/tests/html/css_backtracking.txt", "#{__DIR__}/tests/bash_session/test_newline_in_ls_ps2.txt",
"#{__DIR__}/tests/bash_session/test_virtualenv.txt",
"#{__DIR__}/tests/mcfunction/data.txt", "#{__DIR__}/tests/mcfunction/data.txt",
"#{__DIR__}/tests/mcfunction/selectors.txt", "#{__DIR__}/tests/mcfunction/selectors.txt",
"#{__DIR__}/tests/php/anonymous_class.txt",
"#{__DIR__}/tests/html/javascript_unclosed.txt",
} }
# Tests that fail because of a limitation in PCRE2 # Tests that fail because of a limitation in PCRE2
@ -75,7 +73,8 @@ end
# Helper that creates lexer and tokenizes # Helper that creates lexer and tokenizes
def tokenize(lexer_name, text) def tokenize(lexer_name, text)
lexer = Tartrazine.lexer(lexer_name) lexer = Tartrazine.lexer(lexer_name)
lexer.tokenize(text) tokenizer = Tartrazine::Tokenizer.new(lexer, text)
Tartrazine::Lexer.collapse_tokens(tokenizer.to_a)
end end
# Helper that tokenizes using chroma to validate the lexer # Helper that tokenizes using chroma to validate the lexer

View File

@ -8,12 +8,30 @@ require "./tartrazine"
# perform a list of actions. These actions can emit tokens # perform a list of actions. These actions can emit tokens
# or change the state machine. # or change the state machine.
module Tartrazine module Tartrazine
class Action enum ActionType
property type : String Bygroups
property xml : XML::Node Combined
Include
Pop
Push
Token
Using
Usingself
end
struct Action
property actions : Array(Action) = [] of Action property actions : Array(Action) = [] of Action
def initialize(@type : String, @xml : XML::Node?) @depth : Int32 = 0
@lexer_name : String = ""
@states : Array(String) = [] of String
@states_to_push : Array(String) = [] of String
@token_type : String = ""
@type : ActionType = ActionType::Token
def initialize(t : String, xml : XML::Node?)
@type = ActionType.parse(t.capitalize)
# Some actions may have actions in them, like this: # Some actions may have actions in them, like this:
# <bygroups> # <bygroups>
# <token type="GenericPrompt"/> # <token type="GenericPrompt"/>
@ -23,48 +41,53 @@ module Tartrazine
# #
# The token actions match with the first 2 groups in the regex # The token actions match with the first 2 groups in the regex
# the using action matches the 3rd and shunts it to another lexer # the using action matches the 3rd and shunts it to another lexer
@xml.children.each do |node| xml.children.each do |node|
next unless node.element? next unless node.element?
@actions << Action.new(node.name, node) @actions << Action.new(node.name, node)
end end
# Prefetch the attributes we ned from the XML and keep them
case @type
when ActionType::Token
@token_type = xml["type"]
when ActionType::Push
@states_to_push = xml.attributes.select { |attrib|
attrib.name == "state"
}.map &.content
when ActionType::Pop
@depth = xml["depth"].to_i
when ActionType::Using
@lexer_name = xml["lexer"].downcase
when ActionType::Combined
@states = xml.attributes.select { |attrib|
attrib.name == "state"
}.map &.content
end
end end
# ameba:disable Metrics/CyclomaticComplexity # ameba:disable Metrics/CyclomaticComplexity
def emit(match : Regex::MatchData?, lexer : Lexer, match_group = 0) : Array(Token) def emit(match : MatchData, tokenizer : Tokenizer, match_group = 0) : Array(Token)
case type case @type
when "token" when ActionType::Token
raise Exception.new "Can't have a token without a match" if match.nil? raise Exception.new "Can't have a token without a match" if match.empty?
[Token.new(type: xml["type"], value: match[match_group])] [Token.new(type: @token_type, value: String.new(match[match_group].value))]
when "push" when ActionType::Push
states_to_push = xml.attributes.select { |attrib| to_push = @states_to_push.empty? ? [tokenizer.state_stack.last] : @states_to_push
attrib.name == "state" to_push.each do |state|
}.map &.content if state == "#pop" && tokenizer.state_stack.size > 1
if states_to_push.empty?
# Push without a state means push the current state
states_to_push = [lexer.state_stack.last]
end
states_to_push.each do |state|
if state == "#pop"
# Pop the state # Pop the state
Log.trace { "Popping state" } tokenizer.state_stack.pop
lexer.state_stack.pop
else else
# Really push # Really push
lexer.state_stack << state tokenizer.state_stack << state
Log.trace { "Pushed #{lexer.state_stack}" }
end end
end end
[] of Token [] of Token
when "pop" when ActionType::Pop
depth = xml["depth"].to_i to_pop = [@depth, tokenizer.state_stack.size - 1].min
Log.trace { "Popping #{depth} states" } tokenizer.state_stack.pop(to_pop)
if lexer.state_stack.size <= depth
Log.trace { "Can't pop #{depth} states, only have #{lexer.state_stack.size}" }
else
lexer.state_stack.pop(depth)
end
[] of Token [] of Token
when "bygroups" when ActionType::Bygroups
# FIXME: handle # FIXME: handle
# ><bygroups> # ><bygroups>
# <token type="Punctuation"/> # <token type="Punctuation"/>
@ -79,38 +102,42 @@ module Tartrazine
# the action is skipped. # the action is skipped.
result = [] of Token result = [] of Token
@actions.each_with_index do |e, i| @actions.each_with_index do |e, i|
next if match[i + 1]?.nil? begin
result += e.emit(match, lexer, i + 1) next if match[i + 1].size == 0
rescue IndexError
# FIXME: This should not actually happen
# No match for this group
next
end
result += e.emit(match, tokenizer, i + 1)
end end
result result
when "using" when ActionType::Using
# Shunt to another lexer entirely # Shunt to another lexer entirely
return [] of Token if match.nil? return [] of Token if match.empty?
lexer_name = xml["lexer"].downcase Tokenizer.new(
Log.trace { "to tokenize: #{match[match_group]}" } Tartrazine.lexer(@lexer_name),
Tartrazine.lexer(lexer_name).tokenize(match[match_group], usingself: true) String.new(match[match_group].value),
when "usingself" secondary: true).to_a
when ActionType::Usingself
# Shunt to another copy of this lexer # Shunt to another copy of this lexer
return [] of Token if match.nil? return [] of Token if match.empty?
Tokenizer.new(
new_lexer = Lexer.from_xml(lexer.xml) tokenizer.lexer,
Log.trace { "to tokenize: #{match[match_group]}" } String.new(match[match_group].value),
new_lexer.tokenize(match[match_group], usingself: true) secondary: true).to_a
when "combined" when ActionType::Combined
# Combine two states into one anonymous state # Combine two or more states into one anonymous state
states = xml.attributes.select { |attrib| new_state = @states.map { |name|
attrib.name == "state" tokenizer.lexer.states[name]
}.map &.content
new_state = states.map { |name|
lexer.states[name]
}.reduce { |state1, state2| }.reduce { |state1, state2|
state1 + state2 state1 + state2
} }
lexer.states[new_state.name] = new_state tokenizer.lexer.states[new_state.name] = new_state
lexer.state_stack << new_state.name tokenizer.state_stack << new_state.name
[] of Token [] of Token
else else
raise Exception.new("Unknown action type: #{type}: #{xml}") raise Exception.new("Unknown action type: #{@type}")
end end
end end
end end

73
src/bytes_regex.cr Normal file
View File

@ -0,0 +1,73 @@
module BytesRegex
extend self
class Regex
def initialize(pattern : String, multiline = false, dotall = false, ignorecase = false, anchored = false)
flags = LibPCRE2::UTF | LibPCRE2::UCP | LibPCRE2::NO_UTF_CHECK
flags |= LibPCRE2::MULTILINE if multiline
flags |= LibPCRE2::DOTALL if dotall
flags |= LibPCRE2::CASELESS if ignorecase
flags |= LibPCRE2::ANCHORED if anchored
if @re = LibPCRE2.compile(
pattern,
pattern.bytesize,
flags,
out errorcode,
out erroroffset,
nil)
else
msg = String.new(256) do |buffer|
bytesize = LibPCRE2.get_error_message(errorcode, buffer, 256)
{bytesize, 0}
end
raise Exception.new "Error #{msg} compiling regex at offset #{erroroffset}"
end
@match_data = LibPCRE2.match_data_create_from_pattern(@re, nil)
end
def finalize
LibPCRE2.match_data_free(@match_data)
LibPCRE2.code_free(@re)
end
def match(str : Bytes, pos = 0) : Array(Match)
rc = LibPCRE2.match(
@re,
str,
str.size,
pos,
LibPCRE2::NO_UTF_CHECK,
@match_data,
nil)
if rc > 0
ovector = LibPCRE2.get_ovector_pointer(@match_data)
(0...rc).map do |i|
m_start = ovector[2 * i]
m_end = ovector[2 * i + 1]
if m_start == m_end
m_value = Bytes.new(0)
else
m_value = str[m_start...m_end]
end
Match.new(m_value, m_start, m_end - m_start)
end
else
[] of Match
end
end
end
struct Match
property value : Bytes
property start : UInt64
property size : UInt64
def initialize(@value : Bytes, @start : UInt64, @size : UInt64)
end
end
end
# pattern = "foo"
# str = "foo bar"
# re = BytesRegex::Regex.new(pattern)
# p! String.new(re.match(str.to_slice)[0].value)

View File

@ -7,19 +7,39 @@ module Tartrazine
def initialize(@theme : Theme = Tartrazine.theme("default-dark"), @line_numbers : Bool = false) def initialize(@theme : Theme = Tartrazine.theme("default-dark"), @line_numbers : Bool = false)
end end
private def line_label(i : Int32) : String
"#{i + 1}".rjust(4).ljust(5)
end
def format(text : String, lexer : Lexer) : String def format(text : String, lexer : Lexer) : String
tokenizer = Tokenizer.new(lexer, text)
i = 0
output = String.build do |outp| output = String.build do |outp|
lexer.group_tokens_in_lines(lexer.tokenize(text)).each_with_index do |line, i| outp << line_label(i) if line_numbers?
label = line_numbers? ? "#{i + 1}".rjust(4).ljust(5) : "" tokenizer.each do |token|
outp << label outp << colorize(token[:value], token[:type])
line.each do |token| if token[:value].includes?("\n")
outp << colorize(token[:value], token[:type]) i += 1
outp << line_label(i) if line_numbers?
end end
end end
end end
output output
end end
# def format(text : String, lexer : Lexer) : String
# output = String.build do |outp|
# lexer.group_tokens_in_lines(lexer.tokenize(text)).each_with_index do |line, i|
# label = line_numbers? ? "#{i + 1}".rjust(4).ljust(5) : ""
# outp << label
# line.each do |token|
# outp << colorize(token[:value], token[:type])
# end
# end
# end
# output
# end
def colorize(text : String, token : String) : String def colorize(text : String, token : String) : String
style = theme.styles.fetch(token, nil) style = theme.styles.fetch(token, nil)
return text if style.nil? return text if style.nil?

View File

@ -1,5 +1,6 @@
require "../constants/token_abbrevs.cr" require "../constants/token_abbrevs.cr"
require "../formatter" require "../formatter"
require "html"
module Tartrazine module Tartrazine
class Html < Formatter class Html < Formatter
@ -30,7 +31,7 @@ module Tartrazine
@standalone : Bool = false, @standalone : Bool = false,
@surrounding_pre : Bool = true, @surrounding_pre : Bool = true,
@wrap_long_lines : Bool = false, @wrap_long_lines : Bool = false,
@weight_of_bold : Int32 = 600,) @weight_of_bold : Int32 = 600)
end end
def format(text : String, lexer : Lexer) : String def format(text : String, lexer : Lexer) : String
@ -53,22 +54,29 @@ module Tartrazine
output output
end end
private def line_label(i : Int32) : String
line_label = "#{i + 1}".rjust(4).ljust(5)
line_class = highlighted?(i + 1) ? "class=\"#{get_css_class("LineHighlight")}\"" : ""
line_id = linkable_line_numbers? ? "id=\"#{line_number_id_prefix}#{i + 1}\"" : ""
"<span #{line_id} #{line_class} style=\"user-select: none;\">#{line_label} </span>"
end
def format_text(text : String, lexer : Lexer) : String def format_text(text : String, lexer : Lexer) : String
lines = lexer.group_tokens_in_lines(lexer.tokenize(text)) # lines = lexer.group_tokens_in_lines(lexer.tokenize(text))
tokenizer = Tokenizer.new(lexer, text)
i = 0
output = String.build do |outp| output = String.build do |outp|
if surrounding_pre? if surrounding_pre?
pre_style = wrap_long_lines? ? "style=\"white-space: pre-wrap; word-break: break-word;\"" : "" pre_style = wrap_long_lines? ? "style=\"white-space: pre-wrap; word-break: break-word;\"" : ""
outp << "<pre class=\"#{get_css_class("Background")}\" #{pre_style}>" outp << "<pre class=\"#{get_css_class("Background")}\" #{pre_style}>"
end end
outp << "<code class=\"#{get_css_class("Background")}\">" outp << "<code class=\"#{get_css_class("Background")}\">"
lines.each_with_index(offset: line_number_start - 1) do |line, i| outp << line_label(i) if line_numbers?
line_label = line_numbers? ? "#{i + 1}".rjust(4).ljust(5) : "" tokenizer.each do |token|
line_class = highlighted?(i + 1) ? "class=\"#{get_css_class("LineHighlight")}\"" : "" outp << "<span class=\"#{get_css_class(token[:type])}\">#{HTML.escape(token[:value])}</span>"
line_id = linkable_line_numbers? ? "id=\"#{line_number_id_prefix}#{i + 1}\"" : "" if token[:value].ends_with? "\n"
outp << "<span #{line_id} #{line_class} style=\"user-select: none;\">#{line_label} </span>" i += 1
line.each do |token| outp << line_label(i) if line_numbers?
fragment = "<span class=\"#{get_css_class(token[:type])}\">#{token[:value]}</span>"
outp << fragment
end end
end end
outp << "</code></pre>" outp << "</code></pre>"
@ -104,15 +112,17 @@ module Tartrazine
# Given a token type, return the CSS class to use. # Given a token type, return the CSS class to use.
def get_css_class(token : String) : String def get_css_class(token : String) : String
return class_prefix + Abbreviations[token] if theme.styles.has_key?(token) if !theme.styles.has_key? token
# Themes don't contain information for each specific
# Themes don't contain information for each specific # token type. However, they may contain information
# token type. However, they may contain information # for a parent style. Worst case, we go to the root
# for a parent style. Worst case, we go to the root # (Background) style.
# (Background) style. parent = theme.style_parents(token).reverse.find { |dad|
class_prefix + Abbreviations[theme.style_parents(token).reverse.find { |parent| theme.styles.has_key?(dad)
theme.styles.has_key?(parent) }
}] theme.styles[token] = theme.styles[parent]
end
class_prefix + Abbreviations[token]
end end
# Is this line in the highlighted ranges? # Is this line in the highlighted ranges?

View File

@ -1,9 +1,9 @@
require "baked_file_system"
require "./constants/lexers" require "./constants/lexers"
module Tartrazine module Tartrazine
class LexerFiles class LexerFiles
extend BakedFileSystem extend BakedFileSystem
bake_folder "../lexers", __DIR__ bake_folder "../lexers", __DIR__
end end
@ -37,71 +37,92 @@ module Tartrazine
LEXERS_BY_NAME.keys.sort! LEXERS_BY_NAME.keys.sort!
end end
# A token, the output of the tokenizer
alias Token = NamedTuple(type: String, value: String)
struct Tokenizer
include Iterator(Token)
property lexer : Lexer
property text : Bytes
property pos : Int32 = 0
@dq = Deque(Token).new
property state_stack = ["root"]
def initialize(@lexer : Lexer, text : String, secondary = false)
# Respect the `ensure_nl` config option
if text.size > 0 && text[-1] != '\n' && @lexer.config[:ensure_nl] && !secondary
text += "\n"
end
@text = text.to_slice
end
def next : Iterator::Stop | Token
if @dq.size > 0
return @dq.shift
end
if pos == @text.size
return stop
end
matched = false
while @pos < @text.size
@lexer.states[@state_stack.last].rules.each do |rule|
matched, new_pos, new_tokens = rule.match(@text, @pos, self)
if matched
@pos = new_pos
split_tokens(new_tokens).each { |token| @dq << token }
break
end
end
if !matched
if @text[@pos] == 10u8
@dq << {type: "Text", value: "\n"}
@state_stack = ["root"]
else
@dq << {type: "Error", value: String.new(@text[@pos..@pos])}
end
@pos += 1
break
end
end
self.next
end
# If a token contains a newline, split it into two tokens
def split_tokens(tokens : Array(Token)) : Array(Token)
split_tokens = [] of Token
tokens.each do |token|
if token[:value].includes?("\n")
values = token[:value].split("\n")
values.each_with_index do |value, index|
value += "\n" if index < values.size - 1
split_tokens << {type: token[:type], value: value}
end
else
split_tokens << token
end
end
split_tokens
end
end
# This implements a lexer for Pygments RegexLexers as expressed # This implements a lexer for Pygments RegexLexers as expressed
# in Chroma's XML serialization. # in Chroma's XML serialization.
# #
# For explanations on what actions and states do # For explanations on what actions and states do
# the Pygments documentation is a good place to start. # the Pygments documentation is a good place to start.
# https://pygments.org/docs/lexerdevelopment/ # https://pygments.org/docs/lexerdevelopment/
class Lexer struct Lexer
property config = { property config = {
name: "", name: "",
aliases: [] of String,
filenames: [] of String,
mime_types: [] of String,
priority: 0.0, priority: 0.0,
case_insensitive: false, case_insensitive: false,
dot_all: false, dot_all: false,
not_multiline: false, not_multiline: false,
ensure_nl: false, ensure_nl: false,
} }
property xml : String = ""
property states = {} of String => State property states = {} of String => State
property state_stack = ["root"]
# Turn the text into a list of tokens. The `usingself` parameter
# is true when the lexer is being used to tokenize a string
# from a larger text that is already being tokenized.
# So, when it's true, we don't modify the text.
def tokenize(text, usingself = false) : Array(Token)
@state_stack = ["root"]
tokens = [] of Token
pos = 0
matched = false
# Respect the `ensure_nl` config option
if text.size > 0 && text[-1] != '\n' && config[:ensure_nl] && !usingself
text += "\n"
end
# Loop through the text, applying rules
while pos < text.size
state = states[@state_stack.last]
# Log.trace { "Stack is #{@state_stack} State is #{state.name}, pos is #{pos}, text is #{text[pos..pos + 10]}" }
state.rules.each do |rule|
matched, new_pos, new_tokens = rule.match(text, pos, self)
if matched
# Move position forward, save the tokens,
# tokenize from the new position
# Log.trace { "MATCHED: #{rule.xml}" }
pos = new_pos
tokens += new_tokens
break
end
# Log.trace { "NOT MATCHED: #{rule.xml}" }
end
# If no rule matches, emit an error token
unless matched
# Log.trace { "Error at #{pos}" }
tokens << {type: "Error", value: "#{text[pos]}"}
pos += 1
end
end
Lexer.collapse_tokens(tokens)
end
# Collapse consecutive tokens of the same type for easier comparison # Collapse consecutive tokens of the same type for easier comparison
# and smaller output # and smaller output
def self.collapse_tokens(tokens : Array(Tartrazine::Token)) : Array(Tartrazine::Token) def self.collapse_tokens(tokens : Array(Tartrazine::Token)) : Array(Tartrazine::Token)
@ -124,34 +145,8 @@ module Tartrazine
result result
end end
# Group tokens into lines, splitting them when a newline is found
def group_tokens_in_lines(tokens : Array(Token)) : Array(Array(Token))
split_tokens = [] of Token
tokens.each do |token|
if token[:value].includes?("\n")
values = token[:value].split("\n")
values.each_with_index do |value, index|
value += "\n" if index < values.size - 1
split_tokens << {type: token[:type], value: value}
end
else
split_tokens << token
end
end
lines = [Array(Token).new]
split_tokens.each do |token|
lines.last << token
if token[:value].includes?("\n")
lines << Array(Token).new
end
end
lines
end
# ameba:disable Metrics/CyclomaticComplexity
def self.from_xml(xml : String) : Lexer def self.from_xml(xml : String) : Lexer
l = Lexer.new l = Lexer.new
l.xml = xml
lexer = XML.parse(xml).first_element_child lexer = XML.parse(xml).first_element_child
if lexer if lexer
config = lexer.children.find { |node| config = lexer.children.find { |node|
@ -160,9 +155,6 @@ module Tartrazine
if config if config
l.config = { l.config = {
name: xml_to_s(config, name) || "", name: xml_to_s(config, name) || "",
aliases: xml_to_a(config, _alias) || [] of String,
filenames: xml_to_a(config, filename) || [] of String,
mime_types: xml_to_a(config, mime_type) || [] of String,
priority: xml_to_f(config, priority) || 0.0, priority: xml_to_f(config, priority) || 0.0,
not_multiline: xml_to_s(config, not_multiline) == "true", not_multiline: xml_to_s(config, not_multiline) == "true",
dot_all: xml_to_s(config, dot_all) == "true", dot_all: xml_to_s(config, dot_all) == "true",
@ -215,9 +207,9 @@ module Tartrazine
# A Lexer state. A state has a name and a list of rules. # A Lexer state. A state has a name and a list of rules.
# The state machine has a state stack containing references # The state machine has a state stack containing references
# to states to decide which rules to apply. # to states to decide which rules to apply.
class State struct State
property name : String = "" property name : String = ""
property rules = [] of Rule property rules = [] of BaseRule
def +(other : State) def +(other : State)
new_state = State.new new_state = State.new
@ -226,7 +218,4 @@ module Tartrazine
new_state new_state
end end
end end
# A token, the output of the tokenizer
alias Token = NamedTuple(type: String, value: String)
end end

View File

@ -77,7 +77,7 @@ if options["-f"]
if formatter.is_a?(Tartrazine::Html) && options["--css"] if formatter.is_a?(Tartrazine::Html) && options["--css"]
File.open("#{options["-t"].as(String)}.css", "w") do |outf| File.open("#{options["-t"].as(String)}.css", "w") do |outf|
outf.puts formatter.style_defs outf << formatter.style_defs
end end
exit 0 exit 0
end end
@ -91,7 +91,7 @@ if options["-f"]
puts output puts output
else else
File.open(options["-o"].as(String), "w") do |outf| File.open(options["-o"].as(String), "w") do |outf|
outf.puts output outf << output
end end
end end
end end

View File

@ -1,8 +1,9 @@
require "./actions" require "./actions"
require "./bytes_regex"
require "./formatter" require "./formatter"
require "./lexer"
require "./rules" require "./rules"
require "./styles" require "./styles"
require "./lexer"
# These are lexer rules. They match with the text being parsed # These are lexer rules. They match with the text being parsed
# and perform actions, either emitting tokens or changing the # and perform actions, either emitting tokens or changing the
@ -10,40 +11,15 @@ require "./lexer"
module Tartrazine module Tartrazine
# This rule matches via a regex pattern # This rule matches via a regex pattern
class Rule alias Regex = BytesRegex::Regex
property pattern : Regex = Regex.new "" alias Match = BytesRegex::Match
property actions : Array(Action) = [] of Action alias MatchData = Array(Match)
property xml : String = "foo"
def match(text, pos, lexer) : Tuple(Bool, Int32, Array(Token)) abstract struct BaseRule
match = pattern.match(text, pos) abstract def match(text : Bytes, pos : Int32, tokenizer : Tokenizer) : Tuple(Bool, Int32, Array(Token))
# We don't match if the match doesn't move the cursor abstract def initialize(node : XML::Node)
# because that causes infinite loops
return false, pos, [] of Token if match.nil? || match.end == 0
# Log.trace { "#{match}, #{pattern.inspect}, #{text}, #{pos}" }
tokens = [] of Token
# Emit the tokens
actions.each do |action|
# Emit the token
tokens += action.emit(match, lexer)
end
Log.trace { "#{xml}, #{match.end}, #{tokens}" }
return true, match.end, tokens
end
def initialize(node : XML::Node, multiline, dotall, ignorecase) @actions : Array(Action) = [] of Action
@xml = node.to_s
pattern = node["pattern"]
flags = Regex::Options::ANCHORED
# MULTILINE implies DOTALL which we don't want, so we
# use in-pattern flag (?m) instead
# flags |= Regex::Options::MULTILINE if multiline
pattern = "(?m)" + pattern if multiline
flags |= Regex::Options::DOTALL if dotall
flags |= Regex::Options::IGNORE_CASE if ignorecase
@pattern = Regex.new(pattern, flags)
add_actions(node)
end
def add_actions(node : XML::Node) def add_actions(node : XML::Node)
node.children.each do |child| node.children.each do |child|
@ -53,23 +29,42 @@ module Tartrazine
end end
end end
struct Rule < BaseRule
property pattern : Regex = Regex.new ""
def match(text : Bytes, pos, tokenizer) : Tuple(Bool, Int32, Array(Token))
match = pattern.match(text, pos)
# No match
return false, pos, [] of Token if match.size == 0
return true, pos + match[0].size, @actions.flat_map(&.emit(match, tokenizer))
end
def initialize(node : XML::Node)
end
def initialize(node : XML::Node, multiline, dotall, ignorecase)
pattern = node["pattern"]
pattern = "(?m)" + pattern if multiline
@pattern = Regex.new(pattern, multiline, dotall, ignorecase, true)
add_actions(node)
end
end
# This rule includes another state. If any of the rules of the # This rule includes another state. If any of the rules of the
# included state matches, this rule matches. # included state matches, this rule matches.
class IncludeStateRule < Rule struct IncludeStateRule < BaseRule
property state : String = "" @state : String = ""
def match(text, pos, lexer) : Tuple(Bool, Int32, Array(Token)) def match(text : Bytes, pos : Int32, tokenizer : Tokenizer) : Tuple(Bool, Int32, Array(Token))
Log.trace { "Including state #{state} from #{lexer.state_stack.last}" } tokenizer.@lexer.states[@state].rules.each do |rule|
lexer.states[state].rules.each do |rule| matched, new_pos, new_tokens = rule.match(text, pos, tokenizer)
matched, new_pos, new_tokens = rule.match(text, pos, lexer)
Log.trace { "#{xml}, #{new_pos}, #{new_tokens}" } if matched
return true, new_pos, new_tokens if matched return true, new_pos, new_tokens if matched
end end
return false, pos, [] of Token return false, pos, [] of Token
end end
def initialize(node : XML::Node) def initialize(node : XML::Node)
@xml = node.to_s
include_node = node.children.find { |child| include_node = node.children.find { |child|
child.name == "include" child.name == "include"
} }
@ -79,17 +74,14 @@ module Tartrazine
end end
# This rule always matches, unconditionally # This rule always matches, unconditionally
class UnconditionalRule < Rule struct UnconditionalRule < BaseRule
def match(text, pos, lexer) : Tuple(Bool, Int32, Array(Token)) NO_MATCH = [] of Match
tokens = [] of Token
actions.each do |action| def match(text, pos, tokenizer) : Tuple(Bool, Int32, Array(Token))
tokens += action.emit(nil, lexer) return true, pos, @actions.flat_map(&.emit(NO_MATCH, tokenizer))
end
return true, pos, tokens
end end
def initialize(node : XML::Node) def initialize(node : XML::Node)
@xml = node.to_s
add_actions(node) add_actions(node)
end end
end end

View File

@ -9,7 +9,7 @@ require "xml"
module Tartrazine module Tartrazine
alias Color = Sixteen::Color alias Color = Sixteen::Color
class ThemeFiles struct ThemeFiles
extend BakedFileSystem extend BakedFileSystem
bake_folder "../styles", __DIR__ bake_folder "../styles", __DIR__
end end
@ -39,7 +39,7 @@ module Tartrazine
themes.to_a.sort! themes.to_a.sort!
end end
class Style struct Style
# These properties are tri-state. # These properties are tri-state.
# true means it's set # true means it's set
# false means it's not set # false means it's not set
@ -79,7 +79,7 @@ module Tartrazine
end end
end end
class Theme struct Theme
property name : String = "" property name : String = ""
property styles = {} of String => Style property styles = {} of String => Style