Fix bug in ansi formatter

Use forked baked_file_system for now
Updated README
2025-07-01 12:27:08 -03:00 · 2024-08-26 16:44:44 -03:00 · 2024-08-25 17:05:04 -03:00 · 2024-08-24 22:33:24 -03:00 · 2024-08-24 22:20:38 -03:00 · 2024-08-24 21:45:57 -03:00
11 changed files with 95 additions and 69 deletions
--- a/README.md
+++ b/README.md
@ -2,36 +2,11 @@

 Tartrazine is a library to syntax-highlight code. It is
 a port of [Pygments](https://pygments.org/) to
-[Crystal](https://crystal-lang.org/). Kind of.
+[Crystal](https://crystal-lang.org/).

-The CLI tool can be used to highlight many things in many styles.
+It also provides a CLI tool which can be used to highlight many things in many styles.

-# A port of what? Why "kind of"?
-
-Pygments is a staple of the Python ecosystem, and it's great.
-It lets you highlight code in many languages, and it has many
-themes. Chroma is "Pygments for Go", it's actually a port of
-Pygments to Go, and it's great too.
-
-I wanted that in Crystal, so I started this project. But I did
-not read much of the Pygments code. Or much of Chroma's.
-
-Chroma has taken most of the Pygments lexers and turned them into
-XML descriptions. What I did was take those XML files from Chroma
-and a pile of test cases from Pygments, and I slapped them together
-until the tests passed and my code produced the same output as
-Chroma. Think of it as *extreme TDD*.
-
-Currently the pass rate for tests in the supported languages
-is `96.8%`, which is *not bad for a couple days hacking*.
-
-This only covers the RegexLexers, which are the most common ones,
-but it means the supported languages are a subset of Chroma's, which
-is a subset of Pygments'.
-
-Currently Tartrazine supports ... 248 languages.
-
-It has 331 themes (63 from Chroma, the rest are base16 themes via
+Currently Tartrazine supports 247 languages. and it has 331 themes (63 from Chroma, the rest are base16 themes via 
 [Sixteen](https://github.com/ralsina/sixteen)

 ## Installation
@ -58,7 +33,7 @@ $ tartrazine whatever.c -l c -t catppuccin-macchiato --line-numbers -f terminal
 Generate a standalone HTML file from a C source file with the syntax highlighted:

 ```shell
-$ tartrazine whatever.c -l c -t catppuccin-macchiato --line-numbers \
+$ tartrazine whatever.c -t catppuccin-macchiato --line-numbers \
  --standalone -f html -o whatever.html 
 ```

@ -87,3 +62,29 @@ puts formatter.format(File.read(ARGV[0]), lexer)
 ## Contributors

 - [Roberto Alsina](https://github.com/ralsina) - creator and maintainer
+
+## A port of what? Why "kind of"?
+
+Pygments is a staple of the Python ecosystem, and it's great.
+It lets you highlight code in many languages, and it has many
+themes. Chroma is "Pygments for Go", it's actually a port of
+Pygments to Go, and it's great too.
+
+I wanted that in Crystal, so I started this project. But I did
+not read much of the Pygments code. Or much of Chroma's.
+
+Chroma has taken most of the Pygments lexers and turned them into
+XML descriptions. What I did was take those XML files from Chroma
+and a pile of test cases from Pygments, and I slapped them together
+until the tests passed and my code produced the same output as
+Chroma. Think of it as [*extreme TDD*](https://ralsina.me/weblog/posts/tartrazine-reimplementing-pygments.html)
+
+Currently the pass rate for tests in the supported languages
+is `96.8%`, which is *not bad for a couple days hacking*.
+
+This only covers the RegexLexers, which are the most common ones,
+but it means the supported languages are a subset of Chroma's, which
+is a subset of Pygments' and DelegatingLexers (useful for things like template languages)
+
+Then performance was bad, so I hacked and hacked and made it
+significantly [faster than chroma](https://ralsina.me/weblog/posts/a-tale-of-optimization.html) which is fun.
--- a/TODO.md
+++ b/TODO.md
@ -8,8 +8,8 @@
 * ✅ Implement lexer loader that respects aliases
 * ✅ Implement lexer loader by file extension
 * ✅ Add --line-numbers to terminal formatter
-* Implement lexer loader by mime type
+* ✅ Implement lexer loader by mime type
 * ✅ Implement Delegating lexers
 * ✅ Add RstLexer
 * Add Mako template lexer
-* Implement heuristic lexer detection
+* ✅ Implement heuristic lexer detection
--- a/lexers/LICENSE-heuristics
+++ b/lexers/LICENSE-heuristics
--- a/lexers/groff.xml
+++ b/lexers/groff.xml
@ -3,6 +3,7 @@
    <name>Groff</name>
    <alias>groff</alias>
    <alias>nroff</alias>
+    <alias>roff</alias>
    <alias>man</alias>
    <filename>*.[1-9]</filename>
    <filename>*.1p</filename>
@ -87,4 +88,4 @@
      </rule>
    </state>
  </rules>
-</lexer>
+</lexer>
--- a/heuristics/heuristics.yml
+++ b/heuristics/heuristics.yml
@ -30,12 +30,12 @@
 disambiguations:
 - extensions: ['.1', '.2', '.3', '.4', '.5', '.6', '.7', '.8', '.9']
  rules:
-  - language: Roff Manpage
+  - language: man
    and:
    - named_pattern: mdoc-date
    - named_pattern: mdoc-title
    - named_pattern: mdoc-heading
-  - language: Roff Manpage
+  - language: man
    and:
    - named_pattern: man-title
    - named_pattern: man-heading
@ -43,12 +43,12 @@ disambiguations:
    pattern: '^\.(?:[A-Za-z]{2}(?:\s|$)|\\")'
 - extensions: ['.1in', '.1m', '.1x', '.3in', '.3m', '.3p', '.3pm', '.3qt', '.3x', '.man', '.mdoc']
  rules:
-  - language: Roff Manpage
+  - language: man
    and:
    - named_pattern: mdoc-date
    - named_pattern: mdoc-title
    - named_pattern: mdoc-heading
-  - language: Roff Manpage
+  - language: man
    and:
    - named_pattern: man-title
    - named_pattern: man-heading
--- a/scripts/lexer_metadata.py
+++ b/scripts/lexer_metadata.py
@ -52,6 +52,6 @@ with open("src/constants/lexers.cr", "w") as f:
    f.write("  LEXERS_BY_FILENAME = {\n")
    for k in sorted(lexer_by_filename.keys()):
        v = lexer_by_filename[k]
-        f.write(f'"{k}" => {str(list(v)).replace("'", "\"")}, \n')
+        f.write(f'"{k}" => {str(sorted(list(v))).replace("'", "\"")}, \n')
    f.write("}\n")
    f.write("end\n")
--- a/shard.yml
+++ b/shard.yml
@ -1,5 +1,5 @@
 name: tartrazine
-version: 0.6.0
+version: 0.6.1

 authors:
  - Roberto Alsina <roberto.alsina@gmail.com>
@ -10,11 +10,13 @@ targets:

 dependencies:
  baked_file_system:
-    github: schovi/baked_file_system
+    github: ralsina/baked_file_system
+    branch: master
  base58:
    github: crystal-china/base58.cr
  sixteen:
    github: ralsina/sixteen
+    branch: main
  docopt:
    github: chenkovsky/docopt.cr

--- a/src/constants/lexers.cr
+++ b/src/constants/lexers.cr
@ -328,6 +328,7 @@ module Tartrazine
    "restructuredtext"            => "rst",
    "rexx"                        => "rexx",
    "rkt"                         => "racket",
+    "roff"                        => "groff",
    "rpmspec"                     => "rpm_spec",
    "rs"                          => "rust",
    "rst"                         => "rst",
@ -730,8 +731,8 @@ module Tartrazine
    "*.applescript"     => ["applescript"],
    "*.aql"             => ["arangodb_aql"],
    "*.arexx"           => ["rexx"],
-    "*.as"              => ["actionscript_3", "actionscript"],
-    "*.asm"             => ["tasm", "nasm", "z80_assembly"],
+    "*.as"              => ["actionscript", "actionscript_3"],
+    "*.asm"             => ["nasm", "tasm", "z80_assembly"],
    "*.au3"             => ["autoit"],
    "*.automount"       => ["systemd"],
    "*.aux"             => ["tex"],
@ -739,7 +740,7 @@ module Tartrazine
    "*.awk"             => ["awk"],
    "*.b"               => ["brainfuck"],
    "*.bal"             => ["ballerina"],
-    "*.bas"             => ["vb_net", "qbasic"],
+    "*.bas"             => ["qbasic", "vb_net"],
    "*.bash"            => ["bash"],
    "*.bat"             => ["batchfile"],
    "*.batch"           => ["psl"],
@ -750,7 +751,7 @@ module Tartrazine
    "*.bnf"             => ["bnf"],
    "*.bqn"             => ["bqn"],
    "*.bzl"             => ["python"],
-    "*.c"               => ["c++", "c"],
+    "*.c"               => ["c", "c++"],
    "*.c++"             => ["c++"],
    "*.capnp"           => ["cap_n_proto"],
    "*.cc"              => ["c++"],
@ -839,7 +840,7 @@ module Tartrazine
    "*.fx"              => ["hlsl"],
    "*.fxh"             => ["hlsl"],
    "*.fzn"             => ["minizinc"],
-    "*.gd"              => ["gdscript3", "gdscript"],
+    "*.gd"              => ["gdscript", "gdscript3"],
    "*.gemspec"         => ["ruby"],
    "*.geo"             => ["glsl"],
    "*.gleam"           => ["gleam"],
@ -849,7 +850,7 @@ module Tartrazine
    "*.graphql"         => ["graphql"],
    "*.graphqls"        => ["graphql"],
    "*.groovy"          => ["groovy"],
-    "*.h"               => ["c++", "c", "objective-c"],
+    "*.h"               => ["c", "c++", "objective-c"],
    "*.h++"             => ["c++"],
    "*.ha"              => ["hare"],
    "*.handlebars"      => ["handlebars"],
@ -872,7 +873,7 @@ module Tartrazine
    "*.idc"             => ["c"],
    "*.idr"             => ["idris"],
    "*.ijs"             => ["j"],
-    "*.inc"             => ["objectpascal", "povray", "php", "sourcepawn"],
+    "*.inc"             => ["objectpascal", "php", "povray", "sourcepawn"],
    "*.inf"             => ["ini"],
    "*.ini"             => ["ini"],
    "*.ino"             => ["arduino"],
@ -898,13 +899,13 @@ module Tartrazine
    "*.lpk"             => ["objectpascal"],
    "*.lpr"             => ["objectpascal"],
    "*.lua"             => ["lua"],
-    "*.m"               => ["mathematica", "octave", "matlab", "objective-c", "mason"],
+    "*.m"               => ["mason", "mathematica", "matlab", "objective-c", "octave"],
    "*.ma"              => ["mathematica"],
    "*.mak"             => ["makefile"],
    "*.man"             => ["groff"],
    "*.mao"             => ["mako"],
    "*.markdown"        => ["markdown"],
-    "*.mc"              => ["monkeyc", "mason"],
+    "*.mc"              => ["mason", "monkeyc"],
    "*.mcfunction"      => ["mcfunction"],
    "*.md"              => ["markdown"],
    "*.metal"           => ["metal"],
@ -961,7 +962,7 @@ module Tartrazine
    "*.pml"             => ["promela"],
    "*.pony"            => ["pony"],
    "*.pov"             => ["povray"],
-    "*.pp"              => ["puppet", "objectpascal"],
+    "*.pp"              => ["objectpascal", "puppet"],
    "*.pq"              => ["powerquery"],
    "*.pr"              => ["promela"],
    "*.prm"             => ["promela"],
@ -1010,7 +1011,7 @@ module Tartrazine
    "*.rst"             => ["rst"],
    "*.rvt"             => ["tcl"],
    "*.rx"              => ["rexx"],
-    "*.s"               => ["armasm", "r", "gas"],
+    "*.s"               => ["armasm", "gas", "r"],
    "*.sage"            => ["python"],
    "*.sas"             => ["sas"],
    "*.sass"            => ["sass"],
@ -1023,7 +1024,7 @@ module Tartrazine
    "*.scope"           => ["systemd"],
    "*.scss"            => ["scss"],
    "*.sed"             => ["sed"],
-    "*.service"         => ["systemd", "ini"],
+    "*.service"         => ["ini", "systemd"],
    "*.sh"              => ["bash"],
    "*.sh-session"      => ["bash_session"],
    "*.sieve"           => ["sieve"],
@ -1033,7 +1034,7 @@ module Tartrazine
    "*.smali"           => ["smali"],
    "*.sml"             => ["standard_ml"],
    "*.snobol"          => ["snobol"],
-    "*.socket"          => ["systemd", "ini"],
+    "*.socket"          => ["ini", "systemd"],
    "*.sol"             => ["solidity"],
    "*.sp"              => ["sourcepawn"],
    "*.sparql"          => ["sparql"],
@ -1068,7 +1069,7 @@ module Tartrazine
    "*.tpl"             => ["smarty"],
    "*.tpp"             => ["c++"],
    "*.trig"            => ["psl"],
-    "*.ts"              => ["typoscript", "typescript"],
+    "*.ts"              => ["typescript", "typoscript"],
    "*.tst"             => ["scilab"],
    "*.tsx"             => ["typescript"],
    "*.ttl"             => ["turtle"],
@ -1104,7 +1105,7 @@ module Tartrazine
    "*.xml"             => ["xml"],
    "*.xsd"             => ["xml"],
    "*.xsl"             => ["xml"],
-    "*.xslt"            => ["xml", "html"],
+    "*.xslt"            => ["html", "xml"],
    "*.yaml"            => ["yaml"],
    "*.yang"            => ["yang"],
    "*.yml"             => ["yaml"],
--- a/src/formatters/ansi.cr
+++ b/src/formatters/ansi.cr
@ -11,7 +11,7 @@ module Tartrazine
      "#{i + 1}".rjust(4).ljust(5)
    end

-    def format(text : String, lexer : Lexer) : String
+    def format(text : String, lexer : BaseLexer) : String
      outp = String::Builder.new("")
      format(text, lexer, outp)
      outp.to_s
--- a/src/heuristics.cr
+++ b/src/heuristics.cr
@ -1,13 +1,12 @@
 require "yaml"

-  # Use linguist's heuristics to disambiguate between languages
-  # This is *shamelessly* stolen from https://github.com/github-linguist/linguist
-  # and ported to Crystal. Deepest thanks to the authors of Linguist
-  # for licensing it liberally.
-  # 
-  # Consider this code (c) 2017 GitHub, Inc. even if I wrote it.
-  module Linguist
-
+# Use linguist's heuristics to disambiguate between languages
+# This is *shamelessly* stolen from https://github.com/github-linguist/linguist
+# and ported to Crystal. Deepest thanks to the authors of Linguist
+# for licensing it liberally.
+#
+# Consider this code (c) 2017 GitHub, Inc. even if I wrote it.
+module Linguist
  class Heuristic
    include YAML::Serializable

@ -80,7 +79,3 @@ require "yaml"
    end
  end
 end
-
-h = Linguist::Heuristic.from_yaml(File.read("heuristics/heuristics.yml"))
-fname = "/usr/include/sqlite3.h"
-p! h.run(fname, File.read(fname))
--- a/src/lexer.cr
+++ b/src/lexer.cr
@ -9,13 +9,21 @@ module Tartrazine

  # Get the lexer object for a language name
  # FIXME: support mimetypes
-  def self.lexer(name : String? = nil, filename : String? = nil) : BaseLexer
+  def self.lexer(name : String? = nil, filename : String? = nil, mimetype : String? = nil) : BaseLexer
    return lexer_by_name(name) if name && name != "autodetect"
    return lexer_by_filename(filename) if filename
+    return lexer_by_mimetype(mimetype) if mimetype

    Lexer.from_xml(LexerFiles.get("/#{LEXERS_BY_NAME["plaintext"]}.xml").gets_to_end)
  end

+  private def self.lexer_by_mimetype(mimetype : String) : BaseLexer
+    lexer_file_name = LEXERS_BY_MIMETYPE.fetch(mimetype, nil)
+    raise Exception.new("Unknown mimetype: #{mimetype}") if lexer_file_name.nil?
+
+    Lexer.from_xml(LexerFiles.get("/#{lexer_file_name}.xml").gets_to_end)
+  end
+
  private def self.lexer_by_name(name : String) : BaseLexer
    lexer_file_name = LEXERS_BY_NAME.fetch(name.downcase, nil)
    return create_delegating_lexer(name) if lexer_file_name.nil? && name.includes? "+"
@ -36,12 +44,30 @@ module Tartrazine
    when 1
      lexer_file_name = candidates.first
    else
-      raise Exception.new("Multiple lexers match the filename: #{candidates.to_a.join(", ")}")
+      lexer_file_name = self.lexer_by_content(filename)
+      begin
+        return self.lexer(lexer_file_name)
+      rescue ex : Exception
+        raise Exception.new("Multiple lexers match the filename: #{candidates.to_a.join(", ")}, heuristics suggest #{lexer_file_name} but there is no matching lexer.")
+      end
    end

    Lexer.from_xml(LexerFiles.get("/#{lexer_file_name}.xml").gets_to_end)
  end

+  private def self.lexer_by_content(fname : String) : String?
+    h = Linguist::Heuristic.from_yaml(LexerFiles.get("/heuristics.yml").gets_to_end)
+    result = h.run(fname, File.read(fname))
+    case result
+    when Nil
+      raise Exception.new "No lexer found for #{fname}"
+    when String
+      result.as(String)
+    when Array(String)
+      result.first
+    end
+  end
+
  private def self.create_delegating_lexer(name : String) : BaseLexer
    language, root = name.split("+", 2)
    language_lexer = lexer(language)
Author	SHA1	Message	Date
Roberto Alsina	4dd2e925b0	Fix bug in ansi formatter	2024-08-26 16:44:44 -03:00
Roberto Alsina	7bda19cdea	Use forked baked_file_system for now	2024-08-25 17:05:04 -03:00
Roberto Alsina	0e7dafe711	Updated README	2024-08-24 22:33:24 -03:00
Roberto Alsina	082241eb0f	Load lexer by mimetype	2024-08-24 22:20:38 -03:00
Roberto Alsina	df88047ca8	v0.6.1	2024-08-24 21:45:57 -03:00
Roberto Alsina	5a3b50d7a3	Integrate heuristics into lexer selection	2024-08-24 21:39:39 -03:00