test: add new corner cases for linguist v7.23

test plan
 - go test -run '^Test_EnryOnLinguistCorpus$' github.com/go-enry/go-enry/v2
This commit is contained in:
Alex Bezzubov
2023-03-03 23:17:32 +01:00
parent f206e152e6
commit 9ce235f39e
3 changed files with 14 additions and 3 deletions

View File

@ -174,6 +174,10 @@ Parsing [linguist/samples](https://github.com/github/linguist/tree/master/sample
- [Heuristics for ".csc", ".gsc" and ".gsh" extension](https://github.com/github/linguist/blob/7469c7982d93f2ad922230d712f586a353dc1a42/lib/linguist/heuristics.yml#L650-L651) in GSC could not be parsed, due to unsupported non-backtracking subexpressions in RE2 regexp engine.
- [Heuristic for ".txt"](https://github.com/github/linguist/blob/bf853f1c663903e3ee35935189760191f1c45e1c/lib/linguist/heuristics.yml#L680-L702) detecting 'Adblock Filter List' regexp syntax not supported by RE2
- [IsVendor('bootstrap.css') == false](https://github.com/github/linguist/blob/v7.23.0/lib/linguist/vendor.yml#L77) v7.23 first unsupported RE syntax outside content heuristics
- As of [Linguist v5.3.2](https://github.com/github/linguist/releases/tag/v5.3.2) it is using [flex-based scanner in C for tokenization](https://github.com/github/linguist/pull/3846). Enry still uses [extract_token](https://github.com/github/linguist/pull/3846/files#diff-d5179df0b71620e3fac4535cd1368d15L60) regex-based algorithm. See [#193](https://github.com/src-d/enry/issues/193).
- Bayesian classifier can't distinguish "SQL" from "PLpgSQL. See [#194](https://github.com/src-d/enry/issues/194).