Commit Graph

13 Commits

Author SHA1 Message Date
78d8f43a88 tokenizer: hide flex-based impl, avoid build failures on win
TestPlan:
 - go test -run TestTokenize ./internal/tokenizer
 - go test -tags flex -run TestTokenize ./internal/tokenizer
   (shold fail as default fixtures are from regex-based tokenizer)
2020-03-19 19:58:48 +01:00
e32a70a784 tokenizer: fix a bug and regenerate the code \w latest Go
See https://github.com/bzz/enry/pull/4 for details.

Test Plan:
 - go test ./...
2020-03-19 19:08:21 +01:00
f3ceaa6330 token: refactor & simplify test fixtures
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
2019-05-08 22:17:32 +02:00
a724a2f841 token: test case for regexp + non-valid UTF8
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
2019-05-07 13:46:36 +02:00
8bdc830833 token: new test case with Unicode replacement
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
2019-04-17 19:28:06 +02:00
7e136bade8 test: don't export tokenizer fixtures
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
2019-04-16 19:38:48 +02:00
ada6f15c93 address review feedback
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
2019-04-16 19:38:48 +02:00
7929933eb5 tokenizer: cleanup & attributions
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
2019-04-14 21:38:16 +02:00
8756fbdcb4 refactor to build tags
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
2019-04-14 21:38:16 +02:00
553399ed76 tokenizer: port flex-based C impl from linguist
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
2019-04-14 21:38:16 +02:00
169060e1cd Add a test that tokenization does not modify the input.
At present this test fails, since the tokenizer replaces text in shared slices
of the input. A subsequent commit will fix that.

Signed-off-by: M. J. Fromberger <michael.j.fromberger@gmail.com>
2019-01-29 10:03:09 -08:00
1fc8cf7a5d changes to improve detection accuracy 2017-06-15 10:07:22 +02:00
fcf30a07c8 Added frequencies.go generation 2017-05-29 12:19:37 +02:00