Commit Graph

11 Commits

Author SHA1 Message Date
f3ceaa6330 token: refactor & simplify test fixtures
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
2019-05-08 22:17:32 +02:00
a724a2f841 token: test case for regexp + non-valid UTF8
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
2019-05-07 13:46:36 +02:00
8bdc830833 token: new test case with Unicode replacement
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
2019-04-17 19:28:06 +02:00
7e136bade8 test: don't export tokenizer fixtures
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
2019-04-16 19:38:48 +02:00
ada6f15c93 address review feedback
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
2019-04-16 19:38:48 +02:00
7929933eb5 tokenizer: cleanup & attributions
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
2019-04-14 21:38:16 +02:00
8756fbdcb4 refactor to build tags
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
2019-04-14 21:38:16 +02:00
553399ed76 tokenizer: port flex-based C impl from linguist
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
2019-04-14 21:38:16 +02:00
169060e1cd Add a test that tokenization does not modify the input.
At present this test fails, since the tokenizer replaces text in shared slices
of the input. A subsequent commit will fix that.

Signed-off-by: M. J. Fromberger <michael.j.fromberger@gmail.com>
2019-01-29 10:03:09 -08:00
1fc8cf7a5d changes to improve detection accuracy 2017-06-15 10:07:22 +02:00
fcf30a07c8 Added frequencies.go generation 2017-05-29 12:19:37 +02:00