Alexander Bezzubov
a724a2f841
token: test case for regexp + non-valid UTF8
...
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
2019-05-07 13:46:36 +02:00
Alexander Bezzubov
8bdc830833
token: new test case with Unicode replacement
...
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
2019-04-17 19:28:06 +02:00
Alexander Bezzubov
7e136bade8
test: don't export tokenizer fixtures
...
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
2019-04-16 19:38:48 +02:00
Alexander Bezzubov
ada6f15c93
address review feedback
...
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
2019-04-16 19:38:48 +02:00
Alexander Bezzubov
7929933eb5
tokenizer: cleanup & attributions
...
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
2019-04-14 21:38:16 +02:00
Alexander Bezzubov
8756fbdcb4
refactor to build tags
...
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
2019-04-14 21:38:16 +02:00
Alexander Bezzubov
553399ed76
tokenizer: port flex-based C impl from linguist
...
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
2019-04-14 21:38:16 +02:00
M. J. Fromberger
169060e1cd
Add a test that tokenization does not modify the input.
...
At present this test fails, since the tokenizer replaces text in shared slices
of the input. A subsequent commit will fix that.
Signed-off-by: M. J. Fromberger <michael.j.fromberger@gmail.com>
2019-01-29 10:03:09 -08:00
Manuel Carmona
1fc8cf7a5d
changes to improve detection accuracy
2017-06-15 10:07:22 +02:00
Manuel Carmona
fcf30a07c8
Added frequencies.go generation
2017-05-29 12:19:37 +02:00