Commit Graph

514 Commits

Author SHA1 Message Date
Alex
2ddd4985bc
doc: mention Rust bindings and IsGenerated 2021-04-24 08:56:06 +02:00
Alex
7168084e5e
Merge pull request #44 from zeripath/speed-up-is-vendor
Make IsVendor quicker
2021-04-24 08:35:22 +02:00
Alex
0a9864e6ec
Merge pull request #46 from look/look/add-language-id
Add GetLanguageID function
2021-04-24 08:32:32 +02:00
Andrew Thornton
20726a1de3
Make IsVendor quicker
Although iterating across the regexps is quicker than naively concatenating them,
it is still quite slow.

This PR proposes a slightly cleverer solution.

First instead of just concatenating with groups this PR uses non-capturing groups.
This speeds up the regexp processing.

Secondly we group the regexps in to 3 groups - those that have to be at the start,
those that are segments or at the start and the rest. This makes a considerable speed
improvement.

Thirdly the regexps are sorted within those groups - which also speeds things up.

All in all for a non-vendored file this makes IsVendor around twice as fast.

Signed-off-by: Andrew Thornton <art27@cantab.net>
2021-04-23 10:18:28 +01:00
Luke Francl
cabfdaffc0 Update GetLanguageID to return a found boolean per code review 2021-04-22 16:55:42 -07:00
6543
d2d4c32d4d
Extend & simplify the test for IsVendor (#45) 2021-04-22 22:24:27 +02:00
Alex
b60e5c6f5a
Merge pull request #47 from look/look/mimic-linguist-detect
Rewrite GetLanguages to work like Linguist.detect
2021-04-22 21:38:22 +02:00
Máximo Cuadros
11cbde8956
Merge pull request #48 from look/look/rm-travis
Remove .travis.yml
2021-04-17 01:10:19 +02:00
Luke Francl
ed7a1e67b4 Remove .travis.yml
This file doesn't appear to be used any more, since the builds are run using
GitHub Actions.

This file is affected by the recent Codecov Bash Uploader exploit[1], but since it
hasn't been running, I don't think the project is affected.

[1] https://about.codecov.io/security-update/
2021-04-15 15:11:39 -07:00
Luke Francl
bf7167fc44 Rewrite GetLanguages to work like Linguist.detect
Prior to this change, GetLanguages collected all candidate languages from each
strategy to pass to the next strategy (without de-duplicating them). Linguist
only uses the previous strategy's candidates for the next strategy. Also, it
would overwrite languages with nil if a strategy returned that, so you could get
into a situation where you go from multiple languages to no language.

See the Ruby code for details: aad49acc06/lib/linguist.rb (L14-L49)

This addresses https://github.com/src-d/enry/issues/207 because GetLanguages
should not return all candidates detected, otherwise it would work differently
than Linguist.
2021-04-13 12:04:47 -07:00
Luke Francl
eb043e80a8 Add GetLanguageID function
The Linguist-defined language IDs are important to our use case because they are
used as database identifiers. This adds a new generator to extract the language
IDs into a map and uses that to implement GetLanguageID.

Because one language has the ID 0, there is no way to tell if a language name is
found or not. If desired, we could add this by returning (string, bool) from
GetLanguageID. But none of the other functions that take language names do this,
so I didn't want to introduce it here.
2021-04-13 11:49:21 -07:00
Alex
7f5d84ad74
Merge pull request #43 from lafriks-fork/feat/v7.13.0
Sync with Liguist v7.13.0
2021-03-12 08:02:57 +01:00
Lauris BH
323d739170 Fix test 2021-03-07 18:34:08 +02:00
Lauris BH
c40b34c351 Sync with Liguist v7.13.0 2021-03-07 18:02:04 +02:00
Alexander
1ad7deb89e
Merge pull request #42 from lafriks-fork/feat/sync_v7.12.2
Sync with github/linguist version v7.12.2
2021-03-06 15:35:46 +01:00
Lauris BH
497e2f85d3 Sync with github/linguist version v7.12.2 2021-01-17 14:10:38 +02:00
Alexander
3faf9450da
Merge pull request #40 from lafriks-fork/feat/strategy_xml
Add XML strategy
2020-12-02 00:10:52 +01:00
Lauris BH
0596fda1a4
Fix strategy order 2020-11-26 13:56:25 +02:00
Alexander
6edbff3dec
Merge pull request #38 from softagram/fix-readme-cmd
Fix typo in the pip command in README.md
2020-11-26 12:46:28 +01:00
Alexander
8de21f365e
Merge pull request #39 from lafriks-fork/feat/sync_7.12.1
Sync with linguist 7.12.1
2020-11-26 12:38:46 +01:00
Lauris BH
8ac98f4b77 Update readme 2020-11-15 15:48:03 +02:00
Lauris BH
6d8f15af5b Add XML strategy 2020-11-15 15:43:37 +02:00
Lauris BH
289ac3d9f0 Sync with linguist 7.12.1 2020-11-15 14:32:56 +02:00
Ville Laitila
8d83871580 Fix typo in the pip command in README.md 2020-11-14 23:29:53 +02:00
Alexander
0fb4b8a768
Merge pull request #35 from lafriks-fork/feat/manpage_strategy
Add support for Roff man pages filenames
2020-10-22 00:10:39 +02:00
Alexander
7688057adc
Merge pull request #37 from lafriks-fork/sync_7_11_1
sync to the latest github/linguist v7.11.1
2020-10-22 00:08:04 +02:00
Lauris BH
bc76dd38b0 sync to the latest github/linguist v7.11.1 2020-10-12 12:32:48 +03:00
Lauris BH
cb353b4b05 Add support for Roff man pages filenames 2020-10-12 12:18:57 +03:00
Alexander
d7f6b27b7d
Merge pull request #34 from lafriks-fork/sync_7_11
sync to the latest github/linguist v7.11.0
2020-09-24 12:24:42 +02:00
Lauris BH
7c562a6c34 sync to the latest github/linguist v7.11.0 2020-09-17 10:34:41 +03:00
Alexander
5717abd4c0
Merge pull request #30 from bzz/python-ci
CI for Python bindings
2020-08-17 11:56:30 +02:00
Alexander Bezzubov
e98983b3f9 ci: add Python tests profile (\wo gopy)
Signed-off-by: Alexander Bezzubov <alexander.bezzubov@jetbrains.com>
2020-08-12 15:23:01 +02:00
Alexander Bezzubov
328c16f948 py: use readme as pypy description
Signed-off-by: Alexander Bezzubov <alexander.bezzubov@jetbrains.com>
2020-08-12 15:22:55 +02:00
Alexander Bezzubov
7ee65cc9d0 doc: upd build instructions
Signed-off-by: Alexander Bezzubov <alexander.bezzubov@jetbrains.com>
2020-08-12 15:22:50 +02:00
Alexander
5d58b1aaaf
Merge pull request #29 from vsmaxim/master
python: cover the rest of python bindings from shared library, add tests, add docstrings for API
2020-08-12 14:35:58 +02:00
Maxim Vasilev
59f0f17834 Remove unneded todos 2020-08-11 00:29:33 +03:00
Maxim Vasilev
08bc9bca0e Cover the rest of python bindings from shared library, add tests, add docstrings, add setup.py. 2020-08-11 00:12:43 +03:00
Máximo Cuadros
dc6fc02209
Merge pull request #24 from erizocosmico/fix/bail-out-if-not-enough-lines
data: bailout in some cases if there arent enough lines
2020-05-28 16:45:10 +02:00
Miguel Molina
78696c2272
data: bailout in some cases if there arent enough lines
Signed-off-by: Miguel Molina <miguel@erizocosmi.co>
2020-05-28 13:39:59 +02:00
Máximo Cuadros
2880ccae4a
Merge pull request #23 from erizocosmico/fix/get-first-line
data: fix getting the first line for empty content
2020-05-28 11:52:49 +02:00
Miguel Molina
79398a925d
data: fix getting the first line for empty content
Signed-off-by: Miguel Molina <miguel@erizocosmi.co>
2020-05-28 11:28:49 +02:00
Máximo Cuadros
e1f1b57a84
Merge pull request #22 from erizocosmico/feature/generated
implement IsGenerated helper to filter out generated files
2020-05-28 10:34:37 +02:00
Miguel Molina
8ff885a3a8
implement IsGenerated helper to filter out generated files
Closes #17

Implements the IsGenerated helper function to filter out generated
files using the rules and matchers in:
- https://github.com/github/linguist/blob/master/lib/linguist/generated.rb

Since the vast majority of matchers have very different logic, it cannot
be autogenerated directly from linguist like other logics in enry, so it's
translated by hand.

There are three different types of matchers in this implementation:
- By extension, which mark as generated based only in the extension. These
  are the fastest matchers, so they're done first.
- By file name, which matches patterns against the filename. These
  are performed in second place. Unlike linguist, we try to use string
  functions instead of regexps as much as possible.
- Finally, the rest of the matchers, which go into the content and try
  to identify if they're generated or not based on the content. Unlike
  linguist, we try to only read the content we need and not split it
  all unless it's necessary and use byte functions instead of regexps
  as much as possible.

Signed-off-by: Miguel Molina <miguel@erizocosmi.co>
2020-05-28 08:55:13 +02:00
Máximo Cuadros
bda45fdc8e
go.mod: update go-oniguruma v1.2.1 2020-05-06 21:42:07 +02:00
Alexander
4b468762b6
Merge pull request #13 from go-enry/python-wrapper
Python: API to expose highest-level enry.GetLanguage
2020-04-24 20:57:37 +02:00
Alexander Bezzubov
35575d0a3e
py: expose highest-level enry.language()
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
2020-04-24 20:51:46 +02:00
Máximo Cuadros
1d23012ae6
Merge pull request #15 from go-enry/ci-libonig5
ci: update libonig5 version
2020-04-24 19:24:34 +02:00
Máximo Cuadros
aa46bf4a37
ci: update libonig5 version 2020-04-24 19:18:27 +02:00
Máximo Cuadros
9fba3da45f
Merge pull request #12 from mcuadros/oniguruma
data: replace substring package with regex package
2020-04-15 19:58:28 +02:00
Máximo Cuadros
29bc0a181b
data: replace substring package with regex package 2020-04-15 17:27:48 +02:00