Commit Graph

32 Commits

Author SHA1 Message Date
Alex Bezzubov
8246efecce heuristics regexp engine configurable #3, adapt IsVendor optimization & tests
Regex collation optimization for IsVendor now fails gracefully.
Tests that are affected by non-RE2 syntax are explicitly marked.
2023-02-16 17:55:57 +01:00
Alex Bezzubov
ede9e478fe IsVendor: move RE collation to code generation phase
test plan:
 * go test -run '^TestIsVendor$' github.com/go-enry/go-enry/v2
2022-12-01 22:16:44 +01:00
Alex Bezzubov
95f30f0db4 IsVendor: refactor RE collation optimization
The same optimization still happens during package initialization
at runtime, but an effort was made to make it more transparent and
self-documented.

Both the test & the benchmark were updated.

Old version (usefull for benchmark) was

```go
func IsVendor(filename string) bool {
	for _, matcher := range data.VendorMatchers {
		if matcher.MatchString(filename) {
			return true
		}
	}
	return false
}
```

Test plan:
 * go test -run ^TestIsVendor$ github.com/go-enry/go-enry/v2
2022-10-23 15:19:53 +02:00
Clark Boylan
16a5ff8e22 Fix IsVendor() regex generation
The regex generation introduced by
20726a1de3
had a subtle but important bug in it. In particular the ^ and ^|/
prefixes were only applied to the first regex in the OR'd groupings and
not every regex in the grouping.

We fix this by wrapping each of those OR'd groups in a new non capturing
group to ensure the prefix applies to every entry in the OR'd listing.

Additionally we add three new quick tests of IsVendor to guard against
this regression in the future.

This fixes https://github.com/go-enry/go-enry/issues/135
2022-08-25 16:22:09 -07:00
Andrew Thornton
20726a1de3
Make IsVendor quicker
Although iterating across the regexps is quicker than naively concatenating them,
it is still quite slow.

This PR proposes a slightly cleverer solution.

First instead of just concatenating with groups this PR uses non-capturing groups.
This speeds up the regexp processing.

Secondly we group the regexps in to 3 groups - those that have to be at the start,
those that are segments or at the start and the rest. This makes a considerable speed
improvement.

Thirdly the regexps are sorted within those groups - which also speeds things up.

All in all for a non-vendored file this makes IsVendor around twice as fast.

Signed-off-by: Andrew Thornton <art27@cantab.net>
2021-04-23 10:18:28 +01:00
Miguel Molina
8ff885a3a8
implement IsGenerated helper to filter out generated files
Closes #17

Implements the IsGenerated helper function to filter out generated
files using the rules and matchers in:
- https://github.com/github/linguist/blob/master/lib/linguist/generated.rb

Since the vast majority of matchers have very different logic, it cannot
be autogenerated directly from linguist like other logics in enry, so it's
translated by hand.

There are three different types of matchers in this implementation:
- By extension, which mark as generated based only in the extension. These
  are the fastest matchers, so they're done first.
- By file name, which matches patterns against the filename. These
  are performed in second place. Unlike linguist, we try to use string
  functions instead of regexps as much as possible.
- Finally, the rest of the matchers, which go into the content and try
  to identify if they're generated or not based on the content. Unlike
  linguist, we try to only read the content we need and not split it
  all unless it's necessary and use byte functions instead of regexps
  as much as possible.

Signed-off-by: Miguel Molina <miguel@erizocosmi.co>
2020-05-28 08:55:13 +02:00
Máximo Cuadros
29bc0a181b
data: replace substring package with regex package 2020-04-15 17:27:48 +02:00
Máximo Cuadros
b851ee83ad
IsTest function for top 10 languages 2020-04-06 16:23:48 +02:00
Lauris BH
97a26011a9 Return group color if language has none 2020-03-31 09:30:27 +03:00
Máximo Cuadros
84efad7693
*: module rename to go-enry/go-enry/v4 2020-03-19 17:31:29 +01:00
Alexander Bezzubov
bc5e031cee Drop src-d org ref except for issues
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
2020-03-19 14:04:36 +01:00
Lauris Bukšis-Haberkorns
25b29ebdc4 Implement getting color code for languages
Signed-off-by: Lauris Bukšis-Haberkorns <lauris@nix.lv>
2019-07-19 23:59:46 +03:00
Alexander Bezzubov
6a5f37e9e2
modules: prepare for v2 release
- update go.mod \w v2
 - update all import paths

Signed-off-by: Alexander Bezzubov <bzz@apache.org>
2019-04-14 21:28:12 +02:00
Alexander Bezzubov
20c6d2845a
build: gopkg.in -> github.com imports
Signed-off-by: Alexander Bezzubov <bzz@apache.org>
2019-04-12 11:49:16 +02:00
Alexander
13d3d66d37
refactoring: remove un-used code, add go doc, fix ci (#199)
Refactoring, consisting of
 - remove unused method `isAuxiliaryLanguage` and `FileCountList`
   in order to reduce public API surfaces (go/java)
 - add GoDoc to public APIs
 - ci: java profile use latest go src
  It also now mimics https://docs.travis-ci.com/user/languages/go/#go-import-path
  for non-go build image, as code relies on internal imports.

TEST PLAN:
 - make test
2019-02-05 22:54:14 +01:00
Huitse Tai
a786f6175e patch IsDotFile function compatible with both '.' and '..'
Signed-off-by: Huitse Tai <geb.1989@gmail.com>
2017-10-18 12:50:33 +08:00
Huitse Tai
887bc6a4be make IsDotFile do not treat '.' as true
Signed-off-by: Huitse Tai <geb.1989@gmail.com>
2017-10-18 12:18:52 +08:00
Sunny
b84e338f9e cli: sort the results
Adds `FileCount` and `FileCountList` types for storing language file and
their count which can be sorted based on the value of count.

Signed-off-by: Sunny <me@darkowlzz.space>
2017-09-30 18:52:30 +05:30
David Paz
bd402a063c Fixed output text 2017-07-18 12:54:54 +02:00
David Paz
b2fe3f69ce Added mymeType.gold 2017-07-18 12:47:19 +02:00
David Paz
25e12e9c03 Returns text/plain when mime it's undefined 2017-07-18 12:46:29 +02:00
David Paz
125c802582 Now generates mime file 2017-07-18 12:46:29 +02:00
David Paz
7e827e47ef moved generated data to data subpackage 2017-06-28 08:31:11 +02:00
Manuel Carmona
beda5b73e7 changed signatures for strategies 2017-06-15 10:07:23 +02:00
Manuel Carmona
ba53e10c7b renamed package and cli to enry 2017-06-13 14:18:23 +02:00
Manuel Carmona
0d5dff1979 changes in the API, ready to version 2 2017-06-06 11:30:23 +02:00
Manuel Carmona
5b304524d1 Rearranged code 2017-06-02 09:33:55 +02:00
Manuel Carmona
ca3ae587f3 added documentation_matchers.go generation 2017-04-17 11:52:11 +02:00
Manuel Carmona
e998b0ff2e regexp for vendored files and directories are generated in vendor_matchers.go 2017-04-07 09:27:40 +02:00
Manuel Carmona
13e7886a02 Added utils.go generation 2017-04-06 17:31:17 +02:00
Máximo Cuadros
2bbd7ec440 unified GetLanguage function 2016-07-18 16:20:12 +02:00
Máximo Cuadros
bead3a606f tests 2016-07-13 22:21:18 +02:00