2017-06-21 09:02:33 +00:00
# enry [![GoDoc](https://godoc.org/gopkg.in/src-d/enry.v1?status.svg)](https://godoc.org/gopkg.in/src-d/enry.v1) [![Build Status](https://travis-ci.org/src-d/enry.svg?branch=master)](https://travis-ci.org/src-d/enry) [![codecov](https://codecov.io/gh/src-d/enry/branch/master/graph/badge.svg)](https://codecov.io/gh/src-d/enry)
2016-12-09 12:30:21 +00:00
2017-06-30 12:00:37 +00:00
File programming language detector and toolbox to ignore binary or vendored files. *enry* , started as a port to _Go_ of the original [linguist ](https://github.com/github/linguist ) _Ruby_ library, that has an improved *2x performance* .
2016-12-09 12:30:21 +00:00
Installation
------------
2017-07-25 07:15:50 +00:00
The recommended way to install enry is
2016-12-09 12:30:21 +00:00
```
2017-06-08 07:27:27 +00:00
go get gopkg.in/src-d/enry.v1/...
2016-12-09 12:30:21 +00:00
```
2017-07-07 09:20:54 +00:00
To build enry's CLI you must run
make build-cli
2017-10-04 15:18:38 +00:00
this will generate a binary in the project's root directory called `enry` . You can then move this binary to anywhere in your `PATH` .
2016-12-09 12:30:21 +00:00
2017-06-30 12:00:37 +00:00
2017-10-25 09:11:14 +00:00
### Faster regexp engine (optional)
2017-09-29 12:58:24 +00:00
2018-03-28 18:52:49 +00:00
[Oniguruma ](https://github.com/kkos/oniguruma ) is CRuby's regular expression engine.
2017-10-25 09:11:14 +00:00
It is very fast and performs better than the one built into Go runtime. *enry* supports swapping
between those two engines thanks to [rubex ](https://github.com/moovweb/rubex ) project.
2018-03-28 18:52:49 +00:00
The typical overall speedup from using Oniguruma is 1.5-2x. However, it requires CGo and the external shared library.
2017-09-29 12:58:24 +00:00
On macOS with brew, it is
```
2018-03-28 18:52:49 +00:00
brew install oniguruma
2017-09-29 12:58:24 +00:00
```
On Ubuntu, it is
```
sudo apt install libonig-dev
```
2018-08-28 15:27:18 +00:00
To build enry with Oniguruma regexps use the `oniguruma` build tag
2017-09-29 12:58:24 +00:00
```
2018-08-28 15:27:18 +00:00
go get -v -t --tags oniguruma ./...
2017-09-29 12:58:24 +00:00
```
2017-10-25 09:11:14 +00:00
and then rebuild the project.
2017-09-29 12:58:24 +00:00
2016-12-09 12:30:21 +00:00
Examples
2017-06-30 12:00:37 +00:00
------------
2016-12-09 12:30:21 +00:00
```go
2017-07-07 07:59:56 +00:00
lang, safe := enry.GetLanguageByExtension("foo.go")
2017-10-04 14:53:26 +00:00
fmt.Println(lang, safe)
// result: Go true
2016-12-09 12:30:21 +00:00
2017-10-04 14:53:26 +00:00
lang, safe := enry.GetLanguageByContent("foo.m", []byte("< matlab-code > "))
fmt.Println(lang, safe)
// result: Matlab true
2016-12-09 12:30:21 +00:00
2017-10-04 14:53:26 +00:00
lang, safe := enry.GetLanguageByContent("bar.m", []byte("< objective-c-code > "))
fmt.Println(lang, safe)
// result: Objective-C true
2017-06-21 07:07:55 +00:00
// all strategies together
2017-10-04 14:53:26 +00:00
lang := enry.GetLanguage("foo.cpp", []byte("< cpp-code > "))
// result: C++ true
2017-04-05 17:03:20 +00:00
```
2017-10-04 15:18:38 +00:00
Note that the returned boolean value `safe` is set either to `true` , if there is only one possible language detected, or to `false` otherwise.
2017-07-07 07:59:56 +00:00
To get a list of possible languages for a given file, you can use the plural version of the detecting functions.
```go
2017-10-04 14:53:26 +00:00
langs := enry.GetLanguages("foo.h", []byte("< cpp-code > "))
// result: []string{"C", "C++", "Objective-C}
2017-07-07 07:59:56 +00:00
2017-10-04 14:53:26 +00:00
langs := enry.GetLanguagesByExtension("foo.asc", []byte("< content > "), nil)
2017-07-07 07:59:56 +00:00
// result: []string{"AGS Script", "AsciiDoc", "Public Key"}
2017-10-04 14:53:26 +00:00
langs := enry.GetLanguagesByFilename("Gemfile", []byte("< content > "), []string{})
2017-07-07 07:59:56 +00:00
// result: []string{"Ruby"}
```
2017-07-06 11:30:01 +00:00
CLI
2017-06-30 12:00:37 +00:00
------------
2017-07-06 11:30:01 +00:00
You can use enry as a command,
```bash
$ enry --help
2017-10-04 14:53:26 +00:00
enry v1.5.0 build: 10-02-2017_14_01_07 commit: 95ef0a6cf3, based on linguist commit: 37979b2
enry, A simple (and faster) implementation of github/linguist
usage: enry < path >
enry [-json] [-breakdown] < path >
enry [-json] [-breakdown]
enry [-version]
2017-07-06 11:30:01 +00:00
```
2017-10-04 15:09:58 +00:00
and it'll return an output similar to *linguist* 's output,
2017-07-06 11:30:01 +00:00
```bash
$ enry
55.56% Shell
2017-10-04 14:53:26 +00:00
22.22% Ruby
11.11% Gnuplot
2017-07-06 11:30:01 +00:00
11.11% Go
```
2017-10-04 15:09:58 +00:00
but not only the output; its flags are also the same as *linguist* 's ones,
2017-07-06 11:30:01 +00:00
```bash
$ enry --breakdown
55.56% Shell
2017-10-04 14:53:26 +00:00
22.22% Ruby
11.11% Gnuplot
2017-07-06 11:30:01 +00:00
11.11% Go
Gnuplot
plot-histogram.gp
Ruby
linguist-samples.rb
linguist-total.rb
Shell
parse.sh
plot-histogram.sh
run-benchmark.sh
run-slow-benchmark.sh
run.sh
Go
parser/main.go
```
even the JSON flag,
```bash
$ enry --json
{"Gnuplot":["plot-histogram.gp"],"Go":["parser/main.go"],"Ruby":["linguist-samples.rb","linguist-total.rb"],"Shell":["parse.sh","plot-histogram.sh","run-benchmark.sh","run-slow-benchmark.sh","run.sh"]}
```
2017-10-04 15:18:38 +00:00
Note that even if enry's CLI is compatible with linguist's, its main point is that **_enry doesn't need a git repository to work!_**
2017-07-06 11:30:01 +00:00
2017-08-11 07:55:22 +00:00
Java bindings
------------
2017-10-16 06:33:00 +00:00
Generated Java binidings using a C shared library + JNI are located under [`java` ](https://github.com/src-d/enry/blob/master/java )
2017-07-06 11:30:01 +00:00
Development
2017-06-30 12:00:37 +00:00
------------
2017-07-06 11:30:01 +00:00
2017-10-04 15:09:58 +00:00
*enry* re-uses parts of original [linguist ](https://github.com/github/linguist ) to generate internal data structures. In order to update to the latest upstream and generate the necessary code you must run:
2017-06-21 06:22:22 +00:00
2017-07-03 06:30:03 +00:00
go generate
2017-10-04 15:09:58 +00:00
We update enry when changes are done in linguist's master branch on the following files:
2017-10-04 15:18:38 +00:00
2017-07-03 06:30:03 +00:00
* [languages.yml ](https://github.com/github/linguist/blob/master/lib/linguist/languages.yml )
* [heuristics.rb ](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.rb )
* [vendor.yml ](https://github.com/github/linguist/blob/master/lib/linguist/vendor.yml )
* [documentation.yml ](https://github.com/github/linguist/blob/master/lib/linguist/documentation.yml )
2018-08-28 15:27:18 +00:00
Currently we don't have any procedure established to automatically detect changes in the linguist project and regenerate the code.
2017-10-04 15:09:58 +00:00
So we update the generated code as needed, without any specific criteria.
2017-07-03 06:30:03 +00:00
2017-10-04 15:09:58 +00:00
If you want to update *enry* because of changes in linguist, you can run the *go
generate* command and do a pull request that only contains the changes in
2017-10-16 06:33:00 +00:00
generated files (those files in the subdirectory [data ](https://github.com/src-d/enry/blob/master/data )).
2017-06-21 06:22:22 +00:00
2017-10-04 15:09:58 +00:00
To run the tests,
2017-06-21 06:22:22 +00:00
make test
2017-07-11 11:48:15 +00:00
Divergences from linguist
------------
2017-10-04 15:09:58 +00:00
Using [linguist/samples ](https://github.com/github/linguist/tree/master/samples )
as a set for the tests, the following issues were found:
* With [hello.ms ](https://github.com/github/linguist/blob/master/samples/Unix%20Assembly/hello.ms ) we can't detect the language (Unix Assembly) because we don't have a matcher in contentMatchers (content.go) for Unix Assembly. Linguist uses this [regexp ](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.rb#L300 ) in its code,
2017-07-11 11:48:15 +00:00
`elsif /(?<!\S)\.(include|globa?l)\s/.match(data) || /(?<!\/\*)(\A|\n)\s*\.[A-Za-z][_A-Za-z0-9]*:/.match(data.gsub(/"([^\\"]|\\.)*"|'([^\\']|\\.)*'|\\\s*(?:--.*)?\n/, ""))`
which we can't port.
2017-10-04 15:09:58 +00:00
* All files for the SQL language fall to the classifier because we don't parse
this [disambiguator
expression](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.rb#L433)
for `*.sql` files right. This expression doesn't comply with the pattern for the
rest in [heuristics.rb ](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.rb ).
2017-07-11 11:48:15 +00:00
2017-06-30 12:00:37 +00:00
Benchmarks
------------
2017-10-04 15:09:58 +00:00
Enry's language detection has been compared with Linguist's one. In order to do that, linguist's project directory [*linguist/samples* ](https://github.com/github/linguist/tree/master/samples ) was used as a set of files to run benchmarks against.
2017-06-30 12:00:37 +00:00
2017-10-04 15:18:38 +00:00
We got these results:
2017-06-30 12:00:37 +00:00
2017-07-18 12:01:02 +00:00
![histogram ](https://raw.githubusercontent.com/src-d/enry/master/benchmarks/histogram/distribution.png )
2017-06-30 12:00:37 +00:00
2017-10-04 15:09:58 +00:00
The histogram represents the number of files for which spent time in language
detection was in the range of the time interval indicated in the x axis.
2017-06-30 12:00:37 +00:00
2017-10-04 15:18:38 +00:00
So you can see that most of the files were detected quicker in enry.
2017-06-30 12:00:37 +00:00
2017-10-04 15:09:58 +00:00
We found some few cases where enry turns slower than linguist. This is due to
Golang's regexp engine being slower than Ruby's, which uses the [oniguruma ](https://github.com/kkos/oniguruma ) library, written in C.
2017-06-30 12:00:37 +00:00
2017-10-04 15:09:58 +00:00
You can find scripts and additional information (like software and hardware used
2017-10-16 06:33:00 +00:00
and benchmarks' results per sample file) in [*benchmarks* ](https://github.com/src-d/enry/blob/master/benchmarks ) directory.
2017-06-30 12:00:37 +00:00
2017-10-04 15:09:58 +00:00
If you want to reproduce the same benchmarks you can run:
2017-06-30 12:00:37 +00:00
benchmarks/run.sh
2018-08-28 15:27:18 +00:00
from the root's project directory and it'll run benchmarks for enry and linguist, parse the output, create csv files and create a histogram (you must have installed [gnuplot ](http://gnuplot.info ) in your system to get the histogram).
2017-10-04 15:09:58 +00:00
This can take some time, so to run local benchmarks for a quick check you can either:
2017-06-30 12:00:37 +00:00
make benchmarks
2017-10-04 15:09:58 +00:00
to get average times for the main detection function and strategies for the whole samples set or:
2017-06-30 12:00:37 +00:00
make benchmarks-samples
2017-10-04 15:09:58 +00:00
if you want to see measures by sample file.
2017-06-30 12:00:37 +00:00
2017-06-08 07:27:27 +00:00
Why Enry?
2017-06-30 12:00:37 +00:00
------------
2017-07-06 11:30:01 +00:00
2017-10-04 15:48:13 +00:00
In the movie [My Fair Lady ](https://en.wikipedia.org/wiki/My_Fair_Lady ), [Professor Henry Higgins ](http://www.imdb.com/character/ch0011719/?ref_=tt_cl_t2 ) is one of the main characters. Henry is a linguist and at the very beginning of the movie enjoys guessing the origin of people based on their accent.
2017-06-08 07:27:27 +00:00
2017-06-09 18:27:37 +00:00
`Enry Iggins` is how [Eliza Doolittle ](http://www.imdb.com/character/ch0011720/?ref_=tt_cl_t1 ), [pronounces ](https://www.youtube.com/watch?v=pwNKyTktDIE ) the name of the Professor during the first half of the movie.
2017-06-08 07:27:27 +00:00
2017-04-05 17:03:20 +00:00
License
2017-06-30 12:00:37 +00:00
------------
2017-04-05 17:03:20 +00:00
2017-07-14 14:42:20 +00:00
Apache License, Version 2.0. See [LICENSE ](LICENSE )