Programming language detector and toolbox to ignore binary or vendored files. _enry_, started as a port to _Go_ of the original [Linguist](https://github.com/github/linguist) _Ruby_ library, that has an improved _2x performance_.
-`GetLanguageByShebang` reads only the first line of text to identify the [shebang](<https://en.wikipedia.org/wiki/Shebang_(Unix)>).
-`GetLanguageByModeline` for cases when Vim/Emacs modeline e.g. `/* vim: set ft=cpp: */` may be present at a head or a tail of the text.
-`GetLanguageByClassifier` uses a Bayesian classifier trained on all the `./samples/` from Linguist.
It usually is a last-resort strategy that is used to disambiguate the guess of the previous strategies, and thus it requires a list of "candidate" guesses. One can provide a list of all known languages - keys from the `data.LanguagesLogProbabilities` as possible candidates if more intelligent hypotheses are not available, at the price of possibly suboptimal accuracy.
A library is published on Maven as [tech.sourced:enry-java](https://mvnrepository.com/artifact/tech.sourced/enry-java) for macOS and linux platforms. Windows support is planned under [src-d/enry#150](https://github.com/src-d/enry/issues/150).
- [Heuristic for ".plist" extension](https://github.com/github-linguist/linguist/blob/b5432ebc7e78f25415b98d48c2fbacddbf8df317/lib/linguist/heuristics.yml#L524) in 'XML Property List', due to unsupported backreference in RE2 regexp engine.
- [Heuristics for ".txt" extension](https://github.com/github/linguist/blob/8083cb5a89cee2d99f5a988f165994d0243f0d1e/lib/linguist/heuristics.yml#L521) in Vim Help File could not be parsed, due to unsupported negative lookahead in RE2 regexp engine.
- [Heuristics for ".sol" extension](https://github.com/github/linguist/blob/8083cb5a89cee2d99f5a988f165994d0243f0d1e/lib/linguist/heuristics.yml#L464) in Solidity could not be parsed, due to unsupported negative lookahead in RE2 regexp engine.
- [Heuristics for ".es" extension](https://github.com/github/linguist/blob/e761f9b013e5b61161481fcb898b59721ee40e3d/lib/linguist/heuristics.yml#L103) in JavaScript could not be parsed, due to unsupported backreference in RE2 regexp engine.
- [Heuristics for ".rno" extension](https://github.com/github/linguist/blob/3a1bd3c3d3e741a8aaec4704f782e06f5cd2a00d/lib/linguist/heuristics.yml#L365) in RUNOFF could not be parsed, due to unsupported lookahead in RE2 regexp engine.
- [Heuristics for ".inc" extension](https://github.com/github/linguist/blob/f0e2d0d7f1ce600b2a5acccaef6b149c87d8b99c/lib/linguist/heuristics.yml#L222) in NASL could not be parsed, due to unsupported possessive quantifier in RE2 regexp engine.
- [Heuristics for ".as" extension](https://github.com/github/linguist/blob/223c00bb80eff04788e29010f98c5778993d2b2a/lib/linguist/heuristics.yml#L67) in ActionScript could not be parsed, due to unsupported positive lookahead in RE2 regexp engine.
- [Heuristics for ".csc", ".gsc" and ".gsh" extension](https://github.com/github/linguist/blob/7469c7982d93f2ad922230d712f586a353dc1a42/lib/linguist/heuristics.yml#L650-L651) in GSC could not be parsed, due to unsupported non-backtracking subexpressions in RE2 regexp engine.
- [Heuristic for ".txt"](https://github.com/github/linguist/blob/bf853f1c663903e3ee35935189760191f1c45e1c/lib/linguist/heuristics.yml#L680-L702) detecting 'Adblock Filter List' regexp syntax not supported by RE2
- [IsVendor('bootstrap.css') == false](https://github.com/github/linguist/blob/v7.23.0/lib/linguist/vendor.yml#L77) v7.23 first unsupported RE syntax outside content heuristics
- As of [Linguist v5.3.2](https://github.com/github/linguist/releases/tag/v5.3.2) it is using [flex-based scanner in C for tokenization](https://github.com/github/linguist/pull/3846). Enry still uses [extract_token](https://github.com/github/linguist/pull/3846/files#diff-d5179df0b71620e3fac4535cd1368d15L60) regex-based algorithm. See [#193](https://github.com/src-d/enry/issues/193).
> All the issues related to heuristics' regexp syntax incompatibilities with the RE2 engine can be avoided by using `oniguruma` instead (see [instuctions](#misc))
In the movie [My Fair Lady](https://en.wikipedia.org/wiki/My_Fair_Lady), [Professor Henry Higgins](http://www.imdb.com/character/ch0011719/) is a linguist who at the very beginning of the movie enjoys guessing the origin of people based on their accent.
"Enry Iggins" is how [Eliza Doolittle](http://www.imdb.com/character/ch0011720/), [pronounces](https://www.youtube.com/watch?v=pwNKyTktDIE) the name of the Professor.
There now is automation for detecting the changes in the upstream Linguist project: every day Github CI runs [a job](.github/workflows/sync-linguist.yml) that will create a PR to this repo for each new Linguist release. It will include all the steps from the above.
Separating all the necessary "manual" code changes to a different PR that includes some background description and an update to the documentation on ["divergences from linguist"](#divergences-from-linguist) is encouraged and very much appreciated, as it simplifies the maintenance (review/release notes/etc).