Detection of Unclassified licenses

Files where no license has been matched, but that fulfill other criteria that are a hint for a license are reported by FOSSology as "UnclassifiedLicense". This section should document how this mechanism works.

After some signatures a number is given in parenthesis. This is the index of the signature in the file STRINGS.in. This index is assigned on build time.

First purge: words/phrases matching

ONLY files that have (no matter where all over the file) the words/phrases specified in this section are considered for deeper testing.

1st filter: _LEGAL_VERBS

This test takes place inside of the function checkUnclassified.

_LEGAL_VERBS (2046)

r?e?-?distribut(e|ion)|
deploy|
publish|
replicat|
[^/]\<cop(y|ied)\>|
contribut|
execute|
execution|
sublicen[cs]\<r?e?-?sell\>|
\<sold\>|
\<reserve[ds]?\>|
(term|condition)s? appl(y|ies)|
for sale|
ma[dk]e available|
copyleft|
retain|
confer|
convey|
licen[cs]ed|
\<provided?>\>

2nd filter: _LEGAL_OWNERS, _LEGAL_OBJS and _LEGAL_PERMS

If a file passes the first purge, then the function match3 will be called to test that ALL the following signatures get matched.

_LEGAL_OWNERS (2047)

\<(
  you|
  licen[cs]ees?|
  licen[cs]ors?|
  owners?|
  holders?|
  company|
  proprietors?|
  regents|
  universit(y|ies)|
  colleges?|
  authors?|
  contributors?|
  recipients?|
  persons?|
  anyone|
  every(body|one)|
  distributors?
  )\>|
re[mt]ain[eis]?|
appear[eis]?}|
as[ -]is|
expressly|
in (posession|any form)|

_LEGAL_OBJS (2048)

\<copyright\>|
\<patent\>|
\<trade(mark| ?dress)\>|
\<licen[cs]e\>|
\<terms\>|
\<agreement\>|
\<program\>|
\<software\>|
\<library\>|
\<source\>|
\<binary\>|
\<document\>|
\<manual\>|
\<product\>|
\<files?\>|
derivat|
modification|
change|
representation

_LEGAL_PERMS (2049)

grant|
permi[st]|
\<verbatim\>|
authori[sz]|
\<rights?\>|
request|
\<allowe?d?\>|
entitle|
\<freel?y?\>|
\<liable\>|
responsib(ilit|le)|
same terms|
terms and conditions

Second purge: Paragraph analysis

If a file passes the first purge, then a deeper analysis takes place. In the tests explained in this section the paragraphs where some keywords where matched get individually analyzed. It means that keywords spread all over a file, but not grouped in a paragraph are not considered a hint for an unclassified license.

Negative matching 1

Some paragraphs might look like a hint for a license because of the amount of matched keywords, but there are some known text pieces that yield enough keywords without being licenses. The filter described here tries to eliminate those "false positives".

The function match3 in parse.c takes care of this filtering.

  1. Those paragraphs where more than half of the characters are of the extended ASCII set are probably not licenses.
  2. "no warranty" statements aren't considered to be enough hint for a license.
  3. The FSF-GNU licensing templates that some projects provide aren't licensing any part of the package.
  4. If a GPL-family license was matched previously, then this function wouldn't have been entered. The GPL preamble without the license is not a license, therefore keywords in the GPL preamble aren't considered a hint for a license.

Positive matching based on score

If this point is reached, then basically the presence of keywords in the paragraph is used to decide if an unclassified license gets reported or not. The function match3 also takes care of this filtering.

  1. A paragraph with 3 or more keywords is a good hint for an unclassified license.
  2. If the paragraph only has 2 keywords, but the whole file doesn't have more than 4 keywords, then take it as a good hint for an unclassified license.

Negative matching 2

Even when a paragraph that has passed all the previous filters can be considered a good candidate for a license that isn't being matched by Nomos, there are still some false positives. The last section of the function match3 takes care of this last filtering.

If any of following signatures get matched, then the license candidate gets discarded. This negative filter could damage the accuracy of the powerful unclassified license detection because it contains too many words that appear frequently in texts (and therefore also in licensing texts), like "completed".

_LEGAL_FILTER_1

follow(ing| the) sign[ -]off|
old mechanism|
future changes|
master system|
(en|de)crypt|
cryptograph|
posix[ -]mode|
home[ -]?page|
completed|
build(ing[ -]block|
 (dir|environ|solut))|
then generally|
to switch|
get maintainer|
COLSxROWS|
success story|
normal thing|
wish to|
scription pattern|
bug report|
sorts? of?|
please spread|
pollut|
[2468][ -]bits?|
vaccin|
(read|write|migrate[ds]?)[ -](access|operat|permi)|
random num|
strange|
basis (amount|point)|
lead manage|
spokesw?o?man|test[ -](pre-?req|vector)|
total word|
(
  buildi?n?g?|
  configure|
  chat|
  magic|
  setup
  ) script|
v[ae]ry interest|
[^ ]proxy|
encourag|
some(one|body) (else|oth)|
short[ -]?cut|
not? list|
query string|
custom[ -]buil[dt]|
either det|
report income|
(left|right)[ -]?wing|
above screen

_LEGAL_FILTER_FOREIGN

dokumentation|
\<geschrieben\>|
\<definitionen\>|
\<dokumentu\>|
\<dokumentazio\>|
\<bitte\>|
\<avec\>|
\<sous\>|
\<distinguir\>|
\<donde\>|
\<memorizzate\>|
\<administativos\>|
\<skriptide\>|
\<ventada\>|
\<persoonlijke\>

_LEGAL_FILTER_WORDS

\<repository\>|
\<d?ichotomy\>|
\<granular\>|
\<generate\>|
\<glues?\>|
\<bits?\>|
\<set[gu]id\>|
a \<c[iy]pher\>|
\<plug-?in\>|
\<connections?\>|
\<tokens?\>|
\<hooks?\>|
\<defects?\>|
\<integers?\>|
\<pointers?\>|
\<vectors?\>|
\<bios\>|
\<protocols?\>|
\<stringlist\>|
\<newline\>|
\<dial-?up\>|
\<faq\>|
\<std(in|out|err)\>|
\<status\>|
\<pleas(e[ds]|ing)\>|
\<preparations?\>|
\<latenc(y|ies)\>|
\<literally\>|
\<profil(e[ds]?|ing)\>|
\<hacking\>|
\<greet(ings?)\>|
\<position\>|
\<holistic\>|
\<pre-?remove\>|
\<vague\>|
\<in-?line\>
</pre>