![]() |
FOSSology Advancing open source analysis and development |
FOSSology License Analysis
Applies to version 0.9.0 and beyond
DRAFT
Revisions: July 7, 2008 – Neal Krawetz
Initial: June 20, 2008
Daniel Stangel, Hewlett-Packard Company
The accurate detection of the declared or intended software licenses in open source software source code is a challenging problem facing everyone from independent software developers and operating system distributions, to corporate IT managers and commercial software vendors. FOSSology takes a multi-level approach of filtering, analyzing and refining the search to provide the most accurate license analysis possible.
This document describes the approach used by FOSSology, as it is implemented in version 0.9.0 of the software.
Licenses can appear in many different forms:
The current analysis system is very good at identifying full text matches and most references. While there are a high percent of false-positives for phrases, phrases that identify known license names are identified properly. However, phrase references are currently not identified.
License detection in FOSSology involves the following high-level steps, which are explained in more detail below:
A license “phrase”, or “one-sentence license”, is a simple sentence that represents a license. Samples include:
While a full license may contains many of these phrases, files without full licenses can also contain these sentences.
The bSAM engine compares tokens-to-tokens. For license analysis, “tokens” are words or punctuation found in the analyzed file. Since these tokens may be compared many times, the “Filter_License” agent generates a bsam cache file of precomputed tokens for the file. At the same time, the Filter_License agent also identifies all sentences that look “license-like”. For example, if a sentence contains the words “distributed under”, then the sentence is flagged as a potential license phrase.
As far as accuracy goes, phrase detection creates a significant number of false positives. For this reason, they are identified in the cache file but not yet loaded into the database as a “discovered license”.
It is an ironic fact of life that lawyers frequently plagiarize lawyers. Most open source licenses are either copied or derived from other known licenses. While there may be variations (a name changed, a phrase altered, etc.), it is very common to have similar licenses.
A License Template represents the “DNA” of open source software licenses. A License Template can represent:
FOSSology includes an extensive library of pre-defined License Templates that cover many, though not all, open source software licenses and their variants or derivatives.
NOTE: We cannot conclusively say that FOSSology's library of License Templates covers all licenses, since anyone can create a new license at any time, and the universe of open source software itself is far too large and dispersed to definitively scan everything.
Using the bSAM algorithm and the License Templates, FOSSology scans the uploaded source code and identifies the best match for every instance of a software license.
This system reduces the number of false-positives from phrase matches, and identifies license regions based on templates.
In simple terms, bSAM is a pattern matching algorithm. It takes two sets of tokens as input, and compares them. The tokens are generated by filter files. In the case of software licenses, Filter_License tokenizes all of the text of the License Templates and all of the text of a source code upload, then bSAM compares them. bSAM reports which template(s) are the best match for each source code file, and also identifies precisely where in the match occurred, including additions and omissions of text.
For a more thorough treatment of the bSAM algorithm, please refer to the FOSSology whitepaper on the subject.
The bSAM algorithm solves half of the problem for license analysis. It answers the question, “Where is the license in this file?” However, bSAM only knows the name of the template that matched and not what name to call the matched region.
For example, if there is a 99% match between the file and the AFL template, then the name of the license in the region is likely “AFL”. However, if you replace all instances of “AFL” in the match with something else, like MPL, then the section is really an “MPL” license and not an “AFL” license. In this case, bSAM will identify a high percentage of match against the AFL template (since most of the tokens matched, even though the license is not “AFL”.
Similarly, if a phrase is identified, then it means that the text did not match any license templates. The matched region containing the phrase is known, but not the name to call the region.
License Terms are used to identify the name of a license. These represent words, phrases, or keywords that represent known licenses. For example, “GPL”, “Gnu Public License”, and “Gnu-PL” are all common terms for the Free Software Foundation's Gnu General Public License. We can call this group by the canonical name “GPL”.
After bSAM identifies license regions, the licinspect agent scans the region for known license terms.
License Terms are a good approach for identifying licenses by name. However, there are some notable limitations: