FOSSology Project Logo FOSSology
Advancing open source analysis and development
 

FOSSology License Analysis

Applies to version 0.9.0 and beyond

DRAFT

Revisions: July 7, 2008 – Neal Krawetz

Initial: June 20, 2008

Daniel Stangel, Hewlett-Packard Company

Introduction

The accurate detection of the declared or intended software licenses in open source software source code is a challenging problem facing everyone from independent software developers and operating system distributions, to corporate IT managers and commercial software vendors. FOSSology takes a multi-level approach of filtering, analyzing and refining the search to provide the most accurate license analysis possible.

This document describes the approach used by FOSSology, as it is implemented in version 0.9.0 of the software.

License Types

Licenses can appear in many different forms:

  • Full text. The full text of the license may be included.
  • References. A reference is a paragraph or two that effectively identifies the governing license text, but without including the full text.
  • Phrase. Phrases, or one-sentence licenses, are the bare minimum for identifying the governing license text. It could be something like “This is MPL.” or “This is distributed as GPL.”
  • Phrase references. These are single sentences that reference a license without actually saying the license name. For example, “Please refer to license.txt.”

The current analysis system is very good at identifying full text matches and most references. While there are a high percent of false-positives for phrases, phrases that identify known license names are identified properly. However, phrase references are currently not identified.

The License Detection Process

License detection in FOSSology involves the following high-level steps, which are explained in more detail below:

  1. Upload the code to FOSSology
  2. Unpack the code to expose all files
  3. Scan the files for possible license-like phrases
  4. Scan the files to identify license regions
  5. Scan the identified license regions for License Terms in order to identify the license name
  6. Identify User-Defined License Groups

License Phrases

A license “phrase”, or “one-sentence license”, is a simple sentence that represents a license. Samples include:

  • This is free software.
  • This file is distributed under the GPL.
  • This code is in the public domain.
  • License: MPL

While a full license may contains many of these phrases, files without full licenses can also contain these sentences.

The bSAM engine compares tokens-to-tokens. For license analysis, “tokens” are words or punctuation found in the analyzed file. Since these tokens may be compared many times, the “Filter_License” agent generates a bsam cache file of precomputed tokens for the file. At the same time, the Filter_License agent also identifies all sentences that look “license-like”. For example, if a sentence contains the words “distributed under”, then the sentence is flagged as a potential license phrase.

As far as accuracy goes, phrase detection creates a significant number of false positives. For this reason, they are identified in the cache file but not yet loaded into the database as a “discovered license”.

License Templates

It is an ironic fact of life that lawyers frequently plagiarize lawyers. Most open source licenses are either copied or derived from other known licenses. While there may be variations (a name changed, a phrase altered, etc.), it is very common to have similar licenses.

A License Template represents the “DNA” of open source software licenses. A License Template can represent:

  • The complete text of an open source license
  • Portions of an open source license
  • A declaration of, or reference to, an open source license

FOSSology includes an extensive library of pre-defined License Templates that cover many, though not all, open source software licenses and their variants or derivatives.

NOTE: We cannot conclusively say that FOSSology's library of License Templates covers all licenses, since anyone can create a new license at any time, and the universe of open source software itself is far too large and dispersed to definitively scan everything.

Using the bSAM algorithm and the License Templates, FOSSology scans the uploaded source code and identifies the best match for every instance of a software license.

  • If a license if found, then the license region is identified and stored in the database.
  • If a license is not found, but a license phrase exists, then the phrase is stored in the database.

This system reduces the number of false-positives from phrase matches, and identifies license regions based on templates.

bSAM – the binary Symbolic Alignment Matrix algorithm

In simple terms, bSAM is a pattern matching algorithm. It takes two sets of tokens as input, and compares them. The tokens are generated by filter files. In the case of software licenses, Filter_License tokenizes all of the text of the License Templates and all of the text of a source code upload, then bSAM compares them. bSAM reports which template(s) are the best match for each source code file, and also identifies precisely where in the match occurred, including additions and omissions of text.

For a more thorough treatment of the bSAM algorithm, please refer to the FOSSology whitepaper on the subject.

License Terms

The bSAM algorithm solves half of the problem for license analysis. It answers the question, “Where is the license in this file?” However, bSAM only knows the name of the template that matched and not what name to call the matched region.

For example, if there is a 99% match between the file and the AFL template, then the name of the license in the region is likely “AFL”. However, if you replace all instances of “AFL” in the match with something else, like MPL, then the section is really an “MPL” license and not an “AFL” license. In this case, bSAM will identify a high percentage of match against the AFL template (since most of the tokens matched, even though the license is not “AFL”.

Similarly, if a phrase is identified, then it means that the text did not match any license templates. The matched region containing the phrase is known, but not the name to call the region.

License Terms are used to identify the name of a license. These represent words, phrases, or keywords that represent known licenses. For example, “GPL”, “Gnu Public License”, and “Gnu-PL” are all common terms for the Free Software Foundation's Gnu General Public License. We can call this group by the canonical name “GPL”.

After bSAM identifies license regions, the licinspect agent scans the region for known license terms.

  • If no new terms were added and none were removed, then the template name is likely correct.
    • If the identified region matches the entire license template (97% or better), then use the actual license name to identify the region.
    • If the identified region matches most of the license template (70% or better), then call it ”-style”, like “GPL-style”, since it isn't a solid match.
    • Otherwise, the identified region matches only part of the template, so call it ”-partial”. “MPL-partial” means it matched part of the MPL license.
  • If any terms were added or removed (compared to the license template), then the template's name is probably not correct. Instead, the entire matched region is scanned for license terms.
    • If any terms were found, then name the license based on the canonical name for the terms. For example, if “OSL” was removed and “ASL” was added, then call it “ASL” and not “OSL”.
    • If no terms are found, then call it a ”-partial” match, even if it is a 99% match. This is because at least one significant license term had to have been removed to match this condition.

Analysis Limitations

License Terms are a good approach for identifying licenses by name. However, there are some notable limitations:

  • Ambiguous names. The same term may match two different canonical names. For example, “AFL” could be the “Academic Free License” or the “Apache Free License”. We try to limit the impact from ambiguous names, but the condition does exist. Currently, all matching canonical names are reported.
  • Reused names. People frequently create their own licenses. If Gerome Peters calls his license the “Gerome Peters License”, or “GPL” for short, then terms will incorrectly match the Gnu General Public License.
  • The 'Not' Problem. Terms are analyzed syntacticly and not semmantically. If the matched region contains “This is not GPL” then it will match the GPL term and be called by the canonical name “GPL”. Negative terms such as “not”, “isn't”, and “but” are not considered during the matching process.
  • Tomato Problem. Some people spell it “License”, other people spell it “Licence” (s or c). If there is a term but it is not known or associated with a canonical name, then the license will not be properly identified. (You say toe-may-to, I say toe-mah-to.) We have added many of these alternate spellings to the default license terms, but new ones are bound to crop up.
  • Unknown Template. If the file contains a large license section that is not part of the known license templates, then the results will likely appear as a large number of ”-partial” matches. Although the license will be mostly covered by these partial matches, the name may not be accurate. This situation usually means that a new license template should be added to the system.
  • Not a License. Legal items such as “copyright”, “trademark”, and “patent” are not licenses. A “copyright” identifies who controls the license, but does not identify the license itself. (And in the United States, all files have a copyright holder, even if the copyright holder is not explicitly identified.) A file with a “Sun Microsystems” copyright does not imply a “Sun” public license. Trademarks and patents identify regulated items/terms, but do not identify licenses. Since the current system only identifies “licenses”, these non-license items are not identified.
 
1.0.0/fossology_license_analysis.txt · Last modified: 2009/04/16 12:42 (external edit)

Copyright (C) 2007-2009 Hewlett-Packard Development Company, L.P.
FOSSology Project documentation is licensed under the GNU Free Documentation License Version 1.2
Recent changes RSS feed Valid XHTML 1.0 Valid CSS3 Driven by DokuWiki