FOSSology Project Logo FOSSology
Advancing open source analysis and development
 

Interpreting the License Analysis Report (0.6.1)

License Hierarchy Description

License analysis is performed by comparing an unknown file (that contains zero or more license sections) with a set of license templates. The comparison algorithm used by the license agent looks for groups of the similar words in a similar ordering. The algorithm does not mind if individual words are placed. For example, “This is the Gnu public license and you can share it” matches “This is the Neal public license and you can share it”. Changing a single word in the license usually does not change the meaning of the license.

Individual word changes (or small groups of words) are very common. (It appears that plagiarism does not apply to license text.) While some open source projects use “standard” licenses, such as GPL or BSD, other projects create their own licenses by merging in parts from different licenses. For example, a license may contain the three requirements found in the BSD license along with the warranty disclaimer from GPL and the distribution requirements from the MIT license. There are also many cases where a well-known license is simply renamed. One of the most common is the use of the LGPL license, where “GNU Lesser General Public License” is renamed after a company or project. Similarly, many projects take the GNU Library General Public License (GLGPL) and replace “library” with “program” or an application's name. None of this changes the license requirements or the template that it matches; it only changes the license name and the percentage of the match.

The worst-case scenarios happen when projects take a non-GPL license and simply replace the license name with “GPL”. The question becomes, did the author mean “GPL” or did they mean they wanted their own license rules? Fortunately, this is an issue for the lawyers to resolve. The license analyzer makes no legal interpretation about the semantic meaning of the license. It only matches text against license templates and identifies the percentage of the match.

Under the user interface, you can select a project and click on the license tab. This shows a histogram of the discovered licenses. Each type of license is listed as well as the number of files containing the license. For example:

Count License
1299 Apache Software License 2.0 reference
13 Intel-OSL
10 Phrase
6 Apache Software License 2.0
3 BSD UCRegents 2
2 RSA MD5
1 MIT (oldstyle)
1 Apache Software License 1.1

Each of the licenses has a distinct name and identifies a distinct license. However, “Phrase” is a catch-all category. License that are unknown by the analysis system are usually identified by common phrases, such as “is distributed under…”. Phrases that are potentially associated with licenses are listed the Phrase category.

You can click on each of the license types and see a list of files that contain the license.

License: Intel-OSL
1. 98% pcretest.c
2. 97% COPYING
3. 97% LICENCE
4. 97% pcre.hw
5. 97% pcre.in
6. 97% pcregrep.c
7. 95% internal.h
8. 95% ucptypetable.c
9. 94% pcre.c
10. 93% study.c
11. 91% maketables.c
12. 91% printint.c
13. 89% dftables.c

The files are ordered by the percentage of match. In the example, the file “COPYING” has a section of text that includes a 97% match with a section of the Intel-OSL license. By clicking on the file name, you can see the actual text of the file with the matching license text highlighted.

At the top of the file contents is an index table that lists the licenses in the file, a link to the instance (click on “view”), a link to the actual license (click on “ref”), and a color – each identified license is color coded. Items without a “ref” denote Phrases that are identified as possible license text.

The actual matched text within the document are highlighted to match the license key. Words that are not included in the match are not highlighted. In this example, the attribution of the license has been changed to say “University of Cambridge” and the owner's name has been replaced with “COPYRIGHT OWNER”. Outside of these specific changes, the license text matches the Intel-OSL license.

Templates

The license templates are categorized into families with similar text. It is important to note that “similar text” does not mean “similar purpose”. Your company may consider one type of license to be “bad” but a similar text license to be “good”. Any interpretation is left up to you; the groupings are strictly by similar text. For example, the generic Academic Free License (AFL) appears to have similar text to the Open Software License (OSL). Based on the word usage, OSL was probably derived from AFL (or vice versa). This creates the hierarchy “AFL/AFL/” and “AFL/OSL/”. Under AFL/AFL/ are different versions of the AFL license: 1.1, 1.2, 2.0, etc. Similarly, the BSD license comes in two flavors: the old and new BSL license. BSD/BSD.new/ contains the actual BSD “new” license (BSD/BSD.new/BSD_new) as well as derivatives such as the Apache, Cryptix, and SSLeay licenses. Each license is different, but they contain enough similar text to be derivatives from a central license.

The names for the license text attempt to describe the license. For example, GPL-based licenses contain “GPL” in the template name or family, and the Free Software Federation's license family is denoted by “FSF”. However, some licenses do not have formal names. For the templates, these have been named after the general purpose, such as “Free/Free Use No Change” and “Free/Beerware” (Beerware is a real license, but it is categorized under the general-purpose “Free” family.) To re-emphasize: the naming is relatively arbitrary and should not be interpreted as legal advice.

When there are multiple ways to present a license, the different variants are numbered. For example, “Corporate/Sun/Sun Microsystems variant 1” and “Corporate/Sun/Sun Microsystems variant 2” include different text that means the same thing.

Licenses are not always included in their entirety. Files frequently contain references to the license rather than the actual license. License references are included in the templates, such as “AFL/OSL/Open Software License 1.0 reference”. In some cases, there may be multiple common reference templates, so these are numbered such as “GPL/v2/GPLv2 reference 2” and “GPL/v2/GPLv2 reference 3”.

Besides references, licenses may include shortened versions that summarize the license. For example, “Adobe/Adobe short” is a variation of “Adobe/Adobe”. Similarly, licenses may contain sections such as suppliments and appendices.

License Phrases

When possible, text is matched against license templates. However, not every template matches every license. This can happen when a new license (not in the list of templates) is identified. Similarly, licenses may be single sentences, such as “This is free, enjoy.” Single sentences usually have legal meanings even if there is no formal license associated with the file.

The license analyzer (technically, the filter_license agent) identifies potential license phrases. Any sentence containing these phrases and not found in a license template is flagged as a potential license phrase:

  • “license:” (Any sentence containing “license:” is flagged as a possible phrase.)
  • is free|freely (matches “is free” or “is freely”)
  • is not free|freely
  • provide|provided freely
  • distribute|distributed freely
  • release|released freely
  • is provided|distributed|released|licensed|licenced|covered|adheres as|under|by|in (E.g., “this software is released under my license.”)
  • under the terms of
  • “as is” (or 'as is' with single quotes)
  • “as-is” (or 'as-is')
  • proprietary (if this appears anywhere, it is flagged as a potential license)

As an aside, there is an intentional spelling error “licenced” in this mix because it appears in far too many files. (Nobody ever said that engineers could spell.)

While matches against license templates are very accurate (very few false-positives and very few false-negatives), license phrase matching is less accurate. For example, the sentence “mimic existing proprietary applications for instance” likely matches source code, while “license from proprietary software” is probably important for a legal interpretation. (And “Throw away proprietary and site licenses” could be from source code or be legally important.) Each of these examples comes from a real code analysis.

Due to the lower accuracy level for phrases, it is important to review each case.

Percent of Match

Licenses are matched based on a percentage of similar tokens. Tokens are simply words or punctuation. For example, consider a file that has a potential license section that contains 500 tokens. If 400 of the tokens matches a section of a license template that contains 2000 tokens, then it matches 400/500 tokens, or an 80% match. Since 20% of the text does not match, it could indicate a new license clause, alternate wording, or simply a replaced term.

When viewing the license under the UI, the matched tokens are highlighted. Any word (or character) not highlighted was not part of the match. The highlighting allows users to quickly determine what was changed. It could be as simple as spelling out “General Public License” instead of “GPL”, or it could be the inclusion of the word “not” (a small, but very critical word for legal interpretation).

A “100%” match indicates that the entire potential section matched something in the template, but does not necessarily mean that the entire template matched the section. For example, a license section may have a 100% match with BSD/BSD.new/BSD_new, but only match the warranty clause.

License Templates

License templates are arranged in directories that denote similar text. The organization is strictly based on text similarities and not semantics. Each template has a unique name – the user interface only displays the name and not the hierarchical path.

The current list of license templates are as follows:

Adaptive/Adaptive 1.0
Adaptive/Adaptive 1.0 Appendix A
Adobe/Adobe
Adobe/Adobe short
AFL/AFL/Academic Free License 1.1
AFL/AFL/Academic Free License 1.2
AFL/AFL/Academic Free License 2.0
AFL/AFL/Academic Free License 2.1
AFL/AFL/Academic Free License 3.0
AFL/OSL/Open Software License 1.0
AFL/OSL/Open Software License 1.0 reference
AFL/OSL/Open Software License 1.1
AFL/OSL/Open Software License 2.0
AFL/OSL/Open Software License 2.1
AFL/OSL/Open Software License 3.0
APSL/Apple Public Source License 1.0
APSL/Apple Public Source License 1.1
APSL/Apple Public Source License 1.2
APSL/Apple Public Source License 2.0
Artistic/Artistic 1.0
Artistic/Artistic 1.0 short
Artistic/Artistic 2.0
Artistic/Artistic 2.0beta4
BSD/BSD.new/Apache/Apache Software License 1.0
BSD/BSD.new/Apache/Apache Software License 1.1
BSD/BSD.new/Apache/Apache Software License 2.0
BSD/BSD.new/Apache/Apache Software License 2.0 reference
BSD/BSD.new/BSD new
BSD/BSD.new/BSD new short
BSD/BSD.new/Cryptix
BSD/BSD.new/Entessa Public License
BSD/BSD.new/Maia Mailguard License
BSD/BSD.new/Naumen Public License
BSD/BSD.new/OpenPBS
BSD/BSD.new/Phorum
BSD/BSD.new/PHP/PHP 3.0
BSD/BSD.new/SSLeay
BSD/BSD.new/Vovida Software License 1.0
BSD/BSD.new/Zend
BSD/BSD.old/Attribution Assurance License
BSD/BSD.old/BSD As-Is clause
BSD/BSD.old/BSD Harvard
BSD/BSD.old/BSD NRL
BSD/BSD.old/BSD old
BSD/BSD.old/BSD UCRegents
BSD/BSD.old/BSD UCRegents 2
BSD/BSD.old/BSD zlib
BSD/BSD.old/FreeBSD
BSD/BSD.old/Intel-OSL
BSD/BSD.old/OpenLDAP
BSD/BSD.old/OpenSSL
BSD/BSD.old/Sleepycat
BSD/BSD.old/Sleepycat short
BSD/BSD.old/Zope/Zope 1.0
BSD/BSD.old/Zope/Zope 2.0
CDDL/CDDL 1.0
Corporate/Apple/Apple Common Documentation License 1.0
Corporate/Apple/Apple Squeak
Corporate/CA/TOSL/Computer Associates Trusted Open Source License 1.1
Corporate/HP/Hewlett-Packard
Corporate/HP/HP-UX Java
Corporate/HP/HP-UX JRE
Corporate/IBM/IBM JRE
Corporate/IBM/IBM reciprocal
Corporate/Logica/Logica Open Source License Version 1.0
Corporate/Lucent/Lucent Public License 1.0
Corporate/Lucent/Lucent Public License 1.02
Corporate/Microsoft/Microsoft EULA
Corporate/Microsoft/Microsoft EULA 2003
Corporate/Microsoft/Microsoft EULA Software
Corporate/Motorola
Corporate/NCD/Network Computing Devices 1993
Corporate/NetComponents/NetComponents
Corporate/Nokia/Nokia Open Source License 1.0a
Corporate/Nvidia
Corporate/RSA/RSA MD5
Corporate/SGI/SGI CID 1.0
Corporate/SGI/SGI GPX 1.0
Corporate/Skype
Corporate/Sun/Bigelow&Holmes
Corporate/Sun/Sun Microsystems Binary Code License
Corporate/Sun/Sun Microsystems Binary Code License supplement
Corporate/Sun/Sun Microsystems Free with Copyright 1
Corporate/Sun/Sun Microsystems Free with Copyright 2
Corporate/Sun/Sun Microsystems Sun Public License
Corporate/Sun/Sun Microsystems variant 1
Corporate/Sun/Sun Microsystems variant 2
Corporate/Sun/Sun Solaris Source Code License Foundation Release
CPL/Common Public License 1.0
CPL/IBM/IBM_PL/IBM Public License 1.0
Creative_Commons/Creative Commons GPL
Creative_Commons/Creative Commons LGPL
Creative_Commons/Creative Commons Public Domain
Creative_Commons/Creative Commons Public License
Edu/CMU/Carnegie Mellon University 1998
Edu/CMU/Carnegie Mellon University 2000
Edu/CWI (Center for Mathematics and Computer Science, Netherlands)
Edu/Educational Community License
Edu/University of Utah Public License
Edu/Univ of Cambridge
Edu/Univ of Edinburgh
Edu/Univ of Notre Dame
Eiffel/Eiffel Forum License 1
Eiffel/Eiffel Forum License 2
FreeArtLicense/Free Art License 1.2
Free/Beerware
Free/Fair License
Free/Free clause
Free/Free clause variant 2
Free/Free clause variant 3
Free/Free use no change clause
Free/FreeWithCopyright/Free with copyright clause variant 1
Free/FreeWithCopyright/Free with copyright clause variant 10
Free/FreeWithCopyright/Free with copyright clause variant 3
Free/FreeWithCopyright/Free with copyright clause variant 4
Free/FreeWithCopyright/Free with copyright clause variant 5
Free/FreeWithCopyright/Free with copyright clause variant 8
Free/FreeWithCopyright/Free with copyright clause variant 9
Free/FreeWithCopyright/UC Regents free with copyright clause
Free/FreeWithCopyright/Unidex
Free/FreeWithCopyright/variant.11
Free/Free with files clause
FreeType/FreeType
FreeType/FreeType reference
Free/WTFPL
FSF/FSF
FSF/FSF variant 1
FSF/FSF variant 2
FSF/FSF variant 3
FSF/FSF variant 4
Gov/CeCILL-B_V1-en
Gov/CeCILL-B_V1-fr
Gov/CeCILL-C_V1-en
Gov/CeCILL-C_V1-fr
Gov/CeCILL_V1.1-US
Gov/CeCILL_V1-fr
Gov/CeCILL_V2-en
Gov/CeCILL_V2-fr
Gov/Government clause
Gov/MITRE Collaborative Virtual Workspace License
Gov/NASA Open Source 1.3
Gov/Starndard ML of New Jersey
GPL/Affero/Affero GPL
GPL/CopyLeft reference
GPL/Dual MPL GPL
GPL/Exception/GPL exception clause 1
GPL/Exception/GPL exception clause 2
GPL/GFDL/GNU Free Documentation License 1.1 reference 1
GPL/GFDL/GNU Free Documentation License 1.1 reference 2
GPL/GFDL/GNU Free Documentation License 1.2
GPL/GFDL/GNU Free Documentation License 1.2 reference
GPL/GPL for Computer Programs of the Public Administration
GPL/GPL from FSF reference
GPL/GPL reference
GPL/LGPL/LGPL 2.0
GPL/LGPL/LGPL 2.0 reference
GPL/LGPL/LGPL 2.0 with exceptions
GPL/LGPL/LGPL 2.1
GPL/LGPL/LGPL 2.1 reference
GPL/LGPL/LGPL 3.0
GPL/LGPL/LGPL gettext library variant
GPL/LGPL/LGPL GNU C Library variant
GPL/LGPL/LGPL wxWindows Library Licence 3.0 variant
GPL/v1/GPLv1
GPL/v1/GPLv1 reference
GPL/v2/eCos
GPL/v2/Free with copyright clause
GPL/v2/GPL from FSF reference 1
GPL/v2/GPL from FSF reference 2
GPL/v2/GPLv2
GPL/v2/GPLv2 Java Index Serialization Package variant
GPL/v2/GPLv2 reference
GPL/v2/GPLv2 reference 2
GPL/v2/GPLv2 reference 3
GPL/v2/GPLv2 reference 4
GPL/v2/McKornik Jr. Public License
GPL/v2/RealNetworks/RealNetworks Community Source Licensing
GPL/v2/RealNetworks/RealNetworks Public Source License 1.0
GPL/v2/RealNetworks/RealNetworks Public Source License 1.0 reference
GPL/v2/Sybase Open Watcom Public License 1.0
GPL/v3/GPLv3
GPL/v3/GPLv3 reference 1
GPL/v3/GPLv3 reference 2
GPL/W3C/World Wide Web Consortium 2001
GPL/W3C/World Wide Web Consortium 2002
Historical/Historical free with copyright clause
Historical/Historical Permission Notice and Disclaimer
ICU/ICU 1.8.1
ICU/ICU 1.8.1 variant
IETF/IETF
IETF/IETF variant
MiscOSS/Aladdin Free Public License
MiscOSS/Bitstream
MiscOSS/BitTorrent
MiscOSS/BitTorrent reference
MiscOSS/Catharon Open Source License
MiscOSS/C_Migemo License
MiscOSS/Condor
MiscOSS/Copy clause
MiscOSS/EU DataGrid Software License
MiscOSS/Frameworx Open License 1.0
MiscOSS/Giftware
MiscOSS/Glide
MiscOSS/gnuplot
MiscOSS/Hacktivismo Enhanced-Source Software License Agreement
MiscOSS/IJG
MiscOSS/iMatix
MiscOSS/Internet Software Consortium
MiscOSS/Jabber Open Source License 1.0
MiscOSS/Jahia Community Source License
MiscOSS/LaTeX Project Public License 1.3a
MiscOSS/mecab-ipadic
MiscOSS/Motosoto Open Source License
MiscOSS/MSNTP License
MiscOSS/Nethack General Public License
MiscOSS/OpenContent License
MiscOSS/Open Motif Public End User License
MiscOSS/Pine License
MiscOSS/qmail License
MiscOSS/Q Public License 1.0
MiscOSS/Ruby
MiscOSS/Scilab License
MiscOSS/TCL
MiscOSS/Vim
MiscOSS/zlib/InfoZip
MiscOSS/zlib/zLib
MIT/Imlib2
MIT/JasPer
MIT/MIT Bigelow&Holmes Luxi font variant
MIT/MIT CMU style
MIT/MIT Free with copyright clause
MIT/MIT HP-DEC variant
MIT/MIT MLton variant
MIT/MIT (modern)
MIT/MIT (modern) with sublicense
MIT/MIT New Jersey variant
MIT/MIT (oldstyle)
MIT/MIT (oldstyle) no ads clause
MIT/MIT (oldstyle) with disclaimer 1
MIT/MIT (oldstyle) with disclaimer 2
MIT/MIT (oldstyle) with disclaimer 3
MIT/MIT Unicode variant
MIT/NCSA
MIT/X11
MIT/X.Net License
MPL/CUA Office Public License 1.0
MPL/Dual MPL MIT
MPL/Interbase
MPL/MPL 1.0
MPL/MPL 1.1
MPL/MPL 1.1 reference
MPL/MPL contributor clause with dual license
MPL/Netizen Open Source License
MPL/NPL 1.1
MPL/NPL 1.1 reference
MPL/NPL contributor clause with dual license
MPL/Ricoh Source Code Public License
MPL/SISSL/SISSL 1.1
MPL/SISSL/SISSL 1.1 reference 1
MPL/SISSL/SISSL 1.1 reference 2
OCLC/OCLC Research Public License 2.0
OpenGroup/Open Group
OpenGroup/Open Group Test Suite License
OpenPublicationLicense/Open Publication License 1.0
OpenPublicationLicense/Open Publication License reference
Python/PSF/Python Software Foundation 2.1.1
Python/PSF/Python Software Foundation 2.2
Python/Python BeOpen
Python/Python CNRI
Python/Python CWI
Python/Python InfoSeek variant
RedHat/Red Hat EULA
RedHat/Red Hat reference

Adding a License Template

The license templates are a set of known licenses used for the analysis. When any part of the known license (called a raw template) match, then the license is marked as a match. However, new or unknown licenses may not match well. In some cases, you may want to add in your own license.

Currently, adding licenses is not user-friendly. It requires re-running the build and modifying the database.

  1. Go into the Raw license directory in the build tree:
    cd trunk/fossology/agents/foss_license_agent/Licenses/Raw/

    This directory contains all of the “Raw” license templates. For organization, templates are created directories. You can either put your raw license in one of these directories, or make your own.

  2. When you run “make install”, this entire tree is copied as-is into the install directory: trunk/fossology/install/usr/local/share/fossology/agents/licenses/. When you run “sudo ./install.sh -f”, the following steps happen:
    • The tree is installed in /usr/local/share/fossology/agents/licenses/
    • Filter_License is used on every file to generate the License.bsam cache and to install the license meta data in the database. This happens when you see the “Processing license_name” during the install.

There are some general guidelines when selecting raw license text:

  • Raw licenses. You should not need to edit any of the actual license text. The “raw” directory contains unmodified licenses.
  • Separate parts. Many licenses contain distinct parts that can and should be separated. For example, a preface or history behind the license does not alter the meaning of the license. Removing these sections speeds the comparison process time. Similarly, appendices should either be removed or separated into their own license files.
  • Short licenses. Many full licenses contain sample text for a minimal (short) license declaration. For example, “Files using this license should contain the following paragraph…” These paragraphs should be extracted into separate files so that they match the short license.
  • Virtually identical licenses. Try to stay away from “virtually identical” licenses. Many project use the same license but change the copyright name. It is my understanding as a non-lawyer that different names do not change the meaning of the license. The only thing they change is the copyright owner – the owner is the person who has the option to alter the license. As a result, two variations of the BSD license that only differ by the copyright holder's name are both identified as the BSD license. In general, if your company likes the BSD license, then they will probably like the same license held by a different copyright holder. However, it is up to your legal department to determine whether similar licenses have similar desirability.

Adding a License Phrase

One sentence license phrases (1SL) are phrases commonly associated with licenses. When no license template is found, 1SL phrases are displayed.

Currently, all 1SL templates are hard-coded into the Filter_License agent. The file is: trunk/fossology/agents/foss_license_agent/Filter_License/1sl.c

At the very beginning of the code is a global array named “List1SL”. The array ends with NULL,NULL. If you want to add in your own one-sentence-license (1SL), then just add the pattern before the NULL's.

Each 1SL array entry has two parts:

  1. A name. (I use “1SL: %s”)
  2. A regular expression that isn't quite regular. (See wordregex.c for default.) The expression terms are as follows:
    • ”%” skip 0 or more words (this expression is computationally expensive)
    • ”%5” skip up to 5 words
    • “string” match string
    • “^string” do NOT match string
    • “string*” match word beginning with string
    • “*string” match word ending with string
    • “*string*” match word containing with string
    • “*^string*” do NOT match word containing with string
    • “*” match exactly ONE entire word (same as %1)
    • “string1|string1” match string1 or string2
    • ”< … >” place matches in return string
    • “\” quote next character (only for start of match)
  3. All words are case-insensitive.

As an example one-sentence license phrase:

< * * proprietary % > *.*|*,*|*;*|*:*|*$*|*(*|*)*|*{*|*}*

This expression looks for any set of words containing the word “proprietary”. It returns 2 words before “proprietary” and then any number of words until it finds an end-of-phrase character (period, comma, semi-colon, etc.). This will match phrases like:

This software contains proprietary source code.

(The expression returns “software contains proprietary source code”)

This gets around proprietary Microsoft APIs.

(A real phrase from some open source projects.)

Maybe this should be proprietary?

(Returns “should be proprietary”)

After changing this file, use “make” in the Filter_License/ directory to make sure that it builds. Then use “make” from trunk/fossology/ to build the code and “make install” to install it.

Re-analyzing Licenses

Changes to the license templates or 1SL system are not currently applied to previous analysis results. As a result, files that are already processed will not be reprocessed with the new license. There are a few manual workarounds to reset the database so files will be re-analyzed. Choose one of these two options:

  1. Delete database. Drop the DB, run ”./install.sh -f”, then resubmit the packages for processing. This is brutal, but effective.
  2. Manually reset the jobs for reprocessing:
    1. Stop the scheduler. (It does not matter if any jobs were running.)
    2. Run these SQL commands:
      DELETE FROM agent_lic_status;
      DELETE FROM agent_lic_meta;
      UPDATE jobqueue set jq_starttime=NULL,jq_endtime=NULL,jq_end_bits=0
        WHERE (jq_type = 'filter_license' OR jq_type = 'license' OR
               jq_type = 'filter_clean');
      • The agent_lic_status table says what has been processed. Deleting it makes everything unprocessed.
      • The agent_lic_meta table stores license results.
      • The jobqueue update schedules all license analysis jobs to be re-run.
    3. Restart the scheduler. Every license analysis job will restart and use the new licenses.
 
0.6.1/interpret_the_license_analysis_report.txt · Last modified: 2008/04/01 11:36 by danger

Copyright (C) 2007-2009 Hewlett-Packard Development Company, L.P.
FOSSology Project documentation is licensed under the GNU Free Documentation License Version 1.2
Recent changes RSS feed Valid XHTML 1.0 Valid CSS3 Driven by DokuWiki