I’m going to want to analyze all of our license matches to find out if there are invariant parts of each license. For example, for all the “GPL v2 ref” matches, are there tokens that are in all (or most of) the matches. Some uses for this are:
1) creating license fingerprints These can potentially be used for faster matching.
2) a study of the evolution of software licenses This can be used to help narrow down the licenses that really matter and come up with a good license classification. (bobg)