![]() |
FOSSology Advancing open source analysis and development |
|
Table of Contents
Mining FreshmeatThis document describes the process involved in mining Freshmeat, and discusses the issues that have been encountered when harvesting projects. Overview of the Mining ProcessFossology mines the top 1000 projects from Freshmeat on a nightly basis. Fossology uses the term mine to mean that it loads the software into its repository/db and does analysis on it. This is in contrast to doing an analysis on the XML Resource Definition File (Rdf) supplied by Freshmeat. Fossology also analyzes the Rdf file, as well, and produces statistics on it. The Rdf statistics will be one the fossology web site in the future. The high level view of the mining process can be described in the following steps:
Issues Encountered when HarvestingThere are currently some issues that crop up that keep Fossology from harvesting all top 1000 projects. It can currently harvest approximately 500 of the top 1000 projects; however, ways to harvest all 1000 are under investigation. No Compressed ArchivesThe first issue Fossology encounters is that not everyone uses or supplies a tar archive. We try to gather either .zip, .bz2 or .gz tar files. If the project does not supply one of those types, there is nothing for it to load. Fossology is investigating how to obtain these projects in an automated way. Support for other types of archives (e.g. RPM’s) is planned for the future. See Next Steps below. Url does not point to a Downloadable ItemThe second issue Fossology encounters is that many of the url’s that point to downloads don’t actually point to downloadable files. Instead, they often point to the project’s home page, where one can find a download link. This issue has made collecting all top 1000 projects much harder than anticipated. Fossology is currently investigating the use of a web crawler, or some other technology, to obtain the archives that use these types of urls. Data GatheredNone of the Freshmeat Rdf information is stored in our repository. Some of the values from the Rdf are stored, but not like FLOSSmole. See other Fossology documentation for descriptions of what data is analyzed by the repository/db/agents. The process described here uses the following Freshmeat Rdf data:
Next StepsFossology is seeking to improve its process in the following ways:
Programs UsedThe following programs are used in the Freshmeat process.
For more detailed information on the process see the above man pages and the Readme. FOSSology Project documentation is licensed under the GNU Free Documentation License Version 1.2 | |||