Table of Contents

Mining Freshmeat

This document describes the process involved in mining Freshmeat, and discusses the issues that have been encountered when harvesting projects.

Overview of the Mining Process

Fossology mines the top 1000 projects from Freshmeat on a nightly basis. Fossology uses the term mine to mean that it loads the software into its repository/db and does analysis on it. This is in contrast to doing an analysis on the XML Resource Definition File (Rdf) supplied by Freshmeat. Fossology also analyzes the Rdf file, as well, and produces statistics on it. The Rdf statistics will be one the fossology web site in the future.

The high level view of the mining process can be described in the following steps:

  1. Seed the repository/db with the top 1000 projects
  2. After the seed has been completed, on a nightly basis:
    1. Obtain the Freshmeat Resource Definition File (Rdf) file.
    2. Compare the previous top 1000 projects to the current top 1000.
    3. Reload any of the top 1000 that have had:
      1. a latest_revision change or
      2. Is new to the top 1000.

Issues Encountered when Harvesting

There are currently some issues that crop up that keep Fossology from harvesting all top 1000 projects. It can currently harvest approximately 500 of the top 1000 projects; however, ways to harvest all 1000 are under investigation.

No Compressed Archives

The first issue Fossology encounters is that not everyone uses or supplies a tar archive. We try to gather either .zip, .bz2 or .gz tar files. If the project does not supply one of those types, there is nothing for it to load. Fossology is investigating how to obtain these projects in an automated way. Support for other types of archives (e.g. RPM’s) is planned for the future. See Next Steps below.

Url does not point to a Downloadable Item

The second issue Fossology encounters is that many of the url’s that point to downloads don’t actually point to downloadable files. Instead, they often point to the project’s home page, where one can find a download link. This issue has made collecting all top 1000 projects much harder than anticipated. Fossology is currently investigating the use of a web crawler, or some other technology, to obtain the archives that use these types of urls.

Data Gathered

None of the Freshmeat Rdf information is stored in our repository. Some of the values from the Rdf are stored, but not like FLOSSmole. See other Fossology documentation for descriptions of what data is analyzed by the repository/db/agents. The process described here uses the following Freshmeat Rdf data:

Next Steps

Fossology is seeking to improve its process in the following ways:

Programs Used

The following programs are used in the Freshmeat process.

For more detailed information on the process see the above man pages and the Readme.