Mining Freshmeat

This document describes the process involved in mining [[http://freshmeat.net|Freshmeat]], and discusses the issues that have been encountered when harvesting projects.

Overview of the Mining Process

Fossology mines the top 1000 projects from Freshmeat on a nightly basis. Fossology uses the term mine to mean that it loads the software into its repository/db and does analysis on it. This is in contrast to doing an analysis on the XML Resource Definition File (Rdf) supplied by Freshmeat. Fossology also analyzes the Rdf file, as well, and produces statistics on it. The Rdf statistics will be one the fossology web site in the future.

The high level view of the mining process can be described in the following steps:

  1. Seed the repository/db with the top 1000 projects
  2. After the seed has been completed, on a nightly basis:
    1. Obtain the Freshmeat Resource Definition File (Rdf) file.
    2. Compare the previous top 1000 projects to the current top 1000.
    3. Reload any of the top 1000 that have had:
      1. a latest_revision change or
      2. Is new to the top 1000.

Issues Encountered when Harvesting

There are currently some issues that crop up that keep Fossology from harvesting all top 1000 projects. It can currently harvest approximately 500 of the top 1000 projects; however, ways to harvest all 1000 are under investigation.

No Compressed Archives

The first issue Fossology encounters is that not everyone uses or supplies a tar archive. We try to gather either .zip, .bz2 or .gz tar files. If the project does not supply one of those types, there is nothing for it to load. Fossology is investigating how to obtain these projects in an automated way.
Support for other types of archives (e.g. RPM's) is planned for the future. See Next Steps below.

Url does not point to a Downloadable Item

The second issue Fossology encounters is that many of the url's that point to downloads don't actually point to downloadable files. Instead, they often point to the project's home page, where one can find a download link. This issue has made collecting all top 1000 projects much harder than anticipated. Fossology is currently investigating the use of a web crawler, or some other technology, to obtain the archives that use these types of urls.

Data Gathered

None of the Freshmeat Rdf information is stored in our repository. Some of the values from the Rdf are stored, but not like FLOSSmole.
See other Fossology documentation for descriptions of what data is analyzed by the repository/db/agents. The process described here uses the following Freshmeat Rdf data:

  • Project Rank - the current Freshmeat rank for that day.
  • Project_short_name - the abbreviated name of the project.
  • Either .zip, .bz2, .gz tar archives - bz2 is favored due to it's smaller size.
  • Latest_revision - the value from the <latest_release><latest_release_version> for each Top 1000 project in the Rdf.

Next Steps

Fossology is seeking to improve its process in the following ways:

  • Increase harvest, develop or use methods that allow us to harvest those indirect links to down-loadable archives.
  • Develop a method of harvesting archives that have no archives in the Rdf.
  • Support obtaining RPM's as an archive type.
  • Support for obtaining and uploading Debian format archives.

Programs Used

The following programs are used in the Freshmeat process.

  • GetFM - Driver shell script run by cron.
  • maketop1k - extract the XML entries in the Freshmeat Rdf for a specified number of projects, 1000 is default.
  • diffm - diff consecutive days of the top 1000 XML files looking for differences in latest_release and for new projects in the top 1000.
  • get-projects - given an XML Rdf file format, parse and get via wget as many projects as possible.
  • cp2foss - loads 1 or more projects into the repository/db.

For more detailed information on the process see the above man pages and the Readme.