Table of Contents

Agents

The entire FOSSology system is a combination of agents that run in series:

  1. The unpack agent extracts files to analyze.
  2. The filter_license agent creates the bsam cache files.
  3. The license analysis agent runs the bsam algorithm on the cache files. This generates the list of detected licenses.
  4. The filter_clean agent purges the bsam cache files that are no longer needed.
  5. The engine-shell agent is a generic agent that can turn any command-line program into an agent. This is currently used by the unpack agent.
  6. The mimetype agent associates a mime-type with every file.
  7. The pkgmetagetta agent extracts meta data from files.
  8. The specagent agent is a special subset of pkgmetagetta and designed for extracting information from RPM “.spec” files.
  9. The sqlagent agent performs generic SQL requests. This is useful if the purpose of the agent is to perform some SQL actions.
  10. The delagent agent deletes uploads, folders, or license information. It can also be used from the command-line.

Here is a brief summary of each of the ‘core’ agents and their functioning:

Unpack

The unpack agent extracts files from containers. A container is any kind of file that stores other files. For example, a ZIP file contains an archive of different files. Other types of containers include tar, ar, ISO, and rpm files. The extraction is performed by the Universal Unpacker (ununpack) program.

  Universal Unpacker, version 0.9.6, compiled Feb 7 2007 14:16:32
  Usage: ununpack [options] file [file [file...]] 
  Extracts each file.
  If filename specifies a directory, then extracts everything in it.

  Unpack Options:
  -C     :: force continue when unpack tool fails.
  -d dir :: specify alternate extraction directory.
            Default is the same directory as file.
  -m #   :: number of CPUs to use (default: 1).
  -P     :: prune files: remove links, >1 hard links, zero files, etc.
  -R     :: recursively unpack (same as '-r -1')
  -r #   :: recurse to a specified depth (0=none/default, -1=infinite)
  -X     :: remove recursive sources after unpacking.
  -x     :: remove ALL unpacked files when done (clean up).

  I/O Options:
  -L out :: Generate a log of files extracted to out.
  -F     :: Using files from the repository.
  -Q     :: Using osrb queue system. (Includes -F)
            Each source name should come from the repository.
            First 'gold' is checked, then 'files'.
            If -L is used, unpacked files are placed in 'files'.
  -T rep :: Set gold repository name to 'rep' (for testing)
  -t rep :: Set files repository name to 'rep' (for testing)
  -q     :: quiet (generate no output).
  -v     :: verbose (-vv = more verbose).

  Currently identifies and processes:
  * Unordered List Item
  * Compressed files: .Z .gz .bz .bz2 upx
  * Archives files: tar cpio zip jar ar rar cab
  * Data files: pdf
  * Installer files: rpm deb
  * File images: iso9660(plain/Joliet/Rock Ridge) FAT(12/16/32) ext2/ext3 NTFS
  * Boot partitions: x86, vmlinuz

Besides unpacking files, the ununpack program extracts meta data from containers. For example, ZIP and RPM files include meta data that is not unpacked during file extraction. The meta data is extracted to a “.meta” file. For example, happy.zip will create happy.zip.meta.

Implementation Notes

The unpack agent uses the command-line universal unpacker (ununpack) tool to extract files.

The unpack agent is implemented using the Engine-Shell agent wrapper around the ununpack executable. The rational: Agents are expected to always be running. This is to cut down on spawning time. However, ununpack spawns lots of processes to unpack files. (For example, to unpack a tar file, tar is used in a system call. The same for unzip, rpm, ar, etc.) Since there is already a massive delay from spawning times, using Engine-Shell to spawn one ununpack process does not add a significant delay.

The unpack agent is implemented as a host-specific multi-SQL (MSQ) query.

The unpack agent creates “artifact” files. There are three types of artifacts: artifact.meta, artifact.dir, and artifact.unpacked.

In general, anything that does not have an explicit filename is an “artifact”.

filter license

The bsam analysis tool uses pre-processed license files. These are tokenized versions of the original data files. The filter_license tool converts a real file into a tokenized bsam file.

For the filter_license agent:

Notes on the format of the bSAM cache file

Tue Jan 31 09:34:32 MST 2006
  
New bSAM binary data format:
All data entries are in the same format:
  type (2 bytes)
  size (2 bytes, unsigned) -- size of the data (can be zero meaning "no data")
  data (size matches size)

Basic Type categories:
Type & 0xFFF0 == 0x0000 :: File oriented
Type & 0xFFF0 == 0x0100 :: Function oriented
Type & 0xF000 == 0xF000 :: Comment

All strings should be null terminated!

The different types:
0000 EOF -- size and data are not required
0001 File name -- data is string
0002 File checksum -- data the checksum (may be a string)
0003 File license -- data string
0004 File type -- data string (e.g., "C", "Java", "Class", "Obj")
	File type is used to ensure that comparisons are only done
	between same type of data
0010 File unique value -- data string

0101 Function name -- data is string
0103 Function license -- see 03 File license
0104 Function type -- see 04 File type
0108 Function tokens -- data contains tokens, 2 bytes each!
0110 Function unique value -- data string
0118 Function tokens OR list -- 2 byte tokens, at least one of which must
	be in the comparison token list.  (None are in the comparison?
	Then skip the comparison!)
0128 Function tokens AND list -- 2 byte tokens, all must be in the comparison
	token list.
0129 Function "important" tokens (to be used for choosing between similar matches)
0131 Byte offset to start in untokenized file (length is always 4)
0132 Byte offset to end in untokenized file (length is always 4)
	Both contain 4 bytes.
	0131 and 0132 values should be reset to "undefined" each time
	0101 is seen.
0138 Byte offsets for tokens.
	Each token in the 0108 tag will be represented by 1 byte here.
	This byte is the number of bytes to skip between tokens.
	This way, matches can be calculated down to specific locations
	in the file rather than general ranges specified by 0131 - 0132.
	There is one extra byte here so the end of the last token is known.
	E.g., if the match starts at token #7, then:
	  for(i=0; i<7; i++) RealOffset += TokenValue0138[i].
0140 Single-sentence licenses (text, tokenized into space separated)
01FF End of Function (ok to start processing) -- size is always zero.

F001 File Comment
F101 Function Comment
FFFF General Comment

All unknown types are skipped (treated as comments).

=====================================================================
Wed Feb  1 12:48:49 MST 2006

With the new data format, I don't need to separate .o, .java, .c, and .class
files.  I can store them all in one directory.
BUT: Different file formats need different bsam parameters.
(The number of similar tokens varies based on the language.)
For this reason, I am still keeping them separate.

The license analysis Agent

This agent runs the command-line bsam program for comparing bsam cache files. The bsam program compares two cache files against each other and identifies similarities between them. For license analysis, the first file is the cache file created by filter_license. The second file is /usr/local/share/fossology/agents/License.bsam. This file is a bsam cache containing all of the tokenized known-licenses.

Scheduler Type

The bsam-engine runs as a host-specific MSQ agent. The rational:

Command-Line Parameters

The License Agent uses a stand-alone program called “bsam-engine”. A sample scheduler configuration file entry:

agent=license host=fawkes | /usr/bin/ssh fossy@fawkes "/usr/local/fossology/agents/bsam-engine
         -L 20 -A 0 -B 60 -G 10 -M 2 -E -T license -O n -- -
         /usr/local/share/fossology/agents/License.bsam"

This entry specifies the following parameters to bsam-engine:

Input Values

The stdin stream from the scheduler can contain any of four values:

For license analysis, only A and Akey are specified. The B and Bkey are provided as support for future agents.

Each line can contain an A, Akey, B, and Bkey values. After the line is read, the license values are generated.

Output Values

When using “-O n”, licenses are stored in the agent_lic_meta table. The agent_lic_cache table is updated so the processed column is marked as true.

The processed column is required since a file may be processed and not generate any licenses.

Performance

The bSAM engine is fast for what it does, but it is very slow compared to other agents.

Filter clean

After bsam processes the license files, the tokenized bsam cache files are no longer needed. The filter_clean agent removes the unnecessary token files.

Implementation Details

For the filter_license agent:

Usage

The program filter_clean can perform many tasks:

  Usage: filter_clean [options] [projects]
  For each processed file, remove the cache.
  Options
  -i :: Initialize the DB, then exit.
  -L :: List project IDs.
  -s :: operate via the scheduler.  Stdin contains each record to process.
  -S :: operate via the scheduler.  Stdin contains each project ID.
  -v :: Verbose (-vv for more verbose)
  -T :: TEST -- do not update the DB or delete any files (just pretend)
  You can also list one or more project IDs for processing.
  Project ID of '-1' will process all projects.

When used as an agent, filter_clean can either take a specific record to process (-s), or a project ID (-S). If the input is a project ID, then every record associated with the project is cleaned.

The filter_clean program contains many debugging options:

$ filter_clean -L | tail
6796810 :: pmccabe-2.1-2.i386.rpm
6796835 :: 15.238.5.206:jboss-4.0.2-src.tar.gz
6798336 :: 15.238.5.206:jadls158.zip
6864012 :: 15.238.5.206:hvdistrib.zip
6865408 :: http://devoss/~nealkrawetz/HPSIM-Linux_C.05.00.02.00.bin
6865409 :: stress the artifact part of the GUI
6865420 :: test continer/pfile reuse algos
6865428 :: 15.238.5.206:jakarta-tomcat-5.5.9-src.tar.gz
6868810 :: Firefox 1.0.7 test submitted by Neal
6870424 :: ?description

Engine-Shell

The engine-shell is a generic agent wrapper for command-line applications. Although spawning new applications for each database record is not efficient, it is frequently convenient. The engine-shell is designed for rapid prototyping and for long-term use by infrequently spawned agents.

For example, the unpack agent runs the command-line program “ununpack”. Since this is only called once per upload, there is no significant impact from spawning the ununpack program as needed. Thus, there is no need for an ununpack-specific agent; ununpack can be started by using engine-shell instead.

Developers should consider not using engine-shell if the application is spawned thousands of times in rapid succession since the spawn times will create large processing delays.

The engine-shell converts all database parameters to shell environment variables and handles communications with the scheduler.

Usage

Usage: /usr/local/fossology/agents/engine-shell agent_name command < args
The agent_name is a string assigned to the engine-shell. It can be used as a parameter to the command.

The command-line itself takes a series of parameters that are expanded each time the process is spawned:

%{%}  = percent sign
%{P}  = PID (process ID) of the engine-shell!
%{PP} = PPID (parent process ID) of the engine-shell!
%{U}  = Unique string assigned by the engine-shell!
%{A}  = Agent name assigned to the engine-shell.
%{1}  = the first arg from scheduler  (there is no %{0})
%{2}  = 2nd arg from scheduler
%{1000} = 1000th arg from the scheduler (no real limit)
%{*}  = all args from the scheduler

For example:

  agent=unpack host=localhost | /usr/local/fossology/agents/engine-shell unpack
     '/usr/local/fossology/agents/ununpack -d /srv/fossology/repository/ununpack/%{U} -qRCQx'

In this case, the directory for ununpack (-d) expands to “/srv/fossology/repository/ununpack/12ab45” where 12ab45 is a unique string. The unique string is guaranteed to be unique among all currently running scheduler processes, however it could have been used by a previous process and may be used by a future process when this process completes. It is unique as long as this process is running.

The %{U}, %{P}, and %{PP} parameters are intended to be used by command-lines that require a unique identifier. For example, if a temporary file is needed, then these parameters can form the temporary file name.

Normally the scheduler sends “field=value” pairs to agents for processing. The engine-shell converts these to environment variables before spawning the command. For example, if the scheduler sends “a=12345 b=yes” then the environment variable ARG_a becomes “12345” and ARG_b becomes “yes”. The name of the variable comes from the field (”field=” becomes “ARG_field”) and the variable is assigned the value.

To make this very clear: if the scheduler needs to call engine-shell 500 times with different parameters, then there will be 500 iterations of (1) set the environment variables and (2) spawn the command line to process the request.

Return Codes

If the command returns a 0 return code, then engine-shell assumes the command succeeded. A non-zero return code indicates a command-failure.

The command-line should not generate any unnecessary output. All output it logged by the scheduler as debug statements.

Mimetype Agent

Every file has a mimetype. Some may be “text/plain” or “image/jpeg”, and others may be specific to programs like “application/pdf”. The mimetype agent is passed a file to analyze and sets the mime-type value in the pfile table. (It also populates the database “mimetype” table as needed.)

This agent creates the “official” mimetype used by the database.

Mimetypes are determined in many ways.

  1. The unpack agent sets mimetypes for containers. Since it could unpack the file, the mimetype must be accurate.
  2. If there is no mimetype set, then it uses magic(5) to identify the mimetype. Magic(5) is usually right, but not always right. In particular, a binary file may falsely match a known magic type.
  3. Magic(5) has two default values for when a file is not matched: plain/text for text files, and application/octet-stream for binary files. In these cases, the file extention (from the ufile database table) is compared with the known suffixes in /etc/mime.types. If there is a match, then the matched mimetype is used. This could also lead to errors. For example, if a text file ends with “.c”, it will be labeled as “text/x-csrc” even if it contains C headers. (This is common in the kernel source code – many “.c” files are used by #include statements.) Similarly, a text file that happens to end with “.png” will be called “image/png” instead of “plain/text”.
  4. If magic(5) returned a default mime type and the file ends with “.spec“, then it is assigned the mimetype “application/x-rpm-spec” (for use with the specagent system).
  5. If the file extension does not exist or is unknown, then the first 100 bytes are checked for non-printable characters. This results in a default value of “text/plain”, “application/octet-stream”, or “application/x-empty” for a zero-length file.

Because magic(5), /etc/mime.types, and even the “.spec suffix” match may have false mime associations, the output from the mimetype agent is likely correct, but may contain some errors. Applications that process files based on the mimetype must be careful and check that the file is really the specified mimetype. Programs should not crash if the file contents do not match the mimetype.

PkgMetaGetta Agent

Many file formats, such as DEB and RPM, contain meta data. While the unpack agent extracts meta data to “artifact.meta” files, this is not always complete or useful for applications. The pkgmetagetta agent (pronounced “Package Meta Getta”) extracts meta information from files and stores the information as attributes in the database attrib table.

The extraction is performed using “libextractor”. This library identifies hundreds of different meta types from dozens of different file formats. The extracted information is extremely useful for license analysis. For example, an RPM may have a “license” meta header that says “GPL“, while the license analysis agent may identify GPLv2, MIT, OSL, and many other licenses within the package.

While libextractor is useful, there are some limitations:

Because of these limitations, applications that process based on meta data must be aware that not every meta header may exist for a file, multiple headers may exist, and the header’s value may be inaccurate.

SpecAgent

RPM files come in two forms. There are the actual, packed RPM (file.rpm) files, and there is the source form, before the RPM file is packed. While pkgmetagetta can process meta information from a packed RPM file, it cannot extract information from an unpacked/source RPM directory. The specagent is designed for this situation.

Given a file of mimetype “application/x-rpm-src” (as determined by the mimetype agent), the specagent extracts the meta data and adds it as attributes to the associated pfile.

SQLAgent

The SQLAgent takes a single-line containing an SQL query and runs the SQL. The single line may be multiple SQL statements, separated by a “;”.

Usage: sqlagent [options]
  -i        :: initialize the database, then exit.
  -a arg    :: Expect SQL in parameter 'arg='.
  no file   :: process data from the scheduler.

The “-i” option initializes the database for use with the agent. (All agents have this option.)

The scheduler can run agents in two modes: any-host and MSQ. (See the scheduler documentation for details.) With the “any host” configuration, each line of input to the SQLAgent is an SQL comment to perform. No output is generated by the agent. However, if the SQL command fails, then the agent identifies the failure.

In the MSQ mode, use “-a” to specify the SQL column that contains the SQL query. For example, “sqlagent -a go” can be used with the input line:

a="anything" b="ignored" go="select * from ..."

DelAgent

The DelAgent is used to delete uploads, folders, or license information from the database and repository.

Usage: delagent [options]
  List or delete uploads.
  Options
  -i   :: Initialize the DB, then exit.
  -u   :: List uploads IDs.
  -U # :: Delete upload ID.
  -l   :: List uploads IDs. (same as -u, but goes with -L)
  -L # :: Delete ALL licenses associated with upload ID.
  -f   :: List folder IDs.
  -F # :: Delete folder ID and all uploads under this folder.
          Folder '1' is the default folder.  '-F 1' will delete
          every upload and folder in the navigation tree.
  -s   :: Run from the scheduler.
  -T   :: TEST -- do not update the DB or delete any files (just pretend)
  -v   :: Verbose (-vv for more verbose)

The DelAgent can be used from the command-line or from the scheduler.

Command-Line Usage

The basic command-line uses lowercase options to list items that can be processed, and capitals to actually perform the processing. For example, to list the available folders, use “-f”. This will generate output that matches the UI’s left-hand tree. Each folder is enumerated. For example:

# Folders
     63 :: AUrls (Parent folder AUrls)
        65 :: a-c
           -- :: Contains: Bash
        -- :: Contains: ubuntu
        64 :: v-z
           -- :: Contains: WebSuck
        66 :: WebSuck
           67 :: v-z
              -- :: Contains: WebSuck
     75 :: baz (Parent folder baz)
        -- :: Contains: foo
     77 :: Bobg (BobG test folder)
        -- :: Contains: rats-2.1.tar.gz

Each number indicates a folder ID. The indentation matches the tree layout, and uploads contained in the folders are also displayed. Using ‘-F’, you can delete a specific folder. This will remove the folder, all subfolders, and all uploads contained in the folders.

Similar to -f and -F, ‘-u” lists all available uploads with their IDs. These IDs can be used with -U to delete a specific upload.

The ‘-l’ options actually works like -u, listing all uploads. However, -L (followed by an upload ID) will only delete license information associated with the upload. This is useful for resetting a license analysis.

The DelAgent is intelligent about file reuse. For example, -F and -U delete projects. However, files (technically pfiles) in those projects may also be used by other projects. When using -F and -U, pfile information (and associated license and repository data) are not deleted unless the pfile is no longer used by any uploads.

Finally, some notes:

Scheduler Usage

When using the “-s” command-line option, the DelAgent will assume it is running from the scheduler. This is an “any host” agent since it primarily accesses the database. However, it requires write-access to the entire repository (for deleting unused repository entries).

The input from the scheduler requires three arguments: action, target, and id. The action can either be “LIST” (for debugging – similar to the lowercase command-line options), or “DELETE”. The target is either “UPLOAD”, “LICENSE”, or “FOLDER”. The ID specifies the item to delete (it is optional and ignored for the “LIST” action).

For example, to delete folder 3, the value of the jobqueue jq_args should be: DELETE FOLDER 3 This is equivalent to the command-line “delagent -F 3”.