Table of Contents

Repository

The file repository is used to store the actual files loaded into the FOSSology system. While the Database stores meta information about files, the Repository holds the actual files.

Although sha1 and md5 are relatively unique hashes, there is still a possibility of a hash collision. The working belief is that, while the triplet could have a collision, it is extremely unlikely.

Some notes about filenames:

Directory

Since the repository can store hundreds of thousands of files, we want a quick way to organize the contents. The selected method is based on octets. For example, the file

ffe1cd8dd6b0b4c031262402ab0375ee876b17cb.732fe0681bc974f1075c4bee147c91f8.4232 

is stored in the directory ”/ff/e1/cd/”.

NOTE: If the filename is shorter than the number of characters needed for the path, then the path is padded with underscores. The filename “abcde” would be stored as “ab/cd/e_/abcde”.

Types

The repository must store many different types of files. The type of file determines the contents and the tool that uses it. Some example types:

The type of file is prepended to the directory tree. Thus, the example file could be found under ”/files/ff/e1/cd/”.

The different types are not static – new types can be created at any time. (The type is specified when using the tool – see below.)

Some notes about types:

Hosts

In order load balance storage and processing, the files in the Repository can be distributed across NFS-mounted hosts. A host configuration file specifies which host actually stores which files.

The hostname is prepended to the path, so there only needs to be one mount point per host. For example, ”/host1/files/ff/e1/cd/”.

Note: If there is no host configuration file entry, then no hostname is prepended to the path.

Configuration

The directory for storing the repository configuration files is /srv/fossology/repository/. If this does not exist, then ”.” is used. This can be changed (for testing) by specifying the environment variable “REPCONF”. This should contain the path to the repository.

In the repository configuration directory should be 3 files:

For example:

host1 test 00 7f
host2 test 80 af
host3 test b000 b080
host1 gold 00 7f
host2 gold 80 ff
host4 * 00 ff

In this example, the 'test' file (file type = “test”) “b081cd8dd6b0b4c031262402ab0375ee876b17cb.732fe0681bc974f1075c4bee147c91f8.4232” would be stored on host4, but the same filename would be on host2's 'gold' repository.

Some notes about using the Hosts.conf configuration file:

Repository Tools

The following command-line tools exist for managing the repository:

rephost type sha1.md5.length

This displays the hostname where the file would be found or stored. This is used for optimizing processing by running a process on a local host rather than accessing files remotely. If no hostname is found, then localhost is returned.

Note: This does not check if the file exists. It only says where the file could be found.

reppath type sha1.md5.length

This tool displays the path to the file (reading, writing, or debugging).

Note: This does not check if the file exists, or even if the directories are valid. It only says where the file could be found.

repexist type sha1.md5.length

Determine if the file exists in the repository. This is for use in shell scripts: returns “0” for yes, “1” for no.

repcat type sha1.md5.length

If the file exists, cat the contents to stdout.

repwrite type sha1.md5.length < input

Creates a file in the repository.

repcopyin type source sha1.md5.length

echo 'source sha1.md5.length' | repcopyin type

cat 'XML from ununpack' | repcopyin type XML

Bulk-populates the repository. There are three use options.

All files are inserted into the repository. But, if the file already exists, then it is not copied in again. (This is for a speed improvement.)

The program displays the total number of files imported, duplicated (not imported), and errors (failed to import).

Repository Library

The repository is managed by a C library: librep.a and librep.h. This library contains the following common functions:

REPCONF environment variable

The environment variable REPCONF specifies the configuration directory for the repository. If this is not set, then /srv/fossology/repository/ is used. (And if that does not exist, then the current directory (”.”) is used.)

int RepOpen ();

Since the repository configuration files may be accessed by every function call, we don't want to call fopen/fclose millions of times. This opens and sets up global variables. You should call this first – but if you forget, then it is called anyways by all of the other repository functions. Returns 1 if it is configured, 0 if configuration failed.

void RepClose ();

This closes all global variables. It is proper to call this when you are done, but if you forget… shared memory will not be lost.

NOTE: If you want to refresh the configuration, then call: RepClose(); RepOpen();

char * RepMkPath (char *Type, char *Filename);

Allocate a string containing a path for the type and file. Returns a string, or NULL if the type/filename is invalid (or an allocation error occurs).

The depth of the path is determined by the value in $REPCONF/Depth.conf. If this file does not exist, then the default is “2”. The caller is responsible for calling free().

char * RepGetRepPath ();

Allocate a string containing the path to the top of the repository. Returns NULL if an error occurred. The caller is responsible for calling free().

char * RepGetHost (char *Path, char *Type, char *Filename);

Allocate a string containing the hostname where the file is stored. The hostname is determined from the $REPPATH/Hosts.conf file. Returns a string if the hostname was found. Returns NULL if there is no hostname OR if an error occurred. The caller is responsible for calling free().

int RepExist (char *Type, char *Filename);

Determines if the type+file exists in the repository. Returns 1 if it exists. Returns 0 if it does not exist. Returns -1 if an error occurred.

int RepHostExist (char *Type, char *Host);

Determines if the type+hostname exists in the repository. This is useful for determining of this particular host stores any files of the given type. Returns 1 if it exists. Returns 0 if it does not exist. Returns -1 if an error occurred.

int RepRemove (char *Type, char *Filename);

Remove a file from the repository. Returns the result from unlink() – 0 on success. If there is an error, then a non-zero value is returned.

FILE * RepFread (char *Type, char *Filename);

This is a replacement for fopen(filename,”rb”). It returns a FILE pointer to the type+filename, or NULL on error. The caller should run RepFclose() when they are finished.

FILE * RepFwrite (char *Type, char *Filename);

This is a replacement for fopen(filename,”wb”). This function will also create the repository's directory if it is needed. It returns a FILE pointer to the type+filename, or NULL on error. The caller should run RepFclose() when they are finished.

int RepFclose (FILE *F);

This is a replacement for fclose(FilePointer). This returns the value from fclose().

int RepImport (char *Source, char *Type, char *Filename, int HardLink);

This is a really fast file copy. If HardLink is set (not zero), then it will use a hard link before trying a regular file copy (making it REALLY fast). The contents from Source are copied into the repository. This returns 0 on success, non-zero on failure.

RepMmapStruct * RepMmap (char *Type, char *Filename);

This is a replacement for mmap(). The file is opened for read-only access! Do not use this command to create a new file. It allocates and returns a structure containing the mmap handle:

struct RepMmapStruct
  {
  int FileHandle; /* handle from open() */
  unsigned char *Mmap; /* memory pointer from mmap */
  int MmapSize; /* size of mmap */
  };
typedef struct RepMmapStruct RepMmapStruct;

The caller must call RepMunmap() to free the structure.

RepMmapStruct * RepMmapFile (char *Filename);

Similar to RepMmap(), but takes a full filename as a parameter rather than a repository entry. (Technically, this is used by the RepMmap() function.)

void RepMunmap (RepMmapStruct *M);

This un-mmaps and deallocates the RepMmapStruct variable created by RepMmap() and RepMmapFile().