====== FOSSology: Multi System Setup ======
===== FOSSology: How To Install Multiple Hosts =====
Notes:
Hosts.conf, Proxy.conf & Scheduler.conf must be created/modified after running make install and prior to running postinstall
===== FOSSology: How To Configure Multiple Hosts =====
The scheduler and repository are designed so they can be distributed across multiple hosts. There are few reasons for doing this:
- I/O Bottlenecks. Some of the agents (e.g., the unpack agent) are I/O intensive. As a result, running two on the same host will likely be much slower than running two in serial (one at a time) on the same host. If you want to run two unpack agents in parallel, you should run them on different systems (or on one mega system that has multiple I/O channels to relieve the bottleneck).
- CPU Bottlenecks. The slowest agents are the analyzers, such as the license analyzer. While these can easily be used in parallel, each consumes a full CPU. Running three analyzers in parallel on a system with one CPU will not speed anything up. However, if you have multiple CPUs on many different computers, then you can use them.
The ideal configuration distributes the repository across hosts and runs agents on those hosts. This way, the data used by the agents is local rather than transferred over the network.
===== Part 1: The Repository =====
The repository (repo) is just a directory on the file system. The directory's location is defined in the RepPath.conf file. (I'll refer to the path as $Repo in this document, but the default location is /usr/local/share/fossology/repository/RepPath.conf.)
The layout of the repository is as follows:
$Repo/host/##/##/##/files
Where "host" is just a string (not required to be a hostname) and "##" is
a hexadecimal number. For example:
$Repo/localhost/01/e4/2f/01e42f923c85.txt
The Hosts.conf file (default: /usr/local/share/fossology/repository/Hosts.conf) identifies the name of the host and the directories under it. For example:
==========
sirius * 00 7f
buckbeak * 80 ff
==========
This will create two directories:
$Repo/sirius/ The subdirectories are the range 00 to 7f.
$Repo/buckbeak/ The subdirectories are the range 80 to ff.
Now you can use $Repo/sirius/ and $Repo/buckbeak/ as mount-points for remote file systems. The separation of 00-7f and 80-ff should generally split the repository in half. (The split may not be equal in size, but it should be close.) The subdirectories are named after the SHA1 checksum of the files, so this should be a fairly even split due to random data.
The repository must be writable by the group "fossy". To ensure that all files are group accessible, the directories should be set with the permissions "g+rwxs". By setting the SGID big (g+s) on the directory, all files and directories will regain the group permissions. The **big catch** here is that all mounted filesystems must use the same group ID for "fossy". Ideally, the top directories should be owned by user "fossy" and have the same user ID on all systems.
===== Part 2: The Scheduler =====
However, just because you split the repository across mount points does not mean you are done.
The scheduler.conf file (default: /usr/local/share/fossology/agents/scheduler.conf) lists the host strings where jobs should be used. For example:
agent=filter_clean host=localhost | /usr/local/fossology/agents/filter_clean -s
If you change Hosts.conf, then you will need to change the "host=" strings in scheduler.conf to match the names in the Hosts.conf directory. There are three different scenarios:
----
==== Scenario 1: Distributed Repo, Local Agents ====
If you lack disk space for the Repo on the local system, you can distribute the repository and still use the local CPUs for running agents. This configuration is not ideal since all communication to the repository will be done over the network (significant speed impact). However, if you need the disk space then this is an option.
The simplest solution is to edit the scheduler.conf and simply remove all of the "host=" tags. For example:
agent=filter_clean host=localhost | /usr/local/fossology/agents/filter_clean -s
will become:
agent=filter_clean | /usr/local/fossology/agents/filter_clean -s
This tells the scheduler to ignore host designations for the agent and just run it locally. The repository files will be used regardless of where they are remotely hosted.
----
==== Scenario 2: Distributed Repo, Distributed Agents ====
This is the best, usual, and expected scenario since agents can run on the same systems as the repository data.
In the scheduler.conf, you will need to change the "host=" lines and add additional lines for additional agents. The easiest way to do this is with by using the mkconfig program and SSH.
- Create SSH keys for the fossy user and distribute them on all hosts. The keys should **NOT** include a pass-phrase. (Since the scheduler cannot enter a password, a require pass-phrase will cause the remote execution to fail.)
- Test the keys (including accepting the server key for the first connection). You should be able to login without a password.
- Use mkconfig (/usr/local/fossology/agents/mkconfig) to generate a new scheduler.conf file. Use -C to identify the number of CPUs on the host, -R to specify the remote-login command, and -H to create records for the hostname.
For example, if fawkes has 8 CPUs and buckbeak has 4, then you can use:
mkconfig -C 8 -R '/usr/bin/ssh fossy@fawkes "%s"' -H fawkes \
-C 4 -R '/usr/bin/ssh fossy@buckbeak "%s"' -H buckbeak \
> new_scheduler.conf
Then, if new_scheduler.conf looks good, you can replace scheduler.conf and restart the scheduler.
A few caveats about mkconfig and scheduler.conf:
- mkconfig cross-checks the "-H" value with the hosts in Hosts.conf. It will print warnings if they do not match. If they don't match, then they won't work. However, since you could be building a scheduler.conf for a different host, the tool allows you to create a scheduler.conf that is not valid for the current system. (Or to say it simply: here is the gun and you are allowed to shoot yourself. Don't just ignore the warnings.)
- The scheduler can use any remote login command. The only requirement is that no human interaction can be required. In particular, if the login requires someone to enter a password then it won't work. This is because the scheduler is fully automated.
- The remote login command takes one parameter (%s) as the file to process. You **must** include one and only one "%s". Moreover, the string that replaces the "%s" will contain spaces, hyphens, and single-quotes ('). If the remote command cannot handle this, then you should quote enclose the %s in double quotes: "%s".
----
==== Scenario 3: Local Repo, Distributed Agents ====
If you have lots of CPUs, but only one repository, then you can either pretend to distribute the repository, or just replicate agents.
If you choose to pretend to distribute the repository, then it will look just like Scenario 2, except that you will only mount one directory rather than multiple directories. In this configuration, each CPU is assigned a certain range of files to process. This means, some CPUs may go unused.
Alternately, you can replicate agents. In this case, you will need to manually edit the scheduler.conf file.
- Run mkconfig with the -R parameter so you can see what a remote login line looks like. For example:
mkconfig -R "/usr/bin/ssh fossy@sirius '%s'" -B
- Remove every "host=" label.
- Create a remote login line for each host, CPU, and agent.
- On the very first line of the scheduler.conf file, set the first number to reflect the total number of CPUs. For example, if you have a total of 18 CPUs then use:
%Host localhost 18 1
This tells the scheduler to run at most 18 jobs at once. The various agent lines say where to run the jobs for specific agents. The multiple agent lines indicate that multiple agents can run on the same hosts.
For example, this scheduler.conf runs up to three copies of Filter_License on buckbeak and two on fawkes.
# 3 CPUs on buckbeak
agent=filter_license | /usr/bin/ssh fossy@buckbeak "/usr/local/fossology/agents/Filter_License"
agent=filter_license | /usr/bin/ssh fossy@buckbeak "/usr/local/fossology/agents/Filter_License"
agent=filter_license | /usr/bin/ssh fossy@buckbeak "/usr/local/fossology/agents/Filter_License"
# 2 CPUs on fawkes
agent=filter_license | /usr/bin/ssh fossy@fawkes "/usr/local/fossology/agents/Filter_License"
agent=filter_license | /usr/bin/ssh fossy@fawkes "/usr/local/fossology/agents/Filter_License"
A few caveats to remember:
- The unpack agent will create an I/O bottleneck. It is not recommended to run multple unpack agents on the same host. Ideally, you should have at most one unpack agent per host.
- A few agents are really fast. You probably don't need more than one filter_clean agent per host.
- A few agents are CPU intensive. If you have four CPUs per host, then you probably do not want to run more than three of the license agents on the host. This reserves one CPU for the operating system and other jobs on the computer. The basic rule-of-thumb is to use "n-1" of the license agents, where "n > 1 CPU on the host".
===== Testing the Configuration =====
When you have finished configuring the scheduler.conf, you can test it with the scheduler command. As the user "fossy", run this command:
/usr/local/fossology/agents/scheduler -t
This will attempt to spawn every agent. If there are any errors, it will tell you which command failed. Some common failure causes:
- Wrong user or group. The scheduler and all agents run as user "fossy" and group "fossy". If this is not the case, then programs will fail to run. (If you run the scheduler as root, it will change itself to run as user fossy.)
- Bad path. Every agent must be specified using the full path. If you use SSH for the remote login, be sure to specify the path to SSH (`which ssh`) and not just "ssh".
- Typographical errors. If a hostname or program is misspelled, things will fail to run.
- Password required. If the remote login system requires a password, then it will not work with the scheduler. As user "fossy", try the remote login command and see if it works without a password.