![]() |
FOSSology Advancing open source analysis and development |
|
Table of Contents
How To Create An AgentThis document covers how to create, configure, and install an agent for use with the scheduler, DB, and UI. Tip: Look at the Engine-Shell code in the SVN tree as an example. PurposeAgents are used to perform analysis, statistics, or management tasks related to anything in the database. Each agent performs one (1) task; do not overload agent functionality. If you need three different tasks, then create three different agents. For example, if your agent needs to unpack a file in order to analyze it, then ask yourself: can the unpack be done as a separate agent and does it make sense to spit the functionality?
Building an AgentAgents can be built in any language – from shell script to C. Some of the existing agents are written in Shell, C, Perl, and PHP. Any kind of executable can be an agent. For simplicity, we have some pre-built libraries for common functions.
* libdbapi: A DB-independent library for accessing the database. (This provides non-Postgres-specific calls.) Alternately, you can use the Postgres libraries to access the database. The RepositoryThe repository is a flat file system that contains the pfile contents. For load balancing, it may be divided across hosts. For organization, the repository has data separated by type. For example:
You should never need to search the repository for files. Instead, use the reppath command (or library function) since it will look up the directory for you. You are not limited to the directories gold, files, or license. These are just organizational types. If you want to create your own type for files, go ahead and do it. The current convention is that related files have the same name. For example: “files/foo” will have the license cache file “license/foo”. Since files represent pfiles, they do not use their human name (you will not find /license/COPYRIGHT.GPL in the directories). Instead, they are stored using pfile information: sha1.md5.length. The DatabaseAgents currently have access to all tables in the database. (But this might change in the future, so keep track of the tables you require.) Many agents have special tables just for their own information. Do what is right for your agent. What Kind of Data?The scheduler sends data to the agent. The data comes from the jobqueue. There are two types of agents found in the jobqueue:
A="5" B="Tuba" Populating the Job QueueHow does the data or SQL get into the jobqueue? You will need to develop the data or SQL and tell the front-end UI. The UI will place the data into the jobqueue when the job is requested by the user. Starting an AgentAt this point, we will assume that you have built the code that does the analysis. Now you need to make the agent able run from the scheduler. The scheduler will start the agent using a system() call. The agent, and all of its parameters are static and defined in the scheduler’s configuration file. The scheduler will feed each agents one piece of data to process. After the data is processed, the agent may be fed another piece of data. Because the scheduler may run multiple instances of the agent, the agent must ensure that multiple instances will not interfere. In addition, different instances may be started on different hosts. Used of common temp (or writable) files and common shared memory is strongly discouraged. If an agent needs a unique ID, consider using the environment variable $THREAD_UNIQUE. This is an ID set by the scheduler that is guaranteed to be unique among the running processes.
When to ExitAgents do not exit when the finish processing data! This is intentional: processes have a high cost for starting up, and may be called millions of times (resulting in a cumulative hours or days of startup time). Agents should only exit under a limited number of circumstances:
Note: If the agent does not die fast enough, then it will be killed. The current timeout is 20 seconds. You have 20 seconds to cleanup and exit after receiving a SIGINT. If the agent does not exit fast enough, then it will be killed using SIGKILL. Communicating with the SchedulerAgents work with the scheduler to process data. Essentially, the agent specifies when it is ready for data to process and the scheduler doles out data as needed. The scheduler handles parallel processing of the data by starting up parallel agents. All communication happens over the agent’s stdin and stdout streams. The flow is as follows:
The entire communication is a series of OK - Data - OK - Data - OK strings. When there is no more data, stdin will close and a SIGINT will be sent. Caveat: Because agents are spawned as needed, used in parallel, and killed when no longer needed, an agent may be spawned and killed without ever receiving data. This is rare, but happens when one agent is running and another is spawned. In the time that it takes for the second agent to spawn and become ready, the first agent completes all of the tasks. With nothing left to do, both agents are killed. Special CommunicationsThere are a few special communication methods that can be used between the agent and the scheduler. However, these are usually used for debugging. Each of these are written to stdout.
Error HandlingErrors are recorded as attributes in the database. They can be associated with pfiles, ufiles, or just about anything else found in the database. Agents have the option to directly record errors in the database (inserting their own attribute records), or they can communicate errors to the scheduler. Communication with the scheduler is as follows: keyword type(index) message The keyword is one of the following:
The type and index denote how to attach the attribute. For example “pfile(1234)” will attach the error message to the pfile_pk 1234. The message is a string that describes what happened and (optional) possible resolution methods. This string must be human readable. For a multi-line string, the message should contain the characters “\n” in place of newline characters. Example messages:
FATAL ufile(1111) Unable to allocate required memory for analysis.
ERROR pfile(1234) Wget returned a zero-length file from the URL http://bad.url
Please recheck and resubmit.
WARNING pfile(5557) RPM file missing meta data.
LOG ufile(9876) wget: SELECT returned no rows: 'SELECT * FROM blah
where pfile_fk = 1234;' client.c:512
Quick and Dirty AgentsIf you just want to write a quick and dirty agent, then consider using the Engine-Shell program. This is a wrapper around system(). It takes a command-line parameter of the program to run, and handles all of the OK communications.
Engine-Shell can replace macros in the system() string with variables.
abc def = 2 args :: first arg is %{1}, second is %{2}
a "b c" d = 4 args
If you forget the “%{” or “}”, or the middle part is unknown, then it is treated like regular characters. An example entry in the scheduler’s configuration file:
agent=unpack host=fawkes | /usr/bin/ssh fawkes.rags "/usr/local/bin/engine-shell \
unpack 'ununpack -d /home/repository/ununpack/%{U} -qRCQx'"
Using SSH for Remote AccessThe scheduler is only designed to run processes on the the local system; it does not provide remote agent support. However, it is possible to run an agent on a remote host by using ssh. Agents communicate with the scheduler through stdin and stdout. Any tunneling system that forwards stdin and stdout (e.g., ssh, stunnel, rsh, pppd, and netcat) can be used to create a connection for running the agent on a remote server. The requirements for any tunneling service:
In the scheduler’s configuration file, you will specify the command-line that the agent should run. Rather than running a local application, specify the tunnel software (e.g., ssh remote_host agent_command). FOSSology Project documentation is licensed under the GNU Free Documentation License Version 1.2 | |||||||||||||||||||