====== Scheduler ====== This document covers the technical implementation of the scheduler. It is intended for anyone who needs to replace, modify, or debug the scheduler. This document is not intended for people who want to create an agent (although it certainly helps to know this information). ===== About the Scheduler ===== The scheduler is a //super-agent//, responsible for spawning and managing all other agents. The scheduler balances the needed tasks (found in the job queue) with the available resources. It tries to ensure that: - One task does not lock out other tasks. - Unused resources are used by other pending tasks. - Agents are spawned in an optimal fashion (fastest). - Agents do not exceed the alloted resources. Although the scheduler is single-threaded, it manages child processes that run in parallel. * Scheduler is single-threaded. * Scheduler spawns agents (children) that run in parallel. ===== Front-End Communications ===== The scheduler communicates with the front-end UI through the database's jobqueue table. This table lists which agents need to run, the necessary parameters for the agent, and the current operation status. * Agents can either be general -- running an any available server (host) -- or they can be host-specific. The jobqueue specifies a "runonpfile" field if the agent should be locked to a specific host. The parameter denoted by the runonpfile field is used to identify the host: run on the host that matches the pfile. * Jobs can contain a single parameter that is passed to the agent, or an SQL query can be provided for generating parameters. The latter multi-SQL-query (MSQ) is used in lieu of adding thousands of individual jobs to the jobqueue. The scheduler performs the MSQ request and the results are individually passed to agents. The combination of host and query type leads to four combinations, but only two combinations are implemented. ^ ^ Any host ^ Host-specific ^ ^ One parameter | OK | N/A | ^ MSQ | N/A | OK via runonpfile | Some example agents: * **wget:** Any host, one parameter (the URL to get) * **license:** The bSAM agent uses a host-specific MSQ. This allows bSAM to run on the system containing the license files, rather than resorting to NFS accesses. * **filter_license:** Similar to the bsam agent, the host-specific MSQ reduces NFS access. Some agents could fit into other categories, but are limited by the two available choices: * **unpack:** Host-specific, MSQ. The SQL parameter returns ONE record that contains the pfile and ufile to unpack. This should be implemented as a host-specific one-parameter job, but the only host-specific option is an MSQ. * Any future agent that only performs database accesses could be better fit as an any-host MSQ job. However, they will likely be implemented as a host-specific MSQ job. ==== Adding to the Queue ==== Jobs are added to the jobqueue by the front-end UI. The UI knows the desired agent type, whether it is an any-host or MSQ job, and the proper parameters (or SQL). The scheduler has no control over what is added and does not validate whether the added job is correct. ==== Job Tracking ==== The scheduler tracks jobs based on the jobqueue start and end times. * No start time? Ok to run. * Start without end? Job is currently being managed by the scheduler. * Start with end? Job is completed. Some jobs need to be rescheduled. For example, an MSQ may have a "LIMIT 5000" (allowing the scheduler to only manage a few results at a time and permitting a limited timeslice scheduling). This is done by removing the start time when the job completes -- effectively putting the job back into the jobqueue. If the MSQ returns no results, then the end-time is set, completing the job. The jobs run with the following priorities: - Anything currently running is allowed to run. The rationale: a job may take a long time and it is better not to cancel the job and try to restart it later. - Any jobs held by the scheduler come next. - Of the jobs held by the scheduler, jobs for available, active, agents come first. This reduces kill/spawn times. - If there are no running agents of the correct type, then one is spawned (possibly after killing an incorrect ready/active agent type first). - The job queue has a column for urgent tasks. These come next. - Any available job, oldest first. - Jobs that do not match the available agents are ignored and remain in the jobqueue. The jobqueue keeps a prioritized table in case one agent depends on the results from another agent. As a result, there are frequently jobs in the jobqueue that cannot run (temporarily blocked due to a dependency). This tracking method has a few limitations: * If the scheduler dies, someone needs to remove the start times on incomplete tasks. (The scheduler tries to do this with signal handling, but sometimes dies before resetting the values.) * Since the start time can be reset due to rescheduling, there is no way for the front-end to tell how long a job really took. * The scheduler holds on to jobs. The front-end cannot distinguish between a "held" job and a "running job". * The scheduler can only hold a few jobs at a time: one "any host" per agent, and four MSQ at a time (MAXMSQ in dbq.c). It is very possible for the four pending MSQ commands to be held and waiting for an agent to become available, while other MSQ commands in the queue could run. The main function for checking the queue is in dbq.c: DBProcessQueue(). This checks the jobqueue for new tasks and processes the held MSQ records. ==== Signals ==== The scheduler watches for a few signals. These are mainly used for debugging: * **SIGINT**. Finish all running jobs, but do not start new ones. When all jobs complete, exit. This is denoted in the code by the "SLOWDEATH" flag and is used to provide a clean exit. * **SIGQUIT**. Kill all running children, try to reset the jobqueue start time, and exit. * **SIGTERM**. Handled the same as SIGQUIT. * **SIGUSR1**. Display a quick summary of running processes (how many are running, waiting, or dead.) * **SIGUSR2**. Display details about every MSQ job held by the scheduler. This can generate a huge amount of output, but allows debugging MSQ jobs. * **SIGHUP**. This displays the number of running jobs and the summary of each process (same as SIGUSR1). * **SIGSEGV**. If there is a crash, display all thread info (SIGHUP) before dying. ===== Back-end Children ===== All children are treated as finite-state machines. The states (defined in spawn.h) are: * **ST_FAIL = 0**. If an agent spawns and dies too rapidly, then mark it as failed. Failed agents are not respawned for a few minutes. (Prevents infinite spawning/death loops.) The timeout is defined in spawn.c as RespawnInterval (5 minutes) and RespawnCount (5 respawns). If the agent spawns faster than 5 times in 5 minutes, than mark it as a failure. It will remain a failure for RespawnInterval (5 minutes -- the variable is reused). NOTE: only abnormal deaths are counted here. If the scheduler intentionally kills a process (using ST_FREEING), then the number of spawns is reset and it should never reach ST_FAIL (even if it is spawned and killed rapidly). * **ST_FREE.** The agent is not spawned yet and has no I/O allocated. All agents begin in this state. * **ST_FREEING.** The agent was spawned, but has been told to die by the scheduler. It is now shutting down and has no I/O allocated. * **ST_PREP**. The scheduler is preparing a child data structure. The structure has allocated memory but has not yet been spawned. This step prevents a well-timed SIGCHLD from freeing the data structure before the state becomes ST_SPAWNED. * As an aside: Signals are boolean states and not queued. If three children die at once, then the parent only receives one SIGCHLD. Thus, the scheduler must scan every child when a SIGCHLD is called just in case there were multiple deaths. However, a new-dead child will look just like an old-dead child; there is no distinction. The ST_PREP state prevents an old-dead child from appearing as a new-dead and having its memory freed in the signal interrupt handler, while it was being allocated in the normal (non-interrupt handler) code. * **ST_SPAWNED.** The agent is spawned but not yet ready (I/O allocated). When the agent sends its first "OK", it will be transitioned to ST_READY. * **ST_READY.** The agent is live and ready for data. * **ST_RUNNING.** The agent is actively processing data. When the agent sends an "OK", it will be transitioned back to ST_READY. * **ST_DONE**. This is used by the MSQ table. Each SQL record has a status field and this indicates that the record is completed. When all MSQ records are completed, the MSQ job is done. * **ST_END**. This is an unused marked. Since states are numeric, this allows the code to loop over all possible states: for(i=0; i %Host localhost 2 1 agent=wget host=localhost | /usr/local/fossology/agents/wget_agent agent=unpack host=localhost | /usr/local/fossology/agents/engine-shell unpack '/usr/local/fossology/agents/ununpack -d /home/repository//ununpack/%{U} -qRCQx' agent=filter_license host=localhost | /usr/local/fossology/agents/Filter_License agent=filter_license host=localhost | /usr/local/fossology/agents/Filter_License agent=license host=localhost | /usr/local/fossology/agents/bsam-engine -L 20 -A 0 -B 60 -G 10 -M 2 -E -T license -O n -- - /usr/local/share/fossology/agents/License.bsam agent=mimetype host=localhost | /usr/local/fossology/agents/mimetype agent=mimetype host=localhost | /usr/local/fossology/agents/mimetype agent=specagent host=localhost | /usr/local/fossology/agents/specagent agent=filter_clean host=localhost | /usr/local/fossology/agents/filter_clean -s agent=pkgmetagetta host=localhost | /usr/local/fossology/agents/pkgmetagetta agent=pkgmetagetta host=localhost | /usr/local/fossology/agents/pkgmetagetta The format of the file is as follows: * Lines beginning with a "#" are comments. * Lines beginning with a "%" are settings. * %Verbose specifies the verbose level (same as using "-v" on the command-line). %Verbose 2 is like "-vv". * %Host lists a host name, the number of agents that can run at a time, and the number of urgent (additional) agents that can run. Currently "urgent" is implemented but not used and not tested. * All other lines define agents. These use two parts: attributes | command. * There is one line per agent. If you want to permit three unpack agents on the same host, then you will need to have three of the exact same line! * Agents are tracked by a unique ID in an array. Each line is assigned one position in the array (the first line is 0.) * Attributes are strings used to match an agent. (They may look like "field=value" pairs, but they are really just strings.) There are some well-defined attributes: * agent=name. This comes from the jobqueue agent table and specifies the type of agent. * host=name. This comes from an MSQ request and specifies the hostname to run on. NOTE: The name is just a string! For usability, I named the string after the host's name, but this is not a requirement! It could just as easily use "%host foo 4 1" and "agent=wget host=foo | ssh bar". The string is only used to identify the correct agent line, not to specify the actual hostname! * A vertical bar (|) separates the attribute list from the command. * The command will be used by system() to run the agent. * Each command is also passed an environment variable "$THREAD_UNIQUE". This specifies the unique thread number for the process. NOTE: It is unique for the current running, but if the child dies then the value will likely be reused. In some situations, this is bettern tha $PID or $PPID for managing any temporary files. * Some commands may appear to contain macro expansion variables, like ${U} or ${*}. However, these are not processed by the scheduler. They are processed by the agent. (In this case, the agent is called "engine-shell" and is used to run ugly-hack agents from shell scripts.) * Commands can be shells around agent processes. For example, "engine-shell" is an agent-aware wrapper for shell scripts. Similarly, you can use "ssh" (specify the full path!) to run a command on a remote host. Remember: The command that is executed is independent of the attribute string "host=". ===== Running the Scheduler ===== Usage: ./scheduler [options] [setup.conf] < 'type command' -d :: Run as a daemon! Still generates stdout and stderr -H :: Ignore hosts for host-specific agent requests -I :: Use stdin and queue (default: use queue only) -v :: verbose (-v -v = more verbose) -L log :: send stdout and stderr to log -q :: turn off show stages -R :: reset the job queue in case something was hung setup.conf: defines each engine -- one 'type command' per line. If setup.conf is not specified, then /usr/local/share/fossology/agents/scheduler.conf is used. stdin lists type+data, one per line. stdout comes from threads, non-interlaced and only when thread ends. stderr comes from threads, interlaced and immediate. Each command is executed as a running engine. Each stdin line is matched to a free engine of the same type. If no engine is free, then it will pause until one is available. The agent usually logs all output to a log file and processes data from the queue. It also runs as a daemon so you can logout without killing the scheduler. ./scheduler -d -L log setup.conf For testing: * You might want to use '-I'. This allows you to enter jobs to run on stdin. This is good for testing new agents. * -H is useful if you want to use the real configuration file on the local host. * I used to use -I and -H. Now I just create my own configuration file for the specific test. If the scheduler is killed using "kill -9", then the queue may not be reset to a stable condition. When you start the scheduler, it will monitor the queue. After 10 minutes of inactivity, the abandoned queue entries will be reclaimed for use by the scheduler. For a faster response, you can use "-R" to reset the queue immediately. However, don't use -R if there are multiple schedulers running at the same time. (Multiple schedulers is not supported, but -R will make a bad situation worse.) ===== Commanding the Scheduler ===== The scheduler runs as an independent back-end process from the front-end user interface. As a result, the UI cannot communicate directly with the scheduler. Instead, all commands are placed in the database's jobqueue. During normal operations, the jobqueue stores tasks to be run (the tasks should match the scheduler's configuration file). However, there is one special jobqueue task. jq_type = "command" jq_args = parameters for command When the jobqueue's jq_type is the lowercase string "command", the parameters in "jq_args" are interpreted directly by the scheduler. The following jq_args are supported. * "shutdown". The scheduler will finish all running tasks, but not start anything new. When all tasks complete, the scheduler will exit. * "shutdown now". The scheduler kills all running processes and exits ASAP. * "killjob 1234". If the jobqueue item 1234 (jq_pk="1234") is currently being processed by the scheduler, then kill it and mark it as a failure. This is usually used when the user queues a job to be processed, then decided to delete the job while it is running. The front-end UI knows that the job is complete because it will be marked as processed in the jobqueue. ===== Building the Scheduler ===== The scheduler consists of 5 source files: * clients.c: Handles client communications. * dbq.c: Contains ALL DB accesses. If the function touches the DB, then it is here. MSQ results are managed here too. * hosts.c: Functions for managing host-based spawning. This keeps track of the number of spawns per host and whether new spawns are permitted. * sockets.c: The read() and select() functions for communicating with an agent over stdin/stdout. * spawn.c: Functions for spawning processes and handling signals. * scheduler.c: The main file -- handles configuration and the infinite control loop. To build the scheduler, use the Makefile. make clean # remove all compiled files (clean slate for a new build) make # build the scheduler sudo make install # install it to /usr/local/fossology/agents/ The make command should build without any errors or warnings (a clean make). NOTE: If you make any changes to the state machine labels (the ST_* definitions in spawn.h) then you **must** use 'make clean' before 'make'. (Someday we might introduce a 'make depends' file so code is compiled when all dependencies change.)