Using Office Grid
Howto use OfficeGRID (OG) on the image cluster
OfficeGRID is a grid computing software system developed by Mesh technologies http://meshtechnologies.com/.
OfficeGRID (OG) consists of a OGserver handling all jobs and outputs. This is currently imagediskserver3. Jobs are executed by OGexecuters and the executers ask the OGserver for new jobs when they become vacant. Currently imageserver1-3 are configured as OGexecuters. OG will automatically try to balance the load among the OGexecuters.
There are 4 ways of interacting with the OG system:
- Graphical User Interface (GUI)
- Command line interface (CLI)
- Python API
- Web interface for administration
For more information follow the tutorials below and look at the detailed documentation at .
- On our cluster all communication with OG should be done on imagediskserver2 (192.168.10.103).
- You cannot assume that OG executes your program in a specific directory, e.g. where you started the job. This means that if you need to read and write to files you should always use absolute paths and not relative paths. The same thing holds if you want to execute programs/commands, e.g. in a script - you have to use the complete path to point at the program / command you want to execute.
- For the system to be efficient it is important that all users use OG to execute jobs on the cluster.
The first thing to do
Log in to imagediskserver3 (192.168.10.103):
ssh -Y <mylogin>@192.168.10.103
Configuration needed to execute OG commands
Before being able to use OG you have to execute this command:
export OG_CONFIG=/usr/local/og/officegrid.cfg export PATH=$PATH:/usr/local/og/
Now you can execute Office Grid commands, i.e. the commands starting with og_ in the folder /usr/local/og/.
To avoid doing the above everytime you want to use OG, it is most convinient to add the following lines to the end of your ~/.bashrc file:
# Office Grid stuff export OG_CONFIG=/usr/local/og/officegrid.cfg export PATH=$PATH:/usr/local/og/
Extra note on walltime option
- -w, --walltime <time>
Wall clock time required by the job. Time is in seconds or "w:d:h:m:s"-format where w is weeks, d is days, h is hours, m is minutes and s is seconds. Also accepts suffix format strings like "m:s", "h:m:s" or "d:h:m:s". Thus "3:20" means 3 minutes and 20 seconds. Also allows MAX or UNLIMITED keyword.
The important point here is that the previous value of '-1' is now fixed to really mean UNLIMITED, and thus such jobs will never get executed on your walltime-limited executors. Thus it is important that you change any -w -1 to -w MAX in your job submits.
Q1: I can't remove a job that I just started
If the job was not started on one of the OG executors (the imageservers) and you try to remove it by og_jobrm, the job will not be removed the first time. Simple remove it again using og_jobrm which will immediately remove the job. This bug has been reported and will be fixed in the near future.
Notes on configuration
Currently OfficeGrid (OG) is installed on imageserver1-3 and imagediskserver2-3 in /opt/officegrid-2.1.14 with a link from /usr/local/og/.
The server use /image/data2/Officegrid-tmp as temporary data space (file and job cache). I am not sure this is a good idea in the long run.