Finding out the resource requirements of your jobs
Three most important parameters that control the scheduling of jobs on iceberg are;
- Real-Time requirements of the job ( -l h_rt= )
- Virtual-Memory requirement of the job ( -l mem= )
- Real-Memory requirements of the job ( -l rmem= )
Time Allocation ( -l h_rt= )
- Default time allocation for all jobs is 8 hours of wall_clock_time.
- You may specify the time allocation for your job by using the -l h_rt= parameter to qsub or qsh
- Currently the maximum time allowed for interactive (qsh) jobs is 8 hours.
- Currently the maximum time allowed for batch jobs ( qsub ) is 168 hours.
- It is always safer to over estimate rather than under estimate the job time allocation.
- Any job exceeding the specified time allocation will be terminated abruptly and without any warning.
- Do NOT specify more that 168 hours for batch jobs and more than 8 hours for interactive jobs, as such jobs will never run.
Note: h_rt stands for hard-limit for real-time.
Memory Allocations ( -l mem= and -l rmem= )
- The default virtual memory ( -l mem= ) allocation for each job is 6 GBytes.
- The default real memory allocation ( -l rmem= ) for each job is 2 GBytes.
- -l rmem= parameter's value can never exceed the -l mem= parameter's value.
- Any job trying to consume more virtual memory than allocated will be terminated immediately without any warning.
- Jobs attempting to consume more than the allocated real memory ( rmem ) will send warning to users but will continue to run.
- Do NOT specify more than 768 Gigabytes of virtual memory ( mem ) .
- Do NOT specify more than 256 GB of real memory ( rmem ) .
- For parallel jobs the memory allocations are treated as per-core rather than for the entire job. For example, a 4-way parallel job with -l mem=12G will be allocated 12*4 = 48G of memory in total.
Every iceberg user is allocated a file-storage area of their own. Please read the section on filestore allocation on iceberg for further information. Any attempt to exceed this allocation during the execution of a job can have disastrous consequences. This is because any program or package writing into files will produce a fatal error and stop if the filestore limit happens to be exceeded during that operation.
Filestore limits are not associated with jobs and can not be specified while submitting a job. Users must make sure that there is sufficient spare space in their filestore areas before submitting any job that is going to produce large amounts of output.
quota command can be used to check your current filestore allocation and usage.
Explaining the difference between VIRTUAL MEMORY (mem ) and REAL MEMORY (rmem)
Running a program always involves loading the program instructions and also its data i.e. all variables and arrays that it uses into the computers "RAM" memory. A program's entire instructions and its entire data defines the VIRTUAL STORAGE requirements of that program. If we did not have clever operating systems we would need as much physical memory (RAM) as the virtual-storage requirements of that program.
However, operating systems are clever enough to deal with situations where we have insufficient REAL MEMORY to load all the program instructions and data into the available Real Memory ( i.e. RAM ) .
This technique works because hardly any program needs to access all its instructions and its data simultaneously.
Therefore the operating system loads into RAM only those bits of the instructions and data that are needed by the program at a given instance. This is called PAGING and it involves copying bits of the programs instructions and data to/from hard-disk to RAM as they are needed.
If the REAL MEMORY (i.e. RAM) allocated to a job is much smaller than the entire memory requirements of a job ( i.e. VIRTUAL MEMORY) then there will be excessive need for 'paging' that will slow the execution of the program considerably due to the relatively slow speeds of transferring information to/from the disk into RAM.
On the other hand if the Real Memory (RAM) allocated to a job is larger than the Virtual Memory requirement of that job then it will result in waste of RAM resources which will be idle duration of that job.
It is therefore crucial to strike a fine balance between the VIRTUAL MEMORY and the PHYSICAL MEMORY allocated to a job.
Virtual memory limit defined by the -l mem= parameter defines the maximum amount of virtual-memory your job will be allowed to use. If your job's virtual memory requirements exceed this limit during its execution your job will be killed immediately.
Real memory limit defined by the -l rmem= parameter defines the amount of RAM that will be allocated to your job.
The way we have configured SGE, if your job starts paging excessively your job is not killed but you receive warning messages to increase the RAM allocated to your job next time by means of the rmem= parameter.
It is important to make sure that your -l mem= value is always greater than your -l rmem= value so as not to waste the valuable RAM resources as mentioned earlier.
Finding out your job's requirements
There are number of ways of finding out the time and memory consumption of your job.
Here are few methods we recommend;
1. By using the emailing parameters of the qsub command:
Submit your job 'qsub' by specifying very generous memory and time requirements to ensure that it runs to completion" and also using the -M and -m e parameters to receive an email-report. The mail message will list the maximum memory usage ( MAX VMem ) as well as the Wall_Clock_time used by the job.
Here is an example job script;
#$ -l h_rt=120:00:00 -l mem=8G
#$ -m eba -M email@example.com
myprog < mydata.txt > myresults.txt
When it is run, you will receive an email reporting the memory and time usage figures.
2. By using the qtop command:
While your job is running simply type qtop. This will produce a neat table for each running job giving the Virtual_Memory, RSS_Memory and Real-Time usage-so-far.
3. By using the Qstat command:
You can also detect the memory used by your job while it is running by using the qstat command as follows:
- While a job is still running find out its JOB_ID by -
- And check its current usage of memory by-
qstat -F -j job_id | grep mem
The reported figures will indicate
- the currently used memory ( vmem )
- Maximum memory needed since startup ( maxvmem)
- cumulative memory_usage*seconds ( mem )
It is the maxvmem figure that you will need to use to determine the -l mem= parameter for your next job.
4. By using timing commands in your script:
Another way of deducing the wall_clock_time used by a job is to use the date or the timeused command within the script file. The date command is part of the Linux operating system whereas the timeused command is specific to iceberg and provides the usage figures directly rather than having to manually calculate it from two subsequent date commands.
Here are some examples;
USING THE DATE COMMAND:
#$ -l h_rt=10:00:00
my_program < my_input
When the above script is submitted (via qsub), the job output file will contain the date and time at each invocation of the date command. You can then calculate the difference between these date/times to determine the actual time taken.
USING THE TIMEUSED COMMAND
#$ -l h_rt=10:00:00
my_program < my_input
When the above script is submitted the first invocation of the timeused command will initialise the timer counter due to the fact that TIMECOUNTER variable is set to 0. The subsequent invocations will report the time in hours,minutes and seconds since the first invocation.