My job is not starting

Jobs can stay in the Q state for extended period of time because of license restrictions or other applicationS specific requirements. Follow this guide if you think all conditions are met but your job is still stuck in the Q state.

1 - Verify if the capacity is being provisioned

Verify if the capacity associated to the job is being provisioned by running the following command:

qstat -f <job_id> | grep select
  • If compute_node value is set to tbd : Jobs is not eligible to run for the reasons mentioned above.

  • If compute_node value is set to compute_node=idea-<CLUSTER>-compute-ondemand-<JOB_ID>: In this case IDEA has triggered CloudFormation and the capacity is being provisioned

You can login to AWS Console and navigate to the CloudFormation console to verify the CloudFormation stack associated to your job is in CREATE_COMPLETE state. If not, verify any potential errors via the Events tab.

2 - Verify the bootstrap logs for the compute node(s) being provisioned

If the capacity is being provisioned, the next thing to check is if there is no errors during the bootstrap sequence on the compute node(s) provisioned to run your job.

To verify that, review logs located under /apps/<CLUSTER>/scheduler/jobs/<JOB_ID>/

You will find the bootstrap & compute_node logs for all EC2 instances being provisioned for your job:

Bootstrap:
/apps/<CLUSTER>/scheduler/jobs/<JOB_ID>/bootstrap/<COMPUTE_NODE_JOB_ID>

Compute Node Startup Logs:
/apps/<CLUSTER>/scheduler/jobs/<JOB_ID>/logs/<INSTANCE_HOSTNAME>

3 - Check if the compute node(s) is/are being registered on the scheduler

Verify if the compute node(s) are being registered correctly to the scheduler.

Run pbsnodes -a and find the section specific to your job id (see example below for job 103)

In this example, the host is still being configured as state is state = state-unknown,down. If that's the case, wait a little longer. Your host will be ready to accept job when state = free (see below)

4 - Restart the Scheduler

If needed, SSH to the scheduler machine and restart both OpenPBS and idea-scheduler module.

To restart OpenPBS, run systemctl restart pbs

A valid output looks like this ( see Active: active (running))

If the service is not starting up correctly, verify the logs under:

  • /var/spool/pbs/sched_logs

  • /var/spool/pbs/server_logs

Finally, try to restart idea-scheduler by running /opt/idea/python/latest/bin/supervisorctl restart scheduler

You can confirm if idea-scheduler has started correctly by checking the application logs located under /opt/idea/app/logs

Last updated