Machine Learning

Topic

14 June 2024

HOW TO APPLY

Machine learning is a subfield of artificial intelligence that uses algorithms trained on data sets to create models that enable machines to perform tasks that would otherwise only be possible for humans. These tasks can include categorizing images, analysing data, or predicting price fluctuations.

The eResearch team manages access to several private VMs and a dedicated machine that is suitable for machine learning tasks.

For anyone new to machine learning, we recommend some starting resources to learn key concepts and the technologies involved: the and .

Deeplearning01 (Big GPU Machine)

We have a server with 4 GPUs dedicated to Machine learning/Deep learning/"Artificial intelligence". The main point of it was to bring a large amount of GPU RAM to workloads in these areas.

The server is a Dell T640 with

2 Intel Xeon Gold 6226R@2.9GHz with 16 cores each for a total of 64 threads
384GB of RAM
one 1.92TB SSD hard drive used for the root filesystem and /home
one 3.84TB SSD hard drive used for scratch and mounted on /scratch
4 NVIDIA Quadro RTX6000 24GB GPUs for a total of 96GB of GPU RAM
2 NVlinks each connecting 2 GPUs

Access to the machine is on request to eResearch services via the - just skip details about VM and indicate that you want to use "deeplearning01" in the "other information" field. Once the access is granted, login is with your usual UC username and password and the machine's name is

rcc-deeplearning01.canterbury.ac.nz

Note that the server is only accessible from the campus network. If you want to access it from home (or somewhere else) read the page.

SLURM and Scheduling

Since it is a shared resource, we need a way to ensure fair access amongst the users. With a few users some kind of board or mailing list to request a turn on the machine is fine. With an increasing amount of users, we need a formal workload manager providing a submission queue. Instead of running jobs directly on the GPUs, users need to submit the job to a workload manager which will manage who runs when. A time limit (usually referred as "wallclock time") is also enforced so people do not wait too long in the queue. Use of a job scheduler will also

enable us to figure out how busy is the machine
use the machine more efficiently as no one will have to figure out if the machine is busy or wait for a message that it is their turn

The selected workload manager is . Slurm is open source and an industry standard in the HPC world. People using will be familiar with it, in turn users of Deeplearning01 will become familiar with the technology used at NeSI and many other facilities (according to slurm is used by about 60% of the TOP500 supercomputers).��

Job Submission with Slurm

To submit a job, you need to prepare a small text file which contains a script to run and slurm instruction describing the job and its requirements and then putting it in the queue with the appropriate command. A good summary of slurm commands and options can be found on the , other pages with interesting examples can be found at and .��

As mentioned earlier the job submission file is a text script containing a bash script (other shell languages are possible but sticking to bash is recommended). The bash script can also include comments in a special format which are in fact slurm commands they are usually of the form:

#SBATCH --some-slurm-options=some_value

A full script that could run on our machine could be "example.sl" below with an appropriate "program".

#!/bin/bash

#SBATCH --account=def-someuser    # put your usercode for accounting

#SBATCH --gres=gpu:4              # Number of GPU(s), 1, 2 or 4

#SBATCH --cpus-per-task=6         # CPU cores/threads up to 64

#SBATCH --time=0-03:00            # wallclock time (DD-HH:MM) - up to 48 hours on our machine

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK      # in case program is using openMP

./program

To submit the job above to the queue one simply types:

sbatch example.sl

Important Slurm Commands

��ٳ�� to submit a job to the queue
��ܱ�ܱ��to see all the jobs queued
squeue -u usercode to see all the jobs queued by "usercode"
scancel $jobid to remove the job "$jobid" (provided it belongs to you)
scancel -u usercode cancel all your jobs - provided you are usercode - an error message otherwise
��Դڴ��shows the current state of the manager. Normal states can be "up", "down" (no job running) or "draining" (will stop the queue after the current job, this is useful for maintenance)

Public Services

Public services normally have an associated cost that is not covered by the University. University services are free of charge but may have constraints on the type and quantity of hardware available as well as the duration of your projects.

: Colaboratory allows you to write and execute Python in your browser with free access to GPUs and easy sharing. With Colab you can harness the full power of popular Python libraries to analyse and visualize data.
: Amazon Web Services offers a broad set of machine learning services and supporting cloud infrastructure.
: The Azure Machine Learning service empowers developers and data scientists with a wide range of productive experiences for building, training, and deploying machine learning models.
: Google Cloud offers AI and machine learning products for developers, data scientists, and data engineers.
: Watson is IBM’s portfolio of enterprise-ready pre-built applications, tools, and runtimes.�� With Watson you can infuse AI into your applications to make predictions or automate decisions and processes.
: NeSI provides a national platform of shared high performance computing tools and eResearch services. They further have many resources dedicated to machine learning.

This list does not contain information on Generative AI tools at UC which can be found here.��

mini��ý

Menu

mini��ý UC

Mō UC

Study

Ako

Life

Te Ao o UC

Research

Rangahau

News and Events

Rongo o te Wā