Senior Advanced Research Computing Systems Engineer - IT Services - 57459 - Grade 8

Updated: over 2 years ago
Location: Birmingham, ENGLAND
Deadline: The position may have been removed or expired!

Senior Advanced Research Computing Systems Engineer - IT Services - 57459 - Grade 8  - (210001S4)

Description

 

 

Position Details

IT Services

Location: University of Birmingham, Edgbaston, Birmingham UK 

Full time starting salary is normally in the range £42,149 to £50,296, with potential progression once in post to £56,587 

Grade 8 

Full Time/Permanent

Closing date: 19th October 2021

Job Summary

The Senior Advanced Research Computing Systems Engineer is a member of the Team that has responsibility for the set of centrally operated, inter-related services in the BEAR (Birmingham Environment for Academic Research) portfolio.

The Senior Advanced Research Computing Systems Engineer is responsible for the development, operation and support of hardware and software to support BEAR services. This includes the sophisticated storage infrastructure that underpins all BEAR services, as well as HPC, private cloud and low latency networking.

These highly specialised, powerful and complex computing systems are fundamental to world-class research across a broad spectrum of academic endeavour.

The post holder is responsible for:

  • Planning both long term and short-term developments of the BEAR infrastructure and services

  • Maintenance, support and development of the BEAR infrastructure

  • Ensuring the security and efficient operation of all elements of the infrastructure:

    • In storage this ranges from Cloud to Tape to ensure delivery of BEAR services and the protection of data

    • In hardware this ranges from HPC and HTC batch compute to OpenStack private

  • Adapting the BEAR services portfolio to meet the evolving needs of researchers using the systems

  • Informing and supporting outreach activity to enable exploitation of BEAR by a broader, and potentially non-traditional, user-base

Background

Advanced Research Computing (ARC) services at the University use complex, large-scale architectures based on Linux, OpenStack and parallel filesystems. These need to be configured and maintained to the highest possible standards of reliability, performance and quality. Business continuity and high service levels are demanded equally for research as well as other mission-critical corporate functions.

The nature of the users of BEAR services mean that ARC is a dynamic part of the organisation, working with some of the latest technology to give our researchers an advantage over their peers. As the Senior Advanced Research Computing Systems Engineer, the post holder is jointly responsible for the development and delivery and of these advanced computing systems.

The University of Birmingham has a record of sustained investment in HPC and storage infrastructures over many years. Over the past 8 years, both the services and Team have expanded considerably and now deliver a wide range of services from HPC through enterprise class storage for researchers as well as private-cloud offerings, based on OpenStack along with various tools for collaboration and large data transfer. The Team has built a reputation for quality and innovation. With further investment secured to support HPC, Cloud and Storage services, the Team continues to make progress, adding more power, capacity and capabilities to BEAR.

The post forms part of the Advanced Research Computing Architecture, Infrastructure and Systems team and is jointly responsible for the delivery of BEAR services across the board, however given the skills across the team, the post is likely to have a significant and main focus on delivery of one or two key areas of the infrastructure. The post holder should be aware of the focus of the role in the wider work context and is expected to ensure successful delivery across all BEAR services.

For further information on BEAR services, see www.birmingham.ac.uk/bear

The University of Birmingham operates the Tier 2 (funded by EPSRC) HPC facility “Baskerville”.

In addition the University is part of the HPC MidlandsPlus consortium (funded by EPSRC) providing and sharing Tier 2 compute and storage facilities

Main Duties

  • Manage a programme of developments for the renewal, upgrade, expansion and enhancement of the BEAR infrastructure and services in accordance with the Research Computing strategy and the needs of the BEAR services. (This programme and its activity must be balanced against the demands of managing day to day operations.)

  • As the Senior Advanced Research Computing Systems Engineer for BEAR, take day to day responsibility for the security, operation and support of the infrastructure; In the area of storage this would include backup and import/export mechanisms. In HPC this would include management of the scheduling system. The post manages multiple Petabytes of storage, increasingly containing all the University’s current research data assets as well as the archive from completed projects as well as thousands of compute cores for advanced modelling, data processing and machine learning and AI.

    • Ensure storage is configured and tuned to perform effectively and in line with policies prescribed by the Research Computing Management Committee, designed to provide appropriate access to and allocation of resources as well as efficient operation. Ensure backup and archive systems are functioning correctly and regularly validate their ability to recall data.

    • Ensure HPC systems are configured and tuned to perform effectively and efficiently in line with policies prescribed by the Research Computing Management Committee to ensure optimal use of HPC resource.

    • Ensure underpinning technologies such as high speed and low latency networking are functioning optimally.

    • Monitor performance, investigate and respond to unexpected events.

    • Take the technical lead in the event of major incidents.

    • Take responsibility for resolving, sometimes complex storage or HPC queries from system users in support of the Research Software Group.

    • Respond to user tickets and resolve issues.

  • Undertake day-to-day sysadmin tasks, for example system updates, hardware faults and configuration management.

  • Carry out infrastructure developments, in accordance with the approved programme, maintaining appropriate access to and allocation of resources as well as efficient operation. Plan and co-ordinate with the wider ARC and users to minimize service disruption. Ensure that systems are interconnected correctly and manage underpinning software technologies that support integration of the systems.

  • Collaborate with the Group Leader & Research Computing Architect to influence the strategy and design of the BEAR infrastructure and services, including inter-working with standard University services e.g. the Network and Identity services.

  • Build and manage relationships with suppliers both internal and external to support both operations and the development of the infrastructure and its associated services. This will include the partnerships necessary to support collaborative research; on campus, nationally as well as internationally.

  • Provide technical advice to support and maintain existing central IT Services (in areas such as resilience, backup, scheduling, automated provisioning, archive and restore) based on various storage technologies to achieve high levels of availability, reliability and performance. This includes troubleshooting and solving complex problems.

  • Actively build relationships with wider research community and regularly participate in outreach activity to support research data management initiatives, collaborative research programmes, uptake of BEAR services and embed good practice around HPC and data processing and storage.

  • Provide substantial input to the technical specification of and technical discussions with prospective suppliers for major new research storage implementations.

  • Undertake operational support and management of BEAR services, for example gitlab and data transfer tools.

  • Undertake operational support of the wider BEAR infrastructure supporting systems, for example DNS, xcat, databases and Linux management.

  • Develop and deploy technical strategies and solutions to maximize the efficiency of operations, reducing the carbon footprint of facilities and the associated energy costs e.g. using passive media for storage appropriately or efficiently cooled systems.

  • Identify, evaluation and benchmark new systems which may be of potential benefit in supporting BEAR services.

Knowledge, Skills, Qualifications & Experience Required

  • Formal education to degree level in a computer science related subject or equivalent experience.

  • Substantial experience as a Linux or Unix system administrator is essential.

  • Proficiency in scripting skills in BASH and Python is essential.

  • Specific expertise and substantial experience managing complex, large-scale storage systems.

  • Substantial experience of working with parallel filesystems, their configuration and operation, especially IBM Spectrum Scale and associated backup such as IBM Spectrum Protect.

  • Specific expertise of managing an HPC scheduler such as SLURM in an environment beyond a single queue implementation.

  • Specific experience of managing and working with fabric technologies such as Mellanox InfiniBand.

  • Experience of using a Linux configuration management tool such as salt stack or ansible is highly desirable.

  • A sound understanding of storage infrastructure technologies such as SAN, IB, direct attached, SAS is desirable.

  • Experience of using or operating Object Storage systems would be an advantage e.g. Ceph, WOS, COS.

  • A sound understanding of TCP/IP networking design and operation.

  • Knowledge of advanced networking technologies such EVPN, VXLAN, datacentre networking and host and border firewalls an advantage.

  • Experience of cloud and container technologies e.g. OpenStack, Docker would be an advantage

  • Experience of iRods or other meta data management systems is desirable.

  • Experience of data transfer tools such as Globus or Aspera is desirable.

  • Experience with xcat cluster management is desirable.

  • Good customer relationship management skills.

  • Proven ability to work as part of a team to deliver services.

  • Ability to co-ordinate a small team.

  • Demonstrated capacity to solve complex problems.

  • A self-motivated learner with a track record of continually updating skills.

  • Excellent organizational skills with an ability to research and plan own work.

  • Good communication skills, both written and oral.

  • Good broad knowledge of C&IT, including latest technology industry trends.

  • Knowledge of Higher Education, Research and its environment.

  • Familiarity with ITIL or similar service management framework desirable.

For informal enquiries, please contact Simon Thompson - [email protected]

Valuing excellence, sustaining investment 

We value diversity and inclusion at the University of Birmingham and welcome applications from all sections of the community and are open to discussions around all forms of flexible working. 

 

Qualifications

 


 

 
Primary Location
: GB-GB-Birmingham
Work Locations
: 
IT Services 
IT Services Elms RoadThe University of Birmingham
 Birmingham B15 2TT
Job
: Specialist/Professional
Organization
: IT Services
Schedule
: Regular
 Full-time
Job Posting
: 21.09.2021, 9:50:40 AM
Grade (for job description): Grade 8
Salary (Pay Basis)
: 42,149.00
Maximum Salary
: 56,587.00
Advert Close Date
: 19.10.2021, 6:59:00 PM

Similar Positions