Systems Reliability Engineer (SRE)

Subspace


Subspace is a dedicated, secure network for delivering tomorrow’s internet today. From mission-critical real-time applications to in-network applications, Subspace helps companies create the best possible real-time experiences.

While the biggest multiplayer games in the world already use Subspace to create the most competitive and engaging online experiences, we are taking Subspace to even more industries, helping them create the best possible real-time experiences for voice, video, and in-network applications.

As a company, we’re hyper-focused on making the internet of tomorrow a reality today. We are backed by top Silicon Valley VCs (Lux Capital, Telstra, 01 Advisors) and run by former executives from tech companies including Riot Games, King, and Avaya. We are hiring people who help us raise the bar, add to our culture, and bring positivity, enthusiasm, and pride to what we are building. Our team members exhibit curiosity, are proactive by nature, genuinely enjoy complex challenges, and share a keen sense of urgency. At Subspace, every millisecond counts.

We’re looking for a Systems Reliability Engineer to help build out the next phase of our global infrastructure. As a member of our Pop Operations Team, you’ll collaborate with our Software & Network Engineering, and Infrastructure teams to successfully deploy and maintain scalable, fault-tolerant network topologies. We are committed to continuous improvement and innovation, so you’ll be encouraged to get creative as you solve complex problems related to traffic engineering, orchestration, instrumentation, automation, and telemetry.

What You’ll Be Doing

  • Collaborate with network, software, and infrastructure engineers on network optimization
  • Act as a first responder on our front lines by being accountable for the monitoring, detecting, and troubleshooting of all operations-related service disruptions
  • Escalate to the proper team and analyze incidents to ensure root causes are understood and addressed
  • Utilize Grafana to develop dashboards that provide key insights and analytics
  • Contribute to development and implementation of new network solutions to improve the resilience of the current environment
  • Work with other engineers to ensure service availability levels are met and that adverse impacts are kept to a minimum
  • Create and respond to tickets involving customer impacting issues
  • Work with internal and external teams to provide clear and concise information regarding impact, while finding the clearest path to mitigation and resolution
  • Follow problem management processes by following up on tickets that may require data collection, reporting and tracking issues that may be reoccurring or blocking important work
  • Proactively communicate with management and other key team members on potential customer or internal impacting issues (ex: network issues, site performance, planned and unplanned maintenance, network equipment failures, service disruptions, etc.)
  • Create and maintain scripts to automate processes and workflows with tools like chef, puppet and ansible
  • Create and maintain training manuals, knowledge base articles, or any other incident and problem management procedures that may need documentation
  • Be part of an on-call team supporting the systems and infrastructure
  • Join a culture with innovation at its core to help maintain a high-performing and highly-available service that is constantly evolving!
READ:   Design Technician

What You Bring To The Table

  • 4+ years of technical experience in networking and/or infrastructure, as well as experience in an incident management or network operations center environment. (SRE, NOC, Security Operations etc.)
  • Solid understanding of linux and ability to create and maintain scripts. Bonus points for experience building tools using Python and/or Go
  • Experience setting up and monitoring metrics with applications like Prometheus, Grafana, Kibana, NewRelic, DataDog etc
  • Experience setting up correlating alerts for metrics using applications like PagerDuty, Alerta, BigPanda, or ServiceNow
  • Solid understanding of common network protocols and concepts like TCP/IP, UDP, DNS, DHCP, IPv6, and Anycast routing

Subspace’s mission is to build a momentous software-defined network and make it universally available and impactful. Our goal is to serve the globe, which requires a team that is representative of the world that we serve. As a company we embrace an accelerated approach to diversity and inclusion, and support and encourage our employees to be comfortable bringing their full, authentic selves to work. We believe in fostering an environment where diverse perspectives thrive; this core value is a pillar of our business and is critical to our success.

  • Seniority level


    Mid-Senior level

  • Employment type


    Full-time

  • Job function


    Engineering and Information Technology

  • Industries


    Computer Software, Computer Networking, and Computer Games


Apply