-
Notifications
You must be signed in to change notification settings - Fork 24.5k
Debugging using with ssh for Github Actions
- Linux (https://github.com/pytorch/pytorch/pull/62280)
- Windows (https://github.com/pytorch/pytorch/pull/63440)
- The job must still be running when you ssh into the machine. If the job is finished running, you cannot ssh in.
-
The job must be on a Meta AWS account runner. To check this, look for the
Set up runner
step in the job. If theRunner type
is prefixed with or containslf.
, this is not an Meta AWS account runner. Add theno-runner-experiments
label to the PR and rerun the entire workflow to ensure you run on an Meta AWS account runner. With the migration to Linux Foundation firmly in place, Linux Foundation runners don't allow ssh access anymore for security reasons. - Only works on your own PRs. If you want to ssh into a job running on the main or someone else's job, see the below section about aws ssm.
- You will be removed from the machine after 2 hours.
- Only works for users who are connected to the Meta VPN
- No current planned support for outside collaborators
- Public keys for ssh are pulled from Github using
https://github.com/${github.actor}.keys
- Example: https://github.com/seemethere.keys
- Push a new commit / re-run completed workflows, see below for re-running jobs through the Github UI
or
Unfortunately, jobs can only be rerun after all jobs in the workflow are finished. - Traverse to logs for a
build
ortest
job that runs theSetup SSH
step added (currently all of our linux workflows have this enabled)
- Use the SSH command provided to log into the node (do this immediately, as the job will start cleaning up if it reaches the end without any active SSH session):
VSCode's remote SSH extension works with both Linux and Windows CI machines. Add them as you would any other SSH remote, with cmd + shift + P > SSH: New Remote
and use the address runneruser@https://ec2....compute-1.amazonaws.com
as provided to you in the GitHub CI logs. On Windows you can directly open VSCode to the pytorch
folder with the command:
[your mac] $ export HOST=runneruser@ec2-3-238-198-8.compute-1.amazonaws.com
[your mac] $ code --folder-uri vscode-remote://ssh-remote+$HOST/c:\\actions-runner\\_work\\pytorch\\pytorch
Once you are connected through ssh, you may need to enter a docker container. Run docker ps
to check if there are any docker containers running. Note that your CI job might be in the process of initiating a docker container, which means it will not show up yet. It is best to wait until the CI job reaches a step where it is building pytorch or running pytorch tests. If the job does have a docker container, run docker exec -it CONTAINER_ID /bin/bash
to connect to it (or simply run docker exec -it $(docker ps | awk 'NR==2 {print $1}') /bin/bash
directly to parse the CONTAINER_ID).
Now you can find the pytorch working directory, which could be ~/workspace or ~/project, and run commands locally to debug the failure.
- To set-up the same CI environment run:
C:\actions-runner\_work\pytorch\pytorch\build\win_tmp\ci_scripts\pytorch_env_restore.bat
- The Windows workspace is currently located at
C:\actions-runner\_work\pytorch\pytorch
- To use other shells for Windows just append the shell you'd like to run to your ssh command like:
ssh runneruser@ec2-3-238-136-38.compute-1.amazonaws.com -- bash.exe
- To run a particular test use the Miniconda Python:
C:\jenkins\miniconda3\python.exe test_profiler.py
For certain Windows failures, it may be useful to have a full Remote Desktop connection. To use remote desktop, get a RDP client (Mac App Store) and:
- The Windows EC2 machines only allow SSH connections on port 22, so make a tunnel for the RDP port (default 3389) to your local machine and set a password:
# change the EC2 hostname to the one given to you in the GitHub Actions logs
[your mac] $ ssh -L 3389:localhost:3389 runneruser@ec2-3-238-198-8.compute-1.amazonaws.com
# set a password for the 'runneruser' account (it must have upper
# and lowercase letters, a number, and be at least 8 characters)
[windows] $ net user runneruser <some password>
-
Open the RDP client, add a new PC with the hostname
localhost
-
Connect to it by double clicking, use the username
runneruser
with the password you just created. You should see the remote desktop window open.
- Old reference: Detailed instructions for debugging Windows with SSH on CircleCI.
AWS SSM can be used to log into any currently running EC2 instances whether or not your SSH key has been added to the particular instance.
This can be used to debug currently running jobs on PRs or trunk.
- You must be connected to the META VPN
- The runner must be from the AWS Meta account (see above about how to confirm this)
- You must have access to the AWS account (
bunnylol cloud fbossci
) - Download the session manager plugin for AWS CLI (
aws ssm start-session
will fail silently without it) - Jobs will not wait for your session to finish before concluding and starting a new job, and nodes will still be reaped on their regular schedule. There is no extra hold on the machine, and the 2 hour timeout does not apply here, so sessions will most likely end immediately at the next reap cycle.
aws ssm start-session --target "<instance_id>" --region "<region>"
Example:
aws ssm start-session --target i-0099d5a07d34e8904 --region us-east-1
Page maintainers: @pytorch/pytorch-dev-infra
Last verified: 2025-06-04
Everything except the windows section should be correct and up to date
I would love to contribute to PyTorch!