This guide describes how to set up a Google Cloud project for using Cloud TPU VMs. It describes the commands for using Cloud TPU VMs and solutions to common issues you may encounter when starting to use Cloud TPU VMs.
Cloud TPU VMs run on the TPU host machine (the machine connected to the Cloud TPU device) and offer significantly better performance and usability when working with TPUs.
If you are new to Cloud TPUs, check out the TPU beginner's guide.
If you plan to run on a Cloud TPU Pod with TPU VM, refer to Training on TPU Pods.
Cloud TPU VM introduced a new Cloud TPU architecture. For more information about the Cloud TPU architectures, see System Architecture.
After installing the Google Cloud CLI, install the gcloud components
using the following command:
gcloud components install
For more information about gcloud components, see Managing Google Cloud CLI Components.
Sign in to your Google Account. If you
don't already have one, sign up for a new account.
In the Google Cloud console, select or create a Cloud project from the project selector
page. Make sure billing is enabled for your project. Set your project ID using
gcloud in the Cloud Shell. The project ID is the name of your project shown in
the Google Cloud console.
$ gcloud config set project project-id
Enable the Cloud TPU API using the following gcloud command in Cloud Shell. (You may also enable it from the Google Cloud console.
$ gcloud services enable tpu.googleapis.com
gcloud commandRun the following commands to configure gcloud to use your Google Cloud project and
install components needed for the TPU VM preview.
$ gcloud config set account your-email-account $ gcloud config set project your-project
You can manage Cloud TPU VM using gcloud or curl. For more information,
see Managing Cloud TPUs.
gcloud
$ gcloud compute tpus tpu-vm create tpu-name \
--zone=zone \
--accelerator-type=v3-8 \
--version=tpu-vm-tf-2.11.0
shielded-secure-bootWhen creating a TPU VM, you can specify a startup script using the
--metadata startup-script flag. For example:
$ gcloud compute tpus tpu-vm create tpu-name \
--zone=zone \
--accelerator-type=v3-8 \
--version=tpu-vm-tf-2.11.0 \
--metadata startup-script=your-script
A startup script is run whenever the TPU VM is provisioned as well as if the TPU VM is restarted due to a maintenance event.
curl
$ curl -X POST -H "Authorization: Bearer $(gcloud auth print-access-token)" -H "Content-Type: application/json" -d "{accelerator_type: 'v2-8', \
runtime_version:'tpu-vm-tf-2.11.0', \
network_config: {enable_external_ips: true}, \
shielded_instance_config: { enable_secure_boot: true }}" \
https://tpu.googleapis.com/v2/projects/project-id/locations/us-central1-b/nodes?node_id=node_name
runtime_versionprojectzonenode_nameThe default network comes preconfigured to allow SSH access to all VMs. If you don't use the default network, or the default network settings were edited, you may need to explicitly enable SSH access by adding a firewall-rule:
$ gcloud compute firewall-rules create --network=network allow-ssh --allow=tcp:22
$ gcloud compute tpus tpu-vm ssh tpu-name --zone zone --project project-id
tpu_namezoneproject-idusermy-email-account@tpu-node-1.workerssh-key-file~/.ssh/google_compute_engine.internal-ipcommandtunnel-through-iapTo SSH into other TPU VMs associated with the TPU Pod, append --worker ${WORKER_NUMBER} in the command,
where the WORKER_NUMBER is 0-based index.
You can list all of your Cloud TPUs in a specified zone.
$ gcloud compute tpus tpu-vm list --zone=zone
zoneThis command lists the Cloud TPU resources in the specified zone. If no resources are currently set up, the output will just show dashes for the VM and TPU.
You can retrieve information about a specific Cloud TPU using the following command.
$ gcloud compute tpus tpu-vm describe tpu-name \
--zone=zone
tpu-namezoneYou can stop a single Cloud TPU using the following command. You cannot stop a TPU Pod.
$ gcloud compute tpus tpu-vm stop tpu-name \
--zone=zone
tpu-namezoneIf your Cloud TPU has been stopped, you can restart it using the following command.
$ gcloud compute tpus tpu-vm start tpu-name --zone zone
tpu-namezoneYou can delete your Cloud TPU when you are done using them.
$ gcloud compute tpus tpu-vm delete tpu-name \
--zone=zone
zoneYou can capture a performance profile using a command line script or using TensorBoard. For instructions on installing TensorBoard, see TensorBoard setup.
For TensorFlow models, you can capture profile data automatically by using the standard TensorFlow profiling callback method.
To manually capture profile data for TensorFlow models, use the following command on your TPU VM:
$ python3 -c "import tensorflow as tf; tf.profiler.experimental.client.trace('grpc://localhost:port', 'gs://model-dir', 1000)"
To capture profile data for PyTorch models using the command line, use the following command on your TPU VM:
$ python3 -c "import torch_xla.debug.profiler as xp; xp.trace('localhost:port', '/tmp/tb', 1000)"
For information about how to capture profile data for JAX models see Profiling JAX programs.
6006.Cloud TPU allocates default TPU quota for your project. If you need more, see Requesting additional quota.
You can generate profile information and use TensorBoard to visualize training metrics.
gcloud setup troubleshootinggcloud components update displays the following error message:
ERROR: (gcloud.components.update)
You cannot perform this action because the gcloud CLI component manager
is disabled for this installation.
To use gcloud with TPU VM, you will need to use a gcloud installation that
is not managed through a package manager. Follow these steps to install gcloud
from source code:
sudo apt-get remove google-cloud-sdk
curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-311.0.0-linux-x86_64.tar.gz
tar -xzf google-cloud-sdk-311.0.0-linux-x86_64.tar.gz
./google-cloud-sdk/install.sh
source ~/.bashrc
Running any command beginning with gcloud compute tpus tpu-vm displays
the following information:
ERROR: (gcloud.compute.tpus) Invalid choice: 'tpu-vm'.
This happens when the component repository has not been properly updated. To
verify this, run gcloud --version. The first line of the output should be
"Google Cloud CLI HEAD"; if the output is different, the update did not take
place. if this happens, try updating the gcloud components with the
following command.
gcloud components update
If you are still getting the same error, try reinstalling gcloud with the
following command:
gcloud components reinstall
gcloud compute tpus tpu-vm ssh ${TPU_NAME} --zone ${ZONE} command displays
the following error message:
Waiting for SSH key to propagate.
ssh: connect to host 34.91.136.59 port 22: Connection timed out
ssh: connect to host 34.91.136.59 port 22: Connection timed out
ssh: connect to host 34.91.136.59 port 22: Connection timed out
ERROR: (gcloud.compute.tpus.tpu-vm.ssh) Could not SSH into the instance. It is possible that your SSH key has not propagated to the instance yet. Try running this command again. If you still cannot connect, verify that the firewall and instance are set to accept ssh traffic.
Something may be wrong with the SSH key propagation. Try moving the
automatically-generated keys to a backup location to force gcloud to recreate
them:
mv ~/.ssh/google_compute_engine ~/.ssh/old-google_compute_engine
mv ~/.ssh/google_compute_engine.pub ~/.ssh/old-google_compute_engine.pub
GetNode call on the TPU, set the APIVersion field to V2_ALPHA1.