Cloud TPU VM user's guide | Google Cloud [proxy]

This guide describes how to set up a Google Cloud project for using Cloud TPU VMs. It describes the commands for using Cloud TPU VMs and solutions to common issues you may encounter when starting to use Cloud TPU VMs.

Cloud TPU VMs run on the TPU host machine (the machine connected to the Cloud TPU device) and offer significantly better performance and usability when working with TPUs.

If you are new to Cloud TPUs, check out the TPU beginner's guide.

If you plan to run on a Cloud TPU Pod with TPU VM, refer to Training on TPU Pods.

Cloud TPU VM introduced a new Cloud TPU architecture. For more information about the Cloud TPU architectures, see System Architecture.

Set up a Google Cloud Project

After installing the Google Cloud CLI, install the gcloud components using the following command:

gcloud components install

For more information about gcloud components, see Managing Google Cloud CLI Components.

Prepare a Google Cloud Project

Sign in to your Google Account. If you don't already have one, sign up for a new account. In the Google Cloud console, select or create a Cloud project from the project selector page. Make sure billing is enabled for your project. Set your project ID using gcloud in the Cloud Shell. The project ID is the name of your project shown in the Google Cloud console.

$ gcloud config set project project-id

Enable the Cloud TPU API

Enable the Cloud TPU API using the following gcloud command in Cloud Shell. (You may also enable it from the Google Cloud console.

$ gcloud services enable tpu.googleapis.com

Configure the `gcloud` command

Run the following commands to configure gcloud to use your Google Cloud project and install components needed for the TPU VM preview.

$ gcloud config set account your-email-account
$ gcloud config set project your-project

Managing TPUs

You can manage Cloud TPU VM using gcloud or curl. For more information, see Managing Cloud TPUs.

Creating a Cloud TPU VM with `gcloud`

$ gcloud compute tpus tpu-vm create tpu-name \
  --zone=zone \
  --accelerator-type=v3-8 \
  --version=tpu-vm-tf-2.11.0

Required fields

zone: The zone where you plan to create your Cloud TPU.
accelerator-type: The type of the Cloud TPU to create.
version: The Cloud TPU runtime version.

Optional flag

shielded-secure-boot: Specifies that the TPU instances are created with secure boot enabled. This implicitly makes them Shielded VM instances. See What is shielded VM? for more details.

When creating a TPU VM, you can specify a startup script using the --metadata startup-script flag. For example:

$ gcloud compute tpus tpu-vm create tpu-name \
--zone=zone \
--accelerator-type=v3-8 \
--version=tpu-vm-tf-2.11.0 \
--metadata startup-script=your-script

A startup script is run whenever the TPU VM is provisioned as well as if the TPU VM is restarted due to a maintenance event.

Creating a Cloud TPU VM with `curl`

$ curl -X POST -H "Authorization: Bearer $(gcloud auth print-access-token)"   -H "Content-Type: application/json" -d "{accelerator_type: 'v2-8', \
  runtime_version:'tpu-vm-tf-2.11.0', \
  network_config: {enable_external_ips: true}, \
  shielded_instance_config: { enable_secure_boot: true }}" \
  https://tpu.googleapis.com/v2/projects/project-id/locations/us-central1-b/nodes?node_id=node_name

Required fields

runtime_version: The runtime version you wish to use.
project: The name of your enrolled Google Cloud project.
zone: The zone where you are creating your Cloud TPU.
node_name: The name of the TPU VM you are creating.

Connecting to a Cloud TPU VM

(optional) Set up a firewall for SSH

The default network comes preconfigured to allow SSH access to all VMs. If you don't use the default network, or the default network settings were edited, you may need to explicitly enable SSH access by adding a firewall-rule:

$ gcloud compute firewall-rules create --network=network allow-ssh --allow=tcp:22

SSH into the TPU VM

$ gcloud compute tpus tpu-vm ssh tpu-name --zone zone --project project-id

Required fields

tpu_name: The name of the TPU VM to which you are connecting.
zone: The zone where you are creating your Cloud TPU.
project-id: Your Google Cloud project ID.

Optional fields

user: You can choose the username used to authenticate when connecting to the Cloud TPU VM over SSH, using $USER@ prefix to the TPU name, for example: my-email-account@tpu-node-1.
worker: For Cloud TPU Pods, you can choose which worker VM to SSH into. The default is worker 0, the first VM associated with the TPU Pod.
ssh-key-file: The path to the SSH key file. By default, this is ~/.ssh/google_compute_engine.
internal-ip: Connect to the TPU VMs using an internal IP address. For this connection to work, you must configure your networks and firewall to allow SSH connections to the internal IP address of the TPU VM to which you want to connect.
command: A command to run on the TPU VM. The command is run on the target TPU VM and then exits.
tunnel-through-iap: Tunnel the SSH connection through Cloud Identity-Aware Proxy for TCP forwarding. To learn more, see Overview of TCP forwarding.

To SSH into other TPU VMs associated with the TPU Pod, append --worker ${WORKER_NUMBER} in the command, where the WORKER_NUMBER is 0-based index.

Listing your Cloud TPU resources

You can list all of your Cloud TPUs in a specified zone.

$ gcloud compute tpus tpu-vm list --zone=zone

Required fields

zone: The zone where you plan to create your Cloud TPU.

This command lists the Cloud TPU resources in the specified zone. If no resources are currently set up, the output will just show dashes for the VM and TPU.

Retrieving information about your Cloud TPU

You can retrieve information about a specific Cloud TPU using the following command.

$ gcloud compute tpus tpu-vm describe tpu-name \
  --zone=zone

Required fields

tpu-name: The name of the Cloud TPU to create.
zone: The zone where your Cloud TPU was created.

Stopping your Cloud TPU resources

You can stop a single Cloud TPU using the following command. You cannot stop a TPU Pod.

$ gcloud compute tpus tpu-vm stop tpu-name \
  --zone=zone

Required fields

tpu-name: The name of the Cloud TPU to stop.
zone: The zone where you created your Cloud TPU.

Starting your Cloud TPU resources

If your Cloud TPU has been stopped, you can restart it using the following command.

$ gcloud compute tpus tpu-vm start tpu-name --zone  zone

Command flag descriptions

tpu-name: The name of the Cloud TPU to start.
zone: The zone where the Cloud TPU was created.

Deleting your VM and Cloud TPU resources

You can delete your Cloud TPU when you are done using them.

$ gcloud compute tpus tpu-vm delete tpu-name \
  --zone=zone

Required fields

zone: The zone where your Cloud TPU was created.

Capturing performance metrics

You can capture a performance profile using a command line script or using TensorBoard. For instructions on installing TensorBoard, see TensorBoard setup.

For TensorFlow models, you can capture profile data automatically by using the standard TensorFlow profiling callback method.

To manually capture profile data for TensorFlow models, use the following command on your TPU VM:

$ python3 -c "import tensorflow as tf; tf.profiler.experimental.client.trace('grpc://localhost:port', 'gs://model-dir', 1000)"

To capture profile data for PyTorch models using the command line, use the following command on your TPU VM:

$ python3 -c "import torch_xla.debug.profiler as xp; xp.trace('localhost:port', '/tmp/tb', 1000)"

For information about how to capture profile data for JAX models see Profiling JAX programs.

Viewing profile data

Open a Cloud Shell
Make sure you have installed TensorBoard
Run TensorBoard
From Cloud Shell, click the Web Preview button and select Change port and type 6006.
Click profile, an overview page is displayed.
Navigate to trace viewer under tools

Request More TPU quota

Cloud TPU allocates default TPU quota for your project. If you need more, see Requesting additional quota.

Generating and viewing profile information

You can generate profile information and use TensorBoard to visualize training metrics.

`gcloud` setup troubleshooting

Problem

gcloud components update displays the following error message:

ERROR: (gcloud.components.update)
You cannot perform this action because the gcloud CLI component manager
is disabled for this installation.

Solution

To use gcloud with TPU VM, you will need to use a gcloud installation that is not managed through a package manager. Follow these steps to install gcloud from source code:

sudo apt-get remove google-cloud-sdk
curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-sdk-311.0.0-linux-x86_64.tar.gz
tar -xzf google-cloud-sdk-311.0.0-linux-x86_64.tar.gz
./google-cloud-sdk/install.sh
source ~/.bashrc

Problem

Running any command beginning with gcloud compute tpus tpu-vm displays the following information:

ERROR: (gcloud.compute.tpus) Invalid choice: 'tpu-vm'.

Solution

This happens when the component repository has not been properly updated. To verify this, run gcloud --version. The first line of the output should be "Google Cloud CLI HEAD"; if the output is different, the update did not take place. if this happens, try updating the gcloud components with the following command.

gcloud components update

If you are still getting the same error, try reinstalling gcloud with the following command:

gcloud components reinstall

Problem

gcloud compute tpus tpu-vm ssh ${TPU_NAME} --zone ${ZONE} command displays the following error message:

Waiting for SSH key to propagate.
ssh: connect to host 34.91.136.59 port 22: Connection timed out
ssh: connect to host 34.91.136.59 port 22: Connection timed out
ssh: connect to host 34.91.136.59 port 22: Connection timed out
ERROR: (gcloud.compute.tpus.tpu-vm.ssh) Could not SSH into the instance.  It is possible that your SSH key has not propagated to the instance yet. Try running this command again.  If you still cannot connect, verify that the firewall and instance are set to accept ssh traffic.

Solution

Something may be wrong with the SSH key propagation. Try moving the automatically-generated keys to a backup location to force gcloud to recreate them:

mv ~/.ssh/google_compute_engine ~/.ssh/old-google_compute_engine
mv ~/.ssh/google_compute_engine.pub ~/.ssh/old-google_compute_engine.pub

FAQ

Can I use V1Alpha1 and V1 APIs to manage Cloud TPU VMs?: Get/List is allowed, but mutations are only available in V2Alpha1 API Version.
How do I know whether the TPUs are using Cloud TPU VMs?: Make a GetNode call on the TPU, set the APIVersion field to V2_ALPHA1.

Cloud TPU VM user's guide | Google Cloud

Set up a Google Cloud Project

Prepare a Google Cloud Project

Enable the Cloud TPU API

Configure the gcloud command

Managing TPUs

Creating a Cloud TPU VM with gcloud

Required fields

Optional flag

Creating a Cloud TPU VM with curl

Required fields

Connecting to a Cloud TPU VM

Required fields

Optional fields

Listing your Cloud TPU resources

Required fields

Retrieving information about your Cloud TPU

Required fields

Stopping your Cloud TPU resources

Required fields

Starting your Cloud TPU resources

Command flag descriptions

Deleting your VM and Cloud TPU resources

Required fields

Capturing performance metrics

Viewing profile data

Request More TPU quota

Generating and viewing profile information

gcloud setup troubleshooting

FAQ

Configure the `gcloud` command

Creating a Cloud TPU VM with `gcloud`

Creating a Cloud TPU VM with `curl`

`gcloud` setup troubleshooting