Introduction
TensorFlow is an open source software library for high performance numerical computation. Its flexible architecture allows easy deployment of computation across a variety of platforms (CPUs, GPUs, TPUs). TensorFlowJob is a Kubernetes Custom Resource and Operator for TensorFlow jobs
Deploy the TensorflowJob operator
Before you can use the tfjob runtime, you need to make sure that TensorflowJob operator and the CRD (custom resource definition) are deployed in your cluster.
Enable the operator
To be able to schedule distributed jobs with the TensorflowJob operator, you need to enable the operator in your deployment config.
You need to enable the operator in Polyaxon CE deployment or Polyaxon Agent deployment:
operators:
tfjob: true
Create a component with the tfjob runtime
Once you have the TensorflowJob operator running on a Kubernetes namespace managed by Polyaxon, you can check the specification for creating components with the tfjob runtime:
version: 1.1
kind: component
run:
kind: tfjob
...
Run the distributed job
Running components with the tfjob runtime is similar to running any other component:
polyaxon run -f manifest.yaml -P ...
View a running operation on the dashboard
After running an operation with this component, you can view it on the Dashboard:
polyaxon ops dashboard
Or
polyaxon ops dashboard -p [project-name] -uid [run-uuid] -y
Stop a running operation
To stop a running operation with this component:
polyaxon ops stop
Or
polyaxon ops stop -p [project-name] -uid [run-uuid]
Run the job using the Python client
To run this component using Polyaxon Client:
from polyaxon.client import RunClient
client = RunClient(...)
client.create_from_polyaxonfile(polyaxonfile="path/to/file", ...)