Git integration is now available in the Amazon SageMaker Python SDK. You no longer have to download scripts from a Git repository for training jobs and hosting models. With this new feature, you can use training scripts stored in Git repos directly when training a model in the Python SDK. You can also use hosting scripts stored in Git repos when hosting a model. The scripts are hosted in GitHub, another Git-based repo, or an AWS CodeCommit repo.
This post describes in detail how to use Git integration with the Amazon SageMaker Python SDK.
When you train a model with the Amazon SageMaker Python SDK, you need a training script that does the following:
- Loads data from the input channels
- Configures training with hyperparameters
- Trains a model
- Saves the model
You specify the script as the value of the
entry_point argument when you create an estimator object.
Previously, when users constructed an
Model object, in the Python SDK, the training script had to be a path in the local file system when you provided it as the
entry_point value. This location was inconvenient when you had training scripts in Git repos because you had to download them locally.
If multiple developers were contributing to the Git repo, you would have to keep track of any updates to the repo. Also, if your local version was out of date, you’d need to pull the latest version prior to every training job. This also makes scheduling periodic training jobs even more challenging.
With the launch of Git integration, these issues are solved, which results in a notable improvement in convenience and productivity.
Enable the Git integration feature by passing a
dict parameter named
git_config when you create the
Model object. The
git_config parameter provides information about the location of the Git repo that contains the scripts and the authentication for accessing that repo.
Locate the Git repo
To locate the repo that contains the scripts, use the
commit fields in
repo field is required; the other two fields are optional. If you only provide the
repo field, the latest commit in
master branch is used by default:
To specify a branch, use both the
branch fields. The latest commit in that branch is used by default:
To specify a commit of a specific branch in a repo, use all three fields in
If only the
commit fields are provided, this works when the commit is under the
master branch and the commit is used. However, if the commit is not under the
master branch, the repo is not found:
Get access to the Git repo
If the Git repo is private (all CodeCommit repos are private), you need authentication information to access it.
For CodeCommit repos, first make sure that you set up your authentication method. For more information, see Setting Up for AWS CodeCommit. The topic lists the following ways by which you can authenticate:
Authentication for SSH URLs
For SSH URLs, you must configure the SSH key pair. This applies to GitHub, CodeCommit, and other Git-based repos.
Do not set an SSH key passphrase for the SSH key pairs. If you do, access to the repo fails.
After the SSH key pair is configured, Git integration works with SSH URLs without further authentication information:
Authentication for HTTPS URLs
For HTTPS URLs, there are two ways to deal with authentication:
- Have it configured locally.
- Configure it by providing extra fields in
token. Things can be slightly different here between CodeCommit, GitHub, and other Git-based repos.
Authenticating using Git credentials
If you authenticate with Git credentials, you can do one of the following:
- Provide the credentials in
- Have the credentials stored in local credential storage. Typically, the credentials are stored automatically after you provide them with the AWS CLI. For example, macOS stores credentials in Keychain Access.
With the Git credentials stored locally, you can specify the
git_config parameter without providing the credentials, to avoid showing them in scripts:
Authenticating using AWS CLI Credential Helper
If you follow the setup documentation mentioned earlier to configure AWS CLI Credential Helper, you don’t have to provide any authentication information.
For GitHub and other Git-based repos, check whether two-factor authentication (2FA) is enabled for your account. (Authentication is disabled by default and must be enabled manually.) For more information, see Securing your account with two-factor authentication (2FA).
If 2FA is enabled for your account, provide
2FA_enabled when specifying
git_config and set it to
True. Otherwise, set it to
2FA_enabled is not provided, it is set to
False by default. Usually, you can use either username+password or a personal access token to authenticate for GitHub and other Git-based repos. However, when 2FA is enabled, you can only use a personal access token.
To use username+password for authentication:
Again, you can store the credentials in local credential storage to avoid showing them in the script.
To use a personal access token for authentication:
Create the estimator or model with Git integration
After you correctly specify
git_config, pass it as a parameter when you create the estimator or model object to enable Git integration. Then, make sure that the
dependencies are all be relative paths under the Git repo.
You know that if
source_dir is provided,
entry_point should be a relative path from the source directory. The same is true with Git integration.
For example, with the following structure of the Git repo ‘amazon-sagemaker-examples’ under branch ‘training-scripts’:
You can create the estimator object as follows:
In this example,
source_dir 'char-rnn-tensorflow' is a relative path inside the Git repo, while
entry_point 'train.py' is a relative path under ‘char-rnn-tensorflow’.
Git integration example
Now let’s look at a complete example of using Git integration. This example trains a multi-layer LSTM RNN model on a language modeling task based on PyTorch example. By default, the training script uses the Wikitext-2 dataset. We train a model on SageMaker, deploy it, and then use deployed model to generate new text.
Run the commands in a Python script, except for those that start with a ‘!’, which are bash commands.
First let’s do the setup:
Next get the dataset. This data is from Wikipedia and is licensed CC-BY-SA-3.0. Before you use this data for any other purpose than this example, you should understand the data license, described at https://creativecommons.org/licenses/by-sa/3.0/:
Upload the data to S3:
git_config and create the estimator with it:
Train the mode:
Next let’s host the model. We are going to provide custom implementation of
predict_fn hosting functions in a separate file ‘generate.py’, which is in the same Git repo. The PyTorch model uses a npy serializer and deserializer by default. For this example, since we have a custom implementation of all the hosting functions and plan on using JSON instead, we need a predictor that can serialize and deserialize JSON:
Create the model object:
Create the hosting endpoint:
Now we are going to use our deployed model to generate text by providing random seed, temperature (higher will increase diversity), and number of words we would like to get:
You get the following results:
Finally delete the endpoint after you are done using it:
In this post, I walked through how to use Git integration with the Amazon SageMaker Python SDK. With Git integration, you no longer have to download scripts from Git repos for training jobs and hosting models. Now you can use scripts in Git repos directly, simply by passing an additional parameter
git_config when creating the
If you have questions or suggestions, please leave them in the comments.
About the Authors
Yue Tu is a summer intern on the AWS SageMaker ML Frameworks team. He works on Git integration for the SageMaker Python SDK during his internship. Outside of work he likes playing basketball, his favorite basketball teams are the Golden State Warriors and Duke basketball team. He also likes paying attention to nothing for some time.
Chuyang Deng is a software development engineer on the AWS SageMaker ML Frameworks team. She enjoys playing LEGO alone.