On this publish, we present configure a brand new OAuth-based authentication function for utilizing Snowflake in Amazon SageMaker Data Wrangler. Snowflake is a cloud knowledge platform that gives knowledge options for knowledge warehousing to knowledge science. Snowflake is an AWS Partner with a number of AWS accreditations, together with AWS competencies in machine studying (ML), retail, and knowledge and analytics.
Knowledge Wrangler simplifies the information preparation and have engineering course of, lowering the time it takes from weeks to minutes by offering a single visible interface for knowledge scientists to pick and clear knowledge, create options, and automate knowledge preparation in ML workflows with out writing any code. You possibly can import knowledge from a number of knowledge sources, corresponding to Amazon Simple Storage Service (Amazon S3), Amazon Athena, Amazon Redshift, Amazon EMR, and Snowflake. With this new function, you should utilize your personal id supplier (IdP) corresponding to Okta, Azure AD, or Ping Federate to hook up with Snowflake through Knowledge Wrangler.
Answer overview
Within the following sections, we offer steps for an administrator to arrange the IdP, Snowflake, and Studio. We additionally element the steps that knowledge scientists can take to configure the information circulation, analyze the information high quality, and add knowledge transformations. Lastly, we present export the information circulation and practice a mannequin utilizing SageMaker Autopilot.
Stipulations
For this walkthrough, it’s best to have the next stipulations:
- For admin:
- A Snowflake consumer with permissions to create storage integrations, and safety integrations in Snowflake.
- An AWS account with permissions to create AWS Identity and Access Management (IAM) insurance policies and roles.
- Entry and permissions to configure IDP to register Knowledge Wrangler utility and arrange the authorization server or API.
- For knowledge scientist:
Administrator setup
As a substitute of getting your customers immediately enter their Snowflake credentials into Knowledge Wrangler, you possibly can have them use an IdP to entry Snowflake.
The next steps are concerned to allow Knowledge Wrangler OAuth entry to Snowflake:
- Configure the IdP.
- Configure Snowflake.
- Configure SageMaker Studio.
Configure the IdP
To arrange your IdP, it’s essential to register the Knowledge Wrangler utility and arrange your authorization server or API.
Register the Knowledge Wrangler utility inside the IdP
Check with the next documentation for the IdPs that Knowledge Wrangler helps:
Use the documentation supplied by your IdP to register your Knowledge Wrangler utility. The knowledge and procedures on this part provide help to perceive correctly use the documentation supplied by your IdP.
Particular customizations along with the steps within the respective guides are referred to as out within the subsections.
- Choose the configuration that begins the method of registering Knowledge Wrangler as an utility.
- Present the customers inside the IdP entry to Knowledge Wrangler.
- Allow OAuth shopper authentication by storing the shopper credentials as a Secrets and techniques Supervisor secret.
- Specify a redirect URL utilizing the next format:
https://domain-ID.studio.AWS Area.sagemaker.aws/jupyter/default/lab
.
You’re specifying the SageMaker area ID and AWS Area that you simply’re utilizing to run Knowledge Wrangler. You will need to register a URL for every area and Area the place you’re operating Knowledge Wrangler. Customers from a site and Area that don’t have redirect URLs arrange for them received’t be capable to authenticate with the IdP to entry the Snowflake connection.
- Ensure that the authorization code and refresh token grant varieties are allowed in your Knowledge Wrangler utility.
Arrange the authorization server or API inside the IdP
Inside your IdP, it’s essential to arrange an authorization server or an utility programming interface (API). For every consumer, the authorization server or the API sends tokens to Knowledge Wrangler with Snowflake because the viewers.
Snowflake makes use of the idea of roles which can be distinct from IAM roles utilized in AWS. You will need to configure the IdP to make use of ANY Position to make use of the default function related to the Snowflake account. For instance, if a consumer has methods administrator
because the default function of their Snowflake profile, the connection from Knowledge Wrangler to Snowflake makes use of methods administrator
because the function.
Use the next process to arrange the authorization server or API inside your IdP:
- Out of your IdP, start the method of establishing the server or API.
- Configure the authorization server to make use of the authorization code and refresh token grant varieties.
- Specify the lifetime of the entry token.
- Set the refresh token idle timeout.
The idle timeout is the time that the refresh token expires if it’s not used. In case you’re scheduling jobs in Knowledge Wrangler, we suggest making the idle timeout time higher than the frequency of the processing job. In any other case, some processing jobs would possibly fail as a result of the refresh token expired earlier than they might run. When the refresh token expires, the consumer should re-authenticate by accessing the connection that they’ve made to Snowflake by means of Knowledge Wrangler.
Be aware that Knowledge Wrangler doesn’t help rotating refresh tokens. Utilizing rotating refresh tokens would possibly lead to entry failures or customers needing to log in incessantly.
If the refresh token expires, your customers should reauthenticate by accessing the connection that they’ve made to Snowflake by means of Knowledge Wrangler.
- Specify
session:role-any
as the brand new scope.
For Azure AD, it’s essential to additionally specify a singular identifier for the scope.
After you’ve arrange the OAuth supplier, you present Knowledge Wrangler with the knowledge it wants to hook up with the supplier. You need to use the documentation out of your IdP to get values for the next fields:
- Token URL – The URL of the token that the IdP sends to Knowledge Wrangler
- Authorization URL – The URL of the authorization server of the IdP
- Consumer ID – The ID of the IdP
- Consumer secret – The key that solely the authorization server or API acknowledges
- OAuth scope – That is for Azure AD solely
Configure Snowflake
To configure Snowflake, full the directions in Import data from Snowflake.
Use the Snowflake documentation in your IdP to arrange an exterior OAuth integration in Snowflake. See the earlier part Register the Knowledge Wrangler utility inside the IdP for extra data on arrange an exterior OAuth integration.
Once you’re establishing the safety integration in Snowflake, ensure you activate external_oauth_any_role_mode
.
Configure SageMaker Studio
You retailer the fields and values in a Secrets and techniques Supervisor secret and add it to the Studio Lifecycle Configuration that you simply’re utilizing for Knowledge Wrangler. A Lifecycle Configuration is a shell script that mechanically masses the credentials saved within the secret when the consumer logs into Studio. For details about creating secrets and techniques, see Move hardcoded secrets to AWS Secrets Manager. For details about utilizing Lifecycle Configurations in Studio, see Use Lifecycle Configurations with Amazon SageMaker Studio.
Create a secret for Snowflake credentials
To create your secret for Snowflake credentials, full the next steps:
- On the Secrets and techniques Supervisor console, select Retailer a brand new secret.
- For Secret kind, choose Different kind of secret.
- Specify the small print of your secret as key-value pairs.
Key names require lowercase letters as a consequence of case sensitivity. Knowledge Wrangler offers a warning if you happen to enter any of those incorrectly. Enter the key values as key-value pairs Key/worth if you happen to’d like, or use the Plaintext possibility.
The next is the format of the key used for Okta. If you’re utilizing Azure AD, it’s worthwhile to add the datasource_oauth_scope
discipline.
- Replace the previous values together with your alternative of IdP and data gathered after utility registration.
- Select Subsequent.
- For Secret identify, add the prefix
AmazonSageMaker
(for instance, our secret isAmazonSageMaker-DataWranglerSnowflakeCreds
). - Within the Tags part, add a tag with the important thing
SageMaker
and worthtrue
. - Select Subsequent.
- The remainder of the fields are non-compulsory; select Subsequent till you will have the choice to decide on Retailer to retailer the key.
After you retailer the key, you’re returned to the Secrets and techniques Supervisor console.
- Select the key you simply created, then retrieve the key ARN.
- Retailer this in your most popular textual content editor to be used later while you create the Knowledge Wrangler knowledge supply.
Create a Studio Lifecycle Configuration
To create a Lifecycle Configuration in Studio, full the next steps:
- On the SageMaker console, select Lifecycle configurations within the navigation pane.
- Select Create configuration.
- Select Jupyter server app.
- Create a brand new lifecycle configuration or append an present one with the next content material:
The configuration creates a file with the identify ".snowflake_identity_provider_oauth_config"
, containing the key within the consumer’s residence folder.
- Select Create Configuration.
Set the default Lifecycle Configuration
Full the next steps to set the Lifecycle Configuration you simply created because the default:
- On the SageMaker console, select Domains within the navigation pane.
- Select the Studio area you’ll be utilizing for this instance.
- On the Atmosphere tab, within the Lifecycle configurations for private Studio apps part, select Connect.
- For Supply, choose Present configuration.
- Choose the configuration you simply made, then select Connect to area.
- Choose the brand new configuration and select Set as default, then select Set as default once more within the pop-up message.
Your new settings ought to now be seen beneath Lifecycle configurations for private Studio apps as default.
- Shut down the Studio app and relaunch for the adjustments to take impact.
Knowledge scientist expertise
On this part, we cowl how knowledge scientists can connect with Snowflake as a knowledge supply in Knowledge Wrangler and put together knowledge for ML.
Create a brand new knowledge circulation
To create your knowledge circulation, full the next steps:
- On the SageMaker console, select Amazon SageMaker Studio within the navigation pane.
- Select Open Studio.
- On the Studio House web page, select Import & put together knowledge visually. Alternatively, on the File drop-down, select New, then select SageMaker Knowledge Wrangler Stream.
Creating a brand new circulation can take a couple of minutes.
- On the Import knowledge web page, select Create connection.
- Select Snowflake from the record of knowledge sources.
- For Authentication technique, select OAuth.
In case you don’t see OAuth, confirm the previous Lifecycle Configuration steps.
- Enter particulars for Snowflake account identify and Storage integration.
- Ener a connection identify and select Join.
You’re redirected to an IdP authentication web page. For this instance, we’re utilizing Okta.
- Enter your consumer identify and password, then select Register.
After the authentication is profitable, you’re redirected to the Studio knowledge circulation web page.
- On the Import knowledge from Snowflake web page, browse the database objects, or run a question for the focused knowledge.
- Within the question editor, enter a question and preview the outcomes.
Within the following instance, we load Mortgage Knowledge and retrieve all columns from 5,000 rows.
- Select Import.
- Enter a dataset identify (for this publish, we use
snowflake_loan_dataset
) and select Add.
You’re redirected to the Put together web page, the place you possibly can add transformations and analyses to the information.
Knowledge Wrangler makes it straightforward to ingest knowledge and carry out knowledge preparation duties corresponding to exploratory knowledge evaluation, function choice, and have engineering. We’ve solely coated a couple of of the capabilities of Knowledge Wrangler on this publish on knowledge preparation; you should utilize Knowledge Wrangler for extra superior knowledge evaluation corresponding to function significance, goal leakage, and mannequin explainability utilizing a straightforward and intuitive consumer interface.
Analyze knowledge high quality
Use the Data Quality and Insights Report to carry out an evaluation of the information that you simply’ve imported into Knowledge Wrangler. Knowledge Wrangler creates the report from the sampled knowledge.
- On the Knowledge Wrangler circulation web page, select the plus signal subsequent to Knowledge varieties, then select Get knowledge insights.
- Select Knowledge High quality And Insights Report for Evaluation kind.
- For Goal column, select your goal column.
- For Downside kind, choose Classification.
- Select Create.
The insights report has a short abstract of the information, which incorporates normal data corresponding to lacking values, invalid values, function varieties, outlier counts, and extra. You possibly can both obtain the report or view it on-line.
Add transformations to the information
Knowledge Wrangler has over 300 built-in transformations. On this part, we use a few of these transformations to organize the dataset for an ML mannequin.
- On the Knowledge Wrangler circulation web page, select plus signal, then select Add rework.
In case you’re following the steps within the publish, you’re directed right here mechanically after including your dataset.
- Confirm and modify the information kind of the columns.
Trying by means of the columns, we determine that MNTHS_SINCE_LAST_DELINQ
and MNTHS_SINCE_LAST_RECORD
ought to most definitely be represented as a quantity kind moderately than string.
- After making use of the adjustments and including the step, you possibly can confirm the column knowledge kind is modified to drift.
Trying by means of the information, we will see that the fields EMP_TITLE
, URL
, DESCRIPTION
, and TITLE
will doubtless not present worth to our mannequin in our use case, so we will drop them.
- Select Add Step, then select Handle columns.
- For Remodel, select Drop column.
- For Column to drop, specify
EMP_TITLE
,URL
,DESCRIPTION
, andTITLE
. - Select Preview and Add.
Subsequent, we wish to search for categorical knowledge in our dataset. Knowledge Wrangler has a built-in performance to encode categorical knowledge utilizing each ordinal and one-hot encodings. our dataset, we will see that the TERM
, HOME_OWNERSHIP
, and PURPOSE
columns all seem like categorical in nature.
- Add one other step and select Encode categorical.
- For Remodel, select One-hot encode.
- For Enter column, select
TERM
. - For Output fashion, select Columns.
- Depart all different settings as default, then select Preview and Add.
The HOME_OWNERSHIP
column has 4 doable values: RENT
, MORTGAGE
, OWN
, and different.
- Repeat the previous steps to use a one-hot encoding method on these values.
Lastly, the PURPOSE
column has a number of doable values. For this knowledge, we use a one-hot encoding method as properly, however we set the output to a vector moderately than columns.
- For Remodel, select One-hot encode.
- For Enter column, select
PURPOSE
. - For Output fashion, select Vector.
- For Output column, we name this column
PURPOSE_VCTR
.
This retains the unique PURPOSE
column, if we determine to make use of it later.
- Depart all different settings as default, then select Preview and Add.
Export the information circulation
Lastly, we export this complete knowledge circulation to a function retailer with a SageMaker Processing job, which creates a Jupyter pocket book with the code pre-populated.
- On the information circulation web page , select the plus signal and Export to.
- Select the place to export. For our use case, we select SageMaker Characteristic Retailer.
The exported pocket book is now able to run.
Export knowledge and practice a mannequin with Autopilot
Now we will practice the mannequin utilizing Amazon SageMaker Autopilot.
- On the information circulation web page, select the Coaching tab.
- For Amazon S3 location, enter a location for the information to be saved.
- Select Export and practice.
- Specify the settings within the Goal and options, Coaching technique, Deployment and advance settings, and Assessment and create sections.
- Select Create experiment to seek out one of the best mannequin in your downside.
Clear up
In case your work with Knowledge Wrangler is full, shut down your Data Wrangler instance to keep away from incurring extra charges.
Conclusion
On this publish, we demonstrated connecting Data Wrangler to Snowflake using OAuth, reworking and analyzing a dataset, and eventually exporting it to the information circulation in order that it could possibly be utilized in a Jupyter pocket book. Most notably, we created a pipeline for knowledge preparation with out having to put in writing any code in any respect.
To get began with Knowledge Wrangler, see Prepare ML Data with Amazon SageMaker Data Wrangler.
In regards to the authors
Ajjay Govindaram is a Senior Options Architect at AWS. He works with strategic prospects who’re utilizing AI/ML to resolve complicated enterprise issues. His expertise lies in offering technical course in addition to design help for modest to large-scale AI/ML utility deployments. His information ranges from utility structure to massive knowledge, analytics, and machine studying. He enjoys listening to music whereas resting, experiencing the outside, and spending time together with his family members.
Bosco Albuquerque is a Sr. Accomplice Options Architect at AWS and has over 20 years of expertise in working with database and analytics merchandise from enterprise database distributors and cloud suppliers. He has helped massive know-how firms design knowledge analytics options and has led engineering groups in designing and implementing knowledge analytics platforms and knowledge merchandise.
Matt Marzillo is a Sr. Accomplice Gross sales Engineer at Snowflake. He has 10 years of expertise in knowledge science and machine studying roles each in consulting and with business organizations. Matt has expertise creating and deploying AI and ML fashions throughout many various organizations in areas corresponding to advertising and marketing, gross sales, operations, medical, and finance, in addition to advising in consultative roles.
Huong Nguyen is a product chief for Amazon SageMaker Knowledge Wrangler at AWS. She has 15 years of expertise creating customer-obsessed and data-driven merchandise for each enterprise and shopper areas. In her spare time, she enjoys audio books, gardening, mountaineering, and spending time together with her household and mates.