Cloud Thoughts: Automatically Uploading COVID-19 Dataset to Einstein Analtyics

My Examining COVID-19 with Einstein Analytics templated app relies on a csv file that updates each day. The following steps can be used to automate the process.

Prerequisites

To upload the data, we'll use Analytics Cloud Dataset Utils. Check out the github repository here or go directly to the release page. Download the latest version of the jar file.

Running dataset utils will require a Java JDK.

Next, clone my github repository that has the csv file updates to the desired local location.

git clone https://github.com/carlbrundage/covid-19-ea-data.git

Load Process

First, download the latest csv file update from the covid-19-ea-data local location:

git pull

Then, use dataset utils to upload the csv with the following command:

java -jar ./utils/datasetutils-47.0.0.jar --action load --u <Salesforce user> --p <Salesforce password> --dataset covid_unpivoted --app COVID_19 --inputFile ./covid-19-ea-data/covid.csv --schemaFile ./covid-19-ea-data/covid.json --endpoint https://login.salesforce.com

Use the appropriate parameters for your org:

--u = Salesforce username
--p = Salesforce password
--token = Salesforce token, if required
--dataset = Covid (raw) dataset API name. Can be found from Edit on a dataset in Analytics. Typically, covid_unpivoted
--app = Covid App dataset API name. Can be found on Details from the app in Analytics, Typically, COVID_19
--inputFile = path to the covid.csv file from the github project
--schemaFile = path to the covid.json file from the github project
-- endpoint = production (including dev org) or sandbox url

Put this together in a script and execute it with one click or on schedule.

Dataflow Execution

Finally, you can execute the dataflow from Data Manager in Analytics. This will create the Covid Enhanced dataset. Alternatively, you can schedule it to run daily at a certain time.

If you want to go a step further, you can start the dataflow programmatically. From the REST API, POST to /wave/dataflowjobs with the datafow id and a start command

{
  "dataflowId": "02KS700000004G3eMAE",
  "command" : "start"
}

More information is available in the REST API Developer Guide

Keep in mind, the dataset util upload is asynchronous. Once it uploads the file parts, it kicks off a data file ingest step. It completes after the upload, not after the data ingest (viewable in Data Manager) completes.

Cloud Thoughts

Tuesday, April 7, 2020

Automatically Uploading COVID-19 Dataset to Einstein Analtyics

Prerequisites

Load Process

Dataflow Execution

No comments:

Post a Comment