Azure DevOps CICD With Azure Databricks and Data Factory
Azure DevOps CICD With Azure Databricks and Data Factory
Beda Tse
·
Follow
13 min read
·
Mar 1, 2019
325
9
Let’s cut long story short, we don’t want to add any unnecessary
introduction that you will skip anyway.
has already outdated as well. But hopefully this series will give you
some insight on setting up CI/CD with Azure Databricks.
Prerequisites
You need to have an Azure account, an Azure DevOps
organisation, you can leverage either GitHub as repository
or Azure Repos as repository. In this series, we will assume you
are using Azure Repos.
You will need a git client, or command line git. We will use
command line git throughout the series, thus assuming that you also
have a terminal, such as Terminal on Mac, or Git-Bash on Windows.
You will need a text editor other than the normal Databricks
notebook editor. Visual Studio Code is a good candidate for that. If
the text editor have built-in git support, that will be even better.
Checklist
1. Azure Account
0–2. Create your git repo on Azure DevOps inside a project with initial
README.
0–4. Clone the repository via git using the following command
$ git clone <repository-url>
1–2. Stage the changed file in git, commit and push it onto the Azure
Repo.
$ git add -A$ git commit -m '<your-commit-message>'$ git push
6 Azure DevOps CI/CD
1–2. Commit and Push infrastructure code and build pipeline code onto
repository
7 Azure DevOps CI/CD
1–2. After pushing code back into repository, it should look like this.
The build pipeline currently only do one thing, which is to pack the
Azure Resource Manager JSONs into a build artifact, which can be
consumed on later steps for deployment. Let take a look what is
inside of the artefact now.
1–4. Create variable group for your deployment. You don’t want to
hardcode your variables inside the pipeline, such that you can make
it reusable in another project or environment with least effort. First,
let’s create a group for your Project, storing all variables that would
be the same across all environments.
15 Azure DevOps CI/CD
Add your build artifact from your repository as the source artifact. In
this example, we will add from databricks example. Click +Add next
to Artifacts.
1–5–5 Link Databricks Pipeline Project Variable Group with Release scope
Click the + sign next to the Agent job, add an Azure Resource
Group Deployment task.
26 Azure DevOps CI/CD
After all these, save your release pipeline and we are ready to create
a release.
2–2. After logging into the workspace, click the user icon on the
top right corner, select User Settings. Click Generate New
Token, give it a meaningful comment and click Generate. We will
use this token in our pipeline for Notebook deployment. Your token
will only be displayed once, make sure you do not close the dialog or
browser before you have copied it into key vault.
35 Azure DevOps CI/CD
36 Azure DevOps CI/CD
37 Azure DevOps CI/CD
3–3. The committed change is pushed into git repository. What does
that mean? That means it will trigger the build pipeline. With a little
bit of further configuration, we can actually update the build
pipeline to package this notebook into a deployable package, and use
it to trigger a deployment pipeline. Now download the azure-
pipelines.yml from this commit, replace the original azure-
pipelines.yml from step 1–1, commit and push the change back to
the repository.
49 Azure DevOps CI/CD
4–3. Add Bash Task at the end of the job. Rename it to Install Tools.
Select Type as Inline, copy the following scripts to the Script text
area. This is to install the needed python tools for deploying
notebook onto Databricks via command line interface.
python -m pip install --upgrade pip setuptools wheel databricks-
cli
51 Azure DevOps CI/CD
4–4. Add Bash Task at the end of the job. Rename it to Authenticate
with Databricks CLI. Select Type as Inline, copy the following
scripts to the Script text area. The variable databricks_location is
obtained from variable group defined inside the pipeline,
while databricks-token is obtained from variable group linked with
Azure Key Vault.
databricks configure --token <<EOF
https://$(databricks_location).azuredatabricks.net
$(databricks-token)
EOF
52 Azure DevOps CI/CD
4–5. Add Bash Task at the end of the job. Rename it to Upload
Notebook to Databricks. Select Type as Inline, copy the following
scripts to the Script text area. The variable notebook_name is
retrieved from the release scoped variable group.
databricks workspace mkdirs /build
databricks workspace import --language PYTHON --format SOURCE --
overwrite _databricks-example/notebook/$(notebook_name)-$
(Build.SourceVersion).py /build/$(notebook_name)-$
(Build.SourceVersion).py
53 Azure DevOps CI/CD
4–6. Add Bash Task at the end of the job. Rename it to Create
Notebook Run JSON. Select Type as Inline, copy the following
scripts to the Script text area. This is to prepare a job execution
configuration for the test run, using the template notebook-
run.json.tmpl.
# Replace run name and deployment notebook path
cat _databricks-example/notebook/notebook-run.json.tmpl | jq
'.run_name = "Test Run - $(Build.SourceVersion)"
| .notebook_task.notebook_path = "/build/$(notebook_name)-$
(Build.SourceVersion).py"' > $(notebook_name)-$
(Build.SourceVersion).run.json# Check the Content of the
generated execution file
cat $(notebook_name)-$(Build.SourceVersion).run.json
54 Azure DevOps CI/CD
4–7. Add Bash Task at the end of the job. Rename it to Run
Notebook on Databricks. Select Type as Inline, copy the following
scripts to the Script text area. This is to execute the notebook
prepared in the Build pipeline, i.e. committed by you thru the
Databricks UI, via Job Cluster.
echo "##vso[task.setvariable variable=RunId;
isOutput=true;]`databricks runs submit --json-file $
(notebook_name)-$(Build.SourceVersion).run.json | jq -
r .run_id`"
4–8. Add Bash Task at the end of the job. Rename it to Wait for
Databricks Run to complete. Select Type as Inline, copy the
following scripts to the Script text area. This is to wait for the
previously executing Databricks job and get the execution state from
the run result.
echo "Run Id: $(RunId)"# Wait until job run finish
while [ "`databricks runs get --run-id $(RunId) | jq -r
'.state.life_cycle_state'`" != "INTERNAL_ERROR" ] &&
[ "`databricks runs get --run-id $(RunId) | jq -r
'.state.result_state'`" == "null" ]
do
echo "Waiting for Databrick job run $(RunId) to complete, sleep
for 30 seconds"
sleep 30
done# Print Run Results
databricks runs get --run-id $(RunId)# If not success, report
56 Azure DevOps CI/CD
4–10. Save the Release Pipeline, and create a release to test the
new pipeline.
58 Azure DevOps CI/CD
Anurag Chatterjee
·
Follow
6 min read
Mar 3, 2024
3
1
61 Azure DevOps CI/CD
Microsoft docs
The focus of this article is the first of the 2 items suggested in the
Microsoft docs above to promote a data factory to another
environment. This article provides the code repository structure and
Azure pipelines dev-ops templates to comply with the latest
improvements (as of March 2024) suggested by Microsoft for CI/CD
for Azure Data Factory.
# Sample YAML file to validate and export an ARM template into a build
artifact
# Requires a package.json file located in the target repository
# Inspired from:
https://learn.microsoft.com/en-us/azure/data-factory/continuous-
integration-delivery-improvements
parameters:
- name: packageJSONFolderPath
type: string
- name: subscriptionId
type: string
- name: resourceGroup
type: string
- name: adfName
type: string
- name: adfRootFolder
type: string
jobs:
- job: Build
timeoutInMinutes: 120
pool:
vmImage: 'ubuntu-latest'
steps:
# Installs Node and the npm packages saved in your package.json file
in the build
- task: UseNode@1
inputs:
version: '18.x'
displayName: 'Install Node.js'
- task: Npm@1
inputs:
command: 'install'
workingDir: '$(Build.Repository.LocalPath)/$
{{ parameters.packageJSONFolderPath }}' #replace with the package.json
65 Azure DevOps CI/CD
folder
verbose: true
displayName: 'Install npm package'
- task: Npm@1
inputs:
command: 'custom'
workingDir: '$(Build.Repository.LocalPath)/$
{{ parameters.packageJSONFolderPath }}' #replace with the package.json
folder
customCommand: 'run build validate $(Build.Repository.LocalPath)/$
{{ parameters.adfRootFolder }} /subscriptions/$
{{ parameters.subscriptionId }}/resourceGroups/$
{{ parameters.resourceGroup }}/providers/Microsoft.DataFactory/factories/$
{{ parameters.adfName }}'
displayName: 'Validate'
# Validate and then generate the ARM template into the destination
folder, which is the same as selecting "Publish" from the UX.
# The ARM template generated isn't published to the live version of
the factory. Deployment should be done by using a CI/CD pipeline.
- task: Npm@1
inputs:
command: 'custom'
workingDir: '$(Build.Repository.LocalPath)/$
{{ parameters.packageJSONFolderPath }}' #replace with the package.json
folder
customCommand: 'run build export $(Build.Repository.LocalPath)/${{
parameters.adfRootFolder }} /subscriptions/$
{{ parameters.subscriptionId }}/resourceGroups/$
{{ parameters.resourceGroup }}/providers/Microsoft.DataFactory/factories/$
{{ parameters.adfName }} "ArmTemplate"'
displayName: 'Validate and Generate ARM template'
parameters:
- name: subscriptionId
type: string
- name: resourceGroup
type: string
- name: adfName
type: string
- name: adfRootFolder
type: string
- name: deployEnvironment
type: string
- name: serviceConnection
type: string
- name: location
type: string
- name: overrideParameters
type: string
jobs:
- deployment: DeployADF
environment: ${{ parameters.deployEnvironment }}
displayName: 'Deploy to ${{ parameters.deployEnvironment }} | ADF: ${{
parameters.adfName }}'
timeoutInMinutes: 120
pool:
vmImage: "ubuntu-latest"
strategy:
runOnce:
deploy:
steps:
- checkout: none
# Retrieve the ARM template from the build phase.
- task: DownloadPipelineArtifact@2
inputs:
buildType: 'current'
artifactName: 'ArmTemplates'
targetPath: '$(Pipeline.Workspace)'
displayName: "Retrieve ARM template"
# Deactivate ADF Triggers before deployment.
# Sample: https://learn.microsoft.com/en-us/azure/data-
factory/continuous-integration-delivery-sample-script
- task: AzurePowerShell@5
displayName: Stop ADF Triggers
inputs:
scriptType: 'FilePath'
ConnectedServiceNameARM: ${{ parameters.serviceConnection
}}
scriptPath:
$(Pipeline.Workspace)/PrePostDeploymentScript.ps1
67 Azure DevOps CI/CD
ScriptArguments: -armTemplate
"$(Pipeline.Workspace)/ARMTemplateForFactory.json" -ResourceGroupName ${{
parameters.resourceGroup }} -DataFactoryName ${{ parameters.adfName }} -
predeployment $true -deleteDeployment $false
errorActionPreference: stop
FailOnStandardError: False
azurePowerShellVersion: 'LatestVersion'
pwsh: True
variables:
- name: packageJSONFolderPath
value: build/
- name: adfRootFolder
value: app/adf/
trigger: none
name:
"Build and deploy Azure Data Factory pipelines"
stages:
- stage: Build
displayName: Build
variables:
- template: vars/dev.yml
jobs:
- template: templates/template_build.yml
parameters:
packageJSONFolderPath: ${{ variables.packageJSONFolderPath }}
subscriptionId: ${{ variables.subscriptionId }}
resourceGroup: ${{ variables.resourceGroup }}
adfName: ${{ variables.adfName }}
adfRootFolder: ${{ variables.adfRootFolder }}
69 Azure DevOps CI/CD
- stage: DeployDev
dependsOn: Build
condition: succeeded()
displayName: Deploy ADF pipelines to dev ADF
variables:
- template: vars/dev.yml
jobs:
- template: templates/template_deploy.yml
parameters:
subscriptionId: ${{ variables.subscriptionId }}
resourceGroup: ${{ variables.resourceGroup }}
adfName: ${{ variables.adfName }}
adfRootFolder: ${{ variables.adfRootFolder }}
deployEnvironment: ${{ variables.deployEnvironment }}
serviceConnection: ${{ variables.serviceConnection }}
location: ${{ variables.location }}
overrideParameters: ${{ variables.overrideParameters }}
- stage: DeployQA
displayName: Deploy ADF pipelines to QA ADF
variables:
- template: vars/qa.yml
jobs:
- template: templates/template_deploy.yml
parameters:
subscriptionId: ${{ variables.subscriptionId }}
resourceGroup: ${{ variables.resourceGroup }}
adfName: ${{ variables.adfName }}
adfRootFolder: ${{ variables.adfRootFolder }}
deployEnvironment: ${{ variables.deployEnvironment }}
serviceConnection: ${{ variables.serviceConnection }}
location: ${{ variables.location }}
overrideParameters: ${{ variables.overrideParameters }}
Hope this article could bring together the different resources present
in the Microsoft docs on how to set up the new CI/CD flow for Azure
Data Factory (ADF) using the NPM package and you are able to set
up the same for your ADF projects.