0% found this document useful (0 votes)
14 views3 pages

5 1 PySpark Parameters - Widgets

The document outlines the use of widgets in PySpark for user input in a notebook interface, detailing types of widgets and implementation steps. It provides a step-by-step guide to parameterize data imports into a Spark database, including creating and reading widgets for file paths and table names. Additionally, it presents tasks for loading data from ADLS to Spark tables with dynamic parameters and filters.

Uploaded by

gopichandm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views3 pages

5 1 PySpark Parameters - Widgets

The document outlines the use of widgets in PySpark for user input in a notebook interface, detailing types of widgets and implementation steps. It provides a step-by-step guide to parameterize data imports into a Spark database, including creating and reading widgets for file paths and table names. Additionally, it presents tasks for loading data from ADLS to Spark tables with dynamic parameters and filters.

Uploaded by

gopichandm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

PYSPARK PARAMETERS (WIDGETS)

WIDGET:
A USER INTERFACE ITEM. A PROMPT FOR END USER INPUT IN THE NOTEBOOK INTERFACE.
WIDGET IS USED TO ACCEPT INPUT VALUES FROM THE USERS [EX: SOURCE FILE PATH, DESTINATION
SERVER, DATABASE, USER NAME, PASSWORD, ETC..].

WE DEFINE WIDGETS IN A PYSPARK CELL INSIDE THE NOTEBOOK.


TYPES OF WIDGETS (PARAMETER DEFINITIONS) IN SPARK CLUSTERED ENVIRONMENT:
1. TEXT WIDGET
2. DROPDOWN WIDGET
3. COMBOBOX WIDGET
4. MULTISELECT WIDGET

IMPLEMENTATION STEPS:
STEP 1: CREATE PARAMETER USING dbutils.widgets PREDEFINED PYTHON MODULE
STEP 2: READ THE PARAMETER VALUE INTO A VARIABLE
STEP 3: USE THE VARIABLE FOR ACTUAL CELL EXECUTION

LOGIN TO AZURE PORTAL > GO TO DATABRICKS WORKSPACE > START THE CLUSTER.
UPLOAD GIVEN CSV FILE TO DBFS (IGNORE IF ALREADY DONE THIS EARLIER). DOCUMENT THE FILE PATH:
/FileStore/tables/SalesData.csv

REQUIREMENT:
HOW TO PARAMETERIZE DATA IMPORTS INTO SPARK DATABASE ?
SOURCE FILE PATH NEEDS TO BE DYNAMIC.
TARGET SPARK TABLE NAME NEEDS TO BE DYNAMIC.

SOLUTION:
CREATE PYTHON NOTEBOOK.
IMPLEMENT BELOW CELLS:

CELL 1: TO READ THE METADATA ABOUT WIDGETS


dbutils.widgets.help()

dbutils.widgets.text(name, defaultValue)
Creates a text input widget with a given name and default value

dbutils.widgets.combobox(name, defaultValue, choices)


Creates a combobox input widget with a given name, default value

dbutils.widgets.dropdownbox(name, defaultValue, choices)


Creates a dropdown input widget a with given name, default value

dbutils.widgets.multiselect(name, defaultValue, choices)


Creates a multiselect input widget with given name, default value
dbutils.widgets.get(name)
Retrieves current value of an input widget

dbutils.widgets.remove(name)
Removes an input widget from the notebook

dbutils.widgets.removeAll
Removes all widgets in the entire notebook

CELL 2: DEFINE A NEW WIDGET (NEW PARAMETER) FOR THIS NOTEBOOK


dbutils.widgets.text("FilePath","")

CELL 3: READ ABOVE PARAMETER VALUETO A VARIABLE.


FOR THIS, SUPPLY FILE PATH VALUE TO THE ABOVE DEFINED PARAMETER:
FileStore/tables/SalesData.csv

THEN RUN BELOW COMMANDS IN THE NOTEBOOK CELL:


varFilePath = dbutils.widgets.get("FilePath")
varFilePath

CELL 4: READ DATA FROM ABOVE INPUT FILE (PARAMETERIZED) INTO A DATAFRAME
dataframe1 = spark.read.csv(varFilePath, header="true")
display(dataframe1)

CELL 5: CREATE A TEMP VIEW:


df1.createOrReplaceTempView("vwTempSales")

CELL 6: FILTER, AGGREGATE (TRANSFORMATIONS) THE DATA


%sql
select country, company, sum (sale2018) as sales2018, sum(sale2019) as sales2019, sum(sale2020) as
sale2020 fromvwTempSalesWhere country != "India"Group by country, company

CELL 7: LOAD THE AGGREGATED DATA INTO ANOTHER DATA FRAME


df2 = spark.sql ("""select country, company, sum (sale2018) as sales2018, sum(sale2019) as sales2019,
sum(sale2020) as sale2020 fromvwTempSalesWhere country != "India"Group by country, company""")

CELL 8: CREATE A PARAMETER TO DEFINE THE SPARK TABLE


dbutils.widgets.combobox("SparkTableName", "SparkTable1", ["SparkTable1", "SparkTable2",
"SparkTable3"])

CELL 9: READ THE PARAMETER VALUE


sparktablevar= dbutils.widgets.get("SparkTableName")
sparktablevar
CELL 10: CREATE THE SPARK TABLE
df2.write.format("parquet").saveAsTable(sparktablevar)

CELL 11: TEST THE SPARK TABLE


df3 = spark.sql(f'select * from {sparktablevar}')
display(df3)

--------
Task 1: How to load data from ADLS to Spark Table with Dynamic (Parameterized) Access Key?

Task 2: How to load data from ADLS to Spark Table with Dynamic (Parameterized) File Format, File
Path?

Task 3: How to load data from ADLS to Spark Table with Parameterized Data Filters for Aggregated Store
?
Example: In the below aggregation query, the country value should be parameterized:
select country, company, sum (sale2018) as sales2018, sum(sale2019) as sales2019, sum(sale2020) as
sale2020 from vwTempSales Where country != "India"
Group by country, company

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy