HPCC Data Tutorial: Boca Raton Documentation Team
HPCC Data Tutorial: Boca Raton Documentation Team
Please include Documentation Feedback in the subject line and reference the document name, page numbers, and current Version Number in
the text of the message.
LexisNexis and the Knowledge Burst logo are registered trademarks of Reed Elsevier Properties Inc., used under license.
Other products and services may be trademarks or registered trademarks of their respective companies.
All names and example data used in this manual are fictitious. Any similarity to actual persons, living or dead, is purely coincidental.
Introduction ........................................................................................................................................ 4
The ECL Development Process ..................................................................................................... 4
Working with Data .............................................................................................................................. 5
The Original Data ....................................................................................................................... 5
Begin Coding ............................................................................................................................ 11
Publishing your Thor Query ........................................................................................................ 20
Compile and Publish the Roxie Query .......................................................................................... 24
Summary .......................................................................................................................................... 27
Introduction
The ECL Development Process
This tutorial provides a walk-through of the development process, from beginning to end, and is designed to be an
introduction to working with data on any HPCCSystems HPCC1. We will write code in ECL2to process our data and
query it.
You have a running HPCC. This can be a VM Edition or a single or multinode HPCC platform
The download is approximately 30 MB (compressed) and is available in either ZIP or .tar.gz format. Choose the
appropriate link.
Spray the file to a Data Refinery cluster HPCC clusters "spray" data into file parts on each node.
A spray or import is the relocation of a data file from one location to an HPCC cluster. The term spray was adopted
due to the nature of the file movement the file is partitioned across all nodes within a cluster.
Deploy them to a Rapid Data Delivery Engine (RDDE) cluster, also know as a Roxie cluster.
1
High Performance Computing Cluster (HPCC) is a massively parallel processing computing platform that solves Big Data problems. See http://
www.hpccsystems.com/Why-HPCC/How-it-works for more details.
2
Enterprise Control Language (ECL) is a declarative, data centric programming language used to manage all aspects of the massive data joins,
sorts, and builds that truly differentiate HPCC (High Performance Computing Cluster) from other technologies in its ability to provide flexible
data analysis on a massive scale.
3
The ECL IDE (Integrated Development Environment) is the tool used to create queries into your data and ECL files with which to build your queries.
This gives us a record length of 124 (the total of all field lengths). You will need to know this length for the File
Spray process.
For smaller data files, you can use the upload/download file utility in ECL Watch (a Web-based interface to your
HPCC platform). The sample data file is ~100 mb.
1. Download the sample data file from the HPCC Systems portal.
3. In your browser, go to the ECL Watch URL. For example, http://nnn.nnn.nnn.nnn:8010, where nnn.nnn.nnn.nnn
is your ESP1 Server's IP address.
Your IP address could be different from the ones provided in the example images. Please use the IP
address provided by your installation.
1
The ESP (Enterprise Services Platform) Server is the communication layer server in you HPCC environment.
4. From the ECL Watch home page, click on the Files icon, then click the Landing Zones link from the navigation
sub-menu.
Figure 1. Upload/download
Once you press the Upload button, a dialog opens where you can choose a file to upload.
5. Browse the files on your local machine, select the file to upload, and then press the Open button.
In this example, the file is on your Landing Zone and is named OriginalPerson.
We are going to spray it to our Thor cluster and give it a logical name of tutorial::YN::OriginalPerson where YN are
your initials. The Distributed File Utility maintains a list of logical files and their corresponding physical file locations.
http://nnn.nnn.nnn.nnn:pppp (where nnn.nnn.nnn.nnn is your ESP Servers IP Address and pppp is the
port. The default port is 8010)
2. From the ECL Watch home page, click on the Files icon, then click the Landing Zones link from the navigation
sub-menu.
On the Landing Zones tab, click on the arrow next to your mydropzone container to expand the list of uploaded files.
Figure 4. mydropzone
Find the file you want to spray in the list (OriginalPerson), check the box next to that file name to select that file.
Once you select the file from the list, the Spray action buttons become enabled.
3. Press the Fixed action button. This indicates that you are spraying a fixed width file.
4. The Target name field is automatically filled in with the selected file.
7. Fill in the Target Scope using the naming convention described earlier: tutorial::YN (remember, YN are your
initials).
Note: This option is only available on systems where replication has been enabled.
10.The workunit details page displays. You can view the progress of the spray.
Begin Coding
In this portion of the tutorial, we will write ECL code to define the data file and execute simple queries on it so we
can evaluate it and determine any necessary pre-processing.
1. Start the ECL IDE (Start >> All Programs >> HPCC Systems >> ECL IDE )
3. Right-click on the My Files folder in the Repository window, and select Insert Folder from the pop-up menu.
For purposes of this tutorial, lets create a folder called TutorialYourName (where YourName is your name).
4. Enter TutorialYourName(where YourName is your name) for the label, then press the OK button.
5. Right-click on the TutorialYourNameFolder, and select Insert File from the pop-up menu.
Notice that some text has been written for you in the window. This helps you to remember that the name of the file
(Layout_People) must always exactly match the name of the single EXPORT definition (Layout_People) contained
in that file. This is a requirement -- one EXPORT definition per file, and its name must match the filename.
8. Press the syntax check button on the main toolbar (or press F7).
This file defines the record structure for the data file. Next, we will examine the data.
1. Right-click on the TutorialYourName Folder, and select Insert File from the pop-up menu.
4. Press the syntax check button on the main toolbar (or press F7) to check the syntax.
5. Open a new Builder Window (CTRL+N) and write the following code (remember to replace YourName with your
name):
IMPORT TutorialYourName;
COUNT(TutorialYourName.File_OriginalPerson);
6. Press the syntax check button on the main toolbar (or press F7) to check the syntax.
7. Make sure the selected cluster is your Thor cluster, then press the Submit button. Note that your target cluster
might have a different name.
9. Select the Workunit tab (the one with the number next to the checkmark) and select the Result 1 tab (it may already
be selected).
This shows us that there are 841,400 records in the data file.
10.Select the Builder tab and change COUNT to OUTPUT, as shown below:
IMPORT TutorialYourName;
OUTPUT(TutorialYourName.File_OriginalPerson);
12.When it completes, select the Workunit tab, then select the Result 1 tab.
For our purposes, it will be easier to have all the names in all uppercase. This demonstrates one of the steps in the
basic process of preparing data (Extract, Transform, and LoadETL) using ECL.
1. Right-click on the TutorialYourName Folder, and select Insert File from the pop-up menu.
2. Name this one BWR_ProcessRawData and write the following code (changing YN and YourName as before):
IMPORT TutorialYourName, Std;
TutorialYourName.Layout_People toUpperPlease(TutorialYourName.Layout_People pInput)
:= TRANSFORM
SELF.FirstName := Std.Str.ToUpperCase(pInput.FirstName);
SELF.LastName := Std.Str.ToUpperCase(pInput.LastName);
SELF.MiddleName := Std.Str.ToUpperCase(pInput.MiddleName);
SELF.Zip := pInput.Zip;
SELF.Street := pInput.Street;
SELF.City := pInput.City;
SELF.State := pInput.State;
END ;
OrigDataset := TutorialYourName.File_OriginalPerson;
UpperedDataset := PROJECT(OrigDataset,toUpperPlease(LEFT));
OUTPUT(UpperedDataset,,'~tutorial::YN::TutorialPerson',OVERWRITE);
4. When it completes, select the Workunit tab, then select the Result 1 tab.
The results show that the process has successfully converted the name fields to uppercase.
In the DATASET definition, we will add a virtual field to the RECORD structure for the fileposition. This is required
for indexes.
1. Insert a File into the TutorialYourName Folder. Name it File_TutorialPerson and write this code (changing YN
to your initials):
IMPORT TutorialYourName;
EXPORT File_TutorialPerson :=
DATASET('~tutorial::YN::TutorialPerson',
{TutorialYourName.Layout_People,
UNSIGNED8 fpos {virtual(fileposition)}},THOR);
1. Insert a File into your Tutorial Folder. Name it IDX_PeopleByZip and write this code (changing YN and YourName
as before):
IMPORT TutorialYourName;
EXPORT IDX_PeopleByZIP :=
INDEX(TutorialYourName.File_TutorialPerson,{zip,fpos},'~tutorial::YN::PeopleByZipINDEX');
3. Insert a File into the TutorialYourName Folder and name it BWR_BuildPeopleByZip and write this code (re-
placing YourName with your name):
IMPORT TutorialYourName;
BUILDINDEX(TutorialYourName.IDX_PeopleByZIP,OVERWRITE);
4. Check the syntax and if there are no errors, press the Submit button.
5. Wait for the Workunit to complete, then close the Builder Window.
Build a Query
Now that we have an index file, we will write a query that uses it.
1. Insert a File into your Tutorial Folder. Name it BWR_FetchPeopleByZip and write this code (changing YourName
as before):
IMPORT TutorialYourName;
ZipFilter :='33024';
FetchPeopleByZip :=
FETCH(TutorialYourName.File_TutorialPerson,
TutorialYourName.IDX_PeopleByZIP(zip=ZipFilter),
RIGHT.fpos);
OUTPUT(FetchPeopleByZip);
2. Check the syntax and if there are no errors, press the Submit button.
3. When it completes, select the Workunit tab, then select the Result tab.
4. Examine the result, then close the Builder window and resubmit the code.
Note: You can change the value of the ZipValue field to get results from different Zip codes.
Our STORED variables provide a means to pass values as query parameters. In this example, the user can supply the
ZIP code so the results are people from that ZIP code.
IMPORT TutorialYourName;
STRING10 ZipFilter := '' :STORED('ZIPValue');
resultSet :=
FETCH(TutorialYourName.File_TutorialPerson,
TutorialYourName.IDX_PeopleByZIP(zip=ZipFilter),
RIGHT.fpos);
OUTPUT(resultset);
5. When the workunit completes, select the Workunit tab, then select the ECL Watch tab.
The Publish dialog displays, with the Job Name field automatically filled in. You can add a comment in the Com-
ment field if you wish, then press Submit.
7. If there are no error messages, the workunit is published. Leave the builder window open, you will need it again later.
http://nnn.nnn.nnn.nnn:pppp (where nnn.nnn.nnn.nnn is your ESP Servers IP address and pppp is the port.
Default port is 8002)
3. Provide a zip code (e.g., 33024) in the zipvalue field. Select Output Tables from the drop list, then press the
Submit button.
We will recompile the code with Roxie as the target cluster, then publish it to a Roxie cluster.
1. In the ECL IDE, select the Builder tab on the FetchPeopleByZipService file builder window.
2. Using the Target drop list, select Roxie as the Target cluster.
3. In the Builder window, in the upper left corner the Submit button has a drop down arrow next to it. Select the
arrow to expose the Compile option.
4. Select Compile
5. When the workunit finishes, it will display a green circle indicating it has compiled.
1. Select the workunit tab for the FetchPeopleByZipService that you just compiled.
2. Press the Publish action button, then verify the information in the dialog and press Submit.
http://nnn.nnn.nnn.nnn:pppp (where nnn.nnn.nnn.nnn is your ESP Servers IP address and pppp is the port.
The default port is 8002)
3. Provide a zip code (e.g., 33024), select Output Tables from the drop list, and press the Submit button.
Summary
Now that you have successfully processed raw data, sprayed it onto a cluster, and deployed it to a RDDE cluster,
whats next?
Here is a short list of suggestions on the path you might take from here:
Write client applications to access your queries using JSON or SOAP interfaces.
The Links tab provides easy access to a form, a Sample Request, a Sample Response, the WSDL, the XML Schema
(XSD) and more...