Data Exploitation Notebook with Spark
In this Notebook example, we will load a file containing data from Australian taxes, then we will convert it to a RDD Spark and lastly we will view it with Spark SQL.
To do this:
- First you need to have the role ANALYTICS. This role allows you to access different tools, such as Notebooks and DataFlow. By default, only the Administrator user can give this role. Once you are logged in the platform with this role, you will see a menu option ANALYTICS TOOLS including a set of tools.
- Go to ANALYTICS TOOLS>My Notebooks. From here, click on and give your Notebook a name. Once created, it will take you to a screen where you can write your Notebook.
- You need the file. In this case, you will be using a DataSet from the Australian Government Open Data Portal
https://data.gov.au/dataset/
This will be your DataSet: https://data.gov.au/dataset/health-expenditure-in-australia - Using the Shell interpreter, bring the file to the /tmp/ folder and remove empty fields:
%sh
#remove existing copies of dataset
rm /tmp/australian-expenses.csv#fetch the dataset
wget https://data.gov.au/dataset/f84b9baf-c1c1-437c-8c1e-654b2829848c/resource/88399d53-d55c-466c-8f4a-6cb965d24d6d/download/healthexpenditurebyareaandsource.csv -O /tmp/australian-expenses.csv#remove header
sed -i '1d' /tmp/australian-expenses.csv
#remove empty fields
sed -i "s/,,,,,//g" /tmp/australian-expenses.csv
sed -i '/^\s*$/d' /tmp/australian-expenses.csv - After this, you can create an RDD Spark and start working on the RDD:
- To work with Spark SQL, register the RDD as a Table following the order of the file fields:
- Finally, you can make SQL queries on the health_table such as:
Expenditure (in billions) by state:%sql
select state, sum(spending)/1000 SpendinginBillions
from health_table
group by state
order by SpendinginBillions desc
Expenditure (In Billions) By Year%sql
select year,sum(spending)/1000 SpendinginBillions
from health_table
group by year
order by SpendinginBillionso
Expenditure (in billions) by area
%sql
select category, sum(spending)/1000 SpendinginBillions
from health_table
group by category
order by SpendinginBillions descEasy, right? You can combine the results of this Notebook with the platform's interpreter to insert these data in an ontology as we will see in this tutorial: (Notebooks) Usando el intérprete de la plataforma dentro de los Notebooks