Data Exploitation Notebook with Spark
In this Notebook example, we will load a file containing data from Australian taxes, then we will convert it to a RDD Spark and lastly we will view it with Spark SQL.
To do this:
- First you need to have the role ANALYTICS. This role allows you to access different tools, such as Notebooks and DataFlow. By default, only the Administrator user can give this role. Once you are logged in the platform with this role, you will see a menu option ANALYTICS TOOLS including a set of tools.
- Go to ANALYTICS TOOLS>My Notebooks. From here, click on  and give your Notebook a name. Once created, it will take you to a screen where you can write your Notebook.
- You need the file. In this case, you will be using a DataSet from the Australian Government Open Data Portal
https://data.gov.au/dataset/
This will be your DataSet: https://data.gov.au/dataset/health-expenditure-in-australia - Using the Shell interpreter, bring the file to the /tmp/ folder and remove empty fields:
%sh
#remove existing copies of dataset
rm /tmp/australian-expenses.csv#fetch the dataset
wget https://data.gov.au/dataset/f84b9baf-c1c1-437c-8c1e-654b2829848c/resource/88399d53-d55c-466c-8f4a-6cb965d24d6d/download/healthexpenditurebyareaandsource.csv -O /tmp/australian-expenses.csv#remove header
sed -i '1d' /tmp/australian-expenses.csv
#remove empty fields
sed -i "s/,,,,,//g" /tmp/australian-expenses.csv
sed -i '/^\s*$/d' /tmp/australian-expenses.csv - After this, you can create an RDD Spark and start working on the RDD:
- To work with Spark SQL, register the RDD as a Table following the order of the file fields:
- Finally, you can make SQL queries on the health_table such as:
Expenditure (in billions) by state:Â%sql
select state, sum(spending)/1000 SpendinginBillionsÂ
from health_tableÂ
group by stateÂ
order by SpendinginBillions desc
Expenditure (In Billions) By Year%sql
select year,sum(spending)/1000 SpendinginBillionsÂ
from health_tableÂ
group by yearÂ
order by SpendinginBillionsoÂ
Expenditure (in billions) by area
%sql
select category, sum(spending)/1000 SpendinginBillionsÂ
from health_tableÂ
group by categoryÂ
order by SpendinginBillions descEasy, right? You can combine the results of this Notebook with the platform's interpreter to insert these data in an ontology as we will see in this tutorial: (Notebooks) Usando el intérprete de la plataforma dentro de los Notebooks