Data Exploitation Notebook with Spark

In this Notebook example, we will load a file containing data from Australian taxes, then we will convert it to a RDD Spark and lastly we will view it with Spark SQL.

To do this:

  1. First you need to have the role ANALYTICS. This role allows you to access different tools, such as Notebooks and DataFlow. By default, only the Administrator user can give this role. Once you are logged in the platform with this role, you will see a menu option ANALYTICS TOOLS including a set of tools.
  2. Go to ANALYTICS TOOLS>My Notebooks. From here, click on   and give your Notebook a name. Once created, it will take you to a screen where you can write your Notebook.
  3. You need the file. In this case, you will be using a DataSet from the Australian Government Open Data Portal
    https://data.gov.au/dataset/
    This will be your DataSet: https://data.gov.au/dataset/health-expenditure-in-australia
  4. Using the Shell interpreter, bring the file to the /tmp/ folder and remove empty fields:
    %sh
    #remove existing copies of dataset
    rm /tmp/australian-expenses.csv
    #fetch the dataset
    wget https://data.gov.au/dataset/f84b9baf-c1c1-437c-8c1e-654b2829848c/resource/88399d53-d55c-466c-8f4a-6cb965d24d6d/download/healthexpenditurebyareaandsource.csv -O /tmp/australian-expenses.csv
    #remove header
    sed -i '1d' /tmp/australian-expenses.csv
    #remove empty fields
    sed -i "s/,,,,,//g" /tmp/australian-expenses.csv
    sed -i '/^\s*$/d' /tmp/australian-expenses.csv
  5. After this, you can create an RDD Spark and start working on the RDD:
  6. To work with Spark SQL, register the RDD as a Table following the order of the file fields:
  7. Finally, you can make SQL queries on the health_table such as:
    Expenditure (in billions) by state: 

    %sql
    select state, sum(spending)/1000 SpendinginBillions 
    from health_table 
    group by state 
    order by SpendinginBillions desc



    Expenditure (In Billions) By Year

    %sql
    select year,sum(spending)/1000 SpendinginBillions 
    from health_table 
    group by year 
    order by SpendinginBillions

    o 

    Expenditure (in billions) by area

    %sql
    select category, sum(spending)/1000 SpendinginBillions 
    from health_table 
    group by category 
    order by SpendinginBillions desc


    Easy, right? You can combine the results of this Notebook with the platform's interpreter to insert these data in an ontology as we will see in this tutorial: (Notebooks) Usando el intérprete de la plataforma dentro de los Notebooks