ACEA Smart water analytics
Blog, Data Science project, rblog

ACEA Smart Water Analytics Competition; Introduction

In this blog series I will document my work on the Kaggle ACEA Smart Water Analytics Competition. The purpose is of these blogs is not to share perfect code or completely worked out solutions. Rather I want to take you along for the ride and share my insights as is. There will definitely be some fails and hopefully a success mixed in. In this series I will discuss 4 distinct steps in my project:

If you want to follow along with any code in these blog series then sign up for the challenge here: https://www.kaggle.com/c/acea-water-prediction/overview. You can download the dataset for free after signing up.

The challenge: ACEA Smart Water Analytics Competition

The goal of this challenge is to predict water levels in a collection of different water bodies based in Italy. Specifically we have to predict based on a time series model. The goal is to accurately assess the water level of tomorrow, based on data of today. This specifically is an analytics challenge, which means creating a compelling story & notebook is a very important part. My notebook is publicly available on Kaggle here. I will work through some code excerpts and interesting highlights in these blogs.

For today I want to address my initial findings, and explore how I thought through these challenges initial steps.

Designing my load function

One of the first steps everyone undertakes is (obviously) to load the data. It is my own personal preference to condense this code into a load function. The main reason for load functions is that it creates clean and concise notebooks. It allows me to build code chunks that are independent of each other by loading data in the notebook as necessary.

My load function will typically contain the following 3 components:

  • Filepath to data
  • Column based filters
  • Merge functionality

In the ACEA dataset this worked out as follows:

# dirs:
maindir   <- 'C:/Prive/Kaggle/ACEA'
datadir   <- 'Data'
scriptdir <- 'Scripts'
htmldir   <- 'HTML'
# load:
# Inladen files:
files_loc <- list.files(path = paste(maindir, datadir, sep = '/'), recursive= T, full.names = T)
files_loc <- files_loc[grep("\\.csv", files_loc)]

load_files <- function(path, filter=NULL){
    output <- read.csv(path, stringsAsFactors = FALSE, check.names = FALSE, fileEncoding="UTF-8-BOM")
    output$name <- gsub('\\.csv','', basename(path))
    if(!is.null(filter)){
        return(output[,filter])
    }
    return(output)
}

data_auser <- load_files(files_loc[1])
data_doganella <- load_files(files_loc[2])
data_luco <- load_files(files_loc[3])
data_petrignano <- load_files(files_loc[4])
data_bilancino <- load_files(files_loc[5])
data_arno <- load_files(files_loc[6])
data_amiata <- load_files(files_loc[7])
data_lupa <- load_files(files_loc[8])
data_madonna <- load_files(files_loc[9])

As you can see, there is currently no merge functionality yet in my load function. I imagine I will update this as I can see some usage for it later on in my project. The crux here is that if you are faced with multiple datasets that would logically have to combined, your merge function incorporated with your load function can create a clean, concise and easily understood notebook. The load function does allow us to load specific columns. We will see in the next chapter how this is usefull already.

Temporal elements of ACEA dataset

One of the main discussion points of the ACEA dataset is temporal. This means there are time based elements in the dataset. The data contains multiple types of columns such as rainfall, temperature, hydrometric and ground water levels. The final challenge is to build a time series model, we have to predict the water level of tomorrow. This means that there are some initial assumptions to be challenged or tested:

  • Trends over time
  • Seasonality
  • Availability of time

One of the cornerstones of our narrative needs to be timebased trends, the data runs back to 1998 on some variables (see below later on), which means that it is possible the different explanatory variables contain long running trends. I myself assume for example that rainfall is decreasing over time and temperature increasing. This is due to global warming effects. It is important to test these assumptions. When thinking of seasonality, we may assume furthermore water levels are likely lower in summer and higher in winter or early spring.

As far as easy to use temporal visualizations, I designed a plotly of all datasets and dates available in the dataset to showcase how long they run backwards giving us an idea of the possibility of finding different trends. The codechunk below makes use of the load function we designed earlier thus making it stand-alone executable.

# Put all dates and names variables:
datalist_date_name <- lapply(1:9, function(x){load_files(files_loc[x], filter = c('Date','name'))})
data_date_name <- data.table::rbindlist(datalist_date_name)
data_date_name <- data_date_name[which(nchar(data_date_name$Date) == 10),]
data_date_name$Date <- as.Date(data_date_name$Date, "%d/%m/%Y")

fig(10,6)
p <- ggplotly(ggplot(data_date_name, aes(x = Date, y = name)) + geom_point() + 
labs (x = 'Date (years)', y = '',title = 'Available dates by dataset') +
insighttheme )

htmlwidgets::saveWidget(p, file = paste(maindir, htmldir, 'dates_viz','index.html', sep = '/'))
Available dates in Kaggle ACEA Dataset

Geospatial elements of ACEA dataset

One of the final things I dived into this week is the geospatial element present in the ACEA dataset. We have different aquifers, rainfall and temperature measurements around Italy, and thus it would be beneficial to explore if maps can help us develop a story. The goal is making our viewers understand better how the data works and what crucial insight we are working on.

For example it would be helpful to show our temporal elements on a map, for people to quicker understand the overall findings of all locations found in the ACEA dataset. I did have to use some elbow grease in this section, as the different locations found are not groupable by for example municipalities or cities. Some of them are mountains, bridges, etc. Hence there was no comprehensive dataset with locations existing I could use. Therefore I put all 64 locations in google maps and wrote down the longitude and latitude by hand.

Now with 64 locations this was the quick fix, the longer fix is found here, https://developers.google.com/maps/documentation/geocoding/overview#RegionCodes. If you have to work with 1000’s of cities or a plethora of unknown locations then refer to the google API and programmatically extract the longitude and latitude. Since I am looking into doing more Kaggle and also working more with maps in my regular job I might have some fun and build a function for that later on.

Below was my first attempt at a map, showcasing the average rainfall surrounding the AUSER aquifer. Please note that this code makes use of a shapefile for italian municipalities found here: https://hub.arcgis.com/datasets/e68ceb0a193e4e378b29255b62ab75e0_0. This is basic version of the map, which I will be updating further on my notebook as needed for the final submission.

italy_mun <- readOGR(files_shp)
italy_mun <- fortify(italy_mun, region = 'COMUNE')

locations <- read_excel(files_geo)
locations$lat <- as.numeric(locations$lat)
locations$long <- as.numeric(locations$long)

mean_rainfall <- aggregate(rainfall_value ~ location, data = data_plot, FUN = mean)
mean_rainfall$location <- gsub('_',' ',gsub('Rainfall_','',mean_rainfall$location))
locations <- merge(locations, mean_rainfall, by.x = 'location', by.y = 'location', all.x = T, all.y = F)
locations <- locations[!is.na(locations$rainfall_value),]

#mean lats and longs:
municipalities <- aggregate(cbind(long, lat) ~ id, data = italy_mun, FUN = mean)
municipalities <- municipalities[which(max(locations$lat)+.1 >= municipalities$lat &                  # Municipalities within auser locations latitude
                                           min(locations$lat)-.1 <= municipalities$lat), ]

municipalities <- municipalities[which(max(locations$long)+.1 >= municipalities$long &                # municipalities within auser locations longitude
                                           min(locations$long)-.1 <= municipalities$long), ]

italy_mun_plot <- italy_mun[which(italy_mun$id %in% municipalities$id),]

ggplot() + geom_polygon(data = italy_mun_plot, aes(x = long, y = lat, group = id), fill="grey", alpha=0.3, col = 'white') + 
    geom_point(data = locations, aes(x = long, y = lat, colour = vartype, size = rainfall_value)) + insighttheme +
    labs("Map of rainfall locations belonging to Auser aquifer in Italy")
Output of basic map code

ACEA Smart Water Analytics Competition next steps

My takeaways from this blog are simple, look at your data and develop an understanding of the important dimensions, in this case time and location. This could be other dimensions such as currency, speed and/or connectivity. Connect your final model and insight back to these basic dimensions of your data to create a compelling narrative that is easy for all your users to follow.

I will be back soon with Part 2 of the ACEA Smart Water Analytics Competition!

2 thoughts on “ACEA Smart Water Analytics Competition; Introduction”

Leave a Reply