Blog, rblog

ACEA Smart Water Analytics Competition; Hindsight Bias & Caret

This is blog 3 of my endeavors for the currently ongoing Kaggle challenge posted by ACEA. A short introduction of the challenge is below. What I am trying to do with these blogs is not to create perfect tutorials that contain perfectly revised code, but rather document my own learning process and findings at any moment. So hopefully you enjoy some of my thought processes and we can build together from these findings to an ultimately better end model.

Please note, if you want to follow along with any code in these blog series then sign up for the challenge here: https://www.kaggle.com/c/acea-water-prediction/overview and you will receive the dataset for free.

The challenge: ACEA Water Levels

The goal of this challenge is to predict water levels in a collection of different water bodies based in Italy. Specifically we have to predict based on a time series model, to accurately assess the water level of tomorrow, based on data of today. To shortly note the Kaggle rules, this is an analytics challenge, which means creating a compelling story & notebook is a very important part. My notebook is publicly available on Kaggle here, but I will work through some code excerpts and interesting highlights in these blogs.

So far we have discussed the challenge more generally, looked at some data wrangling and new features for modelling, and in this blog we will discuss some first modelling steps I took after that last blog.

Basic modelling

To start us off, in the initial blog we discussed elements of the dataset to consider. One of them was temporal, we reasoned that water levels are influenced *over time* by rainfall and temperature. This was why 2 weeks ago we wrote about creating lag features in our dataset. This week in the modelling we must thus work with time series based models.
In my journey of the past 2 weeks I’ve fallen into a classic trap of machine learning with time series which I will go into depth about. By the end of the blog we have overcome this issue.

First attempts

My first model contained as basic inputs:

  • Lags of the outcome variable
  • Lags of the predictive variables
  • Seasonal indicators (year, quarter, month)

In the previous blog we only worked with the lags of the predictive variables and came up to an Rsquared of .33%, this time we include a few other variables and except to improve our prediction.

Including lags of the outcome variable is a common time-series technique to improve the prediction. In my financial economics class we learned about the concept of ‘random-walks’, this essentially is a time series that can’t be predicted as it includes no pattern. This is, according to the study books, applicable to the day-to-day trading of currency markets. They say tomorrows exchange rates best prediction is todays exchange rate, noone can say any more then that. Random-walking aside, the point is that if we want to predict something for tomorrow, we better include all information on that phenomena we have to date in our model, including lags of our outcome variable.

The seasonal indicators are also added to allow for trends both in years, quarters and months. For example there might be a decline in water levels over time due to global warming, our year indicator can account for this effect.

The next general step is to preprocess our data, what we discussed in depth in the last blog, and set up our model for cross validation. Cross validation is used to hold out a part of the training data to validate the results. Often this is called the ‘validation’ set for that reason. I used a 10 fold cross validation, with 3 repeats.

Afterwards I set up my gridsearch for the gbm (Gradient Boosting Machine) model we will be using. I initially chose the gbm model as I am familiar with how it works, that honestly really was the only reason. It is smart to try multiple models and compare them later. It is also smart to run a simple linear model as a baseline comparison.

create_data_model <- function(data, features, outcome, lag, preprocess = NULL, include_outcome = NULL, test = NULL){
    # data: This function allows for a generic data object, 
    # features: some features that are column names in the data object
    # outcome: name of the otucome column
    # lag: Highest lag to be included in the model 
    # preprocess : Whether the data is preprocessed before modelling   (Highly recommended!)
    # include_outcome:  Indicator if the outcome variable is included with lags
    # test: Indicator if a test set is to be constructed (default to 1 year)

    data_model <- data[,c(outcome,'Date',features)]
    names(data_model)[which(names(data_model)== outcome)] <- 'outcome'
    
    # Feature addition
    # Laggs
    if(!is.null(include_outcome)){
        features <- c(features, 'outcome')
    }
    
    for(i in 1:length(features)){
        for(j in 1:lag){
            data_model$temp <- Lag(data_model[,features[i],+j])
            names(data_model)[which(names(data_model)=='temp')] <- paste(features[i],j, sep = '_')
        }
    }
    
    # Inlude seasonality:
    data_model$year <- as.numeric(substr(data_model$Date,7,10))
    data_model$year <- data_model$year - min(data_model$year) + 1
    data_model$month <- as.numeric(substr(data_model$Date,4,5))
    data_model$quarter <- ifelse(data_model$month <= 3,1,
                                 ifelse(data_model$month >=4 & data_model$month <= 6,2,
                                        ifelse(data_model$month >=7 & data_model$month <= 9,3,
                                               ifelse(data_model$month >9,4,NA))))
    
    # Data cleaning
    data_model <- data_model[complete.cases(data_model),]                              # Remove all rows with missing values
    data_model <- data_model[which(data_model$outcome!= 0),]                           # Remove all outlier measurements
    
    if(!is.null(test)){
        data_test <- data_model[which(as.Date(data_model$Date,format = '%d/%m/%Y')>=as.Date(as.Date(max(data_model$Date), format = '%d/%m/%Y')-365)),]
        data_model <- data_model[which(!as.Date(data_model$Date,format = '%d/%m/%Y')>=as.Date(as.Date(max(data_model$Date), format = '%d/%m/%Y')-365)),]
        data_test$Date <- NULL
    }    
    
    data_model$Date <- NULL
    
    # Statistical preprocessing
    if(!is.null(preprocess)){
        temp <- data_model[,-1]
        nzv <- nearZeroVar(data_model)                                                 # excluding variables with very low frequencies
        if(length(nzv)>0){temp <- temp[, -nzv]}
        i <- findCorrelation(cor(temp))                                                # excluding variables that are highly correlated with others
        if(length(i) > 0) temp <- temp[, -i]
        i <- findLinearCombos(temp)                                                    # excluding variables that are a linear combination of others
        if(!is.null(i$remove)) temp <- temp[, -i$remove]
        data_model <- data_model[,c('outcome', names(temp))]
    }
    
    if(!is.null(test)){
        data_test <- data_test[,c('outcome',names(temp))]
    }
    
    # Modelling:
    fitControl <- trainControl(## 10-fold CV
                           method = "repeatedcv",
                           number = 10,
                           ## repeated ten times
                           repeats = 3,
                           verboseIter = T)

    gbmGrid <-  expand.grid(interaction.depth = c(1,2,4,8), 
                          n.trees = 1:2000, 
                          shrinkage = c(0.01,0.001), 
                          n.minobsinnode = c(2,5))

    err <- try(load(paste(maindir,modeldir, paste('outcome =',outcome,'lag = ',lag,
                                                  'preprocess = ',preprocess,'include_outcome = ',include_outcome,'test = ',test,'.RData', sep = ''),sep = '/')))
    if(err != 'train1'){
        train1 <- train(outcome ~ ., data= data_model, method = 'gbm', trControl = fitControl, tuneGrid=gbmGrid)
        save(train1, file = paste(maindir,modeldir, paste('outcome =',outcome,'lag = ',lag,'preprocess = ',preprocess,'include_outcome = ',include_outcome,
                                                          'test = ',test,'.RData', sep = ''),
                                  sep = '/'))
    }
    if(!is.null(test)){
        data_test$predict <- predict(train1, newdata = data_test)
        return(list(data_test, train1))
    }
    
    train1
}

Then something weird happened, as I hit an Rsquared of 0.999, this is a near perfect prediction! You can find my training results in the picture below. What turned out is that I fell into a classic timeseries trap. The great thing is that we can learn together what the difference is between a correct prediction and a naïve approach.

What went wrong?

The trap is as follows, by using a naïve cross validation of the dataset we have ignored the temporal aspects of the dataset. Specifically, we included future data points in our training set compared to our validation set, for example in a given training sample there might be a lot of data of the year 2018, while the validation set contains data of 2016. This obviously is not applicable if we want to use this model today to predict ‘tomorrow’.

There is a lot of literature on cross validation with time-series, a specific person to look for is Rob Hyndman (3.4 Evaluating forecast accuracy | Forecasting: Principles and Practice (2nd ed) (otexts.com)), who introduced an ultimately more advanced method that I started to work on next. What we need to ensure in our training and validation data is that the temporal aspect is respected, and thus any training data needs to be before the validation data in time. Luckily R has such a rich community that the caret package that I’ve been using for my modelling has specific settings in the traincontrol setting to address this issue. Hyndmans own forecast package also has options for this modelling procedure.

We are going to make use of the ‘timeslice’ method of traincontrol, this method allows you to set up a window of time, measured out in rows of your dataset. So a dataset that has a day for each row might have a window of 365 to represent 1 year. Next is a horizon argument, this is the part of the window that is used for validation, a logical approach might be to set the horizon to 30 as in 1 month. Next there is a skip argument, this can be used to control the movement of the window of time, if its set to 30 in our above example, then window 1 is day 1:365, and window 2 is day 31:395. Lastly we can set the fixed window argument, this is for telling the timeslice method if the window moves along the dataset in time, or expands every skip. The relevant code compared to the block above is shown below:

  fitControl <- trainControl(method = "timeslice",
                             initialWindow = 500,
                             horizon = 150,
                             fixedWindow = FALSE,
                             skip = 149,
                             verboseIter = T)

Having set all these arguments we rerun the model with proper cross validation and find an Rsquared of 0.78, still high but not borderline unrealistic. Below you can find my training results in more detail:

What can we conclude?

It is actually pretty great to see the exact difference between a naïve modelling approach and one tailored to the issue at hand, I didn’t know much about time series modelling and dove straight in. If I had not checked myself on what I was doing and why, the model would have been introduced in a completely overfitted manner.

When looking at the testset we find that the MSE is practically identical for the model training with specific time series cross-validation folds and the overfitted variant. I would personally have to dive deeper into the literature to explain the exact extensive difference, but my gut feeling already tells me that showing up with the overfitted model is a mistake. We can use the specifically designed method and put more trust in its results in a variety of situations and show that on untrained data it performs identically.

To summarize, the new model has slightly less predictive power but is ultimately designed for the environment it has to work for in a production environment.

Next we will look at the final model I ended up using for the challenge and some thoughts before I submitted my notebook.

Blog, rblog

ACEA Smart Water Analytics Competition; Data Setup

In this blog series I will document my work on the Kaggle ACEA Smart Water Analytics Competition. The purpose is of these blogs is not to share perfect code or completely worked out solutions. Rather I want to take you along for the ride and share my insights as is. There will definitely be some fails and hopefully a success mixed in. In this series I will discuss 4 distinct steps in my project:

If you want to follow along with any code in these blog series then sign up for the challenge here: https://www.kaggle.com/c/acea-water-prediction/overview. You can download the dataset for free after signing up.

The challenge: ACEA Smart Water Analytics Competition

The goal of this challenge is to predict water levels in a collection of different water bodies based in Italy. Specifically we have to predict based on a time series model. The goal is to accurately assess the water level of tomorrow, based on data of today. This specifically is an analytics challenge, which means creating a compelling story & notebook is a very important part. My notebook is publicly available on Kaggle here. I will work through some code excerpts and interesting highlights in these blogs.

For today I want to address some of the data setup steps. This is important to prepare your data for machine learning models.  

The how-to’s on setting up your data

Data setup for machine learning models sounds a bit mystical. This is the crucial step after our initial insights. Where we shape our data to allow for answers on our main research question. Here we add features to our data, transform our variables and put in the business logic. It is the business and can make or break your findings.

There are roughly 2 ways to go about setting up your data optimally for your machine learning efforts. There is the traditional approach and the modern approach. In this blog we will compare both and show code examples for both. To start, in the traditional approach we build our feature set and then clean up our dataset before we start our modelling. An example is by removing variables that have a high correlation. In the modern approach we load all features into the model. Only after running the model we trim our data and rerun the model.

Another way of comparing a traditional approach with a modern approach is as follows. For the traditional approach we use our own reasoning and knowledge of statistics to optimize the model outcome. The modern approach replaces these steps with raw computing power.

ACEA Smart Water Analytics Data Setup; Traditional approach

In the traditional approach for data setup we define the following steps:
– Feature addition
– Data cleaning
– Statistical preprocessing

Feature addition

We will work with the Auser dataset from the ACEA Smart Water Analytics Competition on Kaggle. For further explanation of the dataset please refer to the introduction blog. It includes a total of 5 outcome variables. Each of these variables can and need to be used for prediction with the final notebook. For illustration purposes we will focus on the LT2 variable. By the data definition this is the only water level measurement in the south part of the aquifer. Today we will focus as predictors on the rainfall and temperature variables in the Auser dataset.

One of my main considerations were related to the temporal aspects of the dataset. Specifically the time lags associated with both the temperature and the rainfall variables in the dataset. Hence it is a logical first step to incorporate lags of these variables in our dataset. In the future I can imagine there are other parameters to consider here. One example may be averaging rainfall for multiple measuring points or using a running average of multiple days.

Data cleaning

In this part of the process we normally remove missing values. This can be done in multiple ways which are described in depth elsewhere. In short, we can simply remove all values or In a business environment it is often a time intensive step to retain as much data as possible, for example by using imputation to fill up the missing values. This is where your business logic comes in and its an important area to focus on when your presented with real life ‘dirty’ datasets. For now we simply remove all values that are incomplete.

Another important step here is outliers, we find that some measurements in the LT2 variable are unrealistic, from one day to the next the water level moves from the average to 0 and back to the average. This is a faulty measurement and needs to be removed from the data.

Statistical preprocessing

Now for the most important pre-processing step in the traditional approach, we clean up our variables by removing highly correlated variables (since they have a similar effect on the outcome measure of choice) and near zero variance variables (variables who are for example nearly always 0).

These steps help us gain an understanding of how the data works, and therefore help us better understand how the model gets its final outcome. Finalizing all these steps we find that 18 variables remain, using those in our initial testrun model we find an Rsquared of the model 15%. This is not the best result, but for a first try I’ve had a lot worse!

The next step in the traditional approach is to start tinkering, initially we work backwards, hence we look at our excluded variables and find different cutoff values for the different steps. This will help us expand our model, we add features and attempt to improve the model performance through understanding its inner workings. This step can take days to weeks, depending on how often we rerun our model.

ACEA Smart Water Analytics Data Setup; The modern approach

In what I call a modern approach to data setup we skip the statistical test processing, instead we simply load the data into the model and try to optimize the predictive power. Afterwards we evaluate our model and if possible rerun a simplified version. When running the model without any statistical tests, but with a similar feature set and data cleaning we find an Rsquared of the model at 33.7%. This gives us some idea of the potential in the data.

Code

Below you find my function for preprocessing and running my model both in the traditional and modern approach. I find that writing this in a function helps keep my code clean, and this function can be applied to all datasets in the Kaggle ACEA challenge as needed.  

create_data_model <- function(data, features, outcome, lag, preprocess = NULL){
    
    data_model <- data[,c(outcome,features)]
    names(data_model)[which(names(data_model)== outcome)] <- 'outcome'
    
    # Feature addition
    for(i in 1:length(features)){
        for(j in 1:lag){
            data_model$temp <- Lag(data_model[,features[i],+j])
            names(data_model)[which(names(data_model)=='temp')] <- paste(features[i],j, sep = '_')
        }
    }
    # Data cleaning

    data_model <- data_model[complete.cases(data_model),]       # Remove all rows with missing values
    data_model <- data_model[which(data_model$outcome!= 0),]    # Remove all outlier measurements
    
    # Statistical preprocessing
    if(!is.null(preprocess)){
        temp <- data_model[,-1]
        nzv <- nearZeroVar(data_model)                                                 # excluding variables with very low frequencies
        temp <- temp[, -nzv]
        i <- findCorrelation(cor(temp))                                                # excluding variables that are highly correlated with others
        if(length(i) > 0) temp <- temp[, -i]
        i <- findLinearCombos(temp)                                                    # excluding variables that are a linear combination of others
        if(!is.null(i$remove)) temp <- temp[, -i$remove]
        data_model <- data_model[,c('outcome', names(temp))]
    }
    
    # Modelling:
    fitControl <- trainControl(## 10-fold CV
                           method = "repeatedcv",
                           number = 10,
                           ## repeated ten times
                           repeats = 3,
                           verboseIter = T)

    gbmGrid <-  expand.grid(interaction.depth = c(1,2,4,8), 
                          n.trees = 1:2000, 
                          shrinkage = c(0.01,0.001), 
                          n.minobsinnode = c(2,5))

    err <- try(load(paste(maindir,modeldir, paste('outcome =',outcome,'lag = ',lag,'preprocess = ',preprocess,'.RData', sep = ''),sep = '/')))
    if(err != 'train1'){
        train1 <- train(outcome ~ ., data= data_model, method = 'gbm', trControl = fitControl, tuneGrid=gbmGrid)
        save(train1, file = paste(maindir,modeldir, paste('outcome =',outcome,'lag = ',lag,'preprocess = ',preprocess,'.RData', sep = ''),
                                  sep = '/'))
    }        
    
    train1
}

tr_preproc <- create_data_model(data = data_auser,
                                features = grep('Rainfall|Temperature',names(data_auser),value = T),
                                outcome = 'Depth_to_Groundwater_LT2',
                                lag = 15,
                                preprocess = 'yes') 

tr_nopreproc <- create_data_model(data = data_auser,
                                features = grep('Rainfall|Temperature',names(data_auser),value = T),
                                outcome = 'Depth_to_Groundwater_LT2',
                                lag = 15,
                                preprocess = NULL) 

Discussion

So how to setup your data for modelling? We saw that the traditional approach leaves you with more understanding of the problem at hand. The downside is it also required more time investment and often demands more business logic to deal with the problem. The modern approach skips these steps and moves straight to optimize predictive power. It allows us to utilize machine learning techniques to optimize the data setup.

Comparing the 2, we found that simply loading all information into the model got us to 33% predictive power, not a bad score. Our first try of using some statistical concepts to preprocess the data only got us to 15%. It did take my laptop 6 hours to run the model on the ‘modern approach’, the traditional model was done after 20 mins.

By heart I am a traditionalist, having studied as an economist I strive for understanding a problem, thinking it through and finding a solution. Ultimately to use the model we often need this complex understanding that you develop through the extra time investment and hard work. How else are we going to convince other people to use a model for their business problem?

When do I prefer the modern approach? When for a particular business problem the outcome is more important than the process, or when the assumption is that the important variables in the dataset that influence the outcome can’t be influenced themselves. This Italian water level issue is actually a perfect example of a situation that tills quite heavily towards the modern approach. As we find that temperature and rainfall influence water levels, and we can’t actually influence temperature or rainfall, all we really care about is to predict the water level. It would be great if we know that the temperature on a mountain is highly predictive for this, but once we know this we can’t influence it in any way.

ACEA Smart Water Analytics Competition next steps

I will be back soon with a more in depth look at the modelling process I’ve gone through for this challenge. Further improvements can definitely by made, such as transformations of the data and making use of lags on the outcome variable.

ACEA Smart water analytics
Blog, Data Science project, rblog

ACEA Smart Water Analytics Competition; Introduction

In this blog series I will document my work on the Kaggle ACEA Smart Water Analytics Competition. The purpose is of these blogs is not to share perfect code or completely worked out solutions. Rather I want to take you along for the ride and share my insights as is. There will definitely be some fails and hopefully a success mixed in. In this series I will discuss 4 distinct steps in my project:

If you want to follow along with any code in these blog series then sign up for the challenge here: https://www.kaggle.com/c/acea-water-prediction/overview. You can download the dataset for free after signing up.

The challenge: ACEA Smart Water Analytics Competition

The goal of this challenge is to predict water levels in a collection of different water bodies based in Italy. Specifically we have to predict based on a time series model. The goal is to accurately assess the water level of tomorrow, based on data of today. This specifically is an analytics challenge, which means creating a compelling story & notebook is a very important part. My notebook is publicly available on Kaggle here. I will work through some code excerpts and interesting highlights in these blogs.

For today I want to address my initial findings, and explore how I thought through these challenges initial steps.

Designing my load function

One of the first steps everyone undertakes is (obviously) to load the data. It is my own personal preference to condense this code into a load function. The main reason for load functions is that it creates clean and concise notebooks. It allows me to build code chunks that are independent of each other by loading data in the notebook as necessary.

My load function will typically contain the following 3 components:

  • Filepath to data
  • Column based filters
  • Merge functionality

In the ACEA dataset this worked out as follows:

# dirs:
maindir   <- 'C:/Prive/Kaggle/ACEA'
datadir   <- 'Data'
scriptdir <- 'Scripts'
htmldir   <- 'HTML'
# load:
# Inladen files:
files_loc <- list.files(path = paste(maindir, datadir, sep = '/'), recursive= T, full.names = T)
files_loc <- files_loc[grep("\\.csv", files_loc)]

load_files <- function(path, filter=NULL){
    output <- read.csv(path, stringsAsFactors = FALSE, check.names = FALSE, fileEncoding="UTF-8-BOM")
    output$name <- gsub('\\.csv','', basename(path))
    if(!is.null(filter)){
        return(output[,filter])
    }
    return(output)
}

data_auser <- load_files(files_loc[1])
data_doganella <- load_files(files_loc[2])
data_luco <- load_files(files_loc[3])
data_petrignano <- load_files(files_loc[4])
data_bilancino <- load_files(files_loc[5])
data_arno <- load_files(files_loc[6])
data_amiata <- load_files(files_loc[7])
data_lupa <- load_files(files_loc[8])
data_madonna <- load_files(files_loc[9])

As you can see, there is currently no merge functionality yet in my load function. I imagine I will update this as I can see some usage for it later on in my project. The crux here is that if you are faced with multiple datasets that would logically have to combined, your merge function incorporated with your load function can create a clean, concise and easily understood notebook. The load function does allow us to load specific columns. We will see in the next chapter how this is usefull already.

Temporal elements of ACEA dataset

One of the main discussion points of the ACEA dataset is temporal. This means there are time based elements in the dataset. The data contains multiple types of columns such as rainfall, temperature, hydrometric and ground water levels. The final challenge is to build a time series model, we have to predict the water level of tomorrow. This means that there are some initial assumptions to be challenged or tested:

  • Trends over time
  • Seasonality
  • Availability of time

One of the cornerstones of our narrative needs to be timebased trends, the data runs back to 1998 on some variables (see below later on), which means that it is possible the different explanatory variables contain long running trends. I myself assume for example that rainfall is decreasing over time and temperature increasing. This is due to global warming effects. It is important to test these assumptions. When thinking of seasonality, we may assume furthermore water levels are likely lower in summer and higher in winter or early spring.

As far as easy to use temporal visualizations, I designed a plotly of all datasets and dates available in the dataset to showcase how long they run backwards giving us an idea of the possibility of finding different trends. The codechunk below makes use of the load function we designed earlier thus making it stand-alone executable.

# Put all dates and names variables:
datalist_date_name <- lapply(1:9, function(x){load_files(files_loc[x], filter = c('Date','name'))})
data_date_name <- data.table::rbindlist(datalist_date_name)
data_date_name <- data_date_name[which(nchar(data_date_name$Date) == 10),]
data_date_name$Date <- as.Date(data_date_name$Date, "%d/%m/%Y")

fig(10,6)
p <- ggplotly(ggplot(data_date_name, aes(x = Date, y = name)) + geom_point() + 
labs (x = 'Date (years)', y = '',title = 'Available dates by dataset') +
insighttheme )

htmlwidgets::saveWidget(p, file = paste(maindir, htmldir, 'dates_viz','index.html', sep = '/'))
Available dates in Kaggle ACEA Dataset

Geospatial elements of ACEA dataset

One of the final things I dived into this week is the geospatial element present in the ACEA dataset. We have different aquifers, rainfall and temperature measurements around Italy, and thus it would be beneficial to explore if maps can help us develop a story. The goal is making our viewers understand better how the data works and what crucial insight we are working on.

For example it would be helpful to show our temporal elements on a map, for people to quicker understand the overall findings of all locations found in the ACEA dataset. I did have to use some elbow grease in this section, as the different locations found are not groupable by for example municipalities or cities. Some of them are mountains, bridges, etc. Hence there was no comprehensive dataset with locations existing I could use. Therefore I put all 64 locations in google maps and wrote down the longitude and latitude by hand.

Now with 64 locations this was the quick fix, the longer fix is found here, https://developers.google.com/maps/documentation/geocoding/overview#RegionCodes. If you have to work with 1000’s of cities or a plethora of unknown locations then refer to the google API and programmatically extract the longitude and latitude. Since I am looking into doing more Kaggle and also working more with maps in my regular job I might have some fun and build a function for that later on.

Below was my first attempt at a map, showcasing the average rainfall surrounding the AUSER aquifer. Please note that this code makes use of a shapefile for italian municipalities found here: https://hub.arcgis.com/datasets/e68ceb0a193e4e378b29255b62ab75e0_0. This is basic version of the map, which I will be updating further on my notebook as needed for the final submission.

italy_mun <- readOGR(files_shp)
italy_mun <- fortify(italy_mun, region = 'COMUNE')

locations <- read_excel(files_geo)
locations$lat <- as.numeric(locations$lat)
locations$long <- as.numeric(locations$long)

mean_rainfall <- aggregate(rainfall_value ~ location, data = data_plot, FUN = mean)
mean_rainfall$location <- gsub('_',' ',gsub('Rainfall_','',mean_rainfall$location))
locations <- merge(locations, mean_rainfall, by.x = 'location', by.y = 'location', all.x = T, all.y = F)
locations <- locations[!is.na(locations$rainfall_value),]

#mean lats and longs:
municipalities <- aggregate(cbind(long, lat) ~ id, data = italy_mun, FUN = mean)
municipalities <- municipalities[which(max(locations$lat)+.1 >= municipalities$lat &                  # Municipalities within auser locations latitude
                                           min(locations$lat)-.1 <= municipalities$lat), ]

municipalities <- municipalities[which(max(locations$long)+.1 >= municipalities$long &                # municipalities within auser locations longitude
                                           min(locations$long)-.1 <= municipalities$long), ]

italy_mun_plot <- italy_mun[which(italy_mun$id %in% municipalities$id),]

ggplot() + geom_polygon(data = italy_mun_plot, aes(x = long, y = lat, group = id), fill="grey", alpha=0.3, col = 'white') + 
    geom_point(data = locations, aes(x = long, y = lat, colour = vartype, size = rainfall_value)) + insighttheme +
    labs("Map of rainfall locations belonging to Auser aquifer in Italy")
Output of basic map code

ACEA Smart Water Analytics Competition next steps

My takeaways from this blog are simple, look at your data and develop an understanding of the important dimensions, in this case time and location. This could be other dimensions such as currency, speed and/or connectivity. Connect your final model and insight back to these basic dimensions of your data to create a compelling narrative that is easy for all your users to follow.

I will be back soon with Part 2 of the ACEA Smart Water Analytics Competition!