data decision making
Data Science project, rblog

Designing data driven decision making; Kaggle ColeRidge

There is an interesting challenge running on Kaggle at the moment. It has been designed in cooperation with the Coleridge Initiative ( . This initiative is established at the New York University, it’s goal is to facilitate data driven decision making by governments. In the challenge we get to optimize automated dataset detection In policy papers. The ultimate goal? We need to help civil servants make better and more efficient decisions.

What is the key to making better decisions? Use the most up to date scientific research and data. Data Science can help here to make this process more efficient. Specifically we can use Natural Language Processing (NLP) techniques to predict what papers use a dataset. This is the essence of the Kaggle challenge. If it succeeds, it can facilitate government decision making transparency.

In this initial blog I want to discuss some of my considerations when thinking about this challenge. There will also be some quick analysis on the dataset provided.

Challenge overview

So you are working on a problem, how many times have you looked for a good scientific source to find a solution? Put yourself in the position of someone working for a local government. You have to come up with a local policy impact analysis on the councils climate change efforts. You’ve been looking for evidence and scientific research all morning. Isn’t there a tool that immediately identifies all similar research where similar datasets are used?

I can imagine these decisions are made on a daily basis, all around the world. Designing better systems to navigate this huge information source is the key to data driven decision making. To improve such mass scale decision making, with direct societal impact. It triggers my imagination.

Next the dataset, we are given roughly 20.000 papers, and have to predict what datasets are used in these documents. There are 3 key technical aspects I want to discuss, that in my opinion are vital to reaching a solution.

  • Feature addition
  • Similarity index of scientific papers
  • Transfer learning

Feature addition

The dataset literally consists of 5 columns. A document ID, the publication title, the dataset used in the publication, the institute associated with it and a cleaned label for the dataset. Firstly one of the potential challenges with text data is to find reliable additional structured data from the raw text provided. We want to see if there are some other features available!

There are multiple reasons for attempting to find structured additional data from the provided text. If we for example can extract the date of the publication from the text, we can see if this date of publication is a relevant factor in the usage of different datasets. one of the most important reasons is that it makes our ultimate solution easier to understand for policy users. Hence it creates a simpler interpretation of our outcome if we can use these structured elements. In our efforts to improve decision making it is our duty as the data scientist to maximize understanding of our solution.

From a data science perspective, creating additional metadata also helps us improve our own understanding of the problem. In one of the upcoming blogs I will attempt to use the text to create these metadata elements. From the top of my head important elements are:

  • Publication title
  • Author / Organization
  • Total citations
  • Citated papers
  • Text length

There is probably more potential for data enrichment, but this is a starting consideration for me when going over this challenge.

Similarity index of scientific papers

When we look for different papers that cover similar datasets, we assume that there is some similarity. There are similar technical terms used and introduced, similar domain knowledge. It is logical that a dataset that covers economic indicators, has similar explanations of these indicators. A dataset has a theoretical limit to the angles it can be discussed in.

Aside from the challenge of predicting datasets used in the articles, it should be an interesting business case to cluster the scientific papers, and find similar articles. Are there for example very different articles that cover the same dataset? By zeroing in on these effects we can more optimally cover all the relevant knowledge on a given dataset. It also just interest me a lot this question, and i’m curious if there are very different articles that cover similar data.

One of the ways to approach this simply is to use a string matching alghorithm. Using this method is an interesting baseline approach for the challenge. When I start a challenge I am always looking for a baseline. What I find most helpful in establishing a simple baseline early is clarity on the challenge complexity. For me it answers the question, how difficult is this really? When we create a simple intuitive model it helps understand how humans work the problem in practice. Finally It will further help to understand the mountain we have to climb with our more advanced predictive model.

Transfer learning

When I read this challenge part of what peaked my interest is the potential to use transfer learning. In the field of Natural Language Processing there has been a big buzz around transfer learning. For a scientific breakdown see here , another great link is found here . One of the most promising recent models is BERT.

Personally I’ve never used transfer learning and am eager to give it a try in this challenge. For me It would be a great victory to beat my string matching baseline with a transfer learning method. This will take some studying on the links mentioned in the previous paragraph, but there is alot of great content available.

Improving data driven decision making

The purpose of this blog is to write down initial thoughts. I will start off in the challenge by tackling these aspects. It helps me create a structure for my project. Finally it clears the path for the focus of the challenge.

We are focusing on improving data driven decision making for governments. Creating a good prediction model is one part of this, and that is the focus of the Kaggle challenge. But working on this I also want to broaden the impact. Ultimately the prediction model needs to be explainable and usable to achieve impact. Hence keeping this in the back of our minds as we design it is vital.

To conclude I will leave you with a word cloud of the papers that mention the datasets “World Ocean Database” and “Census of Agriculture”. I expect they look sufficiently different to be of interest. The code used for creating them is below the pictures.

Ocean vs Agriculture dataset wordcloud
# Library

# load and process data

data <- read.csv("C:/Startup/ColeRidge/train.csv")

files_train <- list.files('C:/Startup/ColeRidge/train', full.names = T, recursive = FALSE)
files_id <- gsub('.json','',basename(files_train))

load_json <- function(file, files_id){
  output             <- jsonlite::fromJSON(file)
  output$publication <- files_id

  json_files <- read_feather('C:/Startup/ColeRidge/json_files.feather')
} else {
  json_files <- lapply(1:length(files_train), function(x){load_json(files_train[x], files_id[x])})
  json_files <- data.table::rbindlist(json_files)
  write_feather(json_files, path = 'C:/Startup/ColeRidge/json_files.feather')

data <- merge(data, json_files, by.x = 'Id', by.y = 'publication', all.x = T, all.y = F)

world_ocean <- data[which(data$dataset_title == 'World Ocean Database' & data$section_title =='Introduction'),]
agri_census <- data[which(data$dataset_title == 'Census of Agriculture' & data$section_title =='Introduction'),]

build_wordcloud <- function(text){
  data_corp <- Corpus(VectorSource(text))
  # Cleaning:
  data_corp <- tm_map(data_corp, content_transformer(tolower))
  data_corp <- tm_map(data_corp, removeWords, stopwords("english"))
  data_corp <- tm_map(data_corp, removePunctuation)
  # Document term matrix:
  data_corp <- TermDocumentMatrix(data_corp)
  data_corp <- as.matrix(data_corp)
  data_corp <- sort(rowSums(data_corp),decreasing=TRUE)
  data_corp <- data.frame(word = names(data_corp),freq=data_corp)
  wordcloud(words = data_corp$word, freq = data_corp$freq, min.freq = 1,
            max.words=200, random.order=FALSE, rot.per=0.35, 
            colors=brewer.pal(8, "Dark2"))


Kaggle ACEA Water Analytics Competition Final Model
Data Science project, rblog

ACEA Smart Water Analytics Competition; Final Model overview

This is blog 4 of my endeavors for the currently ongoing Kaggle challenge posted by ACEA. A short introduction of the challenge is below. What I am trying to do with these blogs is not to create perfect tutorials that contain perfectly revised code, but rather document my own project process and findings at any moment. This blog shows the ACEA Smart Water Analytics Competition final model overview, before the project deadline.

Please note, if you want to follow along with any code in these blog series then sign up for the challenge here: and you will receive the dataset for free.

The challenge: ACEA Smart Water Analytics Competition

The goal of this challenge is to predict water levels in a collection of different water bodies based in Italy. Specifically we have to predict based on a time series model, to accurately assess the water level of tomorrow, based on data of today. To shortly note the Kaggle rules, this is an analytics challenge, which means creating a compelling story & notebook is a very important part. My notebook is publicly available on Kaggle here, but I will work through some code excerpts and interesting highlights in these blogs.

So far we have discussed the challenge more generally, looked at some data wrangling and new features for modelling. Last time we overcame a classic issue with time series modelling concerning our cross-validation. This week we will finish off our modelling and how we finish off the challenge for handin. You can find the blogs below if you want to read back first:

ACEA Smart Water Analytics Competition Final model

Last blog was a more indepth discussion of hindsight bias in the cross-validation stage of the modelling. By using specific a specific time series method we stabilized our model. In the week after I stumbled upon a work from Hanna Meyer. She is a machine learning expert in the field of environmental data. There is a great research paper she wrote available here (

Studying her work made me realise that the Kaggle dataset has all the characteristics she discussed in her paper. She discusses indepth both the time series aspect covered by Hyndman, but also the spatial element of the dataset. In this case having multiple measurement points in the dataset that measure the water level. The main advantage of setting up the model in this way is that it is more sensitive to unknown locations within the given water body. On top of that it allows for simultaneous modelling of all given locations.

Leave Location and Time Out (LLTO) modelling

So in the final model I introduce both a spatial element and a time series element in my model. Methodologically this is called LLTO (Leave Location and Time Out) Cross validation. Essentially the idea combines all earlier discussed steps.

In your training set you leave out the location that is in your validation dataset, this covers the spatial element. You then only include timepoints from before the time series in the validation dataset, this covers the time element. If you have 4 locations you have 4 folds per timeperiod. Each time one of the locations is placed in the validation dataset. This method is implemented in the CAST package in R. However I found that the time aspect is actually not handled properly in the relevant function (CreateSpaceTimeFolds). Hence I ended up making my own folds that respect both aspects.


In the code below there is an example function that handles all these steps. Specifically of interest may be lines 47-101, this is where the handmade folds in the cross-validation are created. I have not perfected this code, but it shows the main steps accordingly. If you want to know more about this or discuss then dont hesitate to contact me. We might get back to that later in a different blog.

model_handmade_folds <- function(data, horizon, dataset, lag, features, basefeatures){

#basefeatures <- 'Depth'

# Make lags:
features <- grep(features,names(data),value = T)
basefeatures <- grep(basefeatures,names(data),value = T)

for(i in 1:length(features)){
    for(j in 1:lag){
        data$temp <- Lag(data[,features[i],+j])
        names(data)[which(names(data)=='temp')] <- paste(features[i],j, sep = '_')

data <- data[,which(colMeans(!>.2)]
# Inlude seasonality:
data$year <- as.numeric(substr(data$Date,7,10))
data$year <- data$year - min(data$year) + 1
data$month <- as.numeric(substr(data$Date,4,5))
data$quarter <- ifelse(data$month <= 3,1,
                            ifelse(data$month >=4 & data$month <= 6,2,
                                ifelse(data$month >=7 & data$month <= 9,3,
                                    ifelse(data$month >9,4,NA))))

data_long <- tidyr::pivot_longer(data, basefeatures,names_to = 'location', values_to = 'depth_to_groundwater')

data_long <- data_long[complete.cases(data_long),]
data_long <- data_long[which(data_long$depth_to_groundwater != 0),]

#data_model <- data_long[,-grep('location|Date|name',names(data_long))]

temp <- data_long[,which(!names(data_long)%in%c('depth_to_groundwater','Date','location'))]
nzv <- nearZeroVar(temp)                                                       # excluding variables with very low frequencies
if(length(nzv)>0){temp <- temp[, -nzv]}
i <- findCorrelation(cor(temp))                                                # excluding variables that are highly correlated with others
if(length(i) > 0) temp <- temp[, -i]
i <- findLinearCombos(temp)                                                    # excluding variables that are a linear combination of others
if(!is.null(i$remove)) temp <- temp[, -i$remove]
data_model <- data_long[,c('depth_to_groundwater','Date','location', names(temp))]

data_model$Date <- as.Date(as.character(data_model$Date), format = '%d/%m/%Y')

# Handmade indexes:
index_hand_design <- function(data,period, location, horizon, location_one = NULL){
    horizon2 <- max(period)-horizon
    indexin <- which(data$Date >= min(period) & data$Date <= horizon2)
    indexout <- which(data$Date > horizon2 & data$Date <= max(period))
    } else {
    indexin <- which(data$Date >= min(period) & data$Date <= horizon2 & data$location != location)
    indexout <- which(data$Date > horizon2 & data$Date <= max(period) & data$location == location)
    output <-c(list(indexin),list(indexout))

periods <- round(length(seq.Date(from = min(data_model$Date),to = max(data_model$Date), by = 'day'))/horizon,0)
dates   <- seq.Date(from = min(data_model$Date),to = max(data_model$Date), by = 'day')
indices <- 1:periods*horizon

periods_final <- dates[indices]
periods_final <- periods_final[!]


for(i in 3:length(periods_final)){
    output <- list(c(periods_final[i-2], periods_final[i]))
    if(i <= 3){
        output_final <- output
    } else {
        output_final <- c(output_final, output)

locations <- unique(data_model$location)

for(i in 1:length(locations)){
    for(j in 1:length(output_final)){
        output_temp <- index_hand_design(data_model,output_final[[j]], locations[i], horizon, location_one = 'yes') 
        } else {
        output_temp <- index_hand_design(data_model,output_final[[j]], locations[i], horizon)
        if(j == 1){
            final_inner <- output_temp
        } else {
            final_inner <- c(final_inner, output_temp)
    if(i == 1){
        final <- final_inner
    } else {
        final <- c(final, final_inner)

index_final <- list(index = final[seq(1, length.out = length(locations)*length(output_final), by = 2)], 
                    indexOut = final[seq(2, length.out =length(locations)*length(output_final), by = 2)])

fitcontrol <- trainControl(verboseIter = T,
                          index = index_final$index,
                          indexOut = index_final$indexOut)

gbmGrid <-  expand.grid(interaction.depth = c(1,2,4), 
                          n.trees = 1:4000, 
                          shrinkage = c(0.01), 
                          n.minobsinnode = c(2,5))

for(i in 1:length(locations)){
    data_model$temp <- ifelse(data_model$location == locations[i],1,0)
    names(data_model)[which(names(data_model) == 'temp')] <- paste(locations[i],'ind', sep = '_')

err <- try(load(paste(maindir, modeldir, paste('data_model = ',dataset,'horizon =',horizon,'.RData', sep = ''), sep = '/')))
if(err != 'train'){
    train <- train(depth_to_groundwater ~ . , method = 'gbm', trControl = fitcontrol, tuneGrid = gbmGrid, 
                       data = data_model[,-grep('Date|location',names(data_model))])
    save(train, file = paste(maindir, modeldir, paste('data_model = ', dataset,'horizon =', horizon,'.RData', sep = ''), sep = '/'))

train_auser <- model_handmade_folds(data_auser, 365, 'auser',15,
                     'Rainfall|Temperature|Flow_Rate|Volume|Hydrometry|Depth', 'Depth')

train_petrig <- model_handmade_folds(data_petrignano, 365, 'petrignano',15,
                     'Rainfall|Temperature|Flow_Rate|Volume|Hydrometry|Depth', 'Depth')

train_amiata <- model_handmade_folds(data_amiata, 365, 'amiata',15,
                     'Rainfall|Temperature|Flow_Rate|Volume|Hydrometry|Depth', 'Flow_Rate')

train_lupa <- model_handmade_folds(data_lupa, 365, 'lupa',15,
                     'Rainfall|Temperature|Flow_Rate|Volume|Hydrometry|Depth', 'Flow_Rate')

train_arno <- model_handmade_folds(data_arno, 365, 'arno',15,
                     'Rainfall|Temperature|Flow_Rate|Volume|Hydrometry|Depth', 'Hydrometry')

train_bila_flo <- model_handmade_folds(data_bilancino, 365, 'bilancino_flo',15,
                     'Rainfall|Temperature|Flow_Rate|Volume|Hydrometry|Depth', 'Flow_Rate')

train_bila_hyd <- model_handmade_folds(data_bilancino, 365, 'bilancino_hyd',15,
                     'Rainfall|Temperature|Flow_Rate|Volume|Hydrometry|Depth|Lake_Level', 'Lake_Level')

ACEA Smart Water Analytics Competition; final thoughts

In doing this challenge I’ve ended up putting a lot of focus on the modelling stage. I did learn a lot going through all these steps on different data then I usually work on. When I look over the final model I am happy with the outcome of the project. The main reason is that it can generalize well to new locations of water level measurement points and is robustly designed for time series effects. Overall I feel that hints back to the original spirit of the challenge.

There are some improvements to be made, for example the model failed on some data sets due to lack of usable data. I could fix this by looking at imputation methods, an area I have skipped completely during the project. I might revisit that later as needed, as the dataset provided in the ACEA Smart Water Analytics Competition contains alot of missing data points.

Furthermore the true applicability of this model is still to be determined. I made some assumptions, namely that we want to predict next-day measurements, when ACEA might be interested in multiple steps ahead predictions. The model is also more generalized, this results in easier applicability on new and unknown datasets, as well as new locations. But from a business standpoint it can also be logical to focus on optimizing currently known locations more thoroughly.  It definitely interests me how subtle changes in the data setup and modelling approach can lead to completely different use cases from a business perspective.

Its been fun and I will be back writing about the winning notebook, comparing it to my own findings.

ACEA Smart water analytics
Blog, Data Science project, rblog

ACEA Smart Water Analytics Competition; Introduction

In this blog series I will document my work on the Kaggle ACEA Smart Water Analytics Competition. The purpose is of these blogs is not to share perfect code or completely worked out solutions. Rather I want to take you along for the ride and share my insights as is. There will definitely be some fails and hopefully a success mixed in. In this series I will discuss 4 distinct steps in my project:

If you want to follow along with any code in these blog series then sign up for the challenge here: You can download the dataset for free after signing up.

The challenge: ACEA Smart Water Analytics Competition

The goal of this challenge is to predict water levels in a collection of different water bodies based in Italy. Specifically we have to predict based on a time series model. The goal is to accurately assess the water level of tomorrow, based on data of today. This specifically is an analytics challenge, which means creating a compelling story & notebook is a very important part. My notebook is publicly available on Kaggle here. I will work through some code excerpts and interesting highlights in these blogs.

For today I want to address my initial findings, and explore how I thought through these challenges initial steps.

Designing my load function

One of the first steps everyone undertakes is (obviously) to load the data. It is my own personal preference to condense this code into a load function. The main reason for load functions is that it creates clean and concise notebooks. It allows me to build code chunks that are independent of each other by loading data in the notebook as necessary.

My load function will typically contain the following 3 components:

  • Filepath to data
  • Column based filters
  • Merge functionality

In the ACEA dataset this worked out as follows:

# dirs:
maindir   <- 'C:/Prive/Kaggle/ACEA'
datadir   <- 'Data'
scriptdir <- 'Scripts'
htmldir   <- 'HTML'
# load:
# Inladen files:
files_loc <- list.files(path = paste(maindir, datadir, sep = '/'), recursive= T, full.names = T)
files_loc <- files_loc[grep("\\.csv", files_loc)]

load_files <- function(path, filter=NULL){
    output <- read.csv(path, stringsAsFactors = FALSE, check.names = FALSE, fileEncoding="UTF-8-BOM")
    output$name <- gsub('\\.csv','', basename(path))

data_auser <- load_files(files_loc[1])
data_doganella <- load_files(files_loc[2])
data_luco <- load_files(files_loc[3])
data_petrignano <- load_files(files_loc[4])
data_bilancino <- load_files(files_loc[5])
data_arno <- load_files(files_loc[6])
data_amiata <- load_files(files_loc[7])
data_lupa <- load_files(files_loc[8])
data_madonna <- load_files(files_loc[9])

As you can see, there is currently no merge functionality yet in my load function. I imagine I will update this as I can see some usage for it later on in my project. The crux here is that if you are faced with multiple datasets that would logically have to combined, your merge function incorporated with your load function can create a clean, concise and easily understood notebook. The load function does allow us to load specific columns. We will see in the next chapter how this is usefull already.

Temporal elements of ACEA dataset

One of the main discussion points of the ACEA dataset is temporal. This means there are time based elements in the dataset. The data contains multiple types of columns such as rainfall, temperature, hydrometric and ground water levels. The final challenge is to build a time series model, we have to predict the water level of tomorrow. This means that there are some initial assumptions to be challenged or tested:

  • Trends over time
  • Seasonality
  • Availability of time

One of the cornerstones of our narrative needs to be timebased trends, the data runs back to 1998 on some variables (see below later on), which means that it is possible the different explanatory variables contain long running trends. I myself assume for example that rainfall is decreasing over time and temperature increasing. This is due to global warming effects. It is important to test these assumptions. When thinking of seasonality, we may assume furthermore water levels are likely lower in summer and higher in winter or early spring.

As far as easy to use temporal visualizations, I designed a plotly of all datasets and dates available in the dataset to showcase how long they run backwards giving us an idea of the possibility of finding different trends. The codechunk below makes use of the load function we designed earlier thus making it stand-alone executable.

# Put all dates and names variables:
datalist_date_name <- lapply(1:9, function(x){load_files(files_loc[x], filter = c('Date','name'))})
data_date_name <- data.table::rbindlist(datalist_date_name)
data_date_name <- data_date_name[which(nchar(data_date_name$Date) == 10),]
data_date_name$Date <- as.Date(data_date_name$Date, "%d/%m/%Y")

p <- ggplotly(ggplot(data_date_name, aes(x = Date, y = name)) + geom_point() + 
labs (x = 'Date (years)', y = '',title = 'Available dates by dataset') +
insighttheme )

htmlwidgets::saveWidget(p, file = paste(maindir, htmldir, 'dates_viz','index.html', sep = '/'))
Available dates in Kaggle ACEA Dataset

Geospatial elements of ACEA dataset

One of the final things I dived into this week is the geospatial element present in the ACEA dataset. We have different aquifers, rainfall and temperature measurements around Italy, and thus it would be beneficial to explore if maps can help us develop a story. The goal is making our viewers understand better how the data works and what crucial insight we are working on.

For example it would be helpful to show our temporal elements on a map, for people to quicker understand the overall findings of all locations found in the ACEA dataset. I did have to use some elbow grease in this section, as the different locations found are not groupable by for example municipalities or cities. Some of them are mountains, bridges, etc. Hence there was no comprehensive dataset with locations existing I could use. Therefore I put all 64 locations in google maps and wrote down the longitude and latitude by hand.

Now with 64 locations this was the quick fix, the longer fix is found here, If you have to work with 1000’s of cities or a plethora of unknown locations then refer to the google API and programmatically extract the longitude and latitude. Since I am looking into doing more Kaggle and also working more with maps in my regular job I might have some fun and build a function for that later on.

Below was my first attempt at a map, showcasing the average rainfall surrounding the AUSER aquifer. Please note that this code makes use of a shapefile for italian municipalities found here: This is basic version of the map, which I will be updating further on my notebook as needed for the final submission.

italy_mun <- readOGR(files_shp)
italy_mun <- fortify(italy_mun, region = 'COMUNE')

locations <- read_excel(files_geo)
locations$lat <- as.numeric(locations$lat)
locations$long <- as.numeric(locations$long)

mean_rainfall <- aggregate(rainfall_value ~ location, data = data_plot, FUN = mean)
mean_rainfall$location <- gsub('_',' ',gsub('Rainfall_','',mean_rainfall$location))
locations <- merge(locations, mean_rainfall, by.x = 'location', by.y = 'location', all.x = T, all.y = F)
locations <- locations[!$rainfall_value),]

#mean lats and longs:
municipalities <- aggregate(cbind(long, lat) ~ id, data = italy_mun, FUN = mean)
municipalities <- municipalities[which(max(locations$lat)+.1 >= municipalities$lat &                  # Municipalities within auser locations latitude
                                           min(locations$lat)-.1 <= municipalities$lat), ]

municipalities <- municipalities[which(max(locations$long)+.1 >= municipalities$long &                # municipalities within auser locations longitude
                                           min(locations$long)-.1 <= municipalities$long), ]

italy_mun_plot <- italy_mun[which(italy_mun$id %in% municipalities$id),]

ggplot() + geom_polygon(data = italy_mun_plot, aes(x = long, y = lat, group = id), fill="grey", alpha=0.3, col = 'white') + 
    geom_point(data = locations, aes(x = long, y = lat, colour = vartype, size = rainfall_value)) + insighttheme +
    labs("Map of rainfall locations belonging to Auser aquifer in Italy")
Output of basic map code

ACEA Smart Water Analytics Competition next steps

My takeaways from this blog are simple, look at your data and develop an understanding of the important dimensions, in this case time and location. This could be other dimensions such as currency, speed and/or connectivity. Connect your final model and insight back to these basic dimensions of your data to create a compelling narrative that is easy for all your users to follow.

I will be back soon with Part 2 of the ACEA Smart Water Analytics Competition!