data decision making
Data Science project, rblog

Designing data driven decision making; Kaggle ColeRidge

There is an interesting challenge running on Kaggle at the moment. It has been designed in cooperation with the Coleridge Initiative (https://coleridgeinitiative.org/) . This initiative is established at the New York University, it’s goal is to facilitate data driven decision making by governments. In the challenge we get to optimize automated dataset detection In policy papers. The ultimate goal? We need to help civil servants make better and more efficient decisions.

What is the key to making better decisions? Use the most up to date scientific research and data. Data Science can help here to make this process more efficient. Specifically we can use Natural Language Processing (NLP) techniques to predict what papers use a dataset. This is the essence of the Kaggle challenge. If it succeeds, it can facilitate government decision making transparency.

In this initial blog I want to discuss some of my considerations when thinking about this challenge. There will also be some quick analysis on the dataset provided.

Challenge overview

So you are working on a problem, how many times have you looked for a good scientific source to find a solution? Put yourself in the position of someone working for a local government. You have to come up with a local policy impact analysis on the councils climate change efforts. You’ve been looking for evidence and scientific research all morning. Isn’t there a tool that immediately identifies all similar research where similar datasets are used?

I can imagine these decisions are made on a daily basis, all around the world. Designing better systems to navigate this huge information source is the key to data driven decision making. To improve such mass scale decision making, with direct societal impact. It triggers my imagination.

Next the dataset, we are given roughly 20.000 papers, and have to predict what datasets are used in these documents. There are 3 key technical aspects I want to discuss, that in my opinion are vital to reaching a solution.

  • Feature addition
  • Similarity index of scientific papers
  • Transfer learning

Feature addition

The dataset literally consists of 5 columns. A document ID, the publication title, the dataset used in the publication, the institute associated with it and a cleaned label for the dataset. Firstly one of the potential challenges with text data is to find reliable additional structured data from the raw text provided. We want to see if there are some other features available!

There are multiple reasons for attempting to find structured additional data from the provided text. If we for example can extract the date of the publication from the text, we can see if this date of publication is a relevant factor in the usage of different datasets. one of the most important reasons is that it makes our ultimate solution easier to understand for policy users. Hence it creates a simpler interpretation of our outcome if we can use these structured elements. In our efforts to improve decision making it is our duty as the data scientist to maximize understanding of our solution.

From a data science perspective, creating additional metadata also helps us improve our own understanding of the problem. In one of the upcoming blogs I will attempt to use the text to create these metadata elements. From the top of my head important elements are:

  • Publication title
  • Author / Organization
  • Total citations
  • Citated papers
  • Text length

There is probably more potential for data enrichment, but this is a starting consideration for me when going over this challenge.

Similarity index of scientific papers

When we look for different papers that cover similar datasets, we assume that there is some similarity. There are similar technical terms used and introduced, similar domain knowledge. It is logical that a dataset that covers economic indicators, has similar explanations of these indicators. A dataset has a theoretical limit to the angles it can be discussed in.

Aside from the challenge of predicting datasets used in the articles, it should be an interesting business case to cluster the scientific papers, and find similar articles. Are there for example very different articles that cover the same dataset? By zeroing in on these effects we can more optimally cover all the relevant knowledge on a given dataset. It also just interest me a lot this question, and i’m curious if there are very different articles that cover similar data.

One of the ways to approach this simply is to use a string matching alghorithm. Using this method is an interesting baseline approach for the challenge. When I start a challenge I am always looking for a baseline. What I find most helpful in establishing a simple baseline early is clarity on the challenge complexity. For me it answers the question, how difficult is this really? When we create a simple intuitive model it helps understand how humans work the problem in practice. Finally It will further help to understand the mountain we have to climb with our more advanced predictive model.

Transfer learning

When I read this challenge part of what peaked my interest is the potential to use transfer learning. In the field of Natural Language Processing there has been a big buzz around transfer learning. For a scientific breakdown see here , another great link is found here . One of the most promising recent models is BERT.

Personally I’ve never used transfer learning and am eager to give it a try in this challenge. For me It would be a great victory to beat my string matching baseline with a transfer learning method. This will take some studying on the links mentioned in the previous paragraph, but there is alot of great content available.

Improving data driven decision making

The purpose of this blog is to write down initial thoughts. I will start off in the challenge by tackling these aspects. It helps me create a structure for my project. Finally it clears the path for the focus of the challenge.

We are focusing on improving data driven decision making for governments. Creating a good prediction model is one part of this, and that is the focus of the Kaggle challenge. But working on this I also want to broaden the impact. Ultimately the prediction model needs to be explainable and usable to achieve impact. Hence keeping this in the back of our minds as we design it is vital.

To conclude I will leave you with a word cloud of the papers that mention the datasets “World Ocean Database” and “Census of Agriculture”. I expect they look sufficiently different to be of interest. The code used for creating them is below the pictures.

Ocean vs Agriculture dataset wordcloud
# Library
library(jsonlite)
library(feather)
library(tm)
library(wordcloud)


# load and process data

data <- read.csv("C:/Startup/ColeRidge/train.csv")

files_train <- list.files('C:/Startup/ColeRidge/train', full.names = T, recursive = FALSE)
files_id <- gsub('.json','',basename(files_train))

load_json <- function(file, files_id){
  output             <- jsonlite::fromJSON(file)
  output$publication <- files_id
  output
}

if(file.exists('C:/Startup/ColeRidge/json_files.feather')){
  json_files <- read_feather('C:/Startup/ColeRidge/json_files.feather')
} else {
  json_files <- lapply(1:length(files_train), function(x){load_json(files_train[x], files_id[x])})
  json_files <- data.table::rbindlist(json_files)
  write_feather(json_files, path = 'C:/Startup/ColeRidge/json_files.feather')
}

data <- merge(data, json_files, by.x = 'Id', by.y = 'publication', all.x = T, all.y = F)

world_ocean <- data[which(data$dataset_title == 'World Ocean Database' & data$section_title =='Introduction'),]
agri_census <- data[which(data$dataset_title == 'Census of Agriculture' & data$section_title =='Introduction'),]

build_wordcloud <- function(text){
  
  data_corp <- Corpus(VectorSource(text))
  
  # Cleaning:
  data_corp <- tm_map(data_corp, content_transformer(tolower))
  data_corp <- tm_map(data_corp, removeWords, stopwords("english"))
  data_corp <- tm_map(data_corp, removePunctuation)
  
  # Document term matrix:
  data_corp <- TermDocumentMatrix(data_corp)
  data_corp <- as.matrix(data_corp)
  data_corp <- sort(rowSums(data_corp),decreasing=TRUE)
  data_corp <- data.frame(word = names(data_corp),freq=data_corp)
  
  set.seed(1234)
  wordcloud(words = data_corp$word, freq = data_corp$freq, min.freq = 1,
            max.words=200, random.order=FALSE, rot.per=0.35, 
            colors=brewer.pal(8, "Dark2"))
  
  
}

build_wordcloud(world_ocean$text)
build_wordcloud(agri_census$text)

code review example R code Caret
PSA, rblog

Code Review Example R Code Caret

Today we will go through a practical code review example. I will analyze some code chunks I used during the Kaggle ACEA challenge. The code chunks come from a function I used to run my final model using Caret in R. Code review is a very important step in any project, I recommend doing it after any project. It helps condense your code and make it more efficient. In doing so it also allows you to share and preserve your work more efficiently. In this blog I will follow the top 3 of my checklist I shared earlier.

Obviously I am reviewing my own code in these examples, this is different from reviewing someone elses code. The simplest difference is that your not the original author, I already know what the code is supposed to achieve. In practical examples the first step is actually making sure you understand how the code works. After all how can you review something without understanding it?

If you want some background on the code, I used this function during the ACEA Smart Water Analytics Challenge on Kaggle. Its function is discussed in more detail here. In short, this function was designed to run the final model I ended up using for this challenge on all available datasets.

You can find the full unedited code in this github, named ‘modelling_function.R’. I will now go through some code chunks. This code hasn’t been reviewed or optimized since I did the project, and there is alot of room for improvement.

Code Review Example Function Headers

The code chunk below shows the header of my helper function wrapper.

model_handmade_folds <- function(data, horizon, dataset, lag, features, basefeatures){
  
  #basefeatures <- 'Depth'
  
  # Make lags:
  features <- grep(features,names(data),value = T)
  basefeatures <- grep(basefeatures,names(data),value = T)

There are 2 standout issues I noticed immediately and that should be addressed in order to improve this function. As described in my checklist on code review, helper functions are amazing in R code. But they are only amazing if done correctly. If done correctly they improve code readability and transfer between projects. Lets focus on readability.

Readability

the first issue is no explanation of the functions inputs. Basic explanation should include an example input and short descriptions of type and shape. The second is explanation of the functions use case, explain this shorthand at the function wrapper. Thirdly it is important to explain the output of the function in the wrapper. Finally reevaluate the name of the function, does it clearly relate to the shorthand explanation of its purpose.

All these alterations also improve its transferability to other projects. There is a secondary issue in this functions first 5 lines. The parameters ‘features’ and ‘basefeatures’ used as input for the function are immediately reassigned. This is bad coding practice as there is no reason why the reassigned variables couldn’t have been used as original input for the function.

Revisions

The revised code is shown below. It now contains a precise explanation of all inputs, the functions use case and its output. The name of the function has been altered and the parameters more appropriately renamed. The reassignment of the 2 input parameters is removed and incorporated them into our function inputs/wrapper.

LLTO_Model <- function(data, name_of_dataset, lag, time_out_horizon, features, predictors){
  ## Function purpose: Use Leave Location and Time Out Moddeling (LLTO) on the 
  ## DataSets of the Kaggle ACEA Smart Water Analytics Challenge. Datasets are 
  ## time series with multiple predictors and features. Lags are used as features.
  ## A horizon gets specified to model the Time Out Cross validation.
  
  ## Function Output: Train Object
  
  ## Function input:
  # data                Dataset from Kaggle Acea Smart Water Analytics challenge, 
  #                     containing Date, numeric predictors & feature set
  # name_of_dataset     Name of input dataset, used to save trained model
  # lag                 Amount of lags used for features (integer input)
  # time_out_horizon    Horizon of folds used in Time Out Cross validation
  # features            Names of features used in dataset (numeric variables)
  # predictors          Names of predictors used in dataset (numeric variables)

Code Review Example Preprocessing Setup

The next part of the function runs through the different data preprocessing steps before the model training begins. From a user perspective I expect some control over this piece of code. As a user of a function you need these options to make it transferable. The different preprocessing steps need to be more precisely explained aswell. I also have some thoughts on code placement inside this segment.

  for(i in 1:length(features)){
    for(j in 1:lag){
      data$temp <- Lag(data[,features[i],+j])
      names(data)[which(names(data)=='temp')] <- paste(features[i],j, sep = '_')
    }
  }
  
  
  data <- data[,which(colMeans(!is.na(data))>.2)]
  
  # Inlude seasonality:
  data$year <- as.numeric(substr(data$Date,7,10))
  data$year <- data$year - min(data$year) + 1
  data$month <- as.numeric(substr(data$Date,4,5))
  data$quarter <- ifelse(data$month <= 3,1,
                         ifelse(data$month >=4 & data$month <= 6,2,
                                ifelse(data$month >=7 & data$month <= 9,3,
                                       ifelse(data$month >9,4,NA))))
  
  data_long <- tidyr::pivot_longer(data, predictors,names_to = 'location', values_to = 'depth_to_groundwater')
  
  data_long <- data_long[complete.cases(data_long),]
  data_long <- data_long[which(data_long$depth_to_groundwater != 0),]
  
  #data_model <- data_long[,-grep('location|Date|name',names(data_long))]
  
  temp <- data_long[,which(!names(data_long)%in%c('depth_to_groundwater','Date','location'))]
  nzv <- nearZeroVar(temp)                                                       # excluding variables with very low frequencies
  if(length(nzv)>0){temp <- temp[, -nzv]}
  i <- findCorrelation(cor(temp))                                                # excluding variables that are highly correlated with others
  if(length(i) > 0) temp <- temp[, -i]
  i <- findLinearCombos(temp)                                                    # excluding variables that are a linear combination of others
  if(!is.null(i$remove)) temp <- temp[, -i$remove]
  data_model <- data_long[,c('depth_to_groundwater','Date','location', names(temp))]
  
  data_model$Date <- as.Date(as.character(data_model$Date), format = '%d/%m/%Y')

Preprocessing steps explained

The code above consists of 5 preprocessing distinct steps:

  • Feature addition
  • data shaping for modeling
  • Removal of missing information
  • Statistical preprocessing steps
  • variable formatting

In the current code none of these steps are explained or identified as such. It is up to the reader of the code to figure this out. In the code below I explained these different distinct steps so they can be located easier. They can also be adjusted to a new project or dataset more easily. It is now easier to evaluate them individually and improve on the function.

Code placement

There is an argument here that the preprocessing steps need to be placed into their own function entirely. It is a distinct use case that can be seperated from training the model. Sometimes it even is a requirement in a production environment. This happens when we have to preprocess live production data as we did when training the model. The code is already greatly improved simply by reordening it into the 5 steps above.

Other considerations

One of the concerns I have reading this part of my code back is the lack of stepwise explanations. A clear example is the section on missing values, its as simple as it gets. Remove all rows that even contain a single missing value. As a new reader or user I want to know so many more things about this choice. This includes the impact of this choice, what if it removes all rows? The alternatives that were investigated and even why it removes the missing values.

Code review comes in to make projects stand up to production level quality. You could argue that this level of detail is prohibitive but I am here to argue it is a requirement. When this model is in operation any issue may arise, perhaps the missing data increases and the model fails specifically on this point. A developer may start a new project to implement imputation techniques. Maybe you already investigated this and it led to a dead end. One of the key characteristics of production level code is its ability to handle diverse outcomes and report on them accordingly.

  ## Variable formatting:
  # Date variable:
  data_model$Date <- as.Date(as.character(data_model$Date), format = '%d/%m/%Y')
  
  ## Feature addition:
  # Creating lags for i features in j lags
  for(i in 1:length(features)){
    for(j in 1:lag){
      data$temp <- quantmod::Lag(data[,features[i],+j])
      names(data)[which(names(data)=='temp')] <- paste(features[i],j, sep = '_')
    }
  }
  # Inlude seasonality:
  data$year    <- as.numeric(substr(data$Date,7,10))
  data$year    <- data$year - min(data$year) + 1
  data$month   <- as.numeric(substr(data$Date,4,5))
  data$quarter <- ifelse(data$month <= 3,1,
                         ifelse(data$month >=4 & data$month <= 6,2,
                                ifelse(data$month >=7 & data$month <= 9,3,
                                       ifelse(data$month >9,4,NA))))  
  # remove uncommon newly created features:
  data <- data[,which(colMeans(!is.na(data))>.2)]
  
  ## Data shaping for modelling
  # Long format by predictors for LLTO modelling:
  data_long <- tidyr::pivot_longer(data, predictors,names_to = 'location', values_to = 'depth_to_groundwater')
  
  ## Remove missing variables:
  # Complete cases
  data_long <- data_long[complete.cases(data_long),]
  
  ## Statistical preprocessing:
  # Remove outliers
  data_long <- data_long[which(data_long$depth_to_groundwater != 0),]
  # Preprocess features for modelling
  temp <- data_long[,which(!names(data_long)%in%c('depth_to_groundwater','Date','location'))]
  # excluding variables with very low frequencies
  nzv <- nearZeroVar(temp)                                                       
  if(length(nzv)>0){temp <- temp[, -nzv]}
  # excluding variables that are highly correlated with others
  i <- findCorrelation(cor(temp))                                               
  if(length(i) > 0) temp <- temp[, -i]
  # excluding variables that are a linear combination of others
  i <- findLinearCombos(temp)                                                   
  if(!is.null(i$remove)) temp <- temp[, -i$remove]
  # Selection of final feature set
  data_model <- data_long[,c('depth_to_groundwater','Date','location', names(temp))]
  

Code Review Example Cross validation indices

One of the implementations of this function is the handmade folds for cross validation. I implemented this in the code because there were no libraries that used LLTO modelling exactly how I wanted to use it here. Looking back at my checklist, this is an important stage to bring up library optimization. Here we have a usecase where our code deliberately includes a selfprogrammed aspect, and it is our responsibility as the reviewer to verify that this was correct.

In this particular case I know that there is a CAST package in R that claims to implement LLTO modelling, but I found that it didn’t create the folds for cross validation correctly. This is however not documented in the code below, which is a mistake. The code should reflect this piece of research (perhaps even so far as to show its wrong implementation). This also draws back to my previous point regarding the stepwise documentation. It is a smart move to keep initial implementations and checks of code so that later you can reuse your research to back up your final output. I have failed to do so and hence I am open to critism about why I self-programmed this set of functions.

The function of the code below is to create indices of cross validation folds for every predictor and time period defined in the dataset. Only the first time period is excluded as there is no prior data that can be used to train for it. For example a dataset with 4 predictors and 4 time periods gets 12 folds. Caret expects a names list item that includes the elements ‘IndexIn’ and IndexOut’.

# Handmade indexes:
  index_hand_design <- function(data,period, location, horizon, location_one = NULL){
    horizon2 <- max(period)-horizon
    if(!is.null(location_one)){
      indexin <- which(data$Date >= min(period) & data$Date <= horizon2)
      indexout <- which(data$Date > horizon2 & data$Date <= max(period))
      
    } else {
      indexin <- which(data$Date >= min(period) & data$Date <= horizon2 & data$location != location)
      indexout <- which(data$Date > horizon2 & data$Date <= max(period) & data$location == location)
    }
    output <-c(list(indexin),list(indexout))
    output
  }
  
  periods <- round(length(seq.Date(from = min(data_model$Date),to = max(data_model$Date), by = 'day'))/horizon,0)
  dates   <- seq.Date(from = min(data_model$Date),to = max(data_model$Date), by = 'day')
  indices <- 1:periods*horizon
  
  periods_final <- dates[indices]
  periods_final <- periods_final[!is.na(periods_final)]
  
  stopifnot(length(periods_final)>=4)
  
  for(i in 3:length(periods_final)){
    output <- list(c(periods_final[i-2], periods_final[i]))
    if(i <= 3){
      output_final <- output
    } else {
      output_final <- c(output_final, output)
    }
  }
  
  locations <- unique(data_model$location)
  
  for(i in 1:length(locations)){
    for(j in 1:length(output_final)){
      if(length(locations)==1){
        
        output_temp <- index_hand_design(data_model,output_final[[j]], locations[i], horizon, location_one = 'yes') 
      } else {
        output_temp <- index_hand_design(data_model,output_final[[j]], locations[i], horizon)
      }
      if(j == 1){
        final_inner <- output_temp
      } else {
        final_inner <- c(final_inner, output_temp)
      }
    }
    if(i == 1){
      final <- final_inner
    } else {
      final <- c(final, final_inner)
    }
  }
  
  index_final <- list(index = final[seq(1, length.out = length(locations)*length(output_final), by = 2)], 
                      indexOut = final[seq(2, length.out =length(locations)*length(output_final), by = 2)])

Code function

I find this part of the code hard to understand and read. My excuse is that it was written under time pressure after I found out it had to be done by hand. It is however a prime candidate for this reason to review here.

One of the problems is that original function, “index_hand_design”. I will refer back to section 1 about how to properly design helper functions. All problems discussed with the entire function apply to this inner helper function. There is also an argument to not define a function within a function.

To finish off these practical examples I would like to show how to change the nested for loop I used to run the index_hand_design function. Removing for loops is at the highest point of my code review checklist . Below is that code segment changed into an lapply and mapply format.

  if(length(locations)==1){
    final_inner <- lapply(1:length(output_final), function(x){index_hand_design(data_model,
                                                                                output_final[[x]],
                                                                                locations, 
                                                                                horizon,
                                                                                location_one = 'yes')})
  }
  if(length(locations)>1){
    # Create dataframe with possible combinations:
    sequences <- data.frame(location = rep(locations,length(output_final)), 
                                           lengths = rep(1:length(output_final), length(locations)))
    final <- mapply(index_hand_design, data = data_model, sequences[,2], sequences[,1], horizon = horizon)
  }

The code above is endlessly more concise then the initial nested for loop. In my final pieces of code I always focus on replacing any remaining for loops into lapply or when using nested for loops mapply.

Wrapping up code review example of R Code in Caret

There were some other problems in the cross validation section, which will end up in the final code on github. I highlighted the removal of the nested for loop as a practicle example. The function was almost done anyway. It is very revealing to work through your own code from a code review perspective. I definitly recommend investing the time in optimizing your work in this way. Hopefully these practicle examples benefit your own work.

code review example R code Caret
Code review checklist R code edition
Blog, PSA, rblog

Code Review Checklist R Code Edition Top 3

Great code review is one of the most underrated skills a Data Scientist can have. In this blog I will share my top 3 code review checklist, specifically for R Code. In our data science team we regularly do code review to make sure it is up to the standard it needs to be. The top 3 elements mentioned in this blog should in my opinion be included in any code review for R code.

One of the greatest pitfalls of code review is developer bias. The goal of code review is to create the best code, not the code you as a developer like the best. I can’t claim this blog is completely whiped clean of developer bias, we all have it. For me it is the main reason why I disliked code review so much. I wanted to write the code I like best. My reasoning was that if it works, why does it even matter how it works. Over the years I have learned as members of our team come and go what code review can add. By using best practices we design our code to be better understood and easier to maintain. Anyway, here is my code review checklist Top 3.

R Code Review Checklist: 1. Helper Functions

Helper functions are without a doubt my number 1 on any code review checklist. It is not an understatement that every for loop ever written in the R language should be replaced with a helper function. When I review someone elses code this is the first scan I make. At its core R is a functional programming language, this should say alot on why we should put focus on this aspect. Some other reasons are:

Code Seperation

The benefits of helper functions are many, for me, since I love a great function it is also makes code easier to read. But seriously, one of the main reasons for me is actually how it seperates your code into concise chunks. Great production ready code consists of a central script that loads different helper functions as needed. In the helper functions the actual data processing, modelling, etc is performed. As a reviewer of your code this allows me to evaluate it in the context of what this specific code chunk is ment to do.

Transferability of Helper Functions

Once we build our code in functional chuncks/functions, it allows us to improve transferability of our code. This is true both for projects and colleagues. A core benefit of code review in R is that it allows you and your team to derive a core set of functions for your company data. Ultimately these functions could be gathered and transformed into a company package.

I started this segment by focusing on for loops, in reality helper functions should be more common. Helper functions are helpful to prevent repitition not only through for loops but also through cross-project functionality. So if I see a step written out in a colleagues code which has a high likelihood of being in need of repitition, I will mark it as a potential helper function.

R Code Review Checklist: 2. Package Optimization

What I remember most from starting out with R is the bewonderement of packages. Now I never had much other programming experience so stick with me on this. What was this magic that you could load functions to do stuff to your data as needed? And there are currently how many packages on Cran? The answer is 17260 packages. Thats a hole lot of code read to work with for you in your projects. One of the great skills an experienced developer can have is to be library agnostic, and this is my focus when reviewing code.

Why We Dont Hammer Down Screws

What exactly is library agnostic? it means that you are willing and capable to switch packages when needed. I’m trying to put emphasis on using the right tool for the right job, this should be a central element in code review. An area where this long was a problem for me is BaseR, I would nearly always use BaseR code as this is what I learned and was comfortable with. I had no immediate reason to use packages such as the tidyverse or data.table, my code was working fine. But it ended up holding me back, because I was incapable of switching packages as needed my code was inefficient in specific areas.

It is in my top 3 jobs of the reviewer to identify these areas and give feedback on optimizing the usage of packages. Look at what packages might fit the specific use case, what packages are implemented in the code? When providing feedback in this area I like to explain different options and their potential benefit (speed, understandability, etc). This in my opinion also helps the reviewer think more about the code and is an area that benefits both.

R Code Review Checklist: 3. Code Placement

When I write an initial script for a project, it always starts out quite structured. Its like a ritual, load data, select, transform, make visualizations, etc. But then one thing always happens, I notice something in my output, a weird pattern. Perhaps its a colleague that wants a slight alteration of the insights. What happens to my code now? I end up placing a quick filter on row 241. A week goes by, maybe even a month goes by, I revisit the project. The scope has changed slightly and I start adding code again.

If you are lucky this is a project that ends at some point, and your code gets retired with it. But what if it becomes a success? We are going to need production ready code that is shareable within the organization. Code review is all about this, we focus on reordering and optimizing code placement. As a code reviewer split the code in front of you into use cases and their approriate chunks. If someone loads data outside the ‘load_data’ chunk, mark it. If someone filters their data within a data visualization, mark it.

The value of great code placement is difficult to overestimate. It hits all the relevant boxes of code review. It improves readability of code and transferability between projects, and very often increases its functionality too.

Wrapping Up Code Review Checklist

These 3 points together make up the core of my code review, and belong on any code review checklist. These 3 points also have one thing in common. They all focus on outer appearance of code, I am a strong believer that this is the biggest benefit of code review. By optimizing readability and transferability we create synergies within data science teams and proejcts that are much needed.

You will find online articles that see validation of the output of code as part of code review. I personally think that any validation is part of the code writing process itself. If you dont have checks and balances in your code that validates your inputs and outputs thats a big problem. The role of the code reviewer is to check for these elements, but they are not responsible for writing or doing it themselves.

This blog was very focused on the ‘how’ of code review. If you are interested in an introduction on the what and why of code review, why it benefits your team or company or a practical example of code review I will release content on those subjects soon. Hopefully put together some of it can benefit you in starting out with code review.

Kaggle ACEA Water Analytics Competition Final Model
Data Science project, rblog

ACEA Smart Water Analytics Competition; Final Model overview

This is blog 4 of my endeavors for the currently ongoing Kaggle challenge posted by ACEA. A short introduction of the challenge is below. What I am trying to do with these blogs is not to create perfect tutorials that contain perfectly revised code, but rather document my own project process and findings at any moment. This blog shows the ACEA Smart Water Analytics Competition final model overview, before the project deadline.

Please note, if you want to follow along with any code in these blog series then sign up for the challenge here: https://www.kaggle.com/c/acea-water-prediction/overview and you will receive the dataset for free.

The challenge: ACEA Smart Water Analytics Competition

The goal of this challenge is to predict water levels in a collection of different water bodies based in Italy. Specifically we have to predict based on a time series model, to accurately assess the water level of tomorrow, based on data of today. To shortly note the Kaggle rules, this is an analytics challenge, which means creating a compelling story & notebook is a very important part. My notebook is publicly available on Kaggle here, but I will work through some code excerpts and interesting highlights in these blogs.

So far we have discussed the challenge more generally, looked at some data wrangling and new features for modelling. Last time we overcame a classic issue with time series modelling concerning our cross-validation. This week we will finish off our modelling and how we finish off the challenge for handin. You can find the blogs below if you want to read back first:

ACEA Smart Water Analytics Competition Final model

Last blog was a more indepth discussion of hindsight bias in the cross-validation stage of the modelling. By using specific a specific time series method we stabilized our model. In the week after I stumbled upon a work from Hanna Meyer. She is a machine learning expert in the field of environmental data. There is a great research paper she wrote available here (https://www.sciencedirect.com/science/article/abs/pii/S1364815217310976?via%3Dihub)

Studying her work made me realise that the Kaggle dataset has all the characteristics she discussed in her paper. She discusses indepth both the time series aspect covered by Hyndman, but also the spatial element of the dataset. In this case having multiple measurement points in the dataset that measure the water level. The main advantage of setting up the model in this way is that it is more sensitive to unknown locations within the given water body. On top of that it allows for simultaneous modelling of all given locations.

Leave Location and Time Out (LLTO) modelling

So in the final model I introduce both a spatial element and a time series element in my model. Methodologically this is called LLTO (Leave Location and Time Out) Cross validation. Essentially the idea combines all earlier discussed steps.

In your training set you leave out the location that is in your validation dataset, this covers the spatial element. You then only include timepoints from before the time series in the validation dataset, this covers the time element. If you have 4 locations you have 4 folds per timeperiod. Each time one of the locations is placed in the validation dataset. This method is implemented in the CAST package in R. However I found that the time aspect is actually not handled properly in the relevant function (CreateSpaceTimeFolds). Hence I ended up making my own folds that respect both aspects.

R-Code

In the code below there is an example function that handles all these steps. Specifically of interest may be lines 47-101, this is where the handmade folds in the cross-validation are created. I have not perfected this code, but it shows the main steps accordingly. If you want to know more about this or discuss then dont hesitate to contact me. We might get back to that later in a different blog.

model_handmade_folds <- function(data, horizon, dataset, lag, features, basefeatures){

#basefeatures <- 'Depth'

# Make lags:
features <- grep(features,names(data),value = T)
basefeatures <- grep(basefeatures,names(data),value = T)

for(i in 1:length(features)){
    for(j in 1:lag){
        data$temp <- Lag(data[,features[i],+j])
        names(data)[which(names(data)=='temp')] <- paste(features[i],j, sep = '_')
    }
}


data <- data[,which(colMeans(!is.na(data))>.2)]
    
# Inlude seasonality:
data$year <- as.numeric(substr(data$Date,7,10))
data$year <- data$year - min(data$year) + 1
data$month <- as.numeric(substr(data$Date,4,5))
data$quarter <- ifelse(data$month <= 3,1,
                            ifelse(data$month >=4 & data$month <= 6,2,
                                ifelse(data$month >=7 & data$month <= 9,3,
                                    ifelse(data$month >9,4,NA))))

data_long <- tidyr::pivot_longer(data, basefeatures,names_to = 'location', values_to = 'depth_to_groundwater')

data_long <- data_long[complete.cases(data_long),]
data_long <- data_long[which(data_long$depth_to_groundwater != 0),]

#data_model <- data_long[,-grep('location|Date|name',names(data_long))]

temp <- data_long[,which(!names(data_long)%in%c('depth_to_groundwater','Date','location'))]
nzv <- nearZeroVar(temp)                                                       # excluding variables with very low frequencies
if(length(nzv)>0){temp <- temp[, -nzv]}
i <- findCorrelation(cor(temp))                                                # excluding variables that are highly correlated with others
if(length(i) > 0) temp <- temp[, -i]
i <- findLinearCombos(temp)                                                    # excluding variables that are a linear combination of others
if(!is.null(i$remove)) temp <- temp[, -i$remove]
data_model <- data_long[,c('depth_to_groundwater','Date','location', names(temp))]

data_model$Date <- as.Date(as.character(data_model$Date), format = '%d/%m/%Y')

# Handmade indexes:
index_hand_design <- function(data,period, location, horizon, location_one = NULL){
    horizon2 <- max(period)-horizon
    if(!is.null(location_one)){
    indexin <- which(data$Date >= min(period) & data$Date <= horizon2)
    indexout <- which(data$Date > horizon2 & data$Date <= max(period))
        
    } else {
    indexin <- which(data$Date >= min(period) & data$Date <= horizon2 & data$location != location)
    indexout <- which(data$Date > horizon2 & data$Date <= max(period) & data$location == location)
    }
    output <-c(list(indexin),list(indexout))
    output
}

periods <- round(length(seq.Date(from = min(data_model$Date),to = max(data_model$Date), by = 'day'))/horizon,0)
dates   <- seq.Date(from = min(data_model$Date),to = max(data_model$Date), by = 'day')
indices <- 1:periods*horizon

periods_final <- dates[indices]
periods_final <- periods_final[!is.na(periods_final)]

stopifnot(length(periods_final)>=4)

for(i in 3:length(periods_final)){
    output <- list(c(periods_final[i-2], periods_final[i]))
    if(i <= 3){
        output_final <- output
    } else {
        output_final <- c(output_final, output)
    }
}

locations <- unique(data_model$location)

for(i in 1:length(locations)){
    for(j in 1:length(output_final)){
        if(length(locations)==1){
           
        output_temp <- index_hand_design(data_model,output_final[[j]], locations[i], horizon, location_one = 'yes') 
        } else {
        output_temp <- index_hand_design(data_model,output_final[[j]], locations[i], horizon)
        }
        if(j == 1){
            final_inner <- output_temp
        } else {
            final_inner <- c(final_inner, output_temp)
        }
    }
    if(i == 1){
        final <- final_inner
    } else {
        final <- c(final, final_inner)
    }
}

index_final <- list(index = final[seq(1, length.out = length(locations)*length(output_final), by = 2)], 
                    indexOut = final[seq(2, length.out =length(locations)*length(output_final), by = 2)])


fitcontrol <- trainControl(verboseIter = T,
                          index = index_final$index,
                          indexOut = index_final$indexOut)

gbmGrid <-  expand.grid(interaction.depth = c(1,2,4), 
                          n.trees = 1:4000, 
                          shrinkage = c(0.01), 
                          n.minobsinnode = c(2,5))

if(length(locations)>1){
for(i in 1:length(locations)){
    data_model$temp <- ifelse(data_model$location == locations[i],1,0)
    names(data_model)[which(names(data_model) == 'temp')] <- paste(locations[i],'ind', sep = '_')
}
}

err <- try(load(paste(maindir, modeldir, paste('data_model = ',dataset,'horizon =',horizon,'.RData', sep = ''), sep = '/')))
if(err != 'train'){
 
    train <- train(depth_to_groundwater ~ . , method = 'gbm', trControl = fitcontrol, tuneGrid = gbmGrid, 
                       data = data_model[,-grep('Date|location',names(data_model))])
    save(train, file = paste(maindir, modeldir, paste('data_model = ', dataset,'horizon =', horizon,'.RData', sep = ''), sep = '/'))
}
return(train)
}

train_auser <- model_handmade_folds(data_auser, 365, 'auser',15,
                     'Rainfall|Temperature|Flow_Rate|Volume|Hydrometry|Depth', 'Depth')

train_petrig <- model_handmade_folds(data_petrignano, 365, 'petrignano',15,
                     'Rainfall|Temperature|Flow_Rate|Volume|Hydrometry|Depth', 'Depth')

train_amiata <- model_handmade_folds(data_amiata, 365, 'amiata',15,
                     'Rainfall|Temperature|Flow_Rate|Volume|Hydrometry|Depth', 'Flow_Rate')

train_lupa <- model_handmade_folds(data_lupa, 365, 'lupa',15,
                     'Rainfall|Temperature|Flow_Rate|Volume|Hydrometry|Depth', 'Flow_Rate')

train_arno <- model_handmade_folds(data_arno, 365, 'arno',15,
                     'Rainfall|Temperature|Flow_Rate|Volume|Hydrometry|Depth', 'Hydrometry')

train_bila_flo <- model_handmade_folds(data_bilancino, 365, 'bilancino_flo',15,
                     'Rainfall|Temperature|Flow_Rate|Volume|Hydrometry|Depth', 'Flow_Rate')

train_bila_hyd <- model_handmade_folds(data_bilancino, 365, 'bilancino_hyd',15,
                     'Rainfall|Temperature|Flow_Rate|Volume|Hydrometry|Depth|Lake_Level', 'Lake_Level')

ACEA Smart Water Analytics Competition; final thoughts

In doing this challenge I’ve ended up putting a lot of focus on the modelling stage. I did learn a lot going through all these steps on different data then I usually work on. When I look over the final model I am happy with the outcome of the project. The main reason is that it can generalize well to new locations of water level measurement points and is robustly designed for time series effects. Overall I feel that hints back to the original spirit of the challenge.

There are some improvements to be made, for example the model failed on some data sets due to lack of usable data. I could fix this by looking at imputation methods, an area I have skipped completely during the project. I might revisit that later as needed, as the dataset provided in the ACEA Smart Water Analytics Competition contains alot of missing data points.

Furthermore the true applicability of this model is still to be determined. I made some assumptions, namely that we want to predict next-day measurements, when ACEA might be interested in multiple steps ahead predictions. The model is also more generalized, this results in easier applicability on new and unknown datasets, as well as new locations. But from a business standpoint it can also be logical to focus on optimizing currently known locations more thoroughly.  It definitely interests me how subtle changes in the data setup and modelling approach can lead to completely different use cases from a business perspective.

Its been fun and I will be back writing about the winning notebook, comparing it to my own findings.

Blog, rblog

ACEA Smart Water Analytics Competition; Hindsight Bias & Caret

This is blog 3 of my endeavors for the currently ongoing Kaggle challenge posted by ACEA. A short introduction of the challenge is below. What I am trying to do with these blogs is not to create perfect tutorials that contain perfectly revised code, but rather document my own learning process and findings at any moment. So hopefully you enjoy some of my thought processes and we can build together from these findings to an ultimately better end model.

Please note, if you want to follow along with any code in these blog series then sign up for the challenge here: https://www.kaggle.com/c/acea-water-prediction/overview and you will receive the dataset for free.

The challenge: ACEA Water Levels

The goal of this challenge is to predict water levels in a collection of different water bodies based in Italy. Specifically we have to predict based on a time series model, to accurately assess the water level of tomorrow, based on data of today. To shortly note the Kaggle rules, this is an analytics challenge, which means creating a compelling story & notebook is a very important part. My notebook is publicly available on Kaggle here, but I will work through some code excerpts and interesting highlights in these blogs.

So far we have discussed the challenge more generally, looked at some data wrangling and new features for modelling, and in this blog we will discuss some first modelling steps I took after that last blog.

Basic modelling

To start us off, in the initial blog we discussed elements of the dataset to consider. One of them was temporal, we reasoned that water levels are influenced *over time* by rainfall and temperature. This was why 2 weeks ago we wrote about creating lag features in our dataset. This week in the modelling we must thus work with time series based models.
In my journey of the past 2 weeks I’ve fallen into a classic trap of machine learning with time series which I will go into depth about. By the end of the blog we have overcome this issue.

First attempts

My first model contained as basic inputs:

  • Lags of the outcome variable
  • Lags of the predictive variables
  • Seasonal indicators (year, quarter, month)

In the previous blog we only worked with the lags of the predictive variables and came up to an Rsquared of .33%, this time we include a few other variables and except to improve our prediction.

Including lags of the outcome variable is a common time-series technique to improve the prediction. In my financial economics class we learned about the concept of ‘random-walks’, this essentially is a time series that can’t be predicted as it includes no pattern. This is, according to the study books, applicable to the day-to-day trading of currency markets. They say tomorrows exchange rates best prediction is todays exchange rate, noone can say any more then that. Random-walking aside, the point is that if we want to predict something for tomorrow, we better include all information on that phenomena we have to date in our model, including lags of our outcome variable.

The seasonal indicators are also added to allow for trends both in years, quarters and months. For example there might be a decline in water levels over time due to global warming, our year indicator can account for this effect.

The next general step is to preprocess our data, what we discussed in depth in the last blog, and set up our model for cross validation. Cross validation is used to hold out a part of the training data to validate the results. Often this is called the ‘validation’ set for that reason. I used a 10 fold cross validation, with 3 repeats.

Afterwards I set up my gridsearch for the gbm (Gradient Boosting Machine) model we will be using. I initially chose the gbm model as I am familiar with how it works, that honestly really was the only reason. It is smart to try multiple models and compare them later. It is also smart to run a simple linear model as a baseline comparison.

create_data_model <- function(data, features, outcome, lag, preprocess = NULL, include_outcome = NULL, test = NULL){
    # data: This function allows for a generic data object, 
    # features: some features that are column names in the data object
    # outcome: name of the otucome column
    # lag: Highest lag to be included in the model 
    # preprocess : Whether the data is preprocessed before modelling   (Highly recommended!)
    # include_outcome:  Indicator if the outcome variable is included with lags
    # test: Indicator if a test set is to be constructed (default to 1 year)

    data_model <- data[,c(outcome,'Date',features)]
    names(data_model)[which(names(data_model)== outcome)] <- 'outcome'
    
    # Feature addition
    # Laggs
    if(!is.null(include_outcome)){
        features <- c(features, 'outcome')
    }
    
    for(i in 1:length(features)){
        for(j in 1:lag){
            data_model$temp <- Lag(data_model[,features[i],+j])
            names(data_model)[which(names(data_model)=='temp')] <- paste(features[i],j, sep = '_')
        }
    }
    
    # Inlude seasonality:
    data_model$year <- as.numeric(substr(data_model$Date,7,10))
    data_model$year <- data_model$year - min(data_model$year) + 1
    data_model$month <- as.numeric(substr(data_model$Date,4,5))
    data_model$quarter <- ifelse(data_model$month <= 3,1,
                                 ifelse(data_model$month >=4 & data_model$month <= 6,2,
                                        ifelse(data_model$month >=7 & data_model$month <= 9,3,
                                               ifelse(data_model$month >9,4,NA))))
    
    # Data cleaning
    data_model <- data_model[complete.cases(data_model),]                              # Remove all rows with missing values
    data_model <- data_model[which(data_model$outcome!= 0),]                           # Remove all outlier measurements
    
    if(!is.null(test)){
        data_test <- data_model[which(as.Date(data_model$Date,format = '%d/%m/%Y')>=as.Date(as.Date(max(data_model$Date), format = '%d/%m/%Y')-365)),]
        data_model <- data_model[which(!as.Date(data_model$Date,format = '%d/%m/%Y')>=as.Date(as.Date(max(data_model$Date), format = '%d/%m/%Y')-365)),]
        data_test$Date <- NULL
    }    
    
    data_model$Date <- NULL
    
    # Statistical preprocessing
    if(!is.null(preprocess)){
        temp <- data_model[,-1]
        nzv <- nearZeroVar(data_model)                                                 # excluding variables with very low frequencies
        if(length(nzv)>0){temp <- temp[, -nzv]}
        i <- findCorrelation(cor(temp))                                                # excluding variables that are highly correlated with others
        if(length(i) > 0) temp <- temp[, -i]
        i <- findLinearCombos(temp)                                                    # excluding variables that are a linear combination of others
        if(!is.null(i$remove)) temp <- temp[, -i$remove]
        data_model <- data_model[,c('outcome', names(temp))]
    }
    
    if(!is.null(test)){
        data_test <- data_test[,c('outcome',names(temp))]
    }
    
    # Modelling:
    fitControl <- trainControl(## 10-fold CV
                           method = "repeatedcv",
                           number = 10,
                           ## repeated ten times
                           repeats = 3,
                           verboseIter = T)

    gbmGrid <-  expand.grid(interaction.depth = c(1,2,4,8), 
                          n.trees = 1:2000, 
                          shrinkage = c(0.01,0.001), 
                          n.minobsinnode = c(2,5))

    err <- try(load(paste(maindir,modeldir, paste('outcome =',outcome,'lag = ',lag,
                                                  'preprocess = ',preprocess,'include_outcome = ',include_outcome,'test = ',test,'.RData', sep = ''),sep = '/')))
    if(err != 'train1'){
        train1 <- train(outcome ~ ., data= data_model, method = 'gbm', trControl = fitControl, tuneGrid=gbmGrid)
        save(train1, file = paste(maindir,modeldir, paste('outcome =',outcome,'lag = ',lag,'preprocess = ',preprocess,'include_outcome = ',include_outcome,
                                                          'test = ',test,'.RData', sep = ''),
                                  sep = '/'))
    }
    if(!is.null(test)){
        data_test$predict <- predict(train1, newdata = data_test)
        return(list(data_test, train1))
    }
    
    train1
}

Then something weird happened, as I hit an Rsquared of 0.999, this is a near perfect prediction! You can find my training results in the picture below. What turned out is that I fell into a classic timeseries trap. The great thing is that we can learn together what the difference is between a correct prediction and a naïve approach.

What went wrong?

The trap is as follows, by using a naïve cross validation of the dataset we have ignored the temporal aspects of the dataset. Specifically, we included future data points in our training set compared to our validation set, for example in a given training sample there might be a lot of data of the year 2018, while the validation set contains data of 2016. This obviously is not applicable if we want to use this model today to predict ‘tomorrow’.

There is a lot of literature on cross validation with time-series, a specific person to look for is Rob Hyndman (3.4 Evaluating forecast accuracy | Forecasting: Principles and Practice (2nd ed) (otexts.com)), who introduced an ultimately more advanced method that I started to work on next. What we need to ensure in our training and validation data is that the temporal aspect is respected, and thus any training data needs to be before the validation data in time. Luckily R has such a rich community that the caret package that I’ve been using for my modelling has specific settings in the traincontrol setting to address this issue. Hyndmans own forecast package also has options for this modelling procedure.

We are going to make use of the ‘timeslice’ method of traincontrol, this method allows you to set up a window of time, measured out in rows of your dataset. So a dataset that has a day for each row might have a window of 365 to represent 1 year. Next is a horizon argument, this is the part of the window that is used for validation, a logical approach might be to set the horizon to 30 as in 1 month. Next there is a skip argument, this can be used to control the movement of the window of time, if its set to 30 in our above example, then window 1 is day 1:365, and window 2 is day 31:395. Lastly we can set the fixed window argument, this is for telling the timeslice method if the window moves along the dataset in time, or expands every skip. The relevant code compared to the block above is shown below:

  fitControl <- trainControl(method = "timeslice",
                             initialWindow = 500,
                             horizon = 150,
                             fixedWindow = FALSE,
                             skip = 149,
                             verboseIter = T)

Having set all these arguments we rerun the model with proper cross validation and find an Rsquared of 0.78, still high but not borderline unrealistic. Below you can find my training results in more detail:

What can we conclude?

It is actually pretty great to see the exact difference between a naïve modelling approach and one tailored to the issue at hand, I didn’t know much about time series modelling and dove straight in. If I had not checked myself on what I was doing and why, the model would have been introduced in a completely overfitted manner.

When looking at the testset we find that the MSE is practically identical for the model training with specific time series cross-validation folds and the overfitted variant. I would personally have to dive deeper into the literature to explain the exact extensive difference, but my gut feeling already tells me that showing up with the overfitted model is a mistake. We can use the specifically designed method and put more trust in its results in a variety of situations and show that on untrained data it performs identically.

To summarize, the new model has slightly less predictive power but is ultimately designed for the environment it has to work for in a production environment.

Next we will look at the final model I ended up using for the challenge and some thoughts before I submitted my notebook.

Blog, rblog

ACEA Smart Water Analytics Competition; Data Setup

In this blog series I will document my work on the Kaggle ACEA Smart Water Analytics Competition. The purpose is of these blogs is not to share perfect code or completely worked out solutions. Rather I want to take you along for the ride and share my insights as is. There will definitely be some fails and hopefully a success mixed in. In this series I will discuss 4 distinct steps in my project:

If you want to follow along with any code in these blog series then sign up for the challenge here: https://www.kaggle.com/c/acea-water-prediction/overview. You can download the dataset for free after signing up.

The challenge: ACEA Smart Water Analytics Competition

The goal of this challenge is to predict water levels in a collection of different water bodies based in Italy. Specifically we have to predict based on a time series model. The goal is to accurately assess the water level of tomorrow, based on data of today. This specifically is an analytics challenge, which means creating a compelling story & notebook is a very important part. My notebook is publicly available on Kaggle here. I will work through some code excerpts and interesting highlights in these blogs.

For today I want to address some of the data setup steps. This is important to prepare your data for machine learning models.  

The how-to’s on setting up your data

Data setup for machine learning models sounds a bit mystical. This is the crucial step after our initial insights. Where we shape our data to allow for answers on our main research question. Here we add features to our data, transform our variables and put in the business logic. It is the business and can make or break your findings.

There are roughly 2 ways to go about setting up your data optimally for your machine learning efforts. There is the traditional approach and the modern approach. In this blog we will compare both and show code examples for both. To start, in the traditional approach we build our feature set and then clean up our dataset before we start our modelling. An example is by removing variables that have a high correlation. In the modern approach we load all features into the model. Only after running the model we trim our data and rerun the model.

Another way of comparing a traditional approach with a modern approach is as follows. For the traditional approach we use our own reasoning and knowledge of statistics to optimize the model outcome. The modern approach replaces these steps with raw computing power.

ACEA Smart Water Analytics Data Setup; Traditional approach

In the traditional approach for data setup we define the following steps:
– Feature addition
– Data cleaning
– Statistical preprocessing

Feature addition

We will work with the Auser dataset from the ACEA Smart Water Analytics Competition on Kaggle. For further explanation of the dataset please refer to the introduction blog. It includes a total of 5 outcome variables. Each of these variables can and need to be used for prediction with the final notebook. For illustration purposes we will focus on the LT2 variable. By the data definition this is the only water level measurement in the south part of the aquifer. Today we will focus as predictors on the rainfall and temperature variables in the Auser dataset.

One of my main considerations were related to the temporal aspects of the dataset. Specifically the time lags associated with both the temperature and the rainfall variables in the dataset. Hence it is a logical first step to incorporate lags of these variables in our dataset. In the future I can imagine there are other parameters to consider here. One example may be averaging rainfall for multiple measuring points or using a running average of multiple days.

Data cleaning

In this part of the process we normally remove missing values. This can be done in multiple ways which are described in depth elsewhere. In short, we can simply remove all values or In a business environment it is often a time intensive step to retain as much data as possible, for example by using imputation to fill up the missing values. This is where your business logic comes in and its an important area to focus on when your presented with real life ‘dirty’ datasets. For now we simply remove all values that are incomplete.

Another important step here is outliers, we find that some measurements in the LT2 variable are unrealistic, from one day to the next the water level moves from the average to 0 and back to the average. This is a faulty measurement and needs to be removed from the data.

Statistical preprocessing

Now for the most important pre-processing step in the traditional approach, we clean up our variables by removing highly correlated variables (since they have a similar effect on the outcome measure of choice) and near zero variance variables (variables who are for example nearly always 0).

These steps help us gain an understanding of how the data works, and therefore help us better understand how the model gets its final outcome. Finalizing all these steps we find that 18 variables remain, using those in our initial testrun model we find an Rsquared of the model 15%. This is not the best result, but for a first try I’ve had a lot worse!

The next step in the traditional approach is to start tinkering, initially we work backwards, hence we look at our excluded variables and find different cutoff values for the different steps. This will help us expand our model, we add features and attempt to improve the model performance through understanding its inner workings. This step can take days to weeks, depending on how often we rerun our model.

ACEA Smart Water Analytics Data Setup; The modern approach

In what I call a modern approach to data setup we skip the statistical test processing, instead we simply load the data into the model and try to optimize the predictive power. Afterwards we evaluate our model and if possible rerun a simplified version. When running the model without any statistical tests, but with a similar feature set and data cleaning we find an Rsquared of the model at 33.7%. This gives us some idea of the potential in the data.

Code

Below you find my function for preprocessing and running my model both in the traditional and modern approach. I find that writing this in a function helps keep my code clean, and this function can be applied to all datasets in the Kaggle ACEA challenge as needed.  

create_data_model <- function(data, features, outcome, lag, preprocess = NULL){
    
    data_model <- data[,c(outcome,features)]
    names(data_model)[which(names(data_model)== outcome)] <- 'outcome'
    
    # Feature addition
    for(i in 1:length(features)){
        for(j in 1:lag){
            data_model$temp <- Lag(data_model[,features[i],+j])
            names(data_model)[which(names(data_model)=='temp')] <- paste(features[i],j, sep = '_')
        }
    }
    # Data cleaning

    data_model <- data_model[complete.cases(data_model),]       # Remove all rows with missing values
    data_model <- data_model[which(data_model$outcome!= 0),]    # Remove all outlier measurements
    
    # Statistical preprocessing
    if(!is.null(preprocess)){
        temp <- data_model[,-1]
        nzv <- nearZeroVar(data_model)                                                 # excluding variables with very low frequencies
        temp <- temp[, -nzv]
        i <- findCorrelation(cor(temp))                                                # excluding variables that are highly correlated with others
        if(length(i) > 0) temp <- temp[, -i]
        i <- findLinearCombos(temp)                                                    # excluding variables that are a linear combination of others
        if(!is.null(i$remove)) temp <- temp[, -i$remove]
        data_model <- data_model[,c('outcome', names(temp))]
    }
    
    # Modelling:
    fitControl <- trainControl(## 10-fold CV
                           method = "repeatedcv",
                           number = 10,
                           ## repeated ten times
                           repeats = 3,
                           verboseIter = T)

    gbmGrid <-  expand.grid(interaction.depth = c(1,2,4,8), 
                          n.trees = 1:2000, 
                          shrinkage = c(0.01,0.001), 
                          n.minobsinnode = c(2,5))

    err <- try(load(paste(maindir,modeldir, paste('outcome =',outcome,'lag = ',lag,'preprocess = ',preprocess,'.RData', sep = ''),sep = '/')))
    if(err != 'train1'){
        train1 <- train(outcome ~ ., data= data_model, method = 'gbm', trControl = fitControl, tuneGrid=gbmGrid)
        save(train1, file = paste(maindir,modeldir, paste('outcome =',outcome,'lag = ',lag,'preprocess = ',preprocess,'.RData', sep = ''),
                                  sep = '/'))
    }        
    
    train1
}

tr_preproc <- create_data_model(data = data_auser,
                                features = grep('Rainfall|Temperature',names(data_auser),value = T),
                                outcome = 'Depth_to_Groundwater_LT2',
                                lag = 15,
                                preprocess = 'yes') 

tr_nopreproc <- create_data_model(data = data_auser,
                                features = grep('Rainfall|Temperature',names(data_auser),value = T),
                                outcome = 'Depth_to_Groundwater_LT2',
                                lag = 15,
                                preprocess = NULL) 

Discussion

So how to setup your data for modelling? We saw that the traditional approach leaves you with more understanding of the problem at hand. The downside is it also required more time investment and often demands more business logic to deal with the problem. The modern approach skips these steps and moves straight to optimize predictive power. It allows us to utilize machine learning techniques to optimize the data setup.

Comparing the 2, we found that simply loading all information into the model got us to 33% predictive power, not a bad score. Our first try of using some statistical concepts to preprocess the data only got us to 15%. It did take my laptop 6 hours to run the model on the ‘modern approach’, the traditional model was done after 20 mins.

By heart I am a traditionalist, having studied as an economist I strive for understanding a problem, thinking it through and finding a solution. Ultimately to use the model we often need this complex understanding that you develop through the extra time investment and hard work. How else are we going to convince other people to use a model for their business problem?

When do I prefer the modern approach? When for a particular business problem the outcome is more important than the process, or when the assumption is that the important variables in the dataset that influence the outcome can’t be influenced themselves. This Italian water level issue is actually a perfect example of a situation that tills quite heavily towards the modern approach. As we find that temperature and rainfall influence water levels, and we can’t actually influence temperature or rainfall, all we really care about is to predict the water level. It would be great if we know that the temperature on a mountain is highly predictive for this, but once we know this we can’t influence it in any way.

ACEA Smart Water Analytics Competition next steps

I will be back soon with a more in depth look at the modelling process I’ve gone through for this challenge. Further improvements can definitely by made, such as transformations of the data and making use of lags on the outcome variable.

ACEA Smart water analytics
Blog, Data Science project, rblog

ACEA Smart Water Analytics Competition; Introduction

In this blog series I will document my work on the Kaggle ACEA Smart Water Analytics Competition. The purpose is of these blogs is not to share perfect code or completely worked out solutions. Rather I want to take you along for the ride and share my insights as is. There will definitely be some fails and hopefully a success mixed in. In this series I will discuss 4 distinct steps in my project:

If you want to follow along with any code in these blog series then sign up for the challenge here: https://www.kaggle.com/c/acea-water-prediction/overview. You can download the dataset for free after signing up.

The challenge: ACEA Smart Water Analytics Competition

The goal of this challenge is to predict water levels in a collection of different water bodies based in Italy. Specifically we have to predict based on a time series model. The goal is to accurately assess the water level of tomorrow, based on data of today. This specifically is an analytics challenge, which means creating a compelling story & notebook is a very important part. My notebook is publicly available on Kaggle here. I will work through some code excerpts and interesting highlights in these blogs.

For today I want to address my initial findings, and explore how I thought through these challenges initial steps.

Designing my load function

One of the first steps everyone undertakes is (obviously) to load the data. It is my own personal preference to condense this code into a load function. The main reason for load functions is that it creates clean and concise notebooks. It allows me to build code chunks that are independent of each other by loading data in the notebook as necessary.

My load function will typically contain the following 3 components:

  • Filepath to data
  • Column based filters
  • Merge functionality

In the ACEA dataset this worked out as follows:

# dirs:
maindir   <- 'C:/Prive/Kaggle/ACEA'
datadir   <- 'Data'
scriptdir <- 'Scripts'
htmldir   <- 'HTML'
# load:
# Inladen files:
files_loc <- list.files(path = paste(maindir, datadir, sep = '/'), recursive= T, full.names = T)
files_loc <- files_loc[grep("\\.csv", files_loc)]

load_files <- function(path, filter=NULL){
    output <- read.csv(path, stringsAsFactors = FALSE, check.names = FALSE, fileEncoding="UTF-8-BOM")
    output$name <- gsub('\\.csv','', basename(path))
    if(!is.null(filter)){
        return(output[,filter])
    }
    return(output)
}

data_auser <- load_files(files_loc[1])
data_doganella <- load_files(files_loc[2])
data_luco <- load_files(files_loc[3])
data_petrignano <- load_files(files_loc[4])
data_bilancino <- load_files(files_loc[5])
data_arno <- load_files(files_loc[6])
data_amiata <- load_files(files_loc[7])
data_lupa <- load_files(files_loc[8])
data_madonna <- load_files(files_loc[9])

As you can see, there is currently no merge functionality yet in my load function. I imagine I will update this as I can see some usage for it later on in my project. The crux here is that if you are faced with multiple datasets that would logically have to combined, your merge function incorporated with your load function can create a clean, concise and easily understood notebook. The load function does allow us to load specific columns. We will see in the next chapter how this is usefull already.

Temporal elements of ACEA dataset

One of the main discussion points of the ACEA dataset is temporal. This means there are time based elements in the dataset. The data contains multiple types of columns such as rainfall, temperature, hydrometric and ground water levels. The final challenge is to build a time series model, we have to predict the water level of tomorrow. This means that there are some initial assumptions to be challenged or tested:

  • Trends over time
  • Seasonality
  • Availability of time

One of the cornerstones of our narrative needs to be timebased trends, the data runs back to 1998 on some variables (see below later on), which means that it is possible the different explanatory variables contain long running trends. I myself assume for example that rainfall is decreasing over time and temperature increasing. This is due to global warming effects. It is important to test these assumptions. When thinking of seasonality, we may assume furthermore water levels are likely lower in summer and higher in winter or early spring.

As far as easy to use temporal visualizations, I designed a plotly of all datasets and dates available in the dataset to showcase how long they run backwards giving us an idea of the possibility of finding different trends. The codechunk below makes use of the load function we designed earlier thus making it stand-alone executable.

# Put all dates and names variables:
datalist_date_name <- lapply(1:9, function(x){load_files(files_loc[x], filter = c('Date','name'))})
data_date_name <- data.table::rbindlist(datalist_date_name)
data_date_name <- data_date_name[which(nchar(data_date_name$Date) == 10),]
data_date_name$Date <- as.Date(data_date_name$Date, "%d/%m/%Y")

fig(10,6)
p <- ggplotly(ggplot(data_date_name, aes(x = Date, y = name)) + geom_point() + 
labs (x = 'Date (years)', y = '',title = 'Available dates by dataset') +
insighttheme )

htmlwidgets::saveWidget(p, file = paste(maindir, htmldir, 'dates_viz','index.html', sep = '/'))
Available dates in Kaggle ACEA Dataset

Geospatial elements of ACEA dataset

One of the final things I dived into this week is the geospatial element present in the ACEA dataset. We have different aquifers, rainfall and temperature measurements around Italy, and thus it would be beneficial to explore if maps can help us develop a story. The goal is making our viewers understand better how the data works and what crucial insight we are working on.

For example it would be helpful to show our temporal elements on a map, for people to quicker understand the overall findings of all locations found in the ACEA dataset. I did have to use some elbow grease in this section, as the different locations found are not groupable by for example municipalities or cities. Some of them are mountains, bridges, etc. Hence there was no comprehensive dataset with locations existing I could use. Therefore I put all 64 locations in google maps and wrote down the longitude and latitude by hand.

Now with 64 locations this was the quick fix, the longer fix is found here, https://developers.google.com/maps/documentation/geocoding/overview#RegionCodes. If you have to work with 1000’s of cities or a plethora of unknown locations then refer to the google API and programmatically extract the longitude and latitude. Since I am looking into doing more Kaggle and also working more with maps in my regular job I might have some fun and build a function for that later on.

Below was my first attempt at a map, showcasing the average rainfall surrounding the AUSER aquifer. Please note that this code makes use of a shapefile for italian municipalities found here: https://hub.arcgis.com/datasets/e68ceb0a193e4e378b29255b62ab75e0_0. This is basic version of the map, which I will be updating further on my notebook as needed for the final submission.

italy_mun <- readOGR(files_shp)
italy_mun <- fortify(italy_mun, region = 'COMUNE')

locations <- read_excel(files_geo)
locations$lat <- as.numeric(locations$lat)
locations$long <- as.numeric(locations$long)

mean_rainfall <- aggregate(rainfall_value ~ location, data = data_plot, FUN = mean)
mean_rainfall$location <- gsub('_',' ',gsub('Rainfall_','',mean_rainfall$location))
locations <- merge(locations, mean_rainfall, by.x = 'location', by.y = 'location', all.x = T, all.y = F)
locations <- locations[!is.na(locations$rainfall_value),]

#mean lats and longs:
municipalities <- aggregate(cbind(long, lat) ~ id, data = italy_mun, FUN = mean)
municipalities <- municipalities[which(max(locations$lat)+.1 >= municipalities$lat &                  # Municipalities within auser locations latitude
                                           min(locations$lat)-.1 <= municipalities$lat), ]

municipalities <- municipalities[which(max(locations$long)+.1 >= municipalities$long &                # municipalities within auser locations longitude
                                           min(locations$long)-.1 <= municipalities$long), ]

italy_mun_plot <- italy_mun[which(italy_mun$id %in% municipalities$id),]

ggplot() + geom_polygon(data = italy_mun_plot, aes(x = long, y = lat, group = id), fill="grey", alpha=0.3, col = 'white') + 
    geom_point(data = locations, aes(x = long, y = lat, colour = vartype, size = rainfall_value)) + insighttheme +
    labs("Map of rainfall locations belonging to Auser aquifer in Italy")
Output of basic map code

ACEA Smart Water Analytics Competition next steps

My takeaways from this blog are simple, look at your data and develop an understanding of the important dimensions, in this case time and location. This could be other dimensions such as currency, speed and/or connectivity. Connect your final model and insight back to these basic dimensions of your data to create a compelling narrative that is easy for all your users to follow.

I will be back soon with Part 2 of the ACEA Smart Water Analytics Competition!

Personality traits for a data scientist
Blog

Personality Traits for a Data Scientist; My top 3 explained.

My top 3 personality traits are probably not the standard, it has to do with how I define a great data scientist. This is a troublesome endeavour in general. Data Scientists are everything and everywhere these days and the field of data science is very broad indeed. It is however important to see these top 3 personality traits for a data scientist from the right perspective.

A great data scientist facilitates value growth within a company by making connections with others and helping people and projects forward in a data-driven way.

Let me explain this shortly, I see a data scientist as someone who drives innovation through data-driven decision making. You know all the technical stuff, the difficult bits. As a great data scientist you move beyond that. You create impact by working with people and helping them think and act differently. You end up as a bridge between the technical expertise and the way people work on a daily basis. Understanding one, and not the other, makes you destined to fail.

Noone likes failure, and I’m not gonna write this blog from a perspective of how to avoid failure. I want to talk about succes. Because in my own experience, there are some important traits to learn to cultivate within yourself, to become a great data scientist.

Personality traits 1: Horizontal curiosity

To be curious, almost childlike, should without a doubt be deeply ingrained in your system. There are many facets of curiosity, starting out in a new industry or company we need this skill to adapt to our new environments.

Luckily in such a fresh new environment curiosity comes natural to many. I am also looking for when it does not come natural. When it no longer gets triggered from outside stimuli. What happens when we dont emphasize curiosity from within? We slowly descent into set rhytms which where the only way out is a change in surroundings. This is no healthy and durable strategy, finding and maintaining curiosity from within yourself gives you longevity in your daily life.

Curiosity has another very important facet, it is not only about duration but about direction. To become a successful bridge between teams and find business value within your company. You need to be directing your curiosity horizontally.

Horizontal curiosity means broadening your field of interest, learn about people and their jobs, find parallels and bring them into your own daily routines and work. This is different from vertical curiosity, vertical curiosity is about deepening your knowledge on an ever specializing skillset. This distinction is so often overlooked and in my experience everyone talks about vertical curiosity. We need to be curious about deepening our knowledge. But as a functioning, great data scientist that brings people together to innovate towards new data-driven processes, we desperately need horizontal curiosity.

The great news is cultivating this is easy, if your new to horizontal curiosity, start slow and ask questions to people you speak about areas of their life. Eventually you will see that in every conversation you ever have, you can be horizontally curious.

Personality trait 2: Creativity

I’ve started to read this great book, “Big magic”, by Elizabeth Gilbert, its relatively old and printed in 2015, but new to me (You can find it here ). It talks about creativity,  how it is the gateway to our personal hidden deep treasures. Our life becomes so much richer if we learn to open this gateway. I’m not that far in yet, but it got me curious.

If creativity enriches your own life, which I’m pretty sure it does, I definitely will argue it enriches the lives of the people around you. And if it does that, then it enriches your company and workplace. As a data scientist we share our creativity with our coworkers by imagining new solutions and products. We are in the business of envisioning new horizons, talking to people about them and making it a reality.

“To create”, in other words, “bringing something into existence”, is an actionable skill. You need to do, talk, write, code, visualize to create. We can cultivate creativity by learning to start this one step at a time in a conversation, an essay or even a drawing of our vision. The key aspect for me about creativity is to take that step. It will soon turn into a sprint, get those creative thoughts out of your head and onto something, it will thank you for it.

“The only limit to your impact is your imagination and commitment” –  Anthony Robbins

Personality trait 3: Tenaciousness towards people

This is the final trait I want to discuss today, I know that sentence is not screaming tenaciousness, why stop here? Do I have other plans, things to go to? Actually I am committed to writing a blog with some clarity and ease of perspective, and that’s why this is my last trait for today.

Tenaciousness is seen as hardship, lonely on the trail we walk and are taught to push on. We are told there is light at the end of this tunnel. We have to commit to our ideas and walk the path to greatness. I too firmly believe in commitment, commitment to higher purpose and the people around you. If we are committed to this greater good, and getting there together, nothing can touch us. We can walk out in front of the group, as a data scientist we have too. We offer paths to a new and different future, what we need to learn is to commit not only to the path, but the people who walk that path and the end destination.

There is a place for the more traditional view of tenaciousness. We need to be able to grind out the work, once we agree on setting off down a pathway, we need the tenaciousness to deliver. You can’t lead if you don’t know how to follow goes the saying. Tenaciousness to our own words and agreements translates to other people. To deliver on this continually, is what makes people feel that you can bring them to this higher purpose.

Wrapping up

These were my top 3 personality traits for a data scientist. If you want to develop your skillset as a data scientist, focus on these 3 personality traits. It will shape the impact you create for years to come. Brought together In the right quantities these 3 traits are on my shopping list when I am looking to expand my team and hire a data scientist.

To make valuable connections we need to be curious. To move forward in new directions we need creativity. Bringing this reality to life we need commitment to our coworkers and peers.

See you in a click!

Blog

My personal backstory; A road to Data Science for a living

As the title says, today I will share my own person backstory and  how it shaped me for data science to come into my life. As every person on the planet will tell you, their story is a little bit different, and so is mine. Now lets go down memory lane.

From a young age I’ve always been competitive, into strategy games and numbers. If I’m honest it wasn’t always completely healthy, I was always looking to confirm my own uniqueness and intellect when I was very young. For example in primary school from 6th grade you could play chess in a school selection, they would select 5 people and the winner would be  table number 1, all the way down to 5 to make a team. We would then as a team of 5 play against all the other schools in the area and number 1 would play against the number 1 of the other schools. I was number 1 3 years in a row, the first time that happened in school history and in my first year I was 9 and playing against these 12 year old’s. I thought it was awesome and it just reaffirmed everything I thought of myself.

Aside from the probably mental health concerns, what I enjoyed so much about chess from an early age is the mix between smart logic and time constraints. At the time I would also play age of empires, a real-time strategy (RTS) game that has the exact same tradeoffs between time and smart logic. This was my next addiction, during high school I competitively played RTS games online, I played a lot of e-sports but this was when the entire esports scene was in its infancy. RTS games taught me 2 things, it taught me how to search for optimal strategies given limited information, and it taught me the concept of studying and improving your strategies outside of the competitive environment, training.

By the time I started my economics studies in 2008 I hit a different period in my life, it was right around when online poker became main stream. This game just perfectly hit my interests, it was a natural evolution after playing RTS games, only the stakes were far more interesting. What was new about Poker? It was more math intensive, it brought me into the world of databases and statistics. It connected to my studies that brought me game theory and behavioral economics.

It was through online poker that I first got into contact with a programming language, there was this professional player who developed his own tool to calculate game theory optimal strategies in a given situation. He did this in python and shared his notebooks with everyone who followed his course. This was a very interesting new challenge for me, by going through his code and altering it to my own needs I self-taught basic Python. Later during my master we did some Stata, which really isn’t programming, but we also had the option to use R. I taught myself R too, just by copying code, looking online and using it for my master thesis.

By the time I finished my studies in Health Economics I was ready for a real job, interestingly enough I was not directly looking for a job in data science, it was economic analysis that was my principal interest. I actually had a job offer in hand from a consultancy in Rotterdam specialist in health economic analysis. At the time of that job offer I just started working part time doing some programming work for a data scientist who worked for this insurance company, MediRisk. He told me that they are looking to expand and got me an interview there. It was my curiosity for something new that brought me to the decision to go with the data science position at MediRisk.

See it can be easy to put something on a page and make it seem like a logical buildup and story, and in a way it was, but the real red line was always my curiosity for something new. Perhaps not completely new, but always one step forward and one step sideways. I love to broaden my view of the world by learning new skills, ideally by doing them. In my job currently I get to constantly explore new tools and technologies, understand them for their potential and excite others to join me in using them to create value. Learning by doing something new is something I aspire to always continue doing.

In a next blog I will share some tips and tricks into what I feel is most important if you want to start as a data scientist.

 

Blog

Project Plan Presentation; Explain & Convince

One of the key business skillsets in data science is doing a great project plan presentation. Now this is true for almost every job, but in particular for data scientists. Why? It has to do with the nature of data science projects. Specifically the innovative aspects of data science projects.

As data scientists we are at the forefront of innovation, changing how people and companies do work. Hence we not only convince people the project is worth doing. We also have to convince people to change their historic way of working. People will be skeptic, part of convincing them is based on your technical expertise. There is a comfort in numbers and projections. But there is also a softer side to your storyline. Today I want to focus on some pointers on how to share your ideas with maximum impact.

In this series we earlier discussed how to write a great project plan, you can check that out here.

Craft a positive outlook

If my project succeeds, this will happen… The way we work will improve, get more cost efficient or we make more sales. All projects strive for something, what is so important is to start off at this final product and build this better future in the minds of your audience. It is however even more important to not build this outlook from a negative starting point. Let’s look at an example:

“Think of a new machine learning algorithm you want to design for an app of the company’s home inspection team, those guys that go to homes and check on the state of window frames for example. There is a reasonable chance you started this project because currently the home inspection team misses a lot of cases, or they call in maintenance crews far to often and find to many faults. In any case there is a problem in the status quo and room for potential improvement.”

It is now up to you to help paint them a picture of a new and better future, this algorithm will make their lives easier. However it is also up to you to realize that their first perspective can’t be that their current lives are bad, difficult, or any other negative connotation. Compare these 2 sentences:

“When this app is successful, we will be able to together work on new services and expand the business.”

“This machine learning algorithm will help our maintenance crews decrease their extreme workloads.  “

Which sentence inspires to move forward? Which sentence implicitly states the status quo is full of problems? I have seen a lot of presentations that explain the project benefits by going into detail on how big the current problems are. The thing is everyone is either aware of these problems or perhaps does not even recognize them as a problem. This loses your audience. Focus on the future, and invite your audience to travel with you.

The gift of responsibility

So we are traveling together, you are probably looking for a project team. What you should try and focus on is to give people their own responsibility as much as possible. Craft clear roles and communicate them. Do this the right way and you will have hit a major milestone in the project’s success. Let me show you 2 sentences again:

“To design the app perfectly for your daily jobs, I need your help to tell me what works for you.”

“You will have to tell me how to design the app, if you can test how to work with it on a daily basis the launch will be successful.”

What I am looking for here is to give a clear purpose and responsibility to your coworkers in the project, make them part of the success or failure as much as possible. Compare sentence 1 and sentence 2, which one conveys this more directly? Communicating this properly helps you in multiple ways, both by allowing you to focus on what Is in your own control and by sharing responsibility for the projects outcome. Put all this together and you are more likely to receive the commitment you need during the project.

Tell a story of your end results

Now that you have discussed where you want to go and who you will need to do what along the way, its time to ingrain your end goal into the minds of your audience. The key here is to tease your end result enough, by showing some examples and mockups, but avoid filling in all the details. Instead let your audience fill out the details and leave some room for questions.

See you know where your going, but what you will do precisely when you get? You may have an idea of how the new algorithm will fit into the daily work of your home inspectors, but  keep an open mind, if someone else has a great idea then leave space for that.

The story you should tell is focused on the practical day to day activities, don’t compare the current workday to the future workday, describe only the future. People know what they do on a daily basis better then you so you run the risk of coming across as someone who doesn’t know their facts. People don’t know the new future you will develop together in the coming months, so share some little bits. Take a look at this:

“When you enter the building, you will be able to take your ipad and take pictures of the current state of the building as usual. From this point on, the app will be there for you whenever help is needed. Unsure of the level of degradation? Unclear if you really saw every part of the window frame? For every question you will either automatically be guided by the algorithm or easily consult a coworker.”

Wrapping up Project Plan Presentation

These 3 pointers should help you convince more effectively that your projects are, worth doing. Craft a positive framework, delegate roles and share responsibility will help you build your future. What tips do you have to tag people along for your projects? Let me know below.

See you in a click!