Blog

Writing a Data Science Project Plan; 5 Components explained

Writing a great Data Science project plan is a special skill. It does not come naturally to many people, me included. Over the past years I have written many project plans. In my own experience it’s easy to get lost on what formats to use and what content to include. Also how do you balance explaining the process of the project to the actual content of the project. There are many tutorials and templates about how to run your project. But in this blog I will give you some tips & tricks for writing it down.

This skill is very important, it’s your project plans that communicate your ideas. These documents are shared extensively with your coworkers and bosses. Being able to write a great plan gives you freedom at work. It allows you to work on your own vision of what is most important to the company. From a data science perspective this is extra important. We are busy with new technology and fresh ideas that need extra convincing. If you fail to develop this skill you are in danger of spending your working life focusing on plans from other people. Always remember that your project plan is meant to inspire, explain and convince. Here are 5 project plan components that helped me along in the past.

Introduce the bigger scope

We all start somewhere, and your data science project plan begins here. I find great benefit in starting with the bigger company strategy your project contributes towards. This gives your readers a clear initial scope and they should be able to relate to your ideas immediately. It is so crucially important to start writing from the bigger scope & vision of your company or client. Explain time and again where you want to go, and how your project aids to this vision. If you get people to buy into the bigger picture off your project plan, they will follow your specifics later.

An example of this concept might be helpful.

Lets say your companies goal is to help speed up the transition to clean energy. You are doing this by selling solar panels. You have a dataset on lifecycle management & repairs of the solar panels. The project plan describes how you want to more accurately predict when repairs might be needed. Now that we have introduced the project within the scope of the company it almost feels unnatural, but lets continue.

The big nono is to start by explaining the current issues with repairs and breakdown of solar panels. Instead, take the time and effort to link your project to the larger vision and how it contributes to that. First explain your macro vision, then get specific.

Visualize & story tell your specifics.

In your first paragraph you’ve reminded everyone why we are doing what we are. People now know how this project fits this framework. Your next goal is to get specific, what a good data science project plan needs is a good and precise technical goal. Link this to a current problem.

is there a statistic or graph that displays your current issues? This is especially very important when your project is focused on analyzing a dataset. Try to visualize your data and problem, and visualize your goal of improving this.

Jumping back to our solar panel company, lets say current lifetime of solar panels is 20 years. Repairs are common in a specific timeframe. Perhaps we show people a survival curve, and we can visualize how this curve can change after the project.

If your project is more focused on working with people, this section is great to feature your end users. Ideally they will explain the problem. If this is not feasible, apply some storytelling of what this person runs into on a daily basis. From that background you can introduce your dashboard or app that will improve their decision making.

The goal here is to immediately ingrain, through visualization or storytelling,  in your readers what the main problem is your focusing on and how this problem will disappear in the future.

Explain your projects process

Now is the time to convince people. Think of it like this, so far your plan is easy to follow and believe in, we have accurately shown the project fits the broader scope and explained the problem and specific technical goal. But how are you going to achieve this? This is where you can lose your readers or have the commit to your idea. As mentioned there are many tutorials on different project processes, use this to your advantage and choose an appropriate one for your analysis.

The most important thing I’ve learned is that it’s always better to apply your process model to the project directly. How are you going to work in this specific case? Your readers are looking for realism, if you expect that the modelling phase of your project is going to be the most important step for success, write it down. This has two benefits, one is that your reader doesn’t have to get lost in how projects are run in theory, and second if you do this step correctly you can reference the project plan later when your actually working.

Deep-dive on specifics, specifically

So far the project plan has been non-technical, if you’ve done it properly then you have not yet explained the specifics of your machine learning algorithm or the nitty gritty of the new to be developed data warehouse architecture. Now is the time to explain in more detail these elements, look back at your problem scope and intended technical goal. What methodology are you going to use to solve this issue? How is your dashboard going to look? This section is, most crucially of all the steps in your project plan, dependent on your target readers.

Think of it like this, remember the purpose of your project plan is to inspire, explain and convince. This section is where you convince people, but everyone gets convinced differently. We have a collective habit to get technical and want to let everybody know our solution, so that they start believing too. This is known as the ‘what’ of your solution. But this step only hits home to your technical colleagues, if they don’t have to read your project plan, don’t write this paragraph.

If you need to convince your boss, then you will have to do more work here, not to further explain and teach, but to simplify and generalize your methodology. If your going to build a dashboard, provide a mockup here and explain the intended use in more detail. Focus on answering why this specific solution fixes the problem scope you made earlier. Stay clear of explaining the ‘what’ too much.

Plan for problems

Planning is the final leg of your project plan. This section also dependent more than others on your company’s culture and the projects specifics. Is it important to be precise? Are there external deadlines that need to be met? Or can you discuss during the project how much priority its getting? These are all questions I can’t answer for you, but they do determine how to write your planning and how much time and effort you need to put in this paragraph. In our company we run many internal projects at once, often juggling them for prioritization. This means that planning and deadlines are often flexible.

What I have learned over the years is that it is crucially important to describe your project planning risks here, where do you expect a high probability of a roadblock? What would that look like and how does it affect the timeline you initially had in mind? This is very important information for your readers too and it shows you have thought off not only how to make things work, but also how things can go wrong. Then when things do go wrong, you can refer to your project plan and discuss with your team what the consequences are.

Writing a data science project plan; wrapup

The 5 components I outlined above are important elements in every data science project plan. I’ve found them to be invaluable for keeping structure and focusing on the main points.

If you learn to do this correctly it will ultimately allow you to work on the project you love, instead of having to learn to love other people’s projects.

Blog

Data Scientists vs Superforecasters

One of my primary interests as a data scientist is great forecasting. Personally you might not share this point of view. I bet however you’ve encountered managers or colleagues who just want a great prediction. Forecasting is the essence that people think data scientists do. On my quest to better forecasting I encountered a book. Superforecasting by Philip Tetlock.

This book teaches us how to become a superforecaster, without machine learning or AI. In the book Philip describes how through 30 years of research projects he met people who can very accurately forecast. We learn about the stories of these people and how they use their different skillsets. They apply these skills to answer questions.

“How likely is it that this year China and Vietnam enter a border dispute with deadly consequence?”

One of the questions posed to the superforecasters

Having read the book it was remarkable to learn how these people interact with forecasting. To me it seems that we can learn a few things as data scientists. I will discuss my 3 key insights in relation to my experience working as a data scientist. Implement them at your own peril.

Quantify your question

“Hey Joe, will this marketing campaign result in any sales growth?”

Any manager

One of the striking points in Philips book is the way he phrases forecasting questions. To be a superforecaster you have to measure your progress. What is the requisite for accurate measurement? Objectifiable parameters.

Compare the border dispute with the question above. The border dispute needs to break out within a year, the marketing campaign is unspecified. It can result in sales growth 2 weeks from now or 2 months from now. Both would be a correct answer. Superforecasters need an objectifiable question with a clear time horizon. This is true for data scientists too. It is a fresh reminder that to be accurate we need objectifiable questions.

We need to work with the people around us to drill down on these questions. Before we dive into the data. The superforecasters mentioned in Philips book taught me this again. Once you start working on objectifiable questions you can track your forecasting, improve and learn from them.

Break down your question

Superforecasters don’t do ‘black-box’ guesses. What are black-box guesses? Its the concept that people face difficult questions and attempt to answer them anyway. You will get a response like, maybe 40%? This is a tip-op-the-nose estimate. The mental trap here is to forget to activate your problem solving brain because it seems overwhelming.

The key in breaking down your question is to focus on things you know, and move from there. A famous physisist, Enrico Fermi, bears the name of these problems. His most famous question is below.

“How many piano tuners are there in chicago?”

Enrico Fermi

This question is on the face of it a total unknown. Fermi teaches us we can break it down into components. We can then estimate the components individually based on what we know. This results in better guesses then tip-of-the-nose estimates.

Superforecasters constantly do this, they focus on what they know. They guess those components as accurately as possible and come up with an overall better answer. As data scientists we too have to break down our questions. For example our marketing campaign breaks down into relevant components:

  • What is our current sales volume?
  • How much did our sales grow in the past year?
  • Did we run any marketing campaign in the past year?

These questions allow us to answer components of the original question and craft a more educated response. It allows us to create a mental and mathematical baseline for the problem at hand. Sometimes we forget to take a step back and think about our problem logically. Doing this can help us come up with overall better answers, remembering Fermi can help us with that.

Work together with expert judgement

Superforecasters take in all the information that is available to them, period. They attempt to make the subjective quantifiable, if it supports the end goal, a better prediction. This behavior strikes a particular point that I personally can take to heart.

As data scientists we focus on eliminating subjective judgement to make decisions based on facts. What we forego in these situations is that subjective judgement can also be the final piece of our prediction puzzle.

In my daily work I very often reach a limit of what our companies data can predict. In order to improve our estimates we propose gathering more or richer datasets. Why dont we propose to have the experts that need to work with our prediction outcome adjust it as needed? Perhaps we can even program a feedback loop so we can learn from how our model outcome is subjectively adjusted.

In his book Philip Tetlock discusses this issue, and mentions a final concept, the ‘wisdom-of-the-crowds’. This is a statistical concept coined in 2004, which states that the aggregate forecast very often beats what any single member of the group could have guessed. Its not just a concept that works for 100 humans vs 10 humans that make a prediction. It also works for 1 human and 1 predictive model.

This concept is very powerful and in my opinion should be applied way more often with difficult to predict problems. As data scientists we need to learn the value of expert judgement, and not just strive to remove it.

Superforecasting

Superforecasting really is the business of data scientists. Our phones can now automatically categorize pictures into cats or no cats. We can soon use real-time data on your health to predict cardiac arrest. Looking at these developments you can argue data science is on top of its game. What the superforecasters in Philip Tedlocks book add to this is a refresher on the essence of making forecasts. Its a great reminder that I can totally recommend.