Five Bad Habits Every Data Scientist Should Avoid – and How To Prevent Them!
Nobody likes a bad apple.
Nobody likes bad habits either.
Throughout my data science journey, I got to learn and grow with several others that were starting out too. Looking back, I realized that many of us had several bad habits that were common among us.
And so, I wanted to share with you these five bad habits so that you can prevent yourself from falling into these.
Let’s dive into it!
1) Not completely understanding the business problem you’re trying to solve.
This is a really common problem that most data scientists (including myself) are guilty of when starting off. Why? Typically, when people learn about data science and start their first data science project, the problems that they try to solve are 1) simple and 2) spoon-fed to them (i.e. bootcamps/ tutorials).
For example, if you think about Kaggle competitions, the host clearly outlines what the problem is, what data is available, and what metric to optimize for. This unfortunately is a complete misrepresentation of what reality is.
In the real world, it’ll more likely be the case that you’ll receive vague instructions, especially if you’re working with a team that is not as literate in data science. To give an example, you may be told to “quantify marketing campaigns” or “predict fraudulent transactions” without much more detail.
How to prevent it
Spend some time collecting information before jumping straight to the data. Make sure you can answer these questions before starting any sort of data science project:
- What business problem am I trying to solve?
- What kind of data science problem is this? eg. binary classification vs multi-class
- How is success defined and how can it be measured?
- Who are the domain experts? and who are the stakeholders?
- How will this model be used? Additionally, how will it be implemented in the business process or product?
By answering these questions, you’ll have a better understanding of what you’re trying to achieve, and you’ll be less likely to have to backtrack and redo certain steps.
2) Rushing to complete your data science projects.
This is related to the first point but also applies to all other stages in the machine learning lifecycle. In particular, there are two stages that many data scientists tend to rush: data exploration and model validation.
Similar to the first point, many data scientists also tend to overlook this stage. Going back to my Kaggle example, the host of a competition makes it very clear what each feature represents, and you don’t have to worry about how the data is being generated or what data is out there — and thus the notion of “data exploration” is not completely required in this context.
In reality, however, you’ll have to spend a lot more time exploring what data is available, how the data is generated and transformed, and what the general characteristics of the data are.
Important questions to answer include:
- What feature variables are available and what is the target variable?
- What does each feature represent?
- How is the data generated? What is the pipeline/wrangling process?
- What are the characteristics of the chosen data (# of observations, # of variables, time scale, data types, etc.)
- Are there missing values? How to deal with them?
- Are there any outliers? Why are they there?
Like you would perform several tests in software engineering (unit, end-to-end, integration, etc.), model validation allows you to test your machine learning model. And thus, you want to make sure that you are thoroughly testing (validating) it and that your tests (validations) are robust.
Here are a few points regarding model validation and how you can improve it:
- If you’re only splitting your dataset into a training and testing set, you should actually be splitting your dataset into a training set, a validation set, and a testing set. The purpose of the validation set is so that you can adjust your hyperparameters, leaving the test set for a final test.
- Taking it a step further, you can use techniques like k-Fold cross-validation to minimize your sampling bias even more and ultimately prevent overfitting of the training set.
- In addition to cross-validation, you should also spend some time manually testing the model with your own inputs.
3) Trying to use the most complicated model to solve your problem.
If you could build a model with an accuracy of 95% in three weeks vs a model with an accuracy of 99% in three months, which do you think most businesses would choose?
Generally speaking, they’re more likely to choose the first option. Why? Because model performance isn’t the only factor when building a model. There are also factors like time to implementation, ease of interpretability and maintainability, etc.
Therefore, if you can build a simple rules-based model with if/else statements that solve the problem at hand, why bother with a 10 layer neural network?
How to prevent it
In most cases, it’s better to build the simplest model and determine if building the next best alternative outweighs the costs. To give a simple framework, ask yourself these questions sequentially:
- Can my problem be solved with a simple Python script or SQL query?
- Can my problem be solved with a decision tree (if/else statements)?
- Can my problem be solved with a simple machine learning model, like linear regression or random forests?
If the answer is no for all three, then you can consider your 10-layer neural network ????.
4) Trying to do it all on your own.
One of the biggest perks of being a data scientist is the amount of autonomy you’re given. But this can easily be a downfall if you’re not willing to seek advice, help, and feedback from others.
When understanding the business problem, you’ll have to talk to stakeholders to understand their needs. When exploring data, you’ll likely have to talk with domain experts to understand specific features and how they get ingested. And when validating your model, you’ll ideally want to get feedback from others.
How to prevent it
There’s no magic solution here, but learn to embrace ideas, opinions, and knowledge from other people. The more knowledge you have on the problem you’re faced with, the more likely you’ll succeed in solving it!
5) Not Effectively Communicating Your Methodology and Insights
This point encompasses two smaller points.
It can be hard for a data scientist to translate their knowledge and insights in a manner that can be understood by the average Joe. That being said, you have to be able to communicate technical jargon and modeling techniques in a manner that non-technical people can understand. If you took the time to build a great model, you should take a bit more time to communicate it effectively so that people can recognize your hard work!
Second, as I mentioned earlier, data scientists generally have a lot of independence. However, you have to constantly be communicating with other stakeholders, keeping them in the loop with your thought-process, any assumptions that you make for the model, and getting feedback. Otherwise, you may end up with a model that doesn’t solve the problem at hand.
How to prevent it
Take the time to break down your thought process into a more digestible form, like an article, a slide deck, or a web UI, which relates to my previous point. Using a library like Gradio can help with model interpretability and getting feedback from non-technical stakeholders.
this article has been retrieved from https://towardsdatascience.com/five-bad-habits-every-data-scientist-should-avoid-d2099a16b978