Standardize to Optimize | Simple and Effective Ways to Clean Your Data
Data AnalyticsStandardize to Optimize | Simple and Effective Ways to Clean Your Data
“There are missing values in the dataset, what do you want me to do with it?” “Delete them.” – Said no one ever.
Data is the lifeblood of modern decision-making, but if it's messy, well... you’re in trouble. Sometimes, when the dataset is small, and missing values can be safely removed without affecting your analysis, sure, go ahead. You’ll save yourself a headache later. But, more often than not, data cleaning is about so much more than simply deleting missing values. It’s the essential first step that can make or break your analysis, especially in the world of machine learning and AI.
So, grab a coffee and get comfy because today we’re going head first into the wonderful (and sometimes infuriating) world of data cleaning. Whether you're new to data analysis or working toward becoming the next analytics guru, cleaning your dataset is non-negotiable. It’s the foundation for confident decision-making.
Here are 10 techniques you need to know to standardize your data cleaning methods using python. These can serve as a good starting point for your needs. Keep in mind that you must know the syntax for dataframes and some basic transformations.
1. Remove Duplicate Data
Imagine you’re working at an e-commerce company, and you’ve got two records for the same customer due to a typo in their name. These duplicates mess up your customer segmentation and potentially lead to them getting two of everything in your next email campaign (oops!).
df.drop_duplicates(inplace=True)
Simple, effective, and now your marketing team won’t think you’re trying to win over customers with an army of emails.
2. Eliminate Unnecessary Data
More data doesn’t always mean better insights. Imagine you’re building a model for a bank to predict loan defaults. Do you think the column with customers’ nicknames or the exact minute they created their account will help? Nope.....or will it? Guess its time to have another sleepless night until I answer this question.
df.drop(['nickname', 'signup_time'], axis=1, inplace=True)
If you are wondering, axis =1 is used for specifying 'column' and inplace = True is used for making changes in the existing dataframe and no new dataframe will be returned. Cutting out the noise keeps your analysis focused, and let’s be honest, no one needs to know that a client goes by “Shneedz A. Penny.”
3. Ensure Consistency
Healthcare is one of those areas where inconsistent data can be a nightmare. Patient records might have birthdates written in various formats or medications listed in lowercase one day and uppercase the next. Consistency isn’t just nice, it’s critical.
df['birth_date'] = pd.to_datetime(df['birth_date'], errors='coerce')
df.columns = df.columns.str.lower()
Now let me break the code above down for you a bit. errors='coerce' will throw an error on the dates it cannot parse. If there is any invalid format, this is a great way to find out the flaws in any dataset. It can also be used for the pd.to_numeric() function to clean up other columns that you might have. The second line of code that has '.lower' at the end is to get everything down to a single consistent format i.e. lowercase in this instance. I prefer lowercase but, you can also use '.upper', although, it would seem like all the lines of data are shouting at you.
4. Convert Data Types
How many times have you come across a situation where numbers are stored as text in a dataset? In the world of autonomous vehicles, for instance, if your speed readings are stored as text, you can’t run calculations on them to predict real-time hazards. What’s a car without a sense of speed? It's basically a very expensive couch on wheels.
df['speed'] = df['speed'].astype(float)
The code here is making sure that you convert your data type to a numerical value. This keeps your car on the road (figuratively and literally), ensuring your machine learning model doesn’t crash before the vehicle does.
5. Clear Formatting
Some datasets can have all sorts of formatting issues like extra spaces, weird characters, you name it. These little errors are often harmless to the naked eye but, can wreak havoc when you try to feed them into your models.
df.columns = df.columns.str.strip()
By using '.strip' at the end, you’ll have a clean and tidy dataset with relatively clean columns. This can help you keep calm and all your hair intact during data visualization or various other forms of modelling. The number of times I have seen multiple entries which appear the same in during my data visualization endeavors is just.......let's not go there. Moving on.