Data Engineering SneakPeek: A step-by-step representation of extracting value from the data

While the summer is summering outside, I intend to take full advantage of the sunshine and explore some of my usual spots for any new observations I may have missed during the gloomy days of winter.

One of my favorite pass times is to notice these weird little changes in everyday things like neighbors’ garden or changes in the parking spots while I am on my daily walks and runs

You would think it is weird, and I would agree it is weird to have such a pass time, bordering on stalking or loitering, but this is not just about the change but it is also about spotting a trend or insights that gets more intriguing for me.

Collecting data, Analyzing it, and coming up with a hypothesis is somewhat similar to this pastime and maybe that is the reason I am so intrigued and excited when I get my hands on new datasets.

Something new to learn and uncover and who doesn’t like treasure hunts to uncover unexplored treasure.

Data truly is my treasure. This might be an exaggeration on my side, maybe not.

Why am I telling you this? Well, if you are the one running an organization, be it building products or providing services, you already know the power of data. Heck, this constant chatter around AI is enough for us to get intrigued about data.

I am sure you are asking a lot of questions to make sense of it all, and I will try to summarize these questions and answers to them in this post that might give you an idea of ways to get started with data.

Let’s dig into it -

Where to find this elusive data that everyone seems to be talking about now?

Treasures are always found in the last place we tend to look and the last place we look at is at home.

For organizations, making sense of data produced internally is sometimes the last priority, if at all. For some organizations, capturing and storing the data sensibly might not be that straightforward as well.

Organizations can collect data in several different ways. Generally, the data collected can be categorized into structured (Employee Timesheet, Product sales report, etc.) or unstructured data (Performance Reviews, Customer ratings, etc.). This data can be stored in many different ways as well. It can be stored crudely in an Excel sheet, .csv, or text files or in many sophisticated ways like databases, data warehouses, and/or data lakes.

E.g. For the purpose of this article I wanted to try out looking up data on the recent layoffs that occurred in the Tech industry and how it compares to the layoff trends of the past years. My options can be many but I was able to find some relevant data here -

https://www.kaggle.com/datasets/ulrikeherold/tech-layoffs-2020-2024/data

In general, Kaggle is a good source for data sets on varied different topics.

The question here is what to do with all the data that I have now.

What to do with the data and How to make it answer your questions?

The process we described above is called Data Collection & Ingestion. To work with any datasets, internal or external, there needs to be a robust collection and ingestion process.

In our example above, I will have to download the dataset in a format that is easy to work with, in this case, .csv will work nicely for us.

Data Ingestion

I am using Jupyter Notebook and Python to work on data transformation. I am using the below libraries in Python to make my tasks seamless.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

Each of these libraries is used for different purposes. For data ingestion purposes, I will be using pandas to read from the .csv file and store the data in data frames.

layoffs = pd.read_csv(‘location of the file/file.csv’)

The next step is to have a first look at the data which was imported into the data frame. I will be typically looking at the column names and their types to understand if there is any conversion required to make sure we are working with relevant data types.

layoffs.info()
layoffs.head()

Layoff dataset information and column names

The output here shows that the data frame has 1672 entries and also the names of all the columns with their datatypes.

Data Cleaning & Standardization

The next step here is to find if there are any null values. The null values will skew the data representation when we get to the visualization or even to the model training part to build AI.

layoffs.isnull().sum()

List of columns with count of null values

As we can see there are quite a few null values for the columns which need to be cleared.

layoffs_cleaned = layoffs.dropna()
print(layoffs_cleaned)

Data set after the null values have been dropped

The dropna() method allows us to drop all null values from multiple columns. I am storing the new data set as cleaned data in a new data frame. I will be using the new data frame for further processing.

Data Transformation

The next step is to remove or alter any columns that may not be useful for the analysis. In this case, I am getting rid of columns like lat and long which provide latitude and longitude information. It is no longer necessary for our analysis.

layoffs_cleaned.drop(["lat","lng","#"],axis=1,inplace=True)

Now that we have the data set in a state that we can work with, I will try and visualize the data set to make sense of it. The first step in it is to create a line graph to find a correlation between two sets of columns.

plt.figure(figsize=(10, 20))
sns.relplot(data=layoffs_cleaned, x="Date_layoffs", y="Laid_Off", kind="line", errorbar = None)
plt.xticks(rotation=45)
plt.gca().xaxis.set_major_locator(plt.MaxNLocator(12))
plt.show()

line graph to show correlation between date_layoffs and laid_off columns

The graph gives a good overview of the data set and we can easily interpret that the data is telling us the trend of layoffs peaked during the initial post-COVID years and then steadily came down to normal rates in 2024.

We can do a similar representation of the data and compare the laid-off count against the size of the companies. I will do it using a scatter plot to show the cluster of affected people as per company size.

sns.relplot(x="Company_Size_before_Layoffs", y="Laid_Off", data=layoffs_cleaned, kind="scatter", hue = "Company_Size_before_Layoffs")
plt.xticks(rotation=45)
plt.show()

scatter plot for company size before the layoff and count of laid-off

The plot shows a visual representation of the people laid off at a consistent rate in startups and scale-ups but the count was much higher in much established bigger companies. We can easily say that as per data the count of people laid off is predominantly higher in bigger firms.

Data Analysis

We can also see the count of people laid off as per country.

toplayoffs_Country = layoffs_cleaned.groupby("Country").agg({"Laid_Off":"sum"}).nlargest(10,"Laid_Off")
toplayoffs_Country

We can also represent this data in a bar graph -

sns.set(style="whitegrid")
sns.set_palette("bright")
plt.figure(figsize=(12, 6))
sns.barplot(data = toplayoffs_Country, x ="Country", y = "Laid_Off", dodge=True)
plt.xlabel("Country")
plt.ylabel("Number of Layoffs")
plt.title("Top 10 countries with Highest Layoffs per Year (2020-2024)")
plt.tight_layout()
plt.show()

The graph shows a visual representation of the data table above. It mentions that the layoffs occurred massively in the USA, India, and Germany.

Data Loading

Finally, the data looks good for further analysis. Hence, I will either export the data as a data table to be used in a database or I will export a new .csv which can then be consumed by anyone who wishes to work with the data.

layoffs_cleaned.to_csv('cleaned_layoff_data.csv', index=False)

As we saw, this is just a simple and crud way to work with the dataset available to us. In a real-world scenario, we will be dealing with complex datasets and formats which will require more sophisticated solutions.

What does this all mean and Where do I go next from here?

Once we have a dataset that is cleaned and usable for further analysis, we can move ahead with different types of analysis.

Exploratory Data Analysis

We go deep into understanding data in exploratory data analysis and perform statistical analysis like calculating the median, means, and standard deviations. We create more visualizations to find better correlations between different columns as we did with the country and laid_off column in the above example.

Data Modeling

We can apply statistical models to perform predictive analysis and identify newer patterns. Depending on our goals, we can perform Regression or Classification algorithms and use the dataset to train these algorithms for better insights.

Reporting and Deployment

We can also use the dataset to import into Tableau and other Dashboarding applications to create visuals that can be used to provide clarity to business decisions.

The models developed using this data set can be deployed in production and with the help of API make it available for other systems in the organization for real-time use and decision-making.

Conclusion

For an organization to be successful, a data engineering team needs to not only know how to work the data but also have a robust business objective attached to every data engineering project.

With the help of newly trained models like ChatGPT, Claude AI, etc. we can work with data more efficiently while using proven techniques to extract better value at a faster pace.

If you liked the above post or have something to add to it, please consider dropping a comment or reaching out to us. Feedback is always welcome.