Making sense of the data in its most original form
Modern tech legends say that 70% of the earth is covered by water and the rest is all data points.
Don’t ask me about the identity of these legends but then if you have to take a wild guess, you can narrow it down to a select few who are profiting off of the data points. If you want to take a wild, wild guess. Really, only if.
This is not to say that we cannot be more aware of these data points and learn something in the process. Learn about the quintessence of a successful AI system, Data.
Data is fluid and ever-evolving
The first and most trusted fact about data is that it is ever-evolving.
Let’s take an example of a typical organization that deals with content creators and their content. The way these social media platforms function in the attention economy is by the way of generating constant churn of content or data which is consumed by millions, if not billions of people using their platforms.
To give an estimate - the USA and EU produce around 500 terabytes of data daily while India produces somewhere around 600 terabytes of data daily. This data can be in text, video, audio or any other format.
How do these social media giants manage these quantities of data? Where does this data go?
To answer these questions, we have to consider the way water is stored on earth and consumed by us. We don’t deal with oceans for our daily water usage, while there are ponds and lakes to provide for our water needs in a typical habitat.
Data, in its varied form, can be consumed only if it is stored in a manageable way. The terabytes of data is managed by these large social media organizations by way of creating mini-data lakes.
What is a Data Lake?
By Definition - “A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure it, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning—to guide better decisions.”
The term was first coined by James Dixon, the CTO of Pentaho in 2010.
The term Data Lake refers to the data in its natural form condensed to be used in a raw format to understand the kind of processing that needs to be done for varied purposes.
If you think of a typical lake, this term makes perfect sense. It is a stream of information in its purest form being stored in a condensed body and then treated as per the means and purpose of consumption.
How to create a Data Lake?
Let’s take a typical example of pulling a dataset from a website like Kaggle.
If you are not aware of Kaggle, here is a link to sign up and explore numerous datasets on various topics to learn and understand the world.
As it is with the tech world, to communicate with any application or website, you need to be aware of the API details and probably sign up to get a token to authenticate with the source to access the dataset.
Here is a sample python code to fetch a dataset from Kaggle -
from kaggle.api.kaggle_api_extended import KaggleApi
api = KaggleApi()
api.authenticate()
# List datasets
datasets = api.dataset_list(search="Summer Olympics Medals")
for dataset in datasets:
print(dataset.ref)
The above code will list down all the datasets available in the Kaggle database for the list of medals won in the recently concluded Summer Olympics.
Downloading the dataset from Kaggle
In ideal world, there will be host of different data sources that may be needed to connect together like a Hive to create a data lake. We can call them different streams of data. The data might be structured or unstructured.
In our example, we are going to download a sample dataset from Kaggle -
def download_dataset(self, dataset_ref):
try:
dataset_path = os.path.join(self.base_path, 'raw', dataset_ref.replace('/', '_'))
os.makedirs(dataset_path, exist_ok=True)
self.api.dataset_download_files(dataset_ref, path=dataset_path, unzip=True)
logging.info(f"Downloaded dataset: {dataset_ref}")
return dataset_path
except Exception as e:
logging.error(f"Error downloading dataset {dataset_ref}: {str(e)}")
return None
The above code snippet is just a Python function that takes in parameters for the path of the dataset in Kaggle and downloads it into the project folder.
The different states of data in Data Lake
As mentioned earlier, there can be different data sources with different types of data to create a data lake, in our example we downloaded a sample dataset where we have a general idea about the data but do not know the actual format of the stored data. We can call such type of data as “Raw” data.
Raw Zone
The initial step of getting hands on the data and the process of ingestion of the data in its raw & unprocessed state is typically called the Raw Zone. It is in this zone that the data is stored purely in its original state for analysis and auditing.
The more we familiarize ourselves with the data, the more information we will gather on the state of the data. In most cases, we will want to slightly tweak the dataset for our specific needs.
def process_dataset(self, dataset_path):
try:
processed_path = dataset_path.replace('raw', 'processed')
os.makedirs(processed_path, exist_ok=True)
for file in os.listdir(dataset_path):
if file.endswith('.csv'):
df = pd.read_csv(os.path.join(dataset_path, file))
# Data cleaning
df = self.clean_data(df)
# Data quality check
if self.check_data_quality(df):
# Save processed data
df.to_parquet(os.path.join(processed_path, f"{file[:-4]}.parquet"))
logging.info(f"Processed and saved: {file}")
# Update metadata
self.update_metadata(file, df)
else:
logging.warning(f"Data quality check failed for {file}")
except Exception as e:
logging.error(f"Error processing dataset: {str(e)}")
def clean_data(self, df):
# Remove duplicates
df = df.drop_duplicates()
# Handle missing values
df['name'].fillna('Unknown', inplace=True)
df['sport'].fillna('Unknown', inplace=True)
# Convert age to numeric, removing '$' and ','
df['age'] = df['age'].replace('[\$,]', '', regex=True).astype(float)
# Convert last_review to datetime
df['finish_time'] = pd.to_datetime(df['finish_time'], errors='coerce')
# Remove outliers from price
df = df[np.abs(stats.zscore(df['price'])) < 3]
return df
In the above code snippet, we are trying to clean the data and add basic checks to make sure the data is in usable format. In addition, the data needs to be accurate for any kind of analysis to avoid false positives. This can be done with a host of data quality checks.
Cleaned Zone
def check_data_quality(self, df):
# Check for minimum number of rows
if len(df) < 1000:
return False
# Check for required columns
required_columns = ['id', 'name', 'age', 'sport', 'finish_time']
if not all(col in df.columns for col in required_columns):
return False
# Check for reasonable age range
if df['age'].min() < 0 or df['age'].max() > 50:
return False
return True
def update_metadata(self, filename, df):
metadata_path = os.path.join(self.base_path, 'metadata.json')
metadata = {}
if os.path.exists(metadata_path):
with open(metadata_path, 'r') as f:
metadata = json.load(f)
metadata[filename] = {
'rows': len(df),
'columns': list(df.columns),
'last_updated': datetime.now().isoformat()
}
with open(metadata_path, 'w') as f:
json.dump(metadata, f, indent=2)
Once all the required checks and corrections are done, the data will be then moved on from the “Raw” state to a new state called as “Processed State”. As we process the data, we also need to start building information about the data. In this case, we create a metadata.json file that shows the information about the dataset.
The stage of processing data to standardize it and add metadata to make sense of all the information is called the Cleaned Zone. The data is processed to remove duplicates, handle missing values, and correct any errors to store in a more efficient format like parquet as shown in the above code snippet.
Curated Zone
The data in the processed state can be used for different purposes such as Machine Learning, Data Analysis, Visualization, etc. This also means that the data will be further cleaned and processed.
The data can also be further refined to meet specific business use cases and can be integrated with further cleaned data to derive meaningful analytics and reporting. This stage of data is called Curated Zone since the data is processed for business use cases as per different domains.
Why do we need these Zones
As the data moves from one stage to another, from raw state to processed state, the involved parties can then mark the data as trusted & refined as per each business use case. To make this process easy, there need to be standards put in place for Data Governance, Data Quality, Compliance, and Optimization.
Organizations that are working with Data Lakes will have to deal with these zones for a cost-optimized way to work with the data.
Conclusion
Data Lake is the most natural way to store and work with the data in its original state.
When dealing with large chunks of data or big data, there needs to be a robust understanding of the business requirements, objectives, and goals of the actors involved in the usage of the data.
Data Lake provides a starting point for organizations to build data literacy and eventually build a data warehouse to create in-depth analysis and make data-driven decisions to achieve strategic goals.
The article intends to give a small taste of a simple way to implement a data lake to work with specific use cases and further refine the understanding of the data in use for more robust needs.