Data Made Easy: A Comprehensive Guide for Beginners
A must read for all data enthusiasts!
Table of contents
- So, what is data?
- Isn’t data a simple concept, why should we learn more about it?
- Different types of data
- Some special data types
- How a machine reads data?
Data is all around us, from the information we process every day to the data collected by businesses to make informed decisions.
Businesses today are thriving on the data that they have collected over the years. This data is then utilized intelligently to make informed business decisions.
But understanding the fundamentals of data itself and then utilizing it can be a daunting task, especially for beginners.
That's where this comprehensive guide comes in. We'll break down the concept of data at its most fundamental level, giving you the tools and techniques you need to handle it like a pro.
So let's dive in and make data easy!
So, what is data?
💡 All information essentially can be classified as data.
It can come in multiple different forms, shapes, and sizes. It can be in the form of numbers, text, images, videos, and much more.
Isn’t data a simple concept, why should we learn more about it?
By now we know that all information is data. And from our discussion on statistics, we also know that
💡 Data lies at the heart of any analytical solution. Therefore, without data, there is no statistic. And without statistics, there is no analysis.
The first and most crucial step of solving any problem, be it statistics, analytics, data science, machine learning, etc., is to understand the data at hand.
Different types of data
We spoke about the different forms of data. And there are a few different ways of classifying data, and each serves a specific purpose.
Let’s go over the most popular types of data and see how they are classified.
The two major types of data are:
1 — Unstructured data
💡 As the name suggests, this type of data cannot be organized into a structure or a data model.
Some of the popular examples are images, heatmaps, videos, spatial data, graph data, text documents, etc.
Unstructured data is not easily identifiable or interpretable either by humans or machines. It takes the machine some special features to process this data.
By now, you must have realized that this type of data is a bad fit for traditional relational databases like SQL.
2 — Structured data
💡 On the other hand, this type of data can be defined into a structure (as the name suggests)
These are more commonly used in industrial settings and one of the most commonly used types of structured data is the 2-dimensional data structure, that is, the humble table, also known as rectangular data.
A simple table capturing exam results of different students
I'm sure you can think of enough use cases from your own life where you have used Excel spreadsheets to store some information.
Another popular example of rectangular data is the Titanic dataset [Source]
Structured data is further classified into a few different types of data. They are:
And even the above types of data can be classified further into different data types. Let's look at a complete breakdown before proceeding with each type of data.
Different types of data
2.1 — Categorical data
💡 The type of data that can be categorized [Genius 🕵️♂️].
Now consider the dataset of students’ exam results. Depending on the grade, all students with grades other than F are deemed to have passed the examination.
So we add another column with the Passing status of each student. The column Pass/Fail has only one of the two entries, it can either be a Pass or a Fail.
Exam results of different students
Similarly, there are many instances where the entry in each row is one of the few available options. For example:
Binary data: True or False, Yes or No, 0 or 1
Exam grades: A, B, C, D, E, and F
Laptop brands: Asus, Lenovo, Macbook, Dell, IBM, etc.
⚠️ The entry for a categorical data record can only be one of the available options. For example, it can be either True or False, but not both.
Now there are 2 important types of categorical data as well, and they are:
2.1.1 — Nominal data
💡 Type of categorical data that has no internal order or precedence amongst the different categories. The categories cannot be ranked one over the other.
For example, in binary data like True or False, Male or Female, one category is not more important than the other.
⚠️ There can be some exceptions here as well, refer to the upcoming exercise section of this article for an explanation.
Another example would be subjects like English, Mathematics, Science, History, etc. As long as they carry equal weightage, one cannot have more importance than the other.
2.1.2 — Ordinal data
Here the categories can be ordered and the order in which categories are spread matters.
For example, grades in exam results can be ordered as A, B, C, D, E, & F, from higher rank to lower. Another simple ranking could be in the cloth sizes, which may range from XS, S, M, L, to XL.
⚠️ SPOILER ALERT! 🤖 The knowledge of Nominal and Ordinal datatypes becomes very critical during encoding for machine learning problems.
2.2 — Numerical data
💡 While categorical data is discrete in nature. On the other hand, numerical data is continuous. It can have any numerical value.
It can be either integers or real numbers. For example, the students’ marks in exams, the speed of a car, the length of a video, height, weight, etc.
Again, there are two major types of Numerical data, and they are:
2.2.1 — Discrete data
💡 When the data records can be counted & expressed only in whole numbers, it is called discrete data.
For example, the number of children in a class, the number of cars owned by a person, the number of working days in a month, and many more.
2.2.2 — Continuous data
💡 When the data records can be infinite or expressed in real numbers to many decimals places, the data is known as continuous data.
For example, exact height, weight, & other ratios like π cannot be recorded absolutely accurately in 2 decimal places.
Some special data types
Apart from the above, there are a few other important data types that you should know.
1. Time series
💡 Anything measured over time is time series.
For example, daily or monthly stock prices, daily weather, hourly sea level, speed of a vehicle at every minute, etc.
More often than not, time series is structured data.
An example of time-series data. Records of product demand, precipitation, & temperature over the years.
2. Text data
💡 Text data usually consists of documents containing words, sentences, and paragraphs of free-flowing text.
It can be in any language. And is mostly unstructured.
A good example is the product reviews on Amazon, which can be utilized for sentiment classification. Or email contents that enable machine learning algorithms to detect spam emails from the rest.
After filtering out the spam, Gmail automatically categorizes emails into Primary, Social, & Promotions based on the text data in the email contents.
3. Image & Video data
💡 This is the graphic or pictorial data like images or drawings.
This finds a great use case in object detection, self-driving cars, etc. It is a form of unstructured data as well.
Object detection from live footage.
4. Audio data
💡 Any information recorded in the audio format is data.
Another popular format of data is Audio, that is widely used in machine learning applications. Apps like Shazam are a great example of the same.
Be it a song, a speech, an audiobook, or any other information recorded in the audio format can be used as data.
Now that we have quickly understood so many different concepts, let’s strengthen our understanding with these fun exercises.
Let’s classify each variable in the Titanic dataset into its correct data type.
![Titanic dataset (notion.so/image/https%3A%2F%2Fs3-us-west-2... align="left")
Titanic dataset [Source]
PassengerId: The unique Id for each passenger. This is numerical data and is discrete.
Survived: Passenger survived or not, 0 = No, 1 = Yes. This is categorical data and nominal.
Pclass: This is the ticket type, class 1 = 1st, 2 = 2nd, 3 = 3rd. This is categorical data and ordinal.
Name: Name of the passenger. This is text data.
Sex: Gender of the passenger. This is binary categorical data and ordinal.
In some cases, even data records like sex or gender can be ordinal, and this is one such case. This is because the captain of the ship explicitly issued an order for women and children to be saved first. As a result, the survival rate for women was three times higher than for men [Source]. Therefore, while modeling, the algorithm can give a slightly higher preference to females while predicting the survival status.
Similarly, we can analyze the rest of the variables of this dataset. I will leave this exercise for you to complete.
Considered a video being streamed on Youtube. Multiple different data points are being recorded in real-time simultaneously.
Some of them are video, audio, images, resolutions, time stamps, total people watching at each timestamp, likes, dislikes, text data from the continuous chat, number of comments, transactions, engagement, and much more.
Your task is to classify each feature being recorded into its correct class. Do share your observations in the comments section.
So let’s summarize what we discussed today.
Data is everywhere and every recorded information is data
Data lies at the heart of any statistical analysis
Different types of data, a quick breakdown is below.
Different types of data
How a machine reads data?
Finally, after understanding the foundations and different types of data.
Let's understand how machines are reading and interpreting data as opposed to humans.
Foundationally, no matter the type of data, the machines can only ingest 0s and 1s. Therefore, for us to train a machine learning model on our data, we must convert it into 0s and 1s.
Be it images, text, audio, or any other data type, everything has to be converted into 0s and 1s (or numeric) before feeding it to the machine.
Converting image to 0s and 1s
NOTE: We will look at multiple examples in the coming discussions where we build machine-learning models on different types of data.
I am certain that this discussion will help you better understand your data at a more fundamental level, which will refine your analysis.
Feel free to share your feedback or queries in the comments below.