Data Science
Introduction
Data ? Data indeed is the new oil”
1.Facts about something that can be used in calculating, reasoning, or planning.
2 Information expressed as numbers for use especially in a computer. Hint: Data can be used as a singular or a plural in writing and speaking. This data is useful.
eg: Everything about You| me|World
Science ?
Science is the pursuit and application of knowledge and understanding of the natural and social world following a systematic methodology based on evidence. Scientific methodology includes the following: Objective observation: Measurement and data (possibly although not necessarily using mathematics as a tool) Evidence.
eg: Anthropology, archaeology, astronomy, biology, botany, chemistry, cybernetics, geography, geology, mathematics, medicine, physics, physiology, psychology, social science, sociology, and zoology
eg: Questions
What is the Universe made of?
How Did Life Begin
Are we alone in the Universe?
What makes us Human?..
DS Definition(s):
Courtesy : https://www.heavy.ai/learn/data-science
Data Science
Data science encompasses preparing data for analysis, including cleansing, aggregating, and manipulating the data to perform advanced data analysis. Analytic applications and data scientists can then review the results to uncover patterns and enable business leaders to draw informed insights.
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data,[1][2] and apply knowledge and actionable insights from data across a broad range of application domains. Data science is related to data mining, machine learning and big data.
Stages of Data Science:
- Apply mathematics, statistics, and the scientific method
- Use a wide range of tools in R/Python and techniques for capturing, cleaning, evaluating and preparing data—everything from multi input channels to data mining to data integration methods
- Extract insights from data using predictive analytics and artificial intelligence (AI), including machine learning and deep learning models
- Write applications that automate data processing and calculations
- Tell—and illustrate—stories that clearly convey the meaning of results to decision-makers and stakeholders at every level of technical knowledge and understanding
By using Data Science, companies are able to make:- Better decisions (should we marry/study/start a company/A or B)
- Predictive analysis (what will happen next?)
- Pattern discoveries (deep drive into past and find pattern, or maybe hidden information in the data)
Courtesy: Data Science Skills geek of GeeksApplications of Data Science:
Data science is the field of study that combines domain expertise, programming skills, and knowledge of mathematics and statistics to extract meaningful insights from data.
Data science is a multidisciplinary approach to extracting actionable insights from the large and ever-increasing volumes of data collected and created by today’s organizations. Data science encompasses preparing data for analysis and processing, performing advanced data analysis, and presenting the results to reveal patterns and enable stakeholders to draw informed conclusions.
DS discovering actionable insight patterns in structured, unstructured, semi structured data sets. It involves statistics, inference, computer science, predictive analytics, machine learning algorithm development, and new technologies to gain insights from big data(Volume, Variety, Velocity).
First stage of DS:
Data Capture: acquiring data, sometimes extracting it, and entering it into the system.
Second Stage: Maintenance, which includes data warehousing, data cleansing, data processing, data staging, and data architecture.
Data processing follows, and constitutes one of the data science fundamentals.
It is during data exploration and processing that data scientists stand apart from data engineers.
This stage involves data mining, data classification and clustering, data modeling, and summarizing insights gleaned from the data—the processes that create effective data.
Third Stage : Data analysis, an equally critical stage.
Here data scientists conduct exploratory and confirmatory work, regression, predictive analysis, qualitative analysis, and text mining.
Fourth/Final Stage: Insights
Involves Data visualization, data reporting, the use of various business intelligence tools, and assisting businesses, policymakers, and others in smarter decision making.
As a result, data scientists (as data science practitioners are called) require computer science and pure science skills beyond those of a typical data analyst. A data scientist must be able to do the following:
Data Science Functions
Types of Data
Structured data is highly specific and is stored in a predefined format, where unstructured data is a conglomeration of many varied types of data that are stored in their native formats.
Structured data vs. unstructured data
Structured data vs. unstructured data comes down to data types that can be used, the level of data expertise required to use it, and on-write versus on-read schema.
| Structured Data | Unstructured Data |
---|
Who | Self-service access | Requires data science expertise |
What | Only select data types | Many varied types conglomerated |
When | Schema-on-write | Schema-on-read |
Where | Commonly stored in data warehouses | Commonly stored in data lakes |
How | Predefined format | Native format |
Courtesy: talend.com
Let's see the comparison chart between structured and unstructured data. Here, we are tabulating the difference between both terms based on some characteristics.
On the basis of | Structured data | Unstructured data |
---|
Technology | It is based on a relational database. | It is based on character and binary data. |
Flexibility | Structured data is less flexible and schema-dependent. | There is an absence of schema, so it is more flexible. |
Scalability | It is hard to scale database schema. | It is more scalable. |
Robustness | It is very robust. | It is less robust. |
Performance | Here, we can perform a structured query that allows complex joining, so the performance is higher. | While in unstructured data, textual queries are possible, the performance is lower than semi-structured and structured data. |
Nature | Structured data is quantitative, i.e., it consists of hard numbers or things that can be counted. | It is qualitative, as it cannot be processed and analyzed using conventional tools. |
Format | It has a predefined format. | It has a variety of formats, i.e., it comes in a variety of shapes and sizes. |
Analysis | It is easy to search. | Searching for unstructured data is more difficult. |
Courtesy: https://www.javatpoint.com/structured-data-vs-unstructured-data
Semi Structured data
Semi-structured data refers to data that is not captured or formatted in conventional ways. Semi-structured data does not follow the format of a tabular data model or relational databases because it does not have a fixed schema.
eg . Hypertext Markup Language (HTML) files JavaScript Object Notation (JSON) files Extensible Markup Language (XML) files
The following table gives a brief overview of structured, semi structured and unstructured data.
| Structured data | Semi-structured data | Unstructured data |
What is it? | Data with a high degree of organization, typically stored in a spreadsheet-like manner | Data with some degree of organization | Data with no predefined organizational form and no specific format |
To put it simply | Think of a spreadsheet (e.g. Excel) or data in a tabular format | Think of a TXT file with text that has some structure (headers, paragraphs, etc.) | Essentially anything that is not structured or semi-structured data (which is a lot) |
Example formats | - Excel spreadsheets
- Comma-separated value file (.csv)
- Relational database tables
| - Hypertext Markup Language (HTML) files
- JavaScript Object Notation (JSON) files
- Extensible Markup Language (XML) files
| - Images such as .jpeg or .png files
- Videos such as .mp4 or m4a files
- Sound files such as .mp3 or .wav files
- Plain text files
- Word files
- PDF files
|
Characte- ristics | - Data is structured in a spreadsheet-like manner (e.g. in a table)
- Within that table, entries have the same format and a predefined length and follow the same order
- Is easily machine-readable and can therefore be analysed without major pre-processing of the data
- It is commonly said that around 20% of the world’s data is structured
| - Data is stored in files that have some degree of organization and structure
- Tags or other markers separate elements and enforce hierarchies, but the size of elements can vary and their order is not important
- Needs some pre-processing before it can be analysed by a computer
- Has gained importance with the emergence of the World Wide Web
|