Big Data refers to the massive amount of data collected over time that are difficult to analyze and handle using common database management tools. Big Data can be identified if they fulfill 3 V’s i.e. Velocity (large amount of data generated, which is taking more time in processing than its incoming rate), Volume (data collected over the time becomes huge) and Variety (unstructured, Semi-structured and Structured).
- Unstructured Data- Emails, Blogs, Tweets, Social Networks, mobile data, Web pages and so on.
- Semi-Structured Data- text files, system log files, XML files, etc.
- Structured Data –Transaction data, RDBMS (Relational Database Management Systems), OLTP, etc.
Big Data deals with large scale data which cannot be effectively processed.
Data Analytics is the field that analyzes large amounts of data sets and predicts useful information which is essential in business decision making using different statistical techniques and tools. Data Analytics deals with raw data that needs to be interpreted to find meaningful information out of them. Most people think that data science and data analytics are similar, which is not correct. Data analytics is actually the fundamental level of data science. Data Analytics is mostly used in business, computer science and in commercial industries to increase business efficiency.
Click here to find the differences between Data Scientists, Data Analysts and Data Engineers.
Many people have different opinions when it comes to choosing the programming language for Data Science projects depending on their career backgrounds and domain they have worked in.
There are many tools used for data analytics like R Programming, SAS, Python, Hadoop, SQL and others.
The following are the list of most important programming languages for Data Science.
R is a language and environment for statistical computing and graphics. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible. One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.
Python is an interpreted high-level programming language for general-purpose programming. Created by Guido van Rossum and first released in 1991, Python has a design philosophy that emphasizes code readability, notably using significant whitespace. It provides constructs that enable clear programming on both small and large scales. Python features a dynamic type system and automatic memory management. It supports multiple programming paradigms, including object-oriented, imperative, functional and procedural, and has a large and comprehensive standard library.
Apache Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.
SAS is an integrated system of software solutions that enable you to perform the following tasks: data entry, retrieval, and management, report writing and graphics design, statistical and mathematical analysis, business forecasting and decision support, operations research and project management, and applications development.
Scala is a modern multi-paradigm programming language designed to express common programming patterns in a concise, elegant, and type-safe way. It smoothly integrates features of object-oriented and functional languages. Scala is object-oriented, statically typed, and extensible. Scala has many features of functional programming languages like Scheme, Standard ML and Haskell, including currying, type inference, immutability, lazy evaluation, and pattern matching. It also has an advanced type system supporting algebraic data types, covariance and contravariance, higher-order types (but not higher-rank types), and anonymous types.
Java is a general-purpose computer-programming language that is concurrent, class-based, object-oriented, and specifically designed to have as few implementation dependencies as possible. It is intended to let application developers “write once, run anywhere” (WORA), meaning that compiled Java code can run on all platforms that support Java without the need for recompilation. Java applications are typically compiled to bytecode that can run on any Java virtual machine (JVM) regardless of computer architecture. As of 2016, Java is one of the most popular programming languages in use, particularly for client-server web applications, with a reported 9 million developers.
RapidMiner is a data science software platform developed by the company of the same name that provides an integrated environment for data preparation, machine learning, deep learning, text mining, and predictive analytics. It is used for business and commercial applications as well as for research, education, training, rapid prototyping, and application development and supports all steps of the machine learning process including data preparation, results visualization, model validation and optimization.
SQL (pronounced “ess-que-el”) stands for Structured Query Language. SQL is used to communicate with a database. According to ANSI (American National Standards Institute), it is the standard language for relational database management systems. SQL statements are used to perform tasks such as update data on a database, or retrieve data from a database. Some common relational database management systems that use SQL are: Oracle, Sybase, Microsoft SQL Server, Access, Ingres, etc. Although most database systems use SQL, most of them also have their own additional proprietary extensions that are usually only used on their system. However, the standard SQL commands such as “Select”, “Insert”, “Update”, “Delete”, “Create”, and “Drop” can be used to accomplish almost everything that one needs to do with a database.
Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
Tableau produces interactive data visualization products focused on business intelligence. Tableau products query relational databases, OLAP cubes, cloud databases, and spreadsheets and then generates a number of graph types. The products can also extract data and store and retrieve from its in-memory data engine. Tableau has a mapping functionality and is able to plot latitude and longitude coordinates and connect to spatial files like Esri Shapefiles, KML, and GeoJSON to display custom geography.
KNIME is a free and open-source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining concept. A graphical user interface and use of JDBC allows assembly of nodes blending different data sources, including preprocessing (ETL: Extraction, Transformation, Loading), for modeling, data analysis and visualization without, or with only minimal, programming. To some extent as advanced analytics tool KNIME can be considered as a SAS alternative. Since 2006, KNIME has been used in pharmaceutical research, it also used in other areas like CRM customer data analysis, business intelligence, and financial data analysis.
Scikit-learn (formerly scikits.learn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.
TensorFlow™ is an open source software library for high-performance numerical computation. Its flexible architecture allows easy deployment of computation across a variety of platforms (CPUs, GPUs, TPUs), and from desktops to clusters of servers to mobile and edge devices. Originally developed by researchers and engineers from the Google Brain team within Google’s AI organization, it comes with strong support for machine learning and deep learning and the flexible numerical computation core is used across many other scientific domains.
Julia is a high-level general-purpose dynamic programming language that was originally designed to address the needs of high-performance numerical analysis and computational science, without the typical need of separate compilation to be fast, also usable for client and server web use, low-level systems programming or as a specification language.
You can see from the following chart which languages are most used for Data Science, Data Analytics, Big Data, and Mining.
The following chart taken from KDnuggets will show you the poll of preferred language for Data Science in 2016:
Similarly, this chart will show you the poll of preferred language for Data Science from 2016-2018:
Python has gained more popularity if we are to see the difference from 2016 to 2018.
Data visualization is the process of presentation of data in a graphical or pictorial format like charts, figures, bars, etc. Simply put, Data visualization is using pictures to represent data. Data visualization helps to discover patterns, trends, and correlations that might go unnoticed in a text-based data which helps the decision makers to make a fact-based decision.
You may ask, “But why to do data visualization when the same information might be available in excel tables or spreadsheets?” Well, it is better to represent the data in a picture format as ‘A picture is worth a thousand words‘ but more importantly, we can’t really see insights using spreadsheets. Another reason is that data visualization helps to process data faster.
For instance, suppose you are given a spreadsheet and a pictorial representation of the same data. Which one do you think gives you instant information quickly and beautifully?
But this is just a small amount of information. What about the large amount of information that you get from so many sources, i.e. Big Data? Data visualization helps to process and represent millions of data into a nice looking information providing picture.
Data visualization is a business analytics tool because it assists in data-driven or fact-based decision making. Decision makers can study through visualization and drill down to the details that they need really quickly. If a decision maker can’t drill down and the information is not readily available, they will have to ask the analyst for more information, which could take weeks or months.
A Data Warehouse is a central place where data is stored from different data sources and applications. It consists of data from multiple heterogeneous data sources.
Data is information processed or stored by a computer where the information may be in the form of text, images, audio clips, software programs, or other types of data. A warehouse is a large building where raw materials or manufactured goods may be stored prior to their distribution for sale. So basically, a data warehouse is the place or repository where all the data are stored.
You can see from the above image that the data is coming from different sources to a Data Warehouse. A data warehouse stores both current and historical data.
To build a data warehouse, you first need to copy the raw data from each of the data sources, cleanse, and optimize it. The process of getting data into a data warehouse is called ETL: Extract, Transform, Load.
A data warehouse is very important for large enterprises because it helps the company in product development, marketing, price strategy, production time, historical analysis, forecasting and customer satisfaction. The data in a data warehouse is used for analytical reporting, data mining and analysis which is later used by Business Analyst, Sales Manager or Knowledge workers for decision-making and future strategies.
For example, an e-commerce business having a data warehouse can analyze its data to recognize what product is mostly purchased by the age-group of 18-22 to display similar or related products in the recommendation.
Data warehouse databases provide a decision support system (DSS) environment in which the performance of an entire enterprise can be evaluated over time. In the broadest sense, the term data warehouse is used to refer to a database that contains very large stores of historical data. The data is stored as a series of snapshots, in which each record represents data at a specific time. By analyzing these snapshots, you can make comparisons between different time periods. You can then use these comparisons to help make important business decisions.
A huge amount of data is generated every single day in the world.
2.5 Quintillion Bytes (2.5 Exabyte), i.e. 2.5 followed by 18 zeroes of data is created every day. Data from stock markets, social media giants like Facebook, Twitter, LinkedIn, Instagram, etc., e-commerce sites like Amazon, e-bay, etc., text messages, digital images, videos, audios, documents to name a few. Data is growing faster than ever before. 90% of the data in the world today has been created only in the last two years. This large volume of data is called Big Data.
So what is Data Science?
Data Science is simply the study of extraction of knowledge from data.
Data Science is an interdisciplinary field that uses techniques including Information Science, Statistical Learning, Machine Learning, Probability Models, Artificial Intelligence, Visualization, Pattern Recognition, Mathematics, Statistics, Operations Research, Predictive Analytics, Signal Processing, Computer Science, etc. with the aim of extracting useful knowledge and insights from the large volume of data.
Whether it’s a small Excel spreadsheet or 10 million records in a database, the goal of Data Science is always the same: to extract valuable insight from the data which will eventually help to do things in a better way.
Data Science at a deeper level helps to mine and understand complex behaviors and trends by finding out hidden insights that can help companies to make smarter business decisions. For example, An online food delivery service company with the help of data science can determine from which location is the maximum number of food being ordered so that they can place their delivery agents around that location for faster delivery. Similarly, an e-commerce site with the help of Data Science can identify what type of product a certain customer purchases so that it can recommend similar products and accessories to that potential customer.
This is a general picture depicting the process of how Data Science works. Data Science eventually helps companies to make a better business decision by analyzing their data.