Monday, 25 September 2017

What is Big Data?

What is Big Data?
Big data is a term for data sets that are so large or complex that traditional data processing application software is inadequate to deal with them. 

Challenges contain capture, storage, analysis, data curation, search, sharing, transfer, visualization, querying, and updating and information privacy.  

The term “big data” can also refer to the use of predictive analytics, user-entity behavior analytics (UEBA), or other complex data analytics methods that extract value from data, and seldom to a particular size of data set.  

Big data can be structured, semi-structured and unstructured data that has the potential to be mined for information.

Discussions about big data traditionally contain data lakes. Data lakes support storing data in its unique or exact format. 

The goal is to offer a raw or unprocessed view of data to data scientists and analysts for discovery and analytics.

Big data differs from a relational database.  Relational databases have been around since the early 70’s. 

 A relational database is a collection of data items organized as a set of formally described tables with unique index keys.  

Data can be accessed or reassembled in many different ways lacking having to reorganize the database tables, often in queries with Boolean logic.

The problem with relational database technology is managing multiple, continuous streams of data and scalability for a high volume of data.  Nor can it modify the inward data in real-time.

Big data technologies have made it technically and economically viable to collect and store larger datasets and to analyze them in order to expose new insights.  

In most cases, big data processing involves a common data flow – from the collection of raw data to the consumption of actionable information.

A selection of specific attributes defines big data.  They are regularly called the four V’s: volume, variety, velocity, and veracity

Volume – The quantity of generated and stored data.  The size of the data decides the value and potential insight and whether it can actually be measured big data or not.

Variety – The type and nature of the data.  This helps people who analyze it to efficiently use the resulting insight.

Velocity   – In this context, the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of increase and growth.

Veracity – The quality of captured data can vary greatly, affecting precise analysis.

You can’t have a conversation about big data for very long without running into the elephant in the room, Hadoop.

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware.  It provides huge storage for any kind of data, massive processing power and the ability to handle virtually boundless concurrent tasks or jobs.

The Hadoop Distributed File System (HDFS) is intended to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host straight attached storage and perform user application tasks.

Learn more from hadoop tutorial


  1. That is very interesting; you are a very skilled blogger. I have shared your website in my social networks! A very nice guide. I will definitely follow these tips. Thank you for sharing such detailed article.

    hadoop training in chennai|
    hadoop training in bangalore|