BigData- challenges and solutions

4 min readSep 17, 2020

The term “big data” refers to data that is so large, fast, or complex that it’s difficult or impossible to process using traditional methods. The act of processing and storing large amounts of data for analytics has been around a long time. But the concept of big data gained momentum in the early 2000s

The importance of big data doesn’t revolve around the quantity of data, but how you derive useful insights from it!

BigData Challenges

Big data is a big deal for industries. Along with big data comes the potential to unlock big insights — for every industry, large to small. With BigData ensues big challenges.

A few of the challenges include capture, curation, storage, analysis, search, rapid growth, transfer, sharing, and visualization.

What's the solution then??

Hadoop- the bigdata solution

With the rise in the volume of data, the traditional approaches began failing and the requirement of new solutions emerged in the market.

Google came up with a solution called MapReduce for big data processing, using this solution an Open Source Project called HADOOP came into existence.

Hadoop is an open-source software framework for storing enormous data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, huge processing power, and can handle multiple parallel tasks.

Hadoop key components

Hadoop consists of 3 core components that are- HDFS, MapReduce and YARN

HDFS

High Distributed File System is the storage layer of Hadoop.Files are broken into blocks.It consists of Namenode and Datanode which are master and slave nodes respectively.

MapReduce

MapReduce is the processing layer of Hadoop. It performs distributed processing. The MapReduce model contains two tasks Map and Reduce.

YARN

YARN is the resource management layer of hadoop. YARN is responsible for resource allocation and job scheduling.It is the core component of hadoop version 2. In the this architecture, the processing layer is separated from the resource management layer.

Why Hadoop ??

Tons of data is being generated and collected each second from the various processes carried out by the company. This data could contain important patterns and methods as to how the company can improve its processes, which are often ignored….these solutions can be easily identified by using the right big data tool such as hadoop.

Multiple Data Sources

Data can be collected from various sources. It can be structured, semistructured or unstructured, converting this collected data to a single format would be very time taking. The sources can be social media, clickstream data, or text conversations. . Hadoop saves this time as it can derive valuable insights from any form of data. It also provides a variety of functions such as data warehousing, fraud detection etc.

Speed

Hadoop uses a distributed storage system to store data. Since all the data procesing tools are incorporated within the hadoop framework hence its carried out at a faster rate.

Cost efficiency

Companies used to spend a huge amount of money for data storage. Traditional data storage tools have limited capacity so past datasets had to be deleted.This oten results in losing valuable information.Hadoop solves such problems. It is a cost-effective solution for data storage purposes.Companies can easily use Hadoop to store all their data past and present, which can be used for companies decision making.

Fault tolerence

Hadoop replicates the data stored and creates multiple copies. Replication is done based on replication factor defined,default replication is 2. This is done to prevent data loss in case of failure. So, Hadoop is fault tolerent.

Scalability

Hadoop cluster can be horizontally scaled to increase capacity of nodes or vertical scaled for high computation according to needs.

Data Locality

Hadoop uses the data locality feature. Data locality means moving computation to data, instead of moving data to the computation. It proves to be extremely efficient to improve processing.