Big Data

Big data is a term that describes the large volume of data both structured and unstructured that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves.

Big data is larger, more complex data sets, especially from new data sources. These data sets are so voluminous that traditional data processing software just can’t manage them. But these massive volumes of data can be used to address business problems you wouldn’t have been able to tackle.

Why Is Big Data Important?

Microservice architecture is a design pattern to build software system composed of multiple isolated, autonomous, context-bound, reusable services that interacts with each other to provide a business capability.

Is it same as SOA or different?

  • The importance of big data doesn’t revolve around how much data you have, but what you do with it. You can take data from any source and analyze it to find answers that enable.
  • 1.Cost reductions
  •       2.Time reductions

          3.New product development and optimized offerings

          4.Smart decision making

  • When you combine big data with high powered analytics you can accomplish business related tasks such as
  • Determining root causes of failures, issues and defects in near-real time.
  • Generating coupons at the point of sale based on the customer’s buying habits.
  • Recalculating entire risk portfolios in minutes.
  • Detecting fraudulent behaviour before it affects your organization.

  • Who uses big data? Big data affects organizations across practically every industry. See how each industry can benefit from this onslaught of information.

    Big Data Glossary

    While we've attempted to define concepts as we've used them through out the guide, sometimes it's helpful to have specialized terminology available in a single place.

    1.Cluster computing:

    Clustered computing is the practice of pooling the resources of multiple machines and managing their collective capabilities to complete tasks.Computer clusters require a cluster management layer which handles communication between the individual nodes and coordinates work assignment.

    2.Data lake:

    Data lake is a term for a large repository of collected data in a relatively raw state. This is frequently used to refer to the data collected in a big data system which might be unstructured and frequently changing.

    3.Data mining:

    Data mining is a broad term for the practice of trying to find patterns in large sets of data. It is the process of trying to refine a mass of data into a more understandable and cohesive set of information.

    4.Data warehouse:

    Data warehouses are large, ordered repositories of data that can be used for analysis and reporting. In contrast to a data lake, a data warehouse is composed of data that has been cleaned, integrated with other sources, and is generally well-ordered.


    ETL stands for extract, transform, and load. It refers to the process of taking raw data and preparing it for the system's use. This is traditionally a process associated with data warehouses, but characteristics of this process are also found in the ingestion pipelines of big data systems.


    Hadoop is an Apache PROJECTSthat was the early open-source success in big data. It consists of a distributed filesystem called HDFS, with a cluster management and resource scheduler on top called YARN (Yet Another Resource Negotiator). Batch processing capabilities are provided by the MapReduce computation engine. Other computational and analysis systems can be run alongside MapReduce in modern Hadoop deployments.

    7.In-memory computing:

    In-memory computing is a strategy that involves moving the working datasets entirely within a cluster's collective memory. Intermediate calculations are not written to disk and are instead held in memory. This gives in-memory computing systems like Apache Spark a huge advantage in speed over I/O bound systems like Hadoop's MapReduce.

    8.Machine learning:

    Machine learning is the study and practice of designing systems that can learn, adjust, and improve based on the data fed to them. This typically involves implementation of predictive and statistical algorithms that can continually zero in on "correct" behavior and insights as more data flows through the system.

    9.Map reduce (big data algorithm):

    Map reduce (the big data algorithm, not Hadoop's MapReduce computation engine) is an algorithm for scheduling work on a computing cluster. The process involves splitting the problem set up (mapping it to different nodes) and computing over them to produce intermediate results, shuffling the results to align like sets, and then reducing the results by outputting a single value for each set.


    NoSQL is a broad term referring to databases designed outside of the traditional relational model. NoSQL databases have different trade-offs compared to relational databases.

    11.Stream processing:

    Stream processing is the practice of computing over individual data items as they move through a system. This allows for real-time analysis of the data being fed to the system and is useful for time-sensitive operations using high velocity metrics.

    What tools are used to analyze big data?

    Perhaps the most influential and established tool for analyzing big data is known as Apache Hadoop. Apache Hadoop is a framework for storing and processing data at a large scale, and it is completely open source. Hadoop can run on commodity hardware, making it easy to use with an existing data center, or even to conduct analysis in the cloud. Hadoop is broken into four main parts.

    The Hadoop Distributed File System (HDFS), which is a distributed file system designed for very high aggregate bandwidth.

    YARN, a platform for managing Hadoop's resources and scheduling programs that will run on the Hadoop infrastructure.

    How Big data Works?

    How big data can work for your business, you should understand where it comes from. The sources for big data generally fall into one of three categories.

    1.Streaming data:

    This category includes data that reaches your IT systems from a web of connected devices, often part of the IOT You can analyze this data as it arrives and make decisions on what data to keep, what not to keep and what requires further analysis.

    2.Social media data:

    The data on social interactions is an increasingly attractive set of information, particularly for marketing, sales and support functions. It's often in unstructured or semistructured forms, so it poses a unique challenge when it comes to consumption and analysis.

    3.Publicly available sources:

    Massive amounts of data are available through open data sources like the US government’s data. gov, the CIA World Factbook or the European Union Open Data Portal.

    4.How to store and manage:

    Where as storage would have been a problem several years ago, there are now low-cost options for storing data if that’s the best strategy for your business.

    5.How much of it to analyse:

    Organizations don't exclude any data from their analyses, which is possible with today’s high-performance technologies such as grid computing or in memory analytics Another approach is to determine upfront which data is relevant before analyzing it.

    6.How to use any insights you uncover:

    The more knowledge you have, the more confident you’ll be in making business decisions. It’s smart to have a strategy in place once you have an abundance of information at hand.


    Big data is a broad, rapidly evolving . While it is not well suited for all types of computing, many organizations are turning to big data for certain types of work loads and using it to supplement their existing analysis and business tools. Big data systems are uniquely suited for surfacing difficult-to-detect patterns and providing insight into behaviors that are impossible to find through conventional means. By correctly implement systems that deal with big data, organizations can gain incredible value from data that is already available.