What is Big Data

Big data: everyone seems to be talking about it, but what is big data really? How is it changing the way researchers at companies, nonprofits, governments, institutions, and other organizations are learning about the world around them? Where is this data coming from, how is it being processed, and how are the results being used? And why is open source so important to answering these questions?

In this short primer, learn all about big data and what it means for the changing world we live in.

What is big data? safe
There is no hard and fast rule about exactly what size a database needs to be for the data inside of it to be considered "big." Instead, what typically defines big data is the need for new techniques and tools to be able to process it. In order to use big data, you need programs that span multiple physical and/or virtual machines working together in concert to process all of the data in a reasonable span of time.

Getting programs on multiple machines to work together in an efficient way so that each program knows which components of the data to process, and then being able to put the results from all the machines together to make sense of a large pool of data, takes special programming techniques. Since it is typically much faster for programs to access data stored locally instead of over a network, the distribution of data across a cluster and how those machines are networked together are also important considerations when thinking about big data problems.

What kind of datasets are considered big data?
The uses of big data are almost as varied as they are large. Prominent examples you're probably already familiar with include: social media networks analyzing their members' data to learn more about them and connect them with content and advertising relevant to their interests, or search engines looking at the relationship between queries and results to give better answers to users' questions.

But the potential uses go much further! Two of the largest sources of data in large quantities are transactional data, including everything from stock prices to bank data to individual merchants' purchase histories; and sensor data, much of it coming from what is commonly referred to as the Internet of Things (IoT). This sensor data might be anything from measurements taken from robots on an automaker's manufacturing line, to location data on a cellphone network, to instantaneous electrical usage data in homes and businesses, to passenger boarding information taken on a transit system.

By analyzing this data, organizations can learn trends about the data they are measuring, as well as the people generating this data. The hope for this big data analysis is to provide more customized service and increased efficiencies in whatever industry the data is collected from.

How is big data analyzed?
One of the best-known methods for turning raw data into useful information is what is known as MapReduce. MapReduce is a method for taking a large data set and performing computations on it across multiple computers, in parallel. It serves as a model for how to program and is often used to refer to the actual implementation of this model ios.

In essence, MapReduce consists of two parts. The Map function does sorting and filtering, taking data and placing it inside of categories so that it can be analyzed. The Reduce function provides a summary of this data by combining it all together. While largely credited to research that took place at Google, MapReduce is now a generic term and refers to a general model used by many technologies.


What tools are used to analyze big data?
Perhaps the most influential and established tool for analyzing big data is known as Apache Hadoop. Apache Hadoop is a framework for storing and processing data at a large scale, and it is completely open source. Hadoop can run on commodity hardware, making it easy to use with an existing data center, or even to conduct analysis in the cloud. Hadoop is broken into four main parts:

The Hadoop Distributed File System (HDFS), which is a distributed file system designed for very high aggregate bandwidth;
YARN, a platform for managing Hadoop's resources and scheduling programs that will run on the Hadoop infrastructure;
MapReduce, as described above, a model for doing big data processing;

And a common set of libraries for other modules to use.

Previous
Next Post »