I think I have a problem. As my iPhone filled up for the Nth time, I realized I, like many may be a bit a digital hoarder. The 32 GB, which seemed huge when I got it, has now been reduced to nothing. Almost every iPhone release increases storage space, and most Android phones come with a slot for a storage card, which makes it even easier to increase space.
I save…well, almost everything. If the phone situation gets dire, I also have a NAS I can keep files on, which is full of goodies from many years past. I can’t remember the last time I went through my computer and just deleted things. Today’s technology makes it easier than ever to be a digital hoarder.
Many organizations have the same tendencies in their data centers. Some of these tendencies come from legal requirements, others are the result of the ever decreasing price of storage, and increasing drive sizes. Many organizations are looking to put in archiving strategies for untouched data, if they don’t have them already. Some have just held on to data thinking they would be able to use it later.
There’s a whole industry that has been spurred from digital hoarding, called Big Data. What it this Big Data anyway? Well, according to our friends at Wikipedia:
This sounds a lot like Big Data is meant to address the digital hoarding problem. When we’re thinking about an archiving strategy, chances are we’re focused on getting our data off our high performing storage and putting it on something else to free up space for new projects. When we’re just saving anything and everything – perhaps input from a form on a website, we may only really have been concerned with a few things, and just saved the rest for later.
Because we just threw this data under the bed, it really isn’t structured at all. When we think about things we’ve put in databases, we’ve got a pretty good idea on how to query that data and make it useful. Often times, we’ve planned to put data together which relates to each other, and have and a great time reaping benefit from these datasets.
Big Data deals with what we call unstructured data, the stuff under the bed. We need a new plan of attack to make sense out of this type of data. By mining this data, we can find out things like people who order blue t-shirts are more likely to place their orders on Sunday night, so a company may want to send them a coupon on Sunday morning. We can be able to model how molecules we never though to put together, leading us to begin to discover new uses of drugs for various conditions. Big Data opens up a whole new realm of possibilities, and has the potential to provide solutions to problems we haven’t even discovered yet.
Enter MapReduce. MapReduce was created in order to make sense of these large data sets. It works in a distributed model, in order to spread processing across many, many CPUs. There are two main functions, Map and Reduce. First a Map command is run, preforming an initial sorting operation, and creating an intermediate step. Then, a Reduce command is used, which uses the output of the Map command and further refines the data selection.
One of the most classic example of a MapReduce program you will find out there is a word count program. Let’s say I took a copy of The Martian by Andy Weir and a copy of Ender’s Game by Orson Scott Card and uploaded them into my MapReduce environment. First, we would use a Map command to map how many times each word in the novel showed up. In the reduce function, we would reduce the data to show us how many times each word showed up in both novels, instead of looking at it as Mars showed up 100 times in The Martian and 3 times in Ender’s Game, we would be looking at Mars showing up 103 times. We could do further analysis to find out how many times words like Rocket, Space, and Star showed up in both novels.
Want more information on how MapReduce works? Check out these links:
What is Map Reduce on TutorialsPoint
MapReduce: Simplified Data Processing on Large Clusters on research.google.com
Write your first MapReduce program in 20 minutes by Michael Nielson
Word Count – Hadoop Map Reduce Example on Kick Start Hadoop
We’ve been talking about MapReduce, but we haven’t talked about any of the names we’re used to talking about when we talk about Big Data. One of the names we keep running across is Apache’s Hadoop, which is an open source implementation of MapReduce, and HDFS, which is their Hadoop Distributed File System. Like with most other open source options, we have the option of obtaining Hadoop from Apache directly and rolling our own, or buying a distribution like Cloudera. A Hadoop distribution definitely can have its advantages, with professional services and support around an implementation to make it enterprise ready in no time at all.
The MapReduce use case, the need to run a query on a dataset while leveraging a distributed computing environment lends itself perfectly to the cloud, whether it be public, private, or hybrid. If you were looking to do things in house, you could automate the distribution of MapReduce nodes on hypervisors when they have quiet periods. If you don’t want to deal with it at all, you can easily leverage the public cloud. Amazon offers Amazon Elastic Map Reduce or Amazon EMR. This type of bursty work load is practically what gave birth to cloud computing.
Big Data is a big challenge, with a big future. As organizations make the move to leverage the data they have laying around thanks to their digital hoarding practices, Big Data strategies will become a standard in IT organizations. Like many other popular open source initiatives, it is still early for many organizations. Many are just beginning to dip their toes in the water with Big Data. In the same way which cloud computing went to something in a news story to heavily adopted, Big Data will be making big news in the future.