Unsolved
This post is more than 5 years old
2 Intern
•
308 Posts
0
385
February 25th, 2015 01:00
When storage meets big data
When storage meets big data
Big data brings some challenges to storage management, however, effective integration of big data and Hadoop could resolve these problems.
Detailed Information
Problems with big data:
As petabytes of data spread across many distributed nodes with local storage, it could cause some problems: data will become duplicated, and during the ETL phase of a data warehousing project, a lot of data movement is generated. Also, traditional data management applications may not be capable of managing, optimizing and sustaining the big data infrastructure.
Date storage in Hadoop:
The best way of resolving this is to provide a new way of accessing data such as HDFS.
Hadoop is a highly scalable analytics platform whose design aim is to execute the distributed processing with the minimum latency possible. The most important parts of Hadoop are HDFS, which stores data in files on a cluster of servers, and MapReduce, a programming framework for building parallel applications that run on HDFS. In simple terms, MapReduce is responsible for managing distributed data computing, and HDFS automatically manages data location in the computing cluster. When a computing task is initiated, MapReduce splits it into sub-tasks to support parallel processing, then HDFS is required to check the location of each sub-task, and sends the computing tasks to the compute nodes where the data resides. In essence, the process is sending the computing tasks to the data, and returning the result of the sub-tasks to MapReduce master, which collects and delivers the final results.
In the Hadoop clusters, Mapreduce compute nodes and HDFS storage layer typically reside on the same group of nodes. This configuration allows the framework to effectively schedule the tasks on the nodes where data already exists, to avoid network bottlenecks caused by moving data within the cluster nodes.
HDFS on Isilon:
HDFS is a distributed file system that provides high-performance access to data across Hadoop clusters. HDFS is built to support applications with large data sets, including individual files that reach into the terabytes. It uses a master/slave architecture, with each cluster consisting of a single NameNode that manages file system operations and supporting DataNodes that manage data storage on individual compute nodes.
HDFS can leverage its properties to create a fully object-addressable environment. Object storage is highly scalable and well suited to big data. It can be deployed on distributed nodes, yet maintain data consistency across the distributed network. Instead of operating at the block level, object storage uses flexibly sized data containers. Since ETL can be eliminated, it also minimizes the pressure on the LAN and SAN.
Scale-out NAS system is suitable for big data analytics, it has the ability to scale throughput and capacity, and also support object storage, which can help with the unstructured nature of big data sets. EMC’s Isilon scale-out NAS platform provides native support for Hadoop and adds extra features including a distributed NameNode, data protection through snapshots and NDMP backups and multi-protocol support. An Isilon cluster can be used for multiple workloads, making it a good way to evaluate Hadoop solutions without large-scale expensive deployments. The scale-out capabilities of EMC Isilon and its support for HDFS allow performance to be optimized by distributing I/Os across multiple controller nodes. But most critically, it allows data to stay in place because it won't have to be moved for big data analytic processing.
As a result, customers get the Hadoop analytics engine and native HDFS integration with an enterprise storage platform, as well as advanced data protection and efficiency services such as backups, snapshots, replication and deduplication.
Author: Jiawen Zhang