Hadoop Tutorials: Hadoop Introduction

Hadoop Introduction

Hadoop is a large-scale distributed batch processing infrastructure. While it can be used on a single machine, its true power lies in its ability to scale to hundreds or thousands of computers, each with several processor cores. Hadoop is also designed to efficiently distribute large amounts of work across a set of machines.

How large an amount of work? Orders of magnitude larger than many existing systems work with. Hundreds of gigabytes of data constitute the low end of Hadoop-scale. Actually Hadoop is built to process "web-scale" data on the order of hundreds of gigabytes to terabytes or petabytes. At this scale, it is likely that the input data set will not even fit on a single computer's hard drive, much less in memory. So Hadoop includes a distributed file system which breaks up input data and sends fractions of the original data to several machines in your cluster to hold. This results in the problem being processed in parallel using all of the machines in the cluster and computes output results as efficiently as possible.

Challenges at Large Scale

Performing large-scale computation is difficult. To work with this volume of data requires distributing parts of the problem to multiple machines to handle in parallel. Whenever multiple machines are used in cooperation with one another, the probability of failures rises. In a single-machine environment, failure is not something that program designers explicitly worry about very often: if the machine has crashed, then there is no way for the program to recover anyway.

In a distributed environment, however, partial failures are an expected and common occurrence. Networks can experience partial or total failure if switches and routers break down. Data may not arrive at a particular point in time due to unexpected network congestion. Individual compute nodes may overheat, crash, experience hard drive failures, or run out of memory or disk space. Data may be corrupted, or maliciously or improperly transmitted. Multiple implementations or versions of client software may speak slightly different protocols from one another. Clocks may become desynchronized, lock files may not be released, parties involved in distributed atomic transactions may lose their network connections part-way through, etc. In each of these cases, the rest of the distributed system should be able to recover from the component failure or transient error condition and continue to make progress. Of course, actually providing such resilience is a major software engineering challenge.

Different distributed systems specifically address certain modes of failure, while worrying less about others. Hadoop provides no security model, nor safeguards against maliciously inserted data. For example, it cannot detect a man-in-the-middle attack between nodes. On the other hand, it is designed to handle hardware failure and data congestion issues very robustly. Other distributed systems make different trade-offs, as they intend to be used for problems with other requirements (e.g., high security).

In addition to worrying about these sorts of bugs and challenges, there is also the fact that the compute hardware has finite resources available to it. The major resources include:

* Processor time
* Memory
* Hard drive space
* Network bandwidth

Individual machines typically only have a few gigabytes of memory. If the input data set is several terabytes, then this would require a thousand or more machines to hold it in RAM -- and even then, no single machine would be able to process or address all of the data.

Hard drives are much larger; a single machine can now hold multiple terabytes of information on its hard drives. But intermediate data sets generated while performing a large-scale computation can easily fill up several times more space than what the original input data set had occupied. During this process, some of the hard drives employed by the system may become full, and the distributed system may need to route this data to other nodes which can store the overflow.

Finally, bandwidth is a scarce resource even on an internal network. While a set of nodes directly connected by a gigabit Ethernet may generally experience high throughput between them, if all of the machines were transmitting multi-gigabyte data sets, they can easily saturate the switch's bandwidth capacity. Additionally if the machines are spread across multiple racks, the bandwidth available for the data transfer would be much less. Furthermore RPC requests and other data transfer requests using this channel may be delayed or dropped.

To be successful, a large-scale distributed system must be able to manage the above mentioned resources efficiently. Furthermore, it must allocate some of these resources toward maintaining the system as a whole, while devoting as much time as possible to the actual core computation.

Synchronization between multiple machines remains the biggest challenge in distributed system design. If nodes in a distributed system can explicitly communicate with one another, then application designers must be cognizant of risks associated with such communication patterns. It becomes very easy to generate more remote procedure calls (RPCs) than the system can satisfy! Performing multi-party data exchanges is also prone to deadlock or race conditions. Finally, the ability to continue computation in the face of failures becomes more challenging. For example, if 100 nodes are present in a system and one of them crashes, the other 99 nodes should be able to continue the computation, ideally with only a small penalty proportionate to the loss of 1% of the computing power. Of course, this will require re-computing any work lost on the unavailable node. Furthermore, if a complex communication network is overlaid on the distributed infrastructure, then determining how best to restart the lost computation and propagating this information about the change in network topology may be non trivial to implement.

40 comments :

magnifictrainingJuly 24, 2013 at 2:17 AM
This comment has been removed by a blog administrator.
ReplyDelete
Replies
UnknownNovember 26, 2013 at 9:57 PM
Hadoop i sreally a good booming technology in coming days. And good path for the people who are looking for the Good Hikes and long career. Hadoop Tutorial
ReplyDelete
Replies
Merlin5by5-UNIX Systems EngineerDecember 11, 2013 at 4:52 PM
You can't short circuit your approach to Hadoop. It's NOT for newbies with training.
Hadoop engineersq have to have HA systems training and experience, computational cluster training, SQL and NoSQL DBA training, Network design, Business intelligence, and distributed systems training. Don't rush to it because it's "in".
ReplyDelete
Replies
AnonymousJuly 8, 2014 at 8:04 PM
Hai dude,i have learn to lot of hadoop.Thanks in that reviews.

Hadoop Training in Chennai
ReplyDelete
Replies
AnonymousJuly 10, 2014 at 3:59 AM
nice explanation just look at for online course gyanfinder
ReplyDelete
Replies
AnonymousJuly 21, 2014 at 7:39 PM
Hi,Thanks a lot.so,i have learn to lot of hadoop.

Hadoop Training in Chennai
ReplyDelete
Replies
UnknownAugust 18, 2014 at 4:54 AM
thank u very much.....

Best Hadoop Training in Chennai
ReplyDelete
Replies
UnknownNovember 5, 2014 at 5:33 AM
Thanks for InformationHadoop Course will provide the basic concepts of MapReduce applications developed using Hadoop, including a close look at framework components, use of Hadoop for a variety of data analysis tasks, and numerous examples of Hadoop in action. This course will further examine related technologies such as Hive, Pig, and Apache Accumulo. HADOOP Online Training
ReplyDelete
Replies
UnknownJanuary 2, 2015 at 11:28 PM
Very useful information thanks for sharing Salesforce Training in Hyderabad
ReplyDelete
Replies
UnknownJanuary 7, 2015 at 1:48 AM
Very useful information thanks for sharing Hadoop Training in Hyderabad
ReplyDelete
Replies
UnknownJanuary 7, 2015 at 1:48 AM
Very useful information thanks for sharing Hadoop Training in Hyderabad
ReplyDelete
Replies
UnknownJanuary 10, 2015 at 12:04 AM
HADOOP Online Training by hyderabadsys online Trainings with a fantastic and continuous staff. Our Hadoop hyderabadsys online Trainings substance outlined according to the current IT industry necessity. Apache Hadoop is having great request in the business sector, tremendous number of employment opportunities are there in the IT world. Taking into account this interest hyderabadsys online Trainings began giving Online classes on Hadoop Training through the different online traininng strategies like Gotomeeting, Webex. Hadoop internet preparing is one hot no problem that has been broadly utilized. Owing to this, there is a gigantic requirement for Hadoop web preparing Administrators. There are numerous merchants who offer Hadoop internet preparing organization preparing. Engineering development inside the most recent decade has been huge to the point that things once considered unlimited are currently regular place and capacities and employments that once obliged high abilities and broad preparing can now be performed by practically anybody.
Hadoop Online Training
Contact us:
India +91 9030400777
Usa +1-347-606-2716
Email: contact@Hyderabadsys.com
ReplyDelete
Replies
UnknownJanuary 22, 2015 at 4:44 AM
Hadoop is really a good booming technology in coming days. And good path for the people who are looking for the Good Hikes and long career. We also provide Hadoop online training
ReplyDelete
Replies
UnknownJanuary 24, 2015 at 2:10 AM
Excellent Post thanks for sharing Hadoop Training in Hyderabad
ReplyDelete
Replies
UnknownJanuary 29, 2015 at 5:08 AM
Valuable information thanks for sharing from Manasa
ReplyDelete
Replies
UnknownMarch 2, 2015 at 10:39 AM
Have we gain the more knowledge on Hadoop please visit this site
http://hadoopvideosonline.blogspot.in/
ReplyDelete
Replies
UnknownMarch 3, 2015 at 10:15 AM
If we can get more knowledge on Hadoop please visit this site.

http://hadoopvideosonline.blogspot.in/
ReplyDelete
Replies
UnknownMarch 5, 2015 at 2:22 AM
There were more oppurtunities in Hadoop
ReplyDelete
Replies
UnknownMarch 6, 2015 at 2:13 AM
Hi
Thank you for giving very valuable information on
Hadoop
ReplyDelete
Replies
UnknownMarch 16, 2015 at 6:12 AM
http://hadoop-dina.blogspot.com/2015/03/big-data-tutorial-for-beginners-day.html
ReplyDelete
Replies
kalyan hadoopMarch 26, 2015 at 3:06 AM
You want big data interview question and answers for this link.
big data interview question and answers
ReplyDelete
Replies
UnknownMarch 30, 2015 at 2:11 AM
Apache Hadoop is the leading technology now a days.The career growth is good for people stepping into this technology.Recently,I visited www.hadooponlinetutor.com,they are offering hadoop videos at $20only.If anyone require check it out.
ReplyDelete
Replies
UnknownApril 20, 2015 at 9:19 AM
Learn Hadoop in depth in a systematic and practical manner for free Click Here
ReplyDelete
Replies
asitbangalorereviewsMay 6, 2015 at 4:16 AM
can use please provide some more information that means it would be more easier to learn this concepts.
ReplyDelete
Replies
UnknownMay 13, 2015 at 8:49 PM
Nice blog, excellent posts. Have some video tutorials at deeplexer.co have a look.
ReplyDelete
Replies
UnknownMay 28, 2015 at 10:35 PM
Your specific information Hadoop Training Chennai is truly tremendous for Hadoop Training in Chennai me. Keep overhaul your web journal.Thank you to such an extent...
ReplyDelete
Replies
AnonymousMay 29, 2015 at 4:52 AM
Hadoop cluster lab setup on rest for learning purpose, if anyone interested, please look at
http://www.s4techno.com/lab-setup/
ReplyDelete
Replies
AnonymousMay 29, 2015 at 4:53 AM
Hadoop cluster lab setup on rest for learning purpose, if anyone interested, please look at
http://www.s4techno.com/lab-setup/
ReplyDelete
Replies
UnknownJune 11, 2015 at 2:25 AM
go here
ReplyDelete
Replies
UnknownJune 13, 2015 at 4:01 AM
Good article, Hadoop is a term you will hear and over again when discussing the processing of big data information. Apache Hadoop is an open-source soft ware frame work that supports data intensive distributed applications.

Hadoop consists 2 major core components.

1. The Hadoop Distributed file system(HDRS)
2. Map Reduce. Read more...

Check this site Mindmajix for indepth Hadoop Tutorials
Go here if you’re looking for information on BigData Hadoop Tutorials
ReplyDelete
Replies
AnonymousJune 26, 2015 at 3:47 AM
$20 can get you:

a) Movie Tickets & popcorn,
b) A cuppo for your car keys,
c) A clothespin holder,
d) A Hadoop 5-in-1 package as investment in your future.

What are you buying today?
Visit Now: http://bit.ly/1SqESgK
ReplyDelete
Replies
UnknownSeptember 29, 2015 at 3:33 AM

Appreciation for nice Updates,if u r interested lpease visit
http://www.tekclasses.com/
ReplyDelete
Replies
UnknownOctober 26, 2015 at 3:44 AM
Good source of information on hadoop and Map R. We always rely on this blog other than our regular hadoop online training classes.
ReplyDelete
Replies
UnknownNovember 26, 2015 at 4:36 AM
List of Do Follow Social Bookmarking Sites in United States of America.
www.topleadsmedia.com
ReplyDelete
Replies
UnknownNovember 27, 2015 at 9:50 PM
Great information about Hadoop. It will be helpful for us.
Thank you !!
I would like to share useful things for hadoop job seekers Hadoop Interview Questions .
ReplyDelete
Replies
UnknownDecember 10, 2015 at 11:21 PM
Present for hadoop is booming technology. For learning hadoop basics of java we need to learn first.Big data training in Hyderabad
ReplyDelete
Replies
UnknownDecember 18, 2015 at 4:08 AM
The Information which you provided is really useful for bigdata hadoop learners. Thank You for Sharing Valuable Information.I like this blog and this is very informative.
Regards..
Big Data Training in Chennai
ReplyDelete
Replies
UnknownDecember 19, 2015 at 3:16 AM
Nice it is thanks for sharing

Visit - www.tekclasses.in
ReplyDelete
Replies
monicaDecember 30, 2015 at 1:09 AM
recently i came your blog and have been read along..it's very nice..we are offering dot net online training
ReplyDelete
Replies
UnknownDecember 31, 2015 at 5:51 AM
Great post about Hadoop. It will be helpful for us.
Thank you so much..
For hadoop job seekers Hadoop Interview Questions
Hadoop Training In Noida .
ReplyDelete
Replies

Add comment

Subscribe to: Post Comments ( Atom )