Hadoop Introduction

Hadoop is a large-scale distributed batch processing infrastructure. While it can be used on a single machine, its true power lies in its ability to scale to hundreds or thousands of computers, each with several processor cores. Hadoop is also designed to efficiently distribute large amounts of work across a set of machines.

How large an amount of work? Orders of magnitude larger than many existing systems work with. Hundreds of gigabytes of data constitute the low end of Hadoop-scale. Actually Hadoop is built to process "web-scale" data on the order of hundreds of gigabytes to terabytes or petabytes. At this scale, it is likely that the input data set will not even fit on a single computer's hard drive, much less in memory. So Hadoop includes a distributed file system which breaks up input data and sends fractions of the original data to several machines in your cluster to hold. This results in the problem being processed in parallel using all of the machines in the cluster and computes output results as efficiently as possible.

Challenges at Large Scale

Performing large-scale computation is difficult. To work with this volume of data requires distributing parts of the problem to multiple machines to handle in parallel. Whenever multiple machines are used in cooperation with one another, the probability of failures rises. In a single-machine environment, failure is not something that program designers explicitly worry about very often: if the machine has crashed, then there is no way for the program to recover anyway.

In a distributed environment, however, partial failures are an expected and common occurrence. Networks can experience partial or total failure if switches and routers break down. Data may not arrive at a particular point in time due to unexpected network congestion. Individual compute nodes may overheat, crash, experience hard drive failures, or run out of memory or disk space. Data may be corrupted, or maliciously or improperly transmitted. Multiple implementations or versions of client software may speak slightly different protocols from one another. Clocks may become desynchronized, lock files may not be released, parties involved in distributed atomic transactions may lose their network connections part-way through, etc. In each of these cases, the rest of the distributed system should be able to recover from the component failure or transient error condition and continue to make progress. Of course, actually providing such resilience is a major software engineering challenge.

Different distributed systems specifically address certain modes of failure, while worrying less about others. Hadoop provides no security model, nor safeguards against maliciously inserted data. For example, it cannot detect a man-in-the-middle attack between nodes. On the other hand, it is designed to handle hardware failure and data congestion issues very robustly. Other distributed systems make different trade-offs, as they intend to be used for problems with other requirements (e.g., high security).

In addition to worrying about these sorts of bugs and challenges, there is also the fact that the compute hardware has finite resources available to it. The major resources include:

* Processor time
* Memory
* Hard drive space
* Network bandwidth

Individual machines typically only have a few gigabytes of memory. If the input data set is several terabytes, then this would require a thousand or more machines to hold it in RAM -- and even then, no single machine would be able to process or address all of the data.

Hard drives are much larger; a single machine can now hold multiple terabytes of information on its hard drives. But intermediate data sets generated while performing a large-scale computation can easily fill up several times more space than what the original input data set had occupied. During this process, some of the hard drives employed by the system may become full, and the distributed system may need to route this data to other nodes which can store the overflow.

Finally, bandwidth is a scarce resource even on an internal network. While a set of nodes directly connected by a gigabit Ethernet may generally experience high throughput between them, if all of the machines were transmitting multi-gigabyte data sets, they can easily saturate the switch's bandwidth capacity. Additionally if the machines are spread across multiple racks, the bandwidth available for the data transfer would be much less. Furthermore RPC requests and other data transfer requests using this channel may be delayed or dropped.

To be successful, a large-scale distributed system must be able to manage the above mentioned resources efficiently. Furthermore, it must allocate some of these resources toward maintaining the system as a whole, while devoting as much time as possible to the actual core computation.

Synchronization between multiple machines remains the biggest challenge in distributed system design. If nodes in a distributed system can explicitly communicate with one another, then application designers must be cognizant of risks associated with such communication patterns. It becomes very easy to generate more remote procedure calls (RPCs) than the system can satisfy! Performing multi-party data exchanges is also prone to deadlock or race conditions. Finally, the ability to continue computation in the face of failures becomes more challenging. For example, if 100 nodes are present in a system and one of them crashes, the other 99 nodes should be able to continue the computation, ideally with only a small penalty proportionate to the loss of 1% of the computing power. Of course, this will require re-computing any work lost on the unavailable node. Furthermore, if a complex communication network is overlaid on the distributed infrastructure, then determining how best to restart the lost computation and propagating this information about the change in network topology may be non trivial to implement.

49 comments :

  1. This comment has been removed by a blog administrator.

    ReplyDelete
  2. Hadoop i sreally a good booming technology in coming days. And good path for the people who are looking for the Good Hikes and long career. Hadoop Tutorial

    ReplyDelete
  3. You can't short circuit your approach to Hadoop. It's NOT for newbies with training.
    Hadoop engineersq have to have HA systems training and experience, computational cluster training, SQL and NoSQL DBA training, Network design, Business intelligence, and distributed systems training. Don't rush to it because it's "in".

    ReplyDelete
  4. Hai dude,i have learn to lot of hadoop.Thanks in that reviews.

    Hadoop Training in Chennai

    ReplyDelete
  5. nice explanation just look at for online course gyanfinder

    ReplyDelete
  6. Thanks a lot.
    Its really helpful for me to understand where we i lost in my previous interview. Thanks.
    If anyone wants to Learn Hadoop in Chennai go to the Besant Technologies which is No.1 Training Institute in Chennai.


    Hadoop training in Chennai

    ReplyDelete
  7. hadoop is new technology ,this article is very good for the beginner,i was learn hadoop from Easylearning Guru it was help me the most and in this there is life time support system which will help me to clear my doubt any time any where

    ReplyDelete
  8. Your posts is really helpful for me.Thanks for your wonderful post. I am very happy to read your post. It is really very helpful for us and I have gathered some important information from this blog.


    Big Data Training in Chennai

    ReplyDelete
  9. I was totally amazed when i saw this website Best Hadoop Online

    Training
    first time i thought this is what i am looking for from a long time i am very thankful to you for helping

    not only me but to all those guys who are new to this IT SECTOR and who wants to make a career ih this sector.

    ReplyDelete
  10. Thanks for InformationHadoop Course will provide the basic concepts of MapReduce applications developed using Hadoop, including a close look at framework components, use of Hadoop for a variety of data analysis tasks, and numerous examples of Hadoop in action. This course will further examine related technologies such as Hive, Pig, and Apache Accumulo. HADOOP Online Training

    ReplyDelete
  11. HADOOP Online Training by hyderabadsys online Trainings with a fantastic and continuous staff. Our Hadoop hyderabadsys online Trainings substance outlined according to the current IT industry necessity. Apache Hadoop is having great request in the business sector, tremendous number of employment opportunities are there in the IT world. Taking into account this interest hyderabadsys online Trainings began giving Online classes on Hadoop Training through the different online traininng strategies like Gotomeeting, Webex. Hadoop internet preparing is one hot no problem that has been broadly utilized. Owing to this, there is a gigantic requirement for Hadoop web preparing Administrators. There are numerous merchants who offer Hadoop internet preparing organization preparing. Engineering development inside the most recent decade has been huge to the point that things once considered unlimited are currently regular place and capacities and employments that once obliged high abilities and broad preparing can now be performed by practically anybody.
    Hadoop Online Training
    Contact us:
    India +91 9030400777
    Usa +1-347-606-2716
    Email: contact@Hyderabadsys.com

    ReplyDelete
  12. Hadoop is really a good booming technology in coming days. And good path for the people who are looking for the Good Hikes and long career. We also provide Hadoop online training

    ReplyDelete
  13. Valuable information thanks for sharing from Manasa

    ReplyDelete
  14. Have we gain the more knowledge on Hadoop please visit this site
    http://hadoopvideosonline.blogspot.in/

    ReplyDelete
  15. If we can get more knowledge on Hadoop please visit this site.

    http://hadoopvideosonline.blogspot.in/

    ReplyDelete
  16. Hi
    Thank you for giving very valuable information on
    Hadoop

    ReplyDelete
  17. this is very nice article and very good information for Oracle ADF Learners. our Cubtraining also provide all Oracle Courses

    ReplyDelete
  18. http://hadoop-dina.blogspot.com/2015/03/big-data-tutorial-for-beginners-day.html

    ReplyDelete
  19. You want big data interview question and answers for this link.
    big data interview question and answers

    ReplyDelete
  20. Apache Hadoop is the leading technology now a days.The career growth is good for people stepping into this technology.Recently,I visited www.hadooponlinetutor.com,they are offering hadoop videos at $20only.If anyone require check it out.

    ReplyDelete
  21. It is good to see the best training blog for Hadoop Tutorial This is the place to learn all about hadoop for beginners or experts.

    ReplyDelete
  22. Learn Hadoop in depth in a systematic and practical manner for free Click Here

    ReplyDelete
  23. can use please provide some more information that means it would be more easier to learn this concepts.

    ReplyDelete
  24. Nice blog, excellent posts. Have some video tutorials at deeplexer.co have a look.

    ReplyDelete
  25. Your specific information Hadoop Training Chennai is truly tremendous for Hadoop Training in Chennai me. Keep overhaul your web journal.Thank you to such an extent...

    ReplyDelete
  26. Hadoop cluster lab setup on rest for learning purpose, if anyone interested, please look at
    http://www.s4techno.com/lab-setup/

    ReplyDelete
  27. Hadoop cluster lab setup on rest for learning purpose, if anyone interested, please look at
    http://www.s4techno.com/lab-setup/

    ReplyDelete
  28. Good article, Hadoop is a term you will hear and over again when discussing the processing of big data information. Apache Hadoop is an open-source soft ware frame work that supports data intensive distributed applications.

    Hadoop consists 2 major core components.

    1. The Hadoop Distributed file system(HDRS)
    2. Map Reduce. Read more...

    Check this site Mindmajix for indepth Hadoop Tutorials
    Go here if you’re looking for information on BigData Hadoop Tutorials

    ReplyDelete
  29. $20 can get you:

    a) Movie Tickets & popcorn,
    b) A cuppo for your car keys,
    c) A clothespin holder,
    d) A Hadoop 5-in-1 package as investment in your future.

    What are you buying today?
    Visit Now: http://bit.ly/1SqESgK

    ReplyDelete
  30. The Hadoop tutorial you have explained is most useful for begineers who are taking Hadoop Administrator Online Training
    Thank you for sharing Such a good tutorials on Hadoop

    ReplyDelete

  31. Appreciation for nice Updates,if u r interested lpease visit
    http://www.tekclasses.com/

    ReplyDelete
  32. Good source of information on hadoop and Map R. We always rely on this blog other than our regular hadoop online training classes.

    ReplyDelete
  33. Great information about Hadoop. It will be helpful for us.
    Thank you !!
    I would like to share useful things for hadoop job seekers Hadoop Interview Questions .

    ReplyDelete
  34. Present for hadoop is booming technology. For learning hadoop basics of java we need to learn first.Big data training in Hyderabad

    ReplyDelete
  35. The Information which you provided is really useful for bigdata hadoop learners. Thank You for Sharing Valuable Information.I like this blog and this is very informative.
    Regards..
    Big Data Training in Chennai

    ReplyDelete
  36. Nice it is thanks for sharing

    Visit - www.tekclasses.in

    ReplyDelete
  37. recently i came your blog and have been read along..it's very nice..we are offering dot net online training

    ReplyDelete
  38. Great post about Hadoop. It will be helpful for us.
    Thank you so much..
    For hadoop job seekers Hadoop Interview Questions
    Hadoop Training In Noida .

    ReplyDelete