No announcement yet.

How to recover from CORRUPT HDFS state

  • Filter
  • Time
  • Show
Clear All
new posts

  • How to recover from CORRUPT HDFS state

    We have a hadoop cluster which has 3 login nodes and 10 data nodes. We are running hadoop 2.7.1 with HBase 0.94.23. Both hadoop and HBase running on logging node 2. We are facing a terrible issue with our hadoop cluster recently. There are lot of files in HDFS in corrupt state. We are unable to figure out what cause this mass corruption and how to recover from it. HDFS has 40 TB of data and we are worried that we might have to rebuild the cluster from scratch due to this errors. Our cluster had some file system issues recently. Below is the list of events that took place before that. Both Hadoop and HBase are running on ln02 (logging node 2). ​
    • Nov 30 - SSD drives on ln02 node has died which triggered a kernel panic and reboot.
    • Dec 20 - ln02 file system set to Read-only and both hard drives on ln02 died. Sys admin removed and reinstalled the SSD drives on ln02, and rebooted, and it came back up. One data node was also down on the same day due to disk failure.
    • Dec 21 - Same thing happen as Dec 20th and ln02 was rebooted. Sys admin replaced the failed SSD with another SSD. Another data node was down on the same day.​
    On nov 30th and Dec 20 th after sys admin rebooted the node, I was able to restart Hadoop and HBase without any issue. Everything worked as expected. But on Dec 21st, when I restarted Hadoop, it has automatically switch to the "Safe mode" and hadoop fs fsck command showed lot of corrupt and missing files. Output of fsck is below.

    Click image for larger version

Name:	fsck.png
Views:	1
Size:	3.8 KB
ID:	6305

    HDFS web ui shows below message.

    Safe mode is ON. The reported blocks 391254 needs additional 412774 blocks to reach the threshold 0.9990 of total blocks 804832. The number of live datanodes 10 has reached the minimum number 0. Safe mode will be turned off automatically once the thresholds have been reached.

    We experienced some data nodes showing Input/output errors intermittently as well.

    Anyone experienced such situation before and any idea to recover from this is greatly appreciated