Hadoop is distributed master slave type architecture. The architecture is something like -one namenode , one secondary namenode and eight datanode.
We were using Hadoop to handle large amount of streaming data coming from smart phones, for one of the leading telecom company in India. It was a eight node cluster with Cloudera CDH 5.3.0.
Project was at critical stage. We were facing a problem wherein the size of “/dfs/dn/current/Bp-12345-IpAddress-123456789/dncp-block-verification.log.curr” and “dncp-block-verification.log.prev” kept increasing to 100 of GBs within hours, which was slowing down machine, leading to data node service outage.
It was a HDFS bug (HDFS-7430). There was no help on how it should be resolved. After having good discussion with Hadoop experts at Ellicium, I could solve this issue.
I got two option to resolve this.
Option 1 – By stopping the datanode services and deleting dncp block verification files manually. Implementing this would require continuous monitoring, as log files may increase in size on either of the datanodes (even on the same node after deleting it).
Option 2 – Although slightly drastic, was to disable the block scanner entirely, by setting into the HDFS DataNode configuring the key dfs.datanode.scan.period.hours to 0 (default is 504 in hours). The negative effect of this could have been DNs not auto-detecting corrupted block files.
After considering pros and cons, we went ahead with option 1. After implementation, as expected service was up and running. Hopefully this issue will get resolved in next version of CDH 5.4.x.
It was a big relief and I felt really proud, as it helped to save a lot of downtime of cluster.