How to resolve clock offset issue?
Organizations who have embarked on Big Data journey know how vital it is to keep Hadoop clusters up and running. To meet expected SLA/OLA, also to serve internal and external customers, it is of at most importance to keep Hadoop clusters in good health.
Hence, you can not ignore even a small or simple issue like clock offset. Ripple effect of this could have dire consequences, disrupting all services.
Ellicium provides Big Data managed services to multiple clients. A few months back, while working for one of our client, we faced a similar clock offset issue. It was a large Cloudera cluster using Cent OS environment and this was impacting the health of the cluster and consequently, various projects of our client were also getting affected.
For those who are not aware first, let’s understand why we require time synchronization.
Why we require time synchronization?
Hadoop is master-slave architecture, wherein slave node sends regular heartbeat signals to master node regarding its health. Hence, it is important that all the machines in the cluster are time synchronized and refer to the same time.
Synchronization is the bridge between master and slave to get health updates. Broken synchronization means no updates to master, about health.
The most effective way to do time synchronization is by synchronizing the time of all the cluster machine to common NTP server.
Before going further, let’s first understand what is NTP.
What is NTP?
Network Time Protocol (NTP) is a networking protocol for clock synchronization between computer systems over packet-switched, variable-latency data networks. NTP is intended to synchronize all participating computers to within a few milliseconds of coordinated universal time (UTC). NTP can usually maintain time within tens of milliseconds over the public internet and can achieve better than one-millisecond accuracy on LAN under ideal conditions. There are many reference NTP servers available on the internet, that you can use for synchronization.
Most common way to resolve this is to, sync with reference NTP servers. But our client security policies did not permit us to connect to an external server for time synchronization. Hence, in order to resolve this, we came up with a workaround.
We decided to use one of our master servers as a reference server for time synchronization for all the other machines in the cluster.
Below are steps in brief, which we followed:
- Configure NTP server on all machine using the YUM command
- Select the name node as reference NTP server for other machines
- Edit the /etc/ntp.conf file and comment out the below:
4. Repeat the same thing on all machines
5. Edit the /etc/ntp.conf file on the selected reference NTP server i.e. Name Node in our case and copy the below:
#Use our own NTP Server which is our name node server.
Server namenode iburst
# server 127.127.1.0 # local clock
Note: Iburst is a configurable option. If an NTP server is unresponsive, the iburst mode continues to send frequent queries until the server responds and time synchronization starts.
6. Verify the status of NTP using below command on Name Node
Output should be similar to below:
# ntpq -p
remote refid st t when poll reach delay offset jitter
*elserver1 18.104.22.168 3 u 300 1024 377 1.225 -0.071 4.606
7. Repeat step 5 and 6 on all the remaining machines and verify the output
Using this simple approach, we achieved many objectives in one go! We did not have to depend on the external NTP server. Depending on external NTP servers could be risky as they are not in our control. Above all, using this approach we were able to avoid any security policy breach for our client.
There might be many other approaches to resolve clock offset error, but using this approach we fixed this issue once and for all!