What 5 things to consider before streaming data into Hadoop?

What 5 things to consider before streaming data into Hadoop?
Yogesh Kulkarni
Posted by on January 23, 2017 in Blog, Gazelle

What 5 things to consider before streaming data into Hadoop?

Data is getting generated at a rapid pace these days. Thanks to Big Data technologies like Hadoop, it is becoming easier to process all data which gets generated. Having made it possible for several of our clients to process streaming data and gain valuable insights, we have come up with a list of a few points which need to be considered as a part of streaming data into Hadoop. Some of these seem pretty obvious but based on experience, I can confidently say that they can be easily missed, for want of experience

1) How are you going to transfer data into the Hadoop cluster – Data can be transferred using FTP or HTTP. Unless, there is a specific need to use FTP, we recommend HTTP due to usage of persistent connections and automatic compression, thereby making transfers efficient.

What we do – We normally begin small using a Tomcat server on Linux to receive data. Tomcat is open source and very easy to install and configure. It also gives its best performance on Linux. Since Hadoop is also Linux based, this augurs well. Data is received in a landing folder and then taken ahead by shell scripts or Python scripts.

For production deployments where data is expected to be received from a few hundred thousand sources, we go for an Apache HTTP server (or an IIS server few cases, where the client is a Microsoft shop).

2) How much raw data needs to be stored – This is a tricky question. The more raw data you need to store, the bigger the Hadoop cluster needs to be (remember, Hadoop uses a replication factor of 3 so any data entering Hadoop will be stored as 3 copies). Though Hadoop uses commodity hardware and is cost effective, our experience is that clients are very particular about every extra node which needs to be added.

What we do – We normally recommend storing raw data for not more than 7 days, as data beyond that is normally not required. If there are any bottlenecks in processing, they can definitely be addressed in 7 days and that still gives adequate time to recover the backlog.

3) What is the format of the data received – CSV still continues to be the preferred format due to ease of transfer and parsing and processing for simple data models or data coming from legacy systems. However, it works well only where the data is evenly structured. So, if there is an option to receive data in CSV, we recommend to go for it. For complex data models though, other formats such as JSON or XML are better suited. Also, popular languages like Python and PHP offer libraries for parsing and dealing with these formats so that makes processing easy.

What we do – We normally use shell scripts or Python scripts to parse and massage the data and finally convert it to CSV format. It is then passed to Flume for loading to Hadoop.

4) Have you considered pre-processing the data – This is an important aspect where special attention needs to be paid. Data needs to be formatted or unwanted fields need to be skipped. It is always worthwhile to trim data and take only the requisite data ahead to Hadoop.

What we do – We always use multi-threading in shell scripts on the Linux servers to speed up the processing multi-fold. Where ever possible, we also distribute the processing on the different nodes so that the data can be made available to the Flume agents running on the datanodes thereby improving throughput.

5) Have you considered about delayed data – While data needs to be loaded, processed and made available for visualization in a short span of time, the architecture and processing needs to take care of data which is delayed. This is especially important for time-series data where data is visualized and analysed based on specific time spans. Data might get delayed but it still needs to be processed and put in the appropriate bucket.

What we do – We normally impose a time limit for the delay e.g. data which arrives only in the last 48 hours will be processed. Data arriving later than this period is not processed in the regular cycle. Based on specific client requirements and use cases, this period will vary but the concept remains the same – impose a restriction on the amount of delay to be tolerated.

Once data is loaded into Hadoop, it can be processed as required and made available for visualization. Details of the same will be covered in a later blog.

Hope you found these tips useful. I look forward to your comments and experiences in this area.