What 5 things to consider before streaming data into Hadoop?
Data generate at a rapid pace these days. Thanks to Big Data technologies like Hadoop, Processing of generated data becomes easier.
Having made it possible for several of our clients to process streaming data and gain valuable insights, we have come up with a list of a few points which need to be considered as a part of streaming data into Hadoop. Some of these seem pretty obvious but based on experience, I can confidently say that they can be easily missed, for want of experience
1) How are you going to transfer data into the Hadoop cluster –
Making use of FTP or HTTP data can be transferred. Unless, there is a specific need to use FTP, we recommend HTTP due to usage of persistent connections and automatic compression, thereby making transfers efficient.
What we do – We normally begin small using a Tomcat server on Linux to receive data. Tomcat is open source and very easy to install and configure. It also gives its best performance on Linux. Since Hadoop is also Linux based, this augurs well. Data receives in landing folder and then taken ahead by shell script or Python script.
For production deployments where data is expected to be received from a few hundred thousand sources, we go for an Apache HTTP server (or an IIS server few cases, where the client is a Microsoft shop).
2) How much raw data needs to be stored –
This is a tricky question. The more raw data you need to store, the bigger the Hadoop cluster needs to be (remember, Hadoop uses a replication factor of 3 so any data entering Hadoop will be stored as 3 copies). Though Hadoop use commodity hardware in cost effective, our experience is that clients are very particular about every extra node to add.
What we do – We normally recommend storing raw data for not more than 7 days, as data beyond that is normally unessential. If there are any bottlenecks in processing, they can definitely address in 7 days and that still give adequate time to recover the backlog.
3) Data receives in what format-
CSV still continues to be the preferred format due to ease of transfer and parsing and processing for simple data models or data coming from legacy systems. However, it works well only where the data is evenly structured. So, if there is an option to receive data in CSV, we recommend to go for it. Other formats such as JSON or XML are in better suite for complex data model.
Also, popular languages like Python and PHP offer libraries for parsing and dealing with these formats so that makes processing easy.
What we do – We normally use shell scripts or Python scripts to parse and massage the data and finally convert it to CSV format. It passes to Flume for loading to Hadoop.
4) Have you considered pre-processing the data –
This is an important aspect where special attention needs to be paid. Data needs to format or unwanted fields need to skip. It is always worthwhile to trim data and take only the requisite data ahead to Hadoop.
What we do – We always use multi-threading in shell scripts on the Linux servers to speed up the processing multi-fold. Where ever possible, we also distribute the processing on the different nodes so that the data can be made available to the Flume agents running on the datanodes thereby improving throughput.
5) Have you considered about delayed data –
While data needs to be loaded, processed and made available for visualization in a short span of time, the architecture and processing needs to take care of data which is delayed. This is especially important for time-series data where data is visualized and analysed based on specific time spans. Data might get delayed but it still needs to be processed and put in the appropriate bucket.
What we do – We normally impose a time limit for the delay e.g. data which arrives only in the last 48 hours will be processed. If the data arrives later than this period is unable to process in the regular cycle. Based on specific client requirements and use cases, this period will vary but the concept remains the same. Impose a restriction on the amount of delay to tolerate.
Once data is loading in Hadoop, It can process as require and made available for visualization. Details of the same will get cover in a later blog.
Hope you found these tips useful. I look forward to your comments and experiences in this area.