How To Get The Best Out of ‘R’ Programming!

How To Get The Best Out of ‘R’ Programming!
R Programming At Ellicium Solutions
Onkar Khaladkar
Posted by on October 30, 2017 in Blog

How To Get The Best Out of ‘R’ Programming!

R is one of the most powerful and widely used programming language for statistical computing. In the recent years, ‘R’ has made a substantial progress with Big Data and Machine learning. Ever evolving and vibrant ‘R’ community has always paced itself by providing unique packages that can be leveraged for storage, processing, visualization etc in Big Data services. At the same time, it has tons of libraries for Machine Learning applications.

With advent of Big Data, complexity of processing huge amount of data has reduced a lot. Machine Learning on this humongous data is next logical step and very much inevitable.  Given all challenges related to deriving actionable insights out of data and role of Machine Learning in it, ‘R’ could be a great tool to achieve progress on that front.

In this article, I am going to detail out ‘Parallel Processing’ in R and how you can get the most out of ‘R’ using the same.  Parallel processing in ‘R’ is great aid for holistic implementation of IOT, Big Data, and Machine Learning applications.

Parallelization in R

R by default works on the single core available on the machine. To execute ‘R’ code in parallel, first you will have to make a cluster with required number of cores available to R. You can do this by registering a ‘parallel backend’. For this you can use ‘doParallel’ package, which creates a R instance on each of the cores specified. Package called ‘foreach’ then should be used to defines the code that needs to be executed parallely or sequentially.

Below is a code snippet in R For Parallelization

#Libraries used for parallel processing

library(doParallel) #doParallel  package helps in registering core and parallel backend.

library(foreach) #foreach package provides looping construct for executing R code repeatedly.

#Detect number of cores and make a cluster with available cores on machine.The machine we used for this snippet has 4 cores.

cl=makeCluster(detectCores())

#Registering cores for parallel backend can be tested according to resource utilization.

registerDoParallel(cl, cores = detectCores()-1)

#The main reason for using the foreach package is that it supports parallel and sequential execution.

# ‘%dopar%’:- Executes the code in parallel on each core.

# ‘%do%’:- Executes the code in sequentially on single core.

# .packages:- Packages required for executing code within foreach(){}.

# .combine:- Operations performed with data.Ex rbind,cbind,+,*,c()

# .export:- Exporting parent environment variables.

foreach(i = 1:length(datasets),.combine=c,.export=c(list of functions to be exported to loop),.packages=c(“stringr”,”log4r”,”parallel”,”doParallel”,….)) %dopar%

{

// applying statistical functions, algebric expressions,mathematical computations etc in data science

// batch processing

// data pre-processing techniques like cleaning,formating ,transformations etc before passing to ML algorithms  and predictions.

// real time and time series data processing

}

#stop the parallel backend registered

stopCluster(cl)

I have used ‘R’ parallelization for optimizing execution time in Machine Learning, batch processing, real time data analysis etc on top of Big Data. I will be sharing my insights on the same in my next article in detail.

Monitoring Parallel Processes In R

Parallelization in ‘R’ gives an opportunity for continuous monitoring of different thread and instances in concurrent execution. The processes of R instances on number of cores, RAM  and CPU utilization can be tracked by ‘System Monitor API’ or top command in ubuntu , ‘Activity monitor API’ in Mac and ‘Windows task manager’ in windows.

Below is a result of an execution of R script in parallel. When you make a cluster and register the cores , R instances are allocated on each core. You can observe 4 R processes it in the processes tab below (Fig.1). The memory column shows the memory usage in MiB which is close to MB.

 

R Process Monitor

  Fig.1

 The resources tab below(Fig.2) shows the number of cores on the machine and its utilization. As you can observe in ‘CPU history’ chart the graph spikes in 5 seconds from 20-30% to 100 %. This happens when the code in for each, is executed. The ‘Memory and Swap History ‘ show the total memory used and available at the time of execution. Swap memory is a used from hardisks, if processes run out of ram memory. If processes exceed memory and swap, then machine might freeze and will require a hard boot. The variability pattern for ‘Memory and Swap History ‘ and ‘Network History’ will be observed, when we schedule R parallel process to run continuously in iterations across clusters.

 

R parallel process monitor

Fig.2

In above scenarios, you can observe the step by step execution across use cases. It also helps in start to end analysis of the code.

Logging

Logging for each node in for each package helps in analysing node specific datasets. For a real time data use case when data is received for 1000 of parameters from variety of sources, narrowing down the reason for fatality will be required quiet often and can be cumbersome. These fatalities can be device specific, server specific, data specific, coding specific etc. Processing each parameter specific data set and logging parameter specific details will help in faster root cause analysis.

Some of the time series data challenges like dealing with data inconsistency, data lag, data corruption, data duplicates, data rejection/validation, data cleaning can be dealt with at a granular level. Usual method of logging in txt files will be helpful in the initial stages when we are tweaking the application end to end(data source to the visualization) but after some time managing these files and traversing through them becomes an activity in itself. Birds eye view logging can be done in a well-designed database schema for monitoring the status of real time data and building a UI on top of it. This will give some quick data monitoring stats without manual intervention and will eliminate the hassle for maintaining log file.

‘R’ Coding Tips

Here are some ‘R’ programming tips to get the best from it: 

  • R is memory extensive and does a lot of in memory processing. Hence creating R instance on all the cores for processing hampers resource utilization. Hence, code optimization with sequential and parallel approach can reduce unnecessary CPU consumption. For memory intensive process, it is better to parallelize at small scale in subroutines.
  • Remove objects with large data sets from workspace using rm(). If you are running various iterations on every core of R instance, use rm(list=ls()) to free memory from ram.
  • Call garbage collector gc() where ever necessary. R is expected to do it implicitly for releasing memory, but it will not return it to the operating system. In parallel processing, it is important that each process returns memory after completion.
  • Database connections needs to be tracked! Hence, timely termination helps for connection pooling.
  • For optimizing and avoiding loops in R, you will find many research papers to build the best out of R and you should follow recommendations given as a coding convention in development.

I hope you find this article helpful and get the best out of R! For any question or doubt, feel free to comment or mail to me here: onkar.khaladkar @ellicium.com