March 2018 - 24 Tutorials

CUT command in Unix/Linux with examples

Sai Kumar March 24, 2018 No Comments

Cut Command: – CUT is used to process data in file. – Works only on file having column formatted data Command 1: Display particular position character cut -c3 file.txt Command 2: Range of characters cut -c3-8 file.txt cut -c3- file.txt cut -c-10 file.txt Command 3: Display Columns after seperation cut -d “|” -f2 file.txt cut -d “|” -f2-3 file.txt cut -d “|” -f2- file.txt Command 4: Display all other than given columns[–complement] cut -d “|” -f2 file.txt cut -d “|” –complement -f2 file.txt

Unix

GREP command in Unix/Linux with examples

Sai Kumar March 24, 2018 No Comments

grep – Global Regular Expression Parser It is used to search data in one/more files. Command 1: Search pattern in file - grep hello file.txt - grep sai file.txt file2.txt Command 2: Search pattern in current folder with all txt extensions. grep 1000 *.txt Command 3: Search data in all files in current folder grep 1000 *.* Command 4: Search ignoring case[-i] grep "Sai" file.txt (case sensitive by default) grep -i "Sai" file.txt Command 5: Display line number [-n] grep -n "124" result.txt Command 6: Get only filenames in which data exists[-l] grep -l "100" *.* Command 7: Search exact word [-w] grep -w Sai file.txt Command 8: Search lines of files which does not have that data(reverse of search)[-v] grep -v "1000" file.txt Command 9: - Get one record before the search grep -B 1 "Msd" fi...

Unix

SED command in Unix/Linux with examples

Sai Kumar March 24, 2018 No Comments

SED – Stream Editor Used to display & editing data Editing options are – Insertion/Updation/Deletion 2 Types of Operations ——————— – Lines Addressing – Context Addressing Line Addressing- Command 1: Display line multiple times sed '2p' file.txt sed -n '3p' file.txt (specific line => -n) sed -n '5p' file.txt Command 2: Display last line[$] sed '$p' file.txt (includes last line again along with original) sed -n '$p' file.txt (Specific) Command 3: Range of lines sed -n '2,4p' file.txt Command 4: Do not display specific lines sed -n '2!p' file.txt sed -n '2,4!p' file.txt - do not display specific range of lines(!) Context Addressing: Command 1: Display lines having a specific word sed -n '/Amit/p' file.txt sed -n '/[Aa]mi...

Unix

All about AWK command in Unix – Part 1

Sai Kumar March 18, 2018 No Comments

AWK – select column data -Search data in file and print data on console -Find data of specific columns -Format output data -Used on file with bulk of data for searching, conditional executions, updating, filtering Command 1 ——— Print specific columns awk '{print $1}' file.txt by default TAB seperator awk '{print $1 "--" $2}' file.txt Command 2 – ———– select all data from table awk '{print $0}' tabfile.txt Command 3- ———– select columns from CSV awk -F "," '{print $1}' commafile.txt 1.Seperating data using -F awk -F "," '{print $1}' commafile.txt 2.Using variable (FS) awk '{print $2}' FS="," commafile.txt Command 4- ———- Display content without displaying header of file awk 'NR!=1{print $1 " " $2...

Unix

All about AWK command in Unix – Part 2

Sai Kumar March 18, 2018 No Comments

Command 11 – ———- Find text at the start of line [ ^ ] awk -F "|" '$2-/^s/{print $0}' tabfile.txt Command 12 – ———- Find text at the ent of line [ $ ] awk -F "|" '$2 -/n$/{print $0}' file1.txt Command 13 – ———- perform condition check using if awk -F "|" '{if ($3>2000) print $0;}' file2.txt Command 14 – ———- perform condition check using if-else awk -F "|" '{if($3>=20000) print $2; else print "*****" ; }' file2.txt command 15 – ——– perform condition check using else if awk -F "|" '{ if ($3>=3000) print $2 "your tax is 30%"; else if($3>=2000) print $2 "your tax is 20%"; else print $2 "your tax is 10%;}' file2.txt Command 16 – ——– Begin B...

Scala / Spark

How to write Current method name to log in Scala[Code Snippet]

Sai Kumar March 18, 2018 No Comments

You will be having many methods in your application framework, and if want to trace and log current method name then the below code will be helpful for you. def getCurrentMethodName:String = Thread.currentThread.getStackTrace()(2).getMethodName def test{ println("you are in - "+getCurrentMethodName) println("this is doing some functionality") } test Output: you are in – test this is doing some functionality

Scala / Spark

How to Calculate total time taken for particular method in Spark[Code Snippet]

Sai Kumar March 18, 2018 No Comments

In some cases where you applied Joins in the spark application, you might want to know the time taken to complete the particular join. Below code snippet might come in handy to achieve so. import java.util.Date val curent = new Date().getTime println(curent) Thread.sleep(30000) val end = new Date().getTime println(end) println("time taken "+(end-curent).toFloat/60000 + "mins") Output: import java.util.Date curent: Long = 1520502573995 end: Long = 1520502603996 time taken 0.5000167mins All you need to do is get current time before method starts and get current time after method ends, then calculate the difference to get total time taken to complete that particular method. Hope this code snippet helps!!

Scala / Spark

How to write current date timestamp to log file in Scala[Code Snippet]

Sai Kumar March 18, 2018 No Comments

Scala doesn’t have its own library for Dates and timestamps, so we need to depend on Java libraries. Here is the quick method to get current datetimestamp and format it as per your required format. Please note that all the code syntaxes are in Scala, this can be used while writing Scala application. import java.sql.Timestamp def getCurrentdateTimeStamp: Timestamp ={ val today:java.util.Date = Calendar.getInstance.getTime val timeFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss") val now:String = timeFormat.format(today) val re = java.sql.Timestamp.valueOf(now) re } import java.sql.Timestamp getCurrentdateTimeStamp: java.sql.Timestamp getCurrentdateTimeStamp res0: java.sql.Timestamp = 2018-03-18 07:48:00.0

Spark

Common issues with Apache Spark

Sai Kumar March 10, 2018 No Comments

Tricky Deployment: Once you’re done writing your app, you have to deploy it right? That’s where things get a little out of hand. Although there are many options for deploying your Spark app, the simplest and straightforward approach is standalone deployment. Spark supports Mesos and Yarn, so if you’re not familiar with one of those it can become quite difficult to understand what’s going on. You might face some initial hiccups when bundling dependencies as well. If you don’t do it correctly, the Spark app will work in standalone mode but you’ll encounter Class path exceptions when running in cluster mode. Memory Issues: As Apache Spark is built to process huge chunks of data, monitoring and measuring memory usage is critical. While Spark works just fine for normal usage, it has got tons of...

Spark

Comparison between Apache Spark and Apache Hadoop

Sai Kumar March 10, 2018 No Comments

Spark Hadoop Comparison: The below the comparison between spark and Hadoop. They do different: Hadoop and Apache Spark are both big-data frameworks, but they don’t really serve the same purposes. Hadoop is essentially a distributed data infrastructure: It distributes massive data collections across multiple nodes within a cluster of commodity servers, which means you don’t need to buy and maintain expensive custom hardware. It also indexes and keeps track of that data, enabling big-data processing and analytics far more effectively than was possible previously. Spark, on the other hand, is a data-processing tool that operates on those distributed data collections; it doesn’t do distributed storage. You can use one without the other: Hadoop includes not just a storage comp...

Spark

Version wise features of Apache Spark

Sai Kumar March 10, 2018 No Comments

Spark Release 2.1.0: Apache Spark 2.1.0 release makes significant strides in the production readiness of Structured Streaming, with added support for event time watermarks and Kafka 0.10 support. In addition, this release focuses more on usability, stability, and polish, resolving over 1200 tickets. The below is the list of high level changes Core and Spark SQL: This version supports from json and to json for parsing jsonfor string columns. This version allows for the use of DDL commands to manipulate partitions for tables stored with Spark’s native formats. It Speeds up group-by aggregate performance by adding a fast aggregation cache that is backed by a row-based hashmap. Structured Streaming: This version gives Kafka 0.10 support in Structured Streaming. This version Support all file f...

Spark

Memory management in Apache Spark

24 Tutorials March 10, 2018 No Comments

Memory Management in Spark 1.6 Executors run as Java processes, so the available memory is equal to the heap size. Internally available memory is split into several regions with specific functions. Execution Memory storage for data needed during tasks execution shuffle-related data Storage Memory storage of cached RDDs and broadcast variables possible to borrow from execution memory (spill otherwise) safeguard value is 50% of Spark Memory when cached blocks are immune to eviction User Memory user data structures and internal metadata in Spark safeguarding against OOM Reserved memory memory needed for running executor itself and not strictly related to Spark

Login

Lost Password

Register