Getting Started PIG 1

Assalamualaykum wr br..:)

In this Post we are discussing the basics of Hadoop PIG. It is a language which is used to analyze the data in hadoop. It is also know as PIG LATIN. It is high level data processing language which possess rich data types and operators to perform various operations on Data in Hadoop.

To analyze data in hadoop we need to use PIG scripts and that should be executed in grunt

shell.Internally Apache converts these pig scripts into a series of mapreduce jobs and thus making programmers job easy. Architecture of PIG can be illustrated as below:

apache_pig_architecture-jpg

As we see there are various components involved in Apache PIG. Let us brief them.

 

Parser

It checks the syntax and semantics of scripts. Also involve in type checking and other miscellaneous checks. The output of the parser will be a DAG (directed acyclic graph), which represents the Pig Latin statements and logical operators.

In the DAG, the logical operators of the script are represented as the nodes and the data flows are represented as edges.

Optimizer

The logical plan (DAG) is passed to the logical optimizer, which carries out the logical optimizations such as projection and pushdown.

Compiler

The compiler compiles the optimized logical plan into a series of MapReduce jobs.

Execution engine

Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these MapReduce jobs are executed on Hadoop producing the desired results.

Pig Latin Data Model

The data model of Pig Latin is fully nested and it allows complex non-atomic datatypes such as map and tuple. Given below is the diagrammatical representation of Pig Latin’s data model.

Data Model

Atom

Any single value in Pig Latin, irrespective of their data, type is known as an Atom. It is stored as string and can be used as string and number. int, long, float, double, chararray, and bytearray are the atomic values of Pig. A piece of data or a simple atomic value is known as a field.

Example − ‘Aejaaz’ or ‘27’

Tuple

A record that is formed by an ordered set of fields is known as a tuple, the fields can be of any type. A tuple is similar to a row in a table of RDBMS.

Example − (Aejaaz,27)

Bag

A bag is an unordered set of tuples. In other words, a collection of tuples (non-unique) is known as a bag. Each tuple can have any number of fields (flexible schema). A bag is represented by ‘{}’. It is similar to a table in RDBMS, but unlike a table in RDBMS, it is not necessary that every tuple contain the same number of fields or that the fields in the same position (column) have the same type.

Example − {(Aejaaz,27), (Mohammad, 45)}

A bag can be a field in a relation; in that context, it is known as inner bag.

Example − {Aejaaz,27, {008022008, aaejaaz@gmail.com,}}

Map

A map (or data map) is a set of key-value pairs. The key needs to be of type chararray and should be unique. The value might be of any type. It is represented by ‘[]’

Example − [name#aejaaz, age#27]

Relation

A relation is a bag of tuples. The relations in Pig Latin are unordered (there is no guarantee that tuples are processed in any particular order).

Thats all for the Day..

Jazakallah khair..:)  Alhamdulliah.

Installing Hadoop PIG

Assalamualaykum Wr Br..:)

Today we discuss about the installation of PIG on Ubuntu environment.

  1. Download the PIG from the following link either from web browser or from terminal
  2. To download from terminal use wget http://mirror2.shellbot.com/apache/pig/pig-0.15.0/pig-0.15.0-src.tar.gz
  3. After download extract the files and place it on home environment of hadoop.
  4. Open bashrc file from the command gedit ~/.bashrc then add the pig home variables as#pig_home

export PIG_HOME=/home/aejaaz/pig-0.16.0

export PATH=$PATH:$PIG_HOME/bin

#end

5. Run the bashrc file to verify the just add variables are not conflicting, to do it, just type the command source /.bashrc

6. Now type Pig on your terminal then grunt shell will be opened which makes sure PIG is installed properly on system. To quit from grunt shell just type command quit.

Thats all for the Day…

Jazakallah khair.

 

Installing Sqoop Hadoop in Ubuntu

Assalamualikum Wr Br..:)

In this post am going to discuss the installation steps very precisely on Ubuntu OS. In subsequent posts I will share what it is actually and what functions does it performs and how. Lets get start now..:)

1 .Download the Hadoop Sqoop software binary version from the following link:

http://redrockdigimark.com/apachemirror/sqoop/1.4.6/

Click the above link. It might be downloaded in your local system Downloads directory.

1.a. You can also download from Ubuntu OS terminal by following command:

$ wget http://redrockdigimark.com/apachemirror/sqoop/1.4.6/sqoop-1.4.6.bin__hadoop-0.23.tar.gz

with the above command Sqoop binary version is downloaded.

2. Extract the package and move it to home location where hadoop components exists.

2.a. mv

3. Copy and paste the sqoop-env-template.cmd and name the file as sqoop-env.sh

4. Open the bashrc file and update the file with Sqoop home value as shown below:

4.a $ gedit ~/.bashrc  (opens file in editor)

4.b #sqoop home

export SQOOP_HOME=

export PATH=$PATH:$SQOOP_HOME/bin

save the file and close then from terminal run the bashrc file to verify the file is error free from the command $ source ~/.bashrc

5. from the terminal hit the command as sqoop-version which displays the current sqoop version which is just installed on your system.

Thats it..Happy Learning..

Alhamdulliah..:)