Sunday, February 12, 2017

Big Data Hadoop

Hadoop is an opensource software for data management, thanks to apache software foundation which made it opensource from cloudera. This is to manage huge data sets and not small data sets. Why not small data sets? well If there is a 10 minute project and if you distribute to 10 people then it will take more time. So this is meant for big data as the name itself suggests.
It is an opensource framework overseen by apache software foundation for storing and processing huge data sets with a cluster or clusters of commodity hardware.
Hadoop  has got 2 core components.
1. HDFS - for storing data
2. MapReduce - process that data


HDFS
HDFS (HaDoop File System) is a specially for storing huge data sets with Streaming Access Pattern (SAP) mostly on commodity hardware. One thing to remember is it is "write once, ready any number of times, but do not try to change it once it is written to HDFS". Now why HDFS and why not any other file system?. HDFS has a default 64mb sector size and if you store a file of just 10mb then the other 54mb is not getting wasted unlike other file systems. The remaining free space of a sector will be used for other files.you can change this sector size to 128mb if you like.
This HDFS has 5 major services or demons or nodes.
1.     name node
2.    secondary namenode
3.    jobtracker
4.    data node
5.     task tracker
Here 1,2,3 are master services and 4,5 are slave services. All master demons can talk to each other. All slave demons can talk to each other but master to slave communication happens only between name node - data node and jobtracker - task tracker. 

name node
secondary_namenode
jobtracker
data node
task tracker
namenode
NA 
yes
yes
yes
no
secondary_namenode
yes
NA
yes
no
no
jobtracker
yes
yes
NA
no
yes
data node
yes
no
no
NA
yes
task tracker
no
no
yes
yes
NA

When a client wants to store data on HDFS it contacts the name node and requests for place to store this data. The name node creates metadata (inventory of data or data about data) for this data and informs the client on which systems of the hadoop cluster it can store the data and the client will contact those systems directly distributes this data across the systems in 64mb blocks.
what about redundancy?
Well HDFS by default will have 1+2 replications by default. So 3 copies of your data are available for you to access. Here name node maintains the metadata about where all these blocks are stored and all these data nodes report about the stored blocks to name node on a regular interval, just like a heartbeat. If name node has not received block report for certain block then it a copy of that data block is created on one of the data node.

What if the metadata is lost?
Well, your data is lost forever. It is important to make sure that your namenode is on a highly reliable
hardware and it is as failproof as it can get. Name node is a single point of failure (SPOOF) and thus maintaining fault tolerance of this is very important and to this VMware's FT comes to my mind.

MapReduce
Now mapreduce takes care of the data processing. You create a job (script or program ) to access your data and job tracker will track it. Job tracker will take this request. Job tracker will take this to name node and name node will give the relevant information from the metadata about the blocks it is trying to access and where they are stored. Job tracker will now assign this task to a task tracker. Task tracker will now does the task of accessing or getting the data as per the program that you have written. So this process is called map because you are mapping the data but not actually getting the data to work on it. There is something called as input splits. If you had input a data of 100mb then this input was split into 2 (because a block is of the size of 64mb) and hence you have 2 input splits. If your program is accessing this now then there will exactly be 2 maps. humber of input splits = number of maps.
What happens if there is a lost block?
The task tracker will inform the job tracker that it is unable to access the block and job tracker will assign this to another task tracker to access this data from another data node (remember that there are by default 3 copies of data ) and then we have a new map. All these task trackers give a heartbeat (every 3 seconds)back to job tracker to let them it know that they are alive and working. If for 30 seconds a task tracker hasn't sent it's heartbeat to job tracker then the same task will be handed over to another task tracker on another data node. Please note that the job tracker is smart enough to load balance the load of these jobs to the task trackers. Whichever task tracker is able to deliver fast will get more jobs. In this case of a 100mb data input finally all these maps will be combined and reduced to one (mapReduce) to give you the final output and thus the name mapReduce. The process or service which does the reducation of maps or mappers into one is called mapreducer. number of outpus = number of mapreducers. This reducer can be on any of the nodes. If the output gets generated by the mapreducer, then the task tracker will inform about this to the name node when it is sending it's heartbeat to the name node. Now the name node updates it's metadata about the final output file and it's location and client which requested this output reads the metadata to know the where about of this final output file and contacts that data node directly and gets the output.
What if the job tracker is lost or dead?
even though it is a SPOOF (single point of failure) you don't have to worry about data loss since the job will be disturbed but not the data. It is important to have a redundant server for this but not mandatory.

Thursday, January 26, 2017

Starting with your python django on centos 7

So you can directly install django framework or using yum or the beta version but most prefer to go the pip using virtual environments since you can have multiple versions of python without them affecting the other projects.
python django + pip + virtualenv has all the advantages of every other method + more. So let us go with this shall we?

First, 
yum install epel-release 
will get us access to epel repository. I additionally also would have rpmforge
but not necessary for now.
yum update -y
will update your centos with all the relevant packages. It is a good thing to start with always.
Let us get the pip for python
yum install python-pip
once we got the pip then we can get the python django by
pip install django
you may check the version by running
django-admin --version
1.10.5

then install the virtual environment by

pip install virtualenv

Now we have all the necessary components installed, let us get busy.
Let us create a new directory for our new project project1 and get inside it
mkdir ~/project_test
cd ~/project_test
Create a new virtual environment within this project directory
virtualenv test_env
okay now inside this test_env you have a standalone pip and python installed. let us activate it to use it.
source test_env/bin/activate
you will be automatically moved into this new environment.
let us install django locally in this virtual environment.

pip install django

If you now want to leave the virtual environment then you can deactivate it and 
activate it when you want to get back in. 

Saturday, November 19, 2016

Vblock, Vxblock ( Cisco UCS --> Cisco UCS ); VxRack (Quanta, kylin --> Dell PowerEdge)




Chad slayed few un true gossips regarding dell servers are replacing cisco ucs in vblocks and/or vxblocks. I just want to add few cents to that here.
·         vBlocks are being shipped, sold and offered with cisco compute and networking, VMware virtualization and EMC storage.
·         vxBlocks are being shipped, sold and offered with cisco compute and networking, VMware virtualization and EMC storage.
·         vBlocks are offered with usually n1k but we do ship them with VDS (most of them are shipped with VDS).
·         vxBlocks are currently only offered with VDS to accommodate NSX or ACI.
Let us talk about vxRacks.
vxRacks were supposed to be address a different segment of business requirement and demands than vBlocks or vxBlocks. These are made with commodity components but with dell emc rigorous quality analysis, support, service and more. These were and are currently offered with kylin and quanta servers but we have plans to replace them with dell PowerEdge servers (in vxRack and NOT VBLOCKS OR VXBLOCKS) for more awesomeness.
Majority of the clients seem to ask for VDS in vblocks and vxblocks to have more possibilities in the future.

Saturday, November 12, 2016

Multi node openstack newton installation on centos 7


Installing a multi node open stack newton on centos 7
In your vmware workstation create a custom nat network and a lan segment.
Create 3 VMs with the following VM settings ( do not power on yet) and mount the centos 7 image.
Controller

Compute

Neutron



Using the nmtui (network manager) command assign a private network (lan segment) ip to all of the VMs. 10.10.10.10/24 to the controller, .20 to compute, .30 and .40 to neutron. You can only assign the ip address and leave everything else as default. You may also name these interfaces as tunnel and the interface where the primary ip resides (from nat on vmnic0 or network interface 1) as management.
Run
Ip addr
On all of them and make a note of the ip address of all and prepare a hostfile like this
<ip addr> controller.home.com controller
<ip addr> compute.home.com compute
<ip addr> neutron.home.com neutron

Note: I was using a windows dns server but since the controller node was unable to reach the compute and neutron hosts with their netbios/short names for some reason it always failed to start the rabbitmq service. So, I resorted to hostfile entries.
Once you edit the /etc/hosts with the above host file entries also edit the /etc/hostname to include the netbios name at the end (as an extra measure).

Now let us get these VMs ready for Openstack.
setenforce 0
systemctl stop NetworkManager
systemctl disable NetworkManager
systemctl stop firewalld
systemctl disable firewalld

On the controller node
ssh-keygen
ssh-copy-id -i /root/.ssh/id_rsa.pub root@<ip of compute node>
ssh-copy-id -i /root/.ssh/id_rsa.pub root@<ip of neutron node>
we did this to enable ssh to these hosts without having to enter the password everytime. Test the same by doing
ssh root@compute
ssh root@neutron
and they should work.
yum install -y https://www.rdoproject.org/repos/rdo-release.rpm
yum install -y openstack-packstack
packstack --gen-answer-file=/root/answer.txt
now open /root/answer.txt with vi editor by
vi /root/answer.txt
and enter the ip address of the compute and neutron nodes.
Wherever there is an option to set the password (especially for admin and other places) you might want to set a password.
After all this is done just run
packstack --answer-file=/root/answer.txt
 
Notes: 
If for any reason the installation fails with an error complaining about cannot allocated memory, then increase the memory on the controller node to solve it.
Now at the end of the installation you will see an url for the Openstack dashboard. Open it and enjoy.