Programming

1. Experiments with Prolog : Deontic Logic [Academic Project]

In this project, we (Thanks..to all group buddies..) implemented a Deontic Assessment Engine(no big deal..though..just a prolog program : modelling+rule base) for decision making .PROLOG: A Language for Logic Programming and Symbolic Computation, used in AI applications (other one being LISP).SWI-Prolog was used for this project.The situation is as follows :- Students are required to take up a course (APT) according to some prerequisites and conditions related to their previous academic performance.The main factors are marks and attendance(for simplicity we consider theses two only).Based on these factors students are assigned permission in terms of deontic notions or states(like obligatory,optional etc.).For the purpose of creating a data set of student performance (for demo), we decided to conduct a quiz online with our friends and use the results as input data for the analysis(like a hypo-situation..).

Some definitions... :”Deontic logic at branch of symbolic logic that has been the most concerned with the contribution that the following notions make to what follows from what:”,”Deontic logic is the logic that deals with actual as well as ideal behavior of systems”,”Deontic Logic is the field of philosophical logic that is concerned with obligation, permission, and related concepts. Alternatively, a deontic logic is a formal system that attempts to capture the essential logical features of these concepts.”This concept was used to model the above situation , thereby using a prolog rule based system to capture essential deontic notions.A short summary,along with the project results are shown below.

A short presentation

2. Code and Results

Applications include :-Legal automation, E-contracting, Database security policies, Authorisation mechanisms etc.

For more details :-

[ Date: 25th March ’17 ]

2. Brainfuck : The Craziest Programming Language

++++++++++[>+++++++>++++++++++>+++<<++.>+.+++++++
..+++.>++.<.+++.——.——–.>+.

WTF!!.. This is just ‘Hello World!’??…Yeah, this is Brainfuck!!

Ever heard of such a thing??Surely it looks absurd..(From here on wards i will refer to it as BF..you know..to maintain the decorum).Once you start experimenting with it, it gets interesting …It reminds you of your UG Theory of Computation(TOC) and Compiler Design(CD).It would have been interesting if this was taught at that time..On a lighter note, you could have really understood and implemented some of the CD concepts with this language.Lets just take a whirlwind tour ..just some interesting facts about BF.

It is an esoteric programming language created in 1993 by Urban Müller.
One of the smallest language(probably smallest w.r.t compiler size 240 bytes).
It is Turing complete; yet most minimalistic programming language.
Difficult to comprehend large complex programs; inconvenient & inefficient.
Not used for practical applications
Computational model similar to a Turing machine model.
It consists of 8 operators including two I/O and memory manipulation operators.
Influenced by P”,FALSE etc.

The operators :-

a) > = increases memory pointer, or moves the pointer to the right 1 block.
b) < = decreases memory pointer, or moves the pointer to the left 1 block.
c) + = increases value stored at the block pointed to by the memory pointer
d) – = decreases value stored at the block pointed to by the memory pointer
e) [ = like c while(cur_block_value != 0) loop.
f) ] = if block currently pointed to’s value is not zero, jump back to [
g) , = like c getchar(). input 1 character.
h) . = like c putchar(). print 1 character to the console

The model:-

It consists of 30,000 byte array as tape, each cell(a byte) is initially ‘0’.
It consist of instruction and data pointers.
Two standard streams for I/O(read and write :a byte).
Data pointer initially pointed to leftmost zero byte(can move left or right at a time).
Instruction and Data are separated.
Memory operators can increment/decrements a byte/cell, one at a time.
BF commands are executed sequentially as instruction pointer moves to next one.
Loops are implemented with'[‘ and ‘]’,nested loops are possible.
Any other symbols are treated as comments.
The program terminates when the instruction pointer moves past the last command.

Here are some sample programs, to experiment with..

Visual Brainfuck is an simple IDE to start with..

I.Hello World!

	+++++ +++++ initialize counter (cell #0) to 10
	[ set the next four cells to 70 100 30 and 10 respectively
	> +++++ ++ add 7 to cell #1
	> +++++ +++++ add 10 to cell #2
	> +++ add 3 to cell #3
	> + add 1 to cell #4
	<<<< - decrement counter (cell #0)
	]

	> ++ . print 'H' (H = ASC (72))
	> + . print 'e' (e = ASC (101))
	+++++ ++ . print 'l'
	. print 'l'
	+++ . print 'o'

	> ++ . print ' '

	<< +++++ +++++ +++++ . print 'W'
	> . print 'o'
	+++ . print 'r'
	----- - . print 'l'
	----- --- . print 'd'
	> + . print '!'
	> . print '\n'

view raw helloworld.bf hosted with ❤ by GitHub

II. Length of Input string

Input : A string of characters with a trailing \0 symbol eg.: abcdef\0.

Output: Same string with string length appended to it eg: abcdef6.

	+++++ +++
	[
	> +++++ +
	< -
	]
	>>>
	,.
	[
	[<]<+
	>>[>]
	,.

	]

	<[<]<.

view raw lenofip.bf hosted with ❤ by GitHub

Visual Barainfuck : lenght.bf – Program & Output

III. Convert to Upper-case

	,.
	[
	[<]<+
	>>[>]
	,.

	]
	<[
	[-------- --------
	-------- -------- .>]

view raw convtouc.bf hosted with ❤ by GitHub

IV. Reverse a string

	,
	[
	[<]<+
	>>[>]
	,

	]
	<
	[.<]

view raw revrsstr.bf hosted with ❤ by GitHub

V. Check if a number is even

It accepts a even number i.e goes into accepting state ;but it doesn’t halt for odd number-i.e. goes to infinite loop.

	,.
	[
	[<]<+
	>>[>]
	,.

	]

	+++++ +++++
	[
	> +++++ +++++
	< -
	]

	<
	[--] >> +.

view raw numiseven.bf hosted with ❤ by GitHub

Try decoding the code.. I intentionally did not include the comments .. But,here are some links to help you get started:

Last but not the least..the most interesting part ..

Check out this link..>>>>BF<<<<< {Warning offensive content }

NB: Open challenge : Write programs to prove its turing completeness. Hint : First prove it is as powerful as NPDA : eg.: Program for palindrome, then try program for a^nb^nc^n (i.e. for TM power). Warning: I still haven’t figured it out..I was so mad that I blew my top and switched off my laptop. .. OMG !! What time is it? 3:30 am…Signing out!!..Best of Luck!!!……

[ Date: 26th March ’17 ]

3. SOES

Stock Order Execution System!
Problem: A stock order is an order to buy/sell a given quantity of stocks of specified company. Person willing to buy or sell a stock will submit an order to a stock exchange, where it is executed against the opposite side order of same company i.e, buy order is executed against an existing sell order and vice-versa. The criteria for stock orders execution is that, they should belong to same company, they are opposite sides ( Buy vs Sell), and order of arrival i.e, the order is executed against the first available order. The left over quantity after execution is called remaining quantity. For example, if a buy order of quantity 10 is executed against a sell order of quantity 5, the remaining quantity of buy and sell orders are 5 and 0 respectively. An order status is called OPEN if the remaining quantity is greater than zero(>0), otherwise it’s called CLOSED(i.e., remaining quantity = 0). Implement stock order execution system which takes input orders from given CSV (SOES – Input.csv), processes them and prints the status, remaining quantity of all the orders as output.

Sample Input:-

Stock Id	Side	Company	Quantity
1	Buy	ABC	10
2	Sell	XYZ	15
3	Sell	ABC	13
4	Buy	XYZ	10
5	Buy	XYZ	8

Sample output:-

Stock Id,Side,Company,Quantity

1,Buy, ABC, 10, 0, Closed
2,Sell, XYZ, 15, 0, Closed
3,Sell, ABC, 13, 3, Open
4,Buy, XYZ, 10, 0, Closed
5,Buy, XYZ, 8, 3, Open

A lame approach in python can be found on my github repo…

Github: ‘https://github.com/anilsathyan7/SOES’

Still trying for a better solution..

Screenshot of Output:-

Output

[ Date: 19th May ’17 ]

4. Speeding up Python: NUMBA!!!

Python is known for its slow for loops. ‘for loops’ are statically typed and interpreted;Not compiled.Also lack of type information leads to a lot of indirection and extra code.But is there any way to increase its speed??

Some techniques include using numpy, itertools, generator expression,cython etc..

Here we are using a JIT Compiler for Python (using LLVM) to speed up the code. ie. Numba

According to its project page,

“Numba is an Open Source NumPy-aware optimizing compiler for Python sponsored by Continuum Analytics, Inc. It uses the LLVM compiler infrastructure to compile Python syntax to machine code.

It is aware of NumPy arrays as typed memory regions and so can speed-up code using NumPy arrays. Other, less well-typed code will be translated to Python C-API calls effectively removing the “interpreter” but not removing the dynamic indirection.

Numba is also not a tracing JIT. It compiles code before it gets run either using run-time type information or type information provided in a decorator.

Numba is a mechanism for producing machine code from Python syntax and typed data structures such as those that exist in NumPy.“

Lets try a simple matrix multiplication example….

Here is the code

	import time
	import numba
	from numba import jit
	import numpy as np

	#input matrices
	matrix1 = np.random.rand(30,30)
	matrix2 = np.random.rand(30,30)
	rmatrix = np.zeros(shape=(30,30))

	#multiplication function
	@jit('void(float64[:,:],float64[:,:],float64[:,:])')
	def matmul(matrix1,matrix2,rmatrix):
	for i in range(len(matrix1)):
	for j in range(len(matrix2[0])):
	for k in range(len(matrix2)):
	rmatrix[i][j] += matrix1[i][k] * matrix2[k][j]

	#Calculate running time start=time.clock()
	matmul(matrix1,matrix2,rmatrix)
	end=time.clock()

	#print results
	print end-start
	for r in rmatrix:
	print(r)

view raw numbamatmul.py hosted with ❤ by GitHub

The time taken for the above program was 0.031190446978 s.

Now add a decorator as shown below, just before function definition

@jit(‘void(float64[:,:],float64[:,:],float64[:,:])’)

Run the program again and compare the elapsed time in both cases.

In the modified numba-approach, time taken was found to be 0.000252382354135. i.e Roughly 124 X speedup..Wow !! It just takes few imports and a decorator!! Cool…

Now, That’s Numba Philosophy:”Don’t wrap or re-write;just decorate!!!“

Try Out Numba :-

llvmlite: http://www.lfd.uci.edu/~gohlke/pythonlibs/#llvmlite
numpy: http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy

Install the downloaded python packages (.whl) with pip (update to latest version).

NB: Check the architecture(32/64bit) and python version according to your system.

References:-

[ Date: 20th May ’17 ]

5. GPGPU: Programming Massively Parallel Processors- The Real McCoy!!

General-purpose computing on graphics processing units refers to the use of GPU for handling computing tasks traditionally handled by CPU. It relies on the architecture of GPU’s which allows it for massive parallelization of computations, resulting in significant speed up or performance improvement. They are ideal for vector processing and has unmatched ‘pixel crunching power’,with regard to image processing operations.Another jargon and related term in this domain is heterogeneous computing which refers to systems that use more than one kind of processor or cores. It uses specialized co-processors to improve overall system performance and energy efficiency.

Now lets jump into python once again…

For speeding up python we can use Pycuda with python wrappers for CUDA C/C++ and related API’s. Another approach is write CUDA directly in python.i.e CUDA development in python syntax. Numba !!! Yet again…

Here, we are concentrating on the latter approach…

As usual, lets learn the ropes…

Terms and Terminologies:-

Host: The CPU and its memory (host memory).
Device: The GPU and its memory (device memory)
griddim: This variable contains the dimensions of the grid.
blockdim: This variable and contains the dimensions of the block.
Kernels: Parallel programs to be run on the device.
A number of primitive ‘threads’ will simultaneously execute a kernel program.
Batches of these primitive threads are organized into ‘thread blocks.
A ‘grid’ is a collection of thread blocks of the same thread dimensionality.
Each thread within a thread block can communicate efficiently using the shared memory scoped to each thread block.
Thread blocks within a grid may not communicate via shared memory.

P.S: https://llpanorama.wordpress.com/2008/06/11/threads-and-blocks-and-grids-oh-my/

clip_image004

Fig. 1: Thread hierarchy

Now, the program…

This is a simple program to add two arrays and calculate their sum.Additionally we also record the time taken for processing the arrays. Initially we test the same logic using plane python-numpy approach and later implement the same for GPU i.e CUDA. Finally,we compare the results of both the approaches and benchmark their results.

CODE:https://github.com/anilsathyan7/cuda_numba

To test the limits of our system we are going for a max-load test…

After trying different configurations for block and grid dimension, it was found that the maximum number of threads in a block for this GPU (Nvidia 940mx,Cuda capability=5.0 & cuda cores: 384,4 Gb) was 1024 (same for other similar devices ) and griddimension of about 1.5 lakh (i.e number of blocks in a grid).Here, since we are dealing with a 1D array we are limiting our dimensions to 1D (standard).

Now some math…

Here the array size = blockdim*griddim i.e 1024*146432 …

This is equal to 149946368.

Each item of array is of type float64, which implies its size is 8 bytes. (Use: print np.dtype(np.float64).itemsize )

Thus total size for an array = 1199570944 bytes (8*149946368)..

i.e Almost 1.2 Gb (Hope, there is no mistake)

Now for the results…

‘N’ gives the array size and time elapsed is shown below in the output screenshot.

numba-cuda

Is time shown in seconds?? Believe me!!!.. Yes..

We have Almost 112x Speedup…WOW!!!

Even after taking into account the time taken for moving data back and forth the GPU(inclusive time is shown above), we have more than 100x speed up!!

That’s CUDA..

“If you were plowing a field, which would you rather use?… Two strong oxen or 1024 chickens?”-Seymour Cray

Our results thus proves that cray’s quotes have been obsoleted by changes in technology, and trotting them out yet again just displays ignorance of those changes…….

So its time to ‘GO PARALLEL!!! & XLR8!!!!’

Finally, The Benchmark!!!

gpu_vs_cpu

Fig. 2: Performance Comparison: Speed-up

From the above graph it can be inferred that speed-up increases significantly (shoots up) as the input becomes larger and it is less prominent for smaller inputs.(Actually, view it inverted!!)

References:-

NB: Check out numba parallel for loops, ufuncs and other interesting stuffs…..It’s easy!!!

[ Date: 20th May ’17 ]

6. Steganography: Prime Component Alteration Technique

Steganography is the science with the help of which secret or confidential data is hidden within any media like text, images, audio or video and protocol-based network. As privacy concerns continue to develop, it is in widespread use because it enables to hide the secret data in cover images.

Now,Prime Component Alteration Technique?? Never heard before? Sounds like a neologism!!! Yes it is.. A new term coined by me..!!!

So, what is it ? :- Here we present a new algorithm for hiding text or image inside another image.i.e a steganographic algorithm……

The proposed technique:-

The proposed Steganographic algorithm is basically a RGB-based prime-pixel alteration technique .Our Steganography is applicable for both images and texts respectively. At first a message (either an image or a text)is being taken as an input .The message is encrypted with the key and only after its proper encryption the message is embedded in the cover image. Basically to describe explicitly the process involves saving each of the data (text or input image converted into strings) bits in the prime pixel locations of the Red, Green and Blue Components’ groups of 3 random numbers or prime triplets. And since the key is of a fixed size the key bits are stored in co-prime pixel locations. The key bits will be stored in the Red, Green and Blue components one at a time in a cyclic manner for enhanced security. For storing the data, the Red, Blue and Green components are basically being considered. For decryption or decoding the original message with quality fully retained and without any third party interception, the reverse process of the encryption algorithm has been followed. This can be better understood through the block diagram described
below.

setgo_proposed

Fig: 1. Schematic Diagram of proposed technique

Working:-
The basic functional block diagram of the proposed method is shown in Figure– 1. Firstly the payload data(image or text) is encrypted before embedding. Then, the key along with the encrypted message is embedded using the proposed technique.Larger cover images are chosen for better hiding capacity. The resulting stego-file is transmitted via any communication channel to the intended receiver. At the receiving end the receiver extracts the key from stego file with the help of shared secret data(prime triplets). Using the key the extracted information is decrypted to get back the original message(prime triplet is s shared secret). The cover image is obtained as a byproduct. .The above schematic diagram provides an outline of the basic working of the stegnographic technique.The implementation can be done in different ways depending on the types of encryption, data type, media etc. A ‘micro pixel view’ of the working of the same technique is demonstrated below with the help of an animation.

pc_alter — Fig. 2: A micro-pixel view of PCA steganography

A prime triplet, in mathematics, is defined as a set of three prime numbers of the form (p, p + 2, p + 6) or (p, p + 4, p + 6).For example ordered triplets like (5, 7, 11), (7, 11, 13), (11, 13, 17), (13, 17, 19), (17, 19, 23), (37, 41, 43) etc..In our algorithm it can be any 3 random numbers or numbers of the form just mentioned above.

In this algorithm ,the image is converted into base64 encoded format for easier processing. Then the specific key is used to encrypt the data using XOR operation.XOR operation can provide moderate security as long as key is not compromised.The data is converted to binary format so as to easily embed the same inside the component bits of the pixels. A delimiter ‘0xFFFE’ in binary format is used for identifying the end of message. At each iteration process described above counters are used for the pixels of the cover image and also for the data and key so that the iteration can be stopped when entire key and data is embedded inside the cover image. The key is embedded inside the pixels whose number is co prime to prime triplets. The component positions in which the key bits are hidden (of pixels of the cover image) are changed cyclically. This adds some ‘confusion’ to the encryption scheme. The main aim of Steganography is to conceal the existence of data inside cover-image. Now by adding additional security, even if the existence of message is found out the hacker will be unable to extract the message in unencrypted form.

Results:-

In the end we were able to attain 22% hiding capacity w.r.t the cover image and were successfully able to extract the original image from the stego image.Now, the Asymptotic Time Complexity for the entire algorithm is T(m,n)=O(m+n), where m,n are respectively the size of message and key. Most of the alterations takes place at blue component because it have the least affect on imperceptibility of the image.In addition, this steganographic technique along with encryption has better resistance to steganalysis than conventional techniques like LSB substitution.

So future works are inclusive of extension of the algorithms with different image formats(like jpeg, bmp, etc) and media formats. Also this algorithm can be explored much more by using other color domains like HSI, YCBCR, etc. formats.

sample_images — Fig. 3: Sample test images

For further information, refer my paper:-

“Sathyan, A., Thirugnanam, M., & Hazra, S. (2016). A NOVEL RGB BASED STEGANOGRAPHY USING PRIME COMPONENT ALTERATION TECHNIQUE. IIOAB JOURNAL, 7(5), 58-73.”

Link: http://www.iioab.org/articles/IIOABJ_7.5_58-73.pdf

[IIOAB Journal,T&R indexed]

Acknowledgement :-

Special thanks to Project partner: Sumit Hazra & Project Guide: Mythili T.

Steganography: “What your eyes don’t see”

Applications :- It includes watermarking, corporate espionage, defence and intelligence (govt.) – secret and sensitive data handling,CD/DVD -detecting unauthorized use etc.

[ Date: 21st May ’17 ]

7. Functional Programming: LISP

LISP(LISt Processor is a functional programming language designed by John McCarthy in the year 1958. Ever since ins inception, many dialects of lisp (Scheme,Common Lisp etc) have become popular particularly in the field of AI and research.It is the second oldest high -level programming language, after FORTRAN.

The first thing you will notice about LISP is “(Lots (of (Silly ))( Irritating ) Paranthesis)”!!. The next thing is the recursive nature and structure of the language.It relies in the prefix notation of operators/functions.It has specialized functions for easy and efficient list processing.And once you get by some few syntax and semantics of lisp,it is easy (interesting as well) to write code on your own. LISP has lots of inbuilt functions to make your job easier.(I would recommend implementing them [at-least easy ones] on you own initially..Just to get that ‘LISPTIC. feel!!!)

“LISP is worth learning for the profound enlightenment experience you will have when you finally get it; that experience will make you a better programmer for the rest of your days, even if you never actually use LISP itself a lot “

-Eric S Raymond, “How to become a Hacker“

Now, What is functional programming language??

“In computer science, functional programming is a programming paradigm—a style of building the structure and elements of computer programs—that treats computation as the evaluation of mathematical functions and avoids changing-state and mutable data” -Wiki

In short,programming without using assignment is called functional programming; whereas programming with extensive use of assignment is called imperative programming (like in C).Thus in lips we can see extensive use of functions, recursion, nested functions etc.Functional programming model eliminates side effects: changes in state that do not depend on the function inputs. It has its roots on lambda calculus, whose history dates back to the early 20th century.

You may use CLISP/GCL in windows or UNIX (Also lisp-works for windows).It is fairly easy to install and to get it running, within a few mouse clicks.

Now lets get into coding stuff (Oh yes!!…No more dialogues!!).. Lets follow a different approach!! Let’s make this a top down approach.. First lets go through some small ‘code-lets’ (Oh! great..common, i know tutorialspoint!! ) and at the end i have provided sufficient references/books for further exploration….

The Basics ..

lisp_1

2. Conditionals ..

lisp_2

3. Classical : Factorial-Recursion

lisp_3

4. Tracing recursion with : ( trace fact )

lisp_4

5. Swap and List operations ..

lisp_5

6. Sorting: Custom

lisp_6_cut

7. Factorial: Good Old Ways ..

lisp_7

8. Fibonacci ..

lisp_8

9. Factorial ..One more Time..

lisp_9

10. Set Operations ..

lisp_10_cont

lisp_10_2cont

11. Membership checking ..

lisp_11

12. Set : Symmetric difference ..

lisp_12

All the codes are shown as images.. Sorry, no easy copying !!!

But don’t worry;download it from my github repo!!…

Tribute.. To my friend who inspired me …An email thread.. 10/17/12 …..

“”” Hello,

So here is a classic AI programme in LISP. The program is called ELIZA. It is a chatting programme . ELIZA is a computer programme that talks to you and you can also talk to her.

http://faculty.hampshire.edu/lspector/courses/eliza-simple.lisp

Above links gives the complete programme .

As usual just copy the programme to some files say eliza.lisp AND add the line

(ELIZA)

at the end of the file and save

then

invoke the file

$lisp eliza.lisp , and start chating with the lovely lady 🙂 You can see beauty if you read the code 🙂

“””

Similarly, Tic-Tac-Toe:-

http://ftp.ics.uci.edu/pub/machine-learning-programs/Introductory-AI/programs/tictactoe.lisp

“To iterate is human, to recurse is divine!! “

References & Books:-

Tutorialspoint: https://www.tutorialspoint.com/lisp/lisp_quick_guide.htm
https://github.com/anilsathyan7/lisp/blob/master/clisp.pdf (must read)
http://web.mit.edu/rlm/Public/lisp/lisp.pdf
http://lispinsummerprojects.org/LearningLisp
Land of Lisp
MIT OCW
http://www.cliki.net/
Graham, P. (2004). Hackers & painters: big ideas from the computer age. ” O’Reilly Media, Inc.”.
Abelson, H., Sussman, G. J., & Sussman, J. (1996). Structure and interpretation of computer programs. Justin Kelly.
Felleisen, M. (2001). How to design programs: an introduction to programming and computing. MIT Press.
Friedman, D. P., & Felleisen, M. (1989). The little LISPER. 3 Auflage: Science Research Associates.
Brooks Jr, F. P. (1995). The Mythical Man-Month: Essays on Software Engineering, Anniversary Edition, 2/E. Pearson Education India.
Martin, R. C. (2009). Clean code: a handbook of agile software craftsmanship. Pearson Education.
Evans, E. (2004). Domain-driven design: tackling complexity in the heart of software. Addison-Wesley Professional.
Hunt, A. (2000). The pragmatic programmer. Pearson Education India.
Fowler, M., & Beck, K. (1999). Refactoring: improving the design of existing code. Addison-Wesley Professional.
Freeman, E., Freeman, E., Robson, E., Bates, B., & Sierra, K. (2004). Head first design patterns. ” O’Reilly Media, Inc.”
Knuth, D., & TheArtofComputerProgramming, S. (1973). Vol. 3. Reading: Addison-Wesley, 506-549.
Kernighan, B. W., & Ritchie, D. M. (2006). The C programming language.
Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web
Hyde, R. (2010). The art of assembly language. No Starch Press.

(Most of it are….still in my reading list!! Ahh!!.. Lots of books to read!!!))

For applications: http://wiki.c2.com/?CommercialLispApplications

[ Date: 22nd May ’17 ]

7. Turning the Tables: From Arrays to Graphs – Thoughts on Spreadsheet Modelling..

Everyone must have used spreadsheets (Excel!!), at least once during their course of ‘tech-life’ in academia or industry, regardless of whether they are science, commerce or arts majors.Now, have you thought how these ‘tables’ are implemented in applications like Microsoft Excel, OpenOffice, Google Sheets??What techniques and data structures (at least a broad idea) are actually used for implementing them?? Well, if you are a CS student, you should !!!

Lets see what wiki has to say about spreadsheets!!

“A spreadsheet is an interactive computer application for organization, analysis and storage of data in tabular form. Spreadsheets are developed as computerized simulations of paper accounting worksheets.The program operates on data entered in cells of a table. Each cell may contain either numeric or text data, or the results of formulas that automatically calculate and display a value based on the contents of other cells.”

For those with little experience, spreadsheets are a rectangular grid of cells into which data is entered. The data may be raw numbers, strings, or formulas which most frequently perform arithmetic operations on the contents of other cells to produce a new value. The main feature of interest for spreadsheet implementation is that formulas automatically update when the cells on which they depend change contents.

Now, the major concerns are how the data can be stored and processed efficiently and what techniques & data structures can be used for this purpose.Anyway, no coding this time!!!.

Lets start from the basic DS’s .. First one: Arrays..

On the first go, arrays seems to be the ideal candidate for implementing spreadsheets. Afterall, spreadsheets look like a 2D Array or a Matrix.

java-2d-array — Fig. 1: 2D Array – Abstract View

But arrays lack those flexibility for dynamic allocation of memory (Oh..vs. linked list!!). Also, if most the memory allocated is not used up, there would be wastage of memory.OK, then what about a sparse matrix representation??A multi-dimensional array representation??Well, they are ‘no better’ than normal approach (considerably)!!.. .

Operation on arrays/matrices seems pretty straight forward, to add one row or column keep one index constant and iterate over one row or column for required operations..Now, for the ‘automatic update’, we will have to implement a listener function, which should be called every time a cell value gets modified (or is there a better approach ?).Also, think what happens if we delete/insert a row or column.Ideally we will have to shift the whole stuff back and forth.Hmm.. Not Good!!!

Even though this method seems easy and straight forward, we should try for a new , better approach..

Enter -> Linked List.. Our Hero!!

We know (atleast from our boring DS class!!) that linked list has some advantage over arrays (do they always??).Yes,..now i remember..something like dynamic memory allocation or what??….Why cant we implement the spreadsheet as some matrix of doubly linked list. It will be a better option considering the “memory allocation” (LL allows on-the-fly memory!!) strategy and flexibility of modification at runtime (insert/delete etc). Also, we can iterate through lists for performing some arithmetic operation easily ( Evaluation of polynomial using singly linked list.. while(temp->next != NULL)….etc.Ring any bells?? ).But, here too,we have some trade-offs .We have the overhead of maintaining and storing pointers..(also its complex…).But still, this would be a better approach (Don’t you think so?).

Lets try a modified approach..Hashing (Separate Chaining)…

Oh my hashing!!.. long time no see….

Here iam..

This looks even better..A head pointer array..interesting!!.. Through hashing accessing elements is even faster..we need to give row/column as key input and then iterate through LL for corresponding row/column..To delete a row ,just delete the head pointer and recycle/free the memory dynamically..Overall… better flexibility, lesser memory requirements and still we have the dynamism!!!Also expression evaluation/data updating is better than the naive array implementation.

Finally, the ideal approach would be something like this…

A spreadsheet can be modeled as a DAG, with a vertex for each cell and an edge whenever the formula in one cell uses the value from another; a topological ordering of this DAG can be used to update all cell values when the spreadsheet is changed. Similarly, topological orderings of DAGs can be used to order the compilation operations in a makefile. When one cell of a spreadsheet changes, it is necessary to recalculate the values of other cells that depend directly or indirectly on the changed cell. For this problem, the tasks to be scheduled are the recalculations of the values of individual cells of the spreadsheet. Dependencies arise when an expression in one cell uses a value from another cell. In such a case, the value that is used must be recalculated earlier than the expression that uses it. Topologically ordering the dependency graph, and using this topological order to schedule the cell updates, allows the whole spreadsheet to be updated with only a single evaluation per cell.

To be little more technical…

There are two main options for your “primary data structure”:
– table (2×2 list) — if you expect most of your cells to be filled,
this is simple, but uses lots of memory.
– sparse table (list/dicts of cells) — if you have a large table that
is mostly empty cells.

The Dependency Tree is the “side data structure”; a “Cell” is a “Node”; a Cell’s “dependencies” is the Node’s “children”; a Cell’s “dependant” is the Node’s “parent”. Cells maintains a list of dependencies (or dynamically generates them when requested) by analyzing the “formula” attribute for references to other cells. A dependency graph is a graph that has a vertex for each object to be updated, and an edge connecting two objects whenever one of them needs to be updated earlier than the other.

Strictly speaking, this “side data structure” is not a “Tree” as there are multiple root nodes. I think it’s called “Directed Acyclic Graph“. However, it’s true that, given a cell as a root, the cell’s dependencies is a proper tree.

The “user-entered formula” and “effective value”. A Cell containing a formula “abc” has a value of “abc”; a cell containing the formula “=1+5” has a value of “6”. You could use the’ property’ decorator for the “effective value” attribute.Use parser to evaluate the formulas like 3*A1+A2 (eg:- PyParsing, regex) [Little python oriented].

Topological sorting:-

Time for table topping !!!…

Topological sorting for Directed Acyclic Graph (DAG) is a linear ordering of vertices such that for every directed edge uv, vertex u comes before v in the ordering. Topological Sorting for a graph is not possible if the graph is not a DAG.

In DFS, we print a vertex and then recursively call DFS for its adjacent vertices;whereas in topological sorting,we first recursively call topological sorting for all its adjacent vertices, then push it to a stack. Finally, print contents of stack. Note that a vertex is pushed to stack only when all of its adjacent vertices (and their adjacent vertices and so on) are already in stack.

For the graph below, lets find topological sort…

Topological sortings :-

5, 4, 2, 3, 1,0
4, 5, 2, 3, 1, 0
5, 2, 3, 4, 0, 1 etc.

DFS: 5, 2, 3, 1, 0, 4

SO, there can be more than one topological sorting, for a given graph… Any DAG has at least one topological ordering, and algorithms are known for constructing a topological ordering of any DAG in linear time. Kahn’s algorithm and DFS are used for finding topological sorting for a given graph.

It would be interesting to implement these DS’s using C. (I’am planning on it..later!!)

You know..stuff like pointer to structure and structure of pointer’s!!

By the way if you are planning to learn/refresh C…Then go with the following book’s

the-c-answer-book-original-imadgjyrxrgvhace

Ever heard of K&R C??

Don’t start with Balagurusamy… But YPK Let us C is good…

Checkout : SC: the Venerable Spreadsheet Calculator

“If you like vi, and you like the command line, you will love sc—a spreadsheet that runs in a terminal.”

Another interesting command: tsort – a Unix program for topological sorting.

Also, JS : http://jsfiddle.net/ondras/hYfN3/ Python : http://code.activestate.com/recipes/355045-spreadsheet/

NB: Think about using object oriented approach (classes/objects), Using XML etc.

Brainstorm!!!!

References :-

[ Date: 24nd May ’17 ]

8. ‘ Oye Gooey!! ‘ :Intro to GUI programming using Python and Tkinter

If you are a Linux aficionado, you are probably not a big fan of GUI. But, then why do we use/need GUI? Can’t we just rely on command line.Well, if you are a computer geek or expert, you may do away with CLI (Its fun too..).A Graphical user interface (GUI) is important because it allows higher productivity, while facilitating a lower cognitive load.Also, its more intuitive to use.

Imagine a website with no UI & you have CLI like file searching mechanisms and all..much like searching a ftp site!!!Now don’t stop there.. Imagine Photoshop, monitoring software’s etc. without GUI. You have to go through each commands, files, tables etc. In short GUI allows us for better presentation of the information and its understanding.

Tkinter is Python’s de-facto standard GUI (Graphical User Interface) package. It is a thin object-oriented layer on top of Tcl/Tk. It provides almost a dozen of widgets like button, textbox, canvas, frames, listbox, menu, checkbutton etc. If you are familiar with python,its fairly easy to use Tkinter. Also see all available options for windows, geometry managers, widgets etc.

For developing GUI applications,you just have to follow some standard steps:-

Import the Tkinter module.
Create the GUI application main window.
Add one or more of the above-mentioned widgets to the GUI application.
Enter the main event loop to take action against each event triggered by the user.

Lets start by a basic example: A simple application with one label text and a button. When the user clicks the button, the text is updated and shown on the window screen. Let’s call it ‘hello-world’ of GUI programming (I’m a novice too!!).

Let’s tighten our grip.. Time to cook some code!!!

	#GUI: Sample Program - Increment

	import Tkinter
	window = Tkinter.Tk()
	window.title("Button_Label")

	num = 0

	#callback : function for changing label
	def increment():

	global num
	num += 1
	lbl.configure(text=num) #update label object

	#create label
	lbl=Tkinter.Label(window, text=num)
	lbl.pack()

	#create button
	btn=Tkinter.Button(window,text="Click Me",command=increment).pack()

	window.mainloop()

view raw guitkinter.py hosted with ❤ by GitHub

First of all, import Tkinetr module,which contains classes, functions and other stuff required for this purpose.Now, we have to create a window to place all our widgets and other objects before adding other objects. We can add/configure title, window icon, background color etc. for the window object.

Create a variabel num to store the number (which we will update on button press).Now define a fuction (callback) to increment the ‘num’ variabe. Please note the use of configure function associated with the label. It is used for updating and setting the variable whose value is going to be displayed inside the window. Here we just update the text, we may also update properties like background colour, font etc.

Lets create the label widget.A Label widget can display either text or an icon or other images. Here , as seen from the above code, we just set(initialize) the text property for the label widget. Note that we have to reference the parent object (root/window) for configuring the widgets. Pack method tells the widget to fit itself inside the window( can be configured using its properties) and make itself visible.

Finally, create the button widget. Set the text to display and call the increment function as a function call-back. It listens to the corresponding button press event (binded). As you can see, we provide this option using the command parameter of the button. Configure the size, appearance and position of the widget using pack method.

Every GUI application typically ends with the root.mainloop() method which starts the event loop and continuously handles user events, tkinter operations, display updates etc.

Now lets look at the output..

You may run the same like a normal python program..Now you will see a new window spawned up…Yes that’s my boy.. TADA!!! That was so easy….

The output looks like..

Ok, looks fine..but it doesn’t have that glossy feel.. Right? Then, try changing background colours, adding images etc..

Now, let me show you a similar application that i’ve made using the same library and concepts.. A basic calculator app…

The output looks like this..

Hmmm.. The code?? Yes, its there in my github repo.. You can easily extend this basic calculator to a ..say a ‘scientific calcy’ by adding simple python functions for mathematical operations an associating them with additional buttons (remember the math library!!).

While developing this application few functions and data structures became handy (i haven’t used them much before..)

eval function (Oh..i get it ..that’s tricky)
dictionary (why? no idea..)
lambda function ( wow!! )

An interesting advantage of python is that there are a lot of built in functions, DS’s, techniques and programming constructs such that when you encounter a particular tricky situation and you begin to think that “Will there be ideal function/DS [already available] that takes ‘this’ type of input and provides ‘that” type of output?'”, you will most probably find such a function/construct that exactly suits you need. The only thing is that you may have to spend couple of hours understanding the new concepts and i can guarantee that at the end it would be worth the wait!!

NB: In the sample code shown above see what happens when you pass arguments to callback functions!!

“Using ‘eval’ & ‘globals ‘ frequently is considered unsafe/bad” –Why? (remember goto?) Also check out Spaghetti code/Code smells.

Also see PyQt, PySide, Kivy, wxPython, PyGTK, Glade etc … GUI Frameworks!!!

“A user interface is like a joke. If you have to explain it, it’s not that good.”

References:-

[ Date: 28th May ’17 ]

9. Weaving the web: Intro to Websockets and Real-Time Communication.

Web is the largest knowledge sharing medium, which makes people from one end of the world to be connected to the other end. It is becoming enormously ubiquitous. The web includes several applications which handles everything from text messaging, video conferencing, image processing and now even share the computing resources. The clients of the web are increasing in an enormous rate day to day. The increasing number demands more advanced technologies for the communication.

The dawn of real time bidirectional communication and push technologies …

The web initially started working on the principles of client sending request to server and sever processing and sending back the results to the client.But in the mid 2000’s new push technologies emerged which enables sever to directly contact or communicate to the client.Technologies like Ajax, Comet, Polling etc. comes into picture in this regard, allowing for bidirectional (full or half duplex) communication.But all these technologies sufferred from the overhead of http headers (unnecessary), which had to be sent to server every time a client wants to communicate with sever.This leads to higher latency and network traffic.The idea of WebSocket protocol is to enable a persistent, low latency full-duplex communication between a client and a remote host over a single TCP socket, that can be initiated by either the client or by the host.

So what about UDP?

They’re designed for streaming real-time data when the newest data is more important and allowing older information to be dropped.Reliability is less.So we are not taking int into account…

communication — Fig. 1: Three modes of transmission

In HTTP polling, client continuously polls the server based on a time interval and server always responds with a message (empty or new message based on the scenario).In long polling, server holds the request until a new message is available or a timeout expires instead of sending empty messages. This reduces the number of client requests when no new messages are available.In ajax, it creates a connection to the server, sends request headers with optional data, gets a response from the server, and closes the connection.It requires to reestablish the connection every time;but both are bidirectional.Therefore it prevents entire application from freezing. The Web Socket protocol provides a full-duplex, bidirectional communication channel that operates through a single socket over the Web and can reduce the HTTP overhead and reduce communication cost.In a light- weight approach of SSE the client establishes persistent and long-term connection to server. Only the server can send data to a client. If the client wants to send data to the server, it would require the use of another technology/protocol to do so. Oh..that was the worst network class ever!!!

Now lets jump into websockets..

WebSockets:-

Protocol providing full-duplex communication channels over a single TCP connection.
Secure protocol which can be used for bi-directional communication over TCP.
It is standardized by IETF and W3C (RFC 6455) and can be used by any client server application for fast real-time data transfer.
It provides true concurrency and optimization of performance, resulting in more responsive and rich web applications.
They’ve become platform independent and has lower latency in comparison to other methods.It provides desktop-rich functionality to all web browsers.
It is stateful and communication is asynchronous.
WebSocket differs from TCP in that it enables a stream of messages instead of a stream of bytes.
Traverses proxies and firewalls (especially when using SSL) and support standard origin based security models.
It has minimum header overhead, latency and has no polling overhead i.e only sends messages when there is something to send.

The WebSocket protocol has two parts. The handshaking mechanism consists of a message from the client and the handshake response from the server.The second part is data transfer.

Initially it starts with a client giving a http request to server and subsequently sends a upgrade request to inform the sever to switch to websocket protocol (upgrade).

GET ws://websocket.example.com/ HTTP/1.1
Origin: http://example.com
Connection: Upgrade
Host: websocket.example.com
Upgrade: websocket
                            Websocket - Handshake

A sequence diagram showing the communication or interaction (ordered) between client and server is shown below. More on these will be discussed during actual implementation.

Fig. 2: Websocket communication – Sequence diagram

Websocket: Frames :-

Websocket frames have few header bytes, ad text or binary data.Frames are masked from client to server.

ws_frame — Fig. 3: Websocket Frame Structure

Websocket Efficiency:-

	HTTP	WebSocket
Overhead	100’s of bytes	2-6 bytes (typical)
Latency	New Connection each time	None: Use existing connection
Latency (Polling)	Wait for next interval	No waiting
Latency (Long Polling)	None, if request sent earlier + time to set up next request	No waiting

Websocket vs Polling:-

As evident from the above graph websockets scales well as number of client increase when compared to polling.Thus, it has very low latency and thus are suited for Real Time applications.

Now this may look like.. your networking theory class!!! Guess that’s right..!! All right, time for coding Again!!!

Where is my dear python?? Looks like… he has brought our good ol’ bosom buddies: HTML&JS along….

Here is the code for websocket client ..you may run the same in your browser ( any ws supported browsers will do..even mobile..just make sure you are connected to internet).


WebSocket Client

#output {
border: solid 1px #000;
}

Send</pre>

<hr>

<div id="output">&nbsp;</div>
<pre> var inputBox = document.getElementById("message");
 var output = document.getElementById("output");
 var form = document.getElementById("form");
 var i=0;

 try {

 var host = "ws://localhost:9999/";
 console.log("Host:", host);

 var s = new WebSocket(host);

 s.onopen = function (e) {
 console.log("Socket opened.");
 };

 s.onclose = function (e) {
 console.log("Socket closed.");
 };

 s.onmessage = function (e) {
 console.log("Socket message:", e.data);
 var p = document.createElement("p");
 p.innerHTML = "CLI_NUM: "+inputBox.value+ " SERV_DATA: " + e.data + " CLI_RCV_T: "+new Date().getTime();
 output.appendChild(p);
 s.send(i++ +" cli_num: "+ inputBox.value);
 };

 s.onerror = function (e) {
 console.log("Socket error:", e);
 };

 } catch (ex) {
 console.log("Socket exception:", ex);
 }

 form.addEventListener("submit", function (e) {
 e.preventDefault();
 s.send(inputBox.value);

 }, false)

The program above shows a client sending request to a server..a websocket echo server to be precise and it gets the message back echoed..without much delay (don’t believe me?; then try it offline or use a different address!!! ) Now lets see what is happening here (don’t worry i’am not going to explain HTML again!!).Basically it is sending a message to websocket server at location – URL.

var host = "ws://echo.websocket.org"
var s = new WebSocket(host);

It is an echo server..You can see that it uses (uri) ws schema like http (wss-secure mode much like https..), remaining structure is similar to http..in order ..host, port,sever etc.

The object s is a JS object created for implementing a websocket. Initially the connection is established between client and server through handshaking mechanism.Then the corresponding onopen event is fired.Similarly onmessage event is fired when client receives data from server.Similarly, you can guess the functions of onclose and onerror functions.The connection is usually terminated/closed after an error occurs.

There are two function which may be used by user to send data or close the connection explicitly, viz. send() and close() respectively. send is used for sending text,binary or image data and close is used for closing or terminating the connection.

That’s it …you have your first websocket program ..in just a dozen lines of code!!! That was pretty easy compared to our tcp chat/echo program right? (C/Java network lab!!!) Well, its not just easy; it is more efficient and flexible …

Lets monitor our program network usage using chrome dev tools… Typically output looks like this…(Goto developer tools->WS->Press F5->Frames/Headers)

ws_header — Websocket communication: Headers

ws_hdr — Websocket communication: Frames

Now so far what we’ve seen is client point of view..what about creating a custom websocket server?? Yes we can implement that too..

For python code refer github repo..

See websocket abstractions and libraries: tornado(python), libwebsocket(C), Java Web Socket, Fleck (C#) etc..

Applications:-

Chat applications
Collaborative editing
Location based services
Online multiplayer games
Social networking applications
Live trading/auctions/sports notifications
Controlling medical equipment over the Web

NB:-

Also see criticisms/disadvantages of using websocket, REST and Websockets,WS Security..

References:-

http://blog.teamtreehouse.com/an-introduction-to-websockets
https://www.html5rocks.com/en/tutorials/websockets/basics/
https://gist.github.com/jkp/3136208
http://enterprisewebbook.com/ch8_websockets.html
Wikipedia
Tutorialspoint
Pimentel, V., & Nickerson, B. G. (2012). Communicating and displaying real-time data with websocket. IEEE Internet Computing, 16(4), 45-53.
https://drive.google.com/file/d/0B9OmJfFV6JNbbWZLeF8yOUhNLU0/view?usp=sharing
https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API
https://code.tutsplus.com/tutorials/start-using-html5-websockets-today–net-13270
Fette, I. (2011). The websocket protocol.
https://www.pubnub.com/blog/2015-01-05-websockets-vs-rest-api-understanding-the-difference/

[ Date: 31st May ’17 ]

10. Alone we can do so little, together we can do so much: A Distributed DoS Simulation

If you are a frequent windows user, you might have come across a situation where your system hangs up or freezes all-together,typically the system GUI stops responding to keyboard and mouse clicks.The fundamental reason is typically resource exhaustion: resources necessary for some part of the system to run are not available, due to being in use by other processes or simply insufficient (Sometimes, even Ctrl+Alt+Delete doesn’t help us in this regard..).In CS terminology this is called thrashing.This is more related to paging memory memory and stuffs like that..(not again.. my dear OS!!). Now this is within a system i.e local to the system. All the processes and resources are local to a particular system.What are the chances that similar affects can be obtained by remote processes on an external system connected across a network. This more or less looks like some network hacking stuff..Right?? Exactly!!, we are now looking at Denial of Service attack (DoS).

What is DoS and DDoS?

A Denial of Service(DoS) is a cyber attack that is targeted towards a system or a network resource with the aim of blocking its intended users, by making them unavailable or busy temporarily.Also, this attack can stop the service of the host machine indefinitely. Often this is accomplished by flooding the target system with packets or request such that it overload the system and prevents some or all legitimate requests from being fulfilled.

In Distributed DoS,we use multiple computers and Internet connections to flood the targeted resource/system.The attacker infects a group of system using a virus/Trojan or other harmful malware’s and takes control of these systems.These compromised systems are called botnets. Each of them may be called as zombies or zombie bot.They are thus controlled by bot herder (or bot master) i.e the attacker who issues the control and command for this kind of attack as their ‘master’.A simplified pictorial representation is given below.

ddos_resize — Fig. 1: A Typical DDoS Attack

A real time and practical scenario would be something like this: A large number of people crowding in-front of a gate or entrance of a hall/stadium, thereby preventing the entry of a intended legitimate party (like a VIP of some-sort).

Let’s make this a little more interesting…shall we?? A better example…

Imagine you are talking to your girlfriend (or boyfriend..correspondingly..u know!!) and suddenly you notice the noise of your little brother playing games on Playstation (Damn! i too hate that sound!!). After sometime, your father keeps on calling you to help him to fix up the lights (fan, furniture or whatever..).Now thats disturbing, considering the fact that you are attending an important phone call (u know.. future depends on it!!).Finally, your mom calls you for having dinner(probably its late..you are the only one who hasn’t had dinner yet and your mom.. apparently she needs to get some sleep after her hectic household workarounds..).Ok, now thats a lot of disturbance!! They keep on calling you and finally you decide to give up..Oh, by the way don’t forget the good night message..that’s the close() connection message …u know!! So in the end, you got overloaded with requests and couldn’t continue with your phone call.Thus, the service was denied!! So who was behind all this??The whole stuff was actually made up by your elder sister who knew about your daily phone call routines.The truth was that, there was only one telephone.. and she was expecting a call from her boyfriend..around the same time.Hmmm..so she was the attacker..the bot herder!!.She tricked everyone to call you..Oh how cruel!! Apparently, you sister loves you more than anyone else.

Oh, how gross!! May be.. but you wont forget DDoS if you remember it like this!!!

So why are they doing this? (Not the phone call..by the way!) ,What are the after-effects?

DDoS attacks may be carried out for financial gains, political motivations, activism or just for fun. The after affects include financial loses, reputation damage, customer agitation, legal repercussions etc.

Types of DDoS:-

Volume based attacks aims to saturate the bandwidth of the attacked site.They do so by flooding the site with UDP, ICMP or other useless packets.
Protocol attacks aims to consume server resources,firewalls, load balances and other equipments.Typical examples includes SYN floods, Ping of Death etc.
Application layer attacks aims to crash the web-server by issuing GET/POST floods and using vulnerabilities of Server/OS.

Some popular DDoS attacks : UDP Flood, HTTP Flood, SYN Flood, Slowloris, POD, NTP Amplification etc.

DDoS Mitigation:-

Mitigation services employ several different strategies to thwart DDoS attacks. Web proxies, BGP and DNS are all methods used to redirect traffic to a safe location or scrubbing center where technicians can cleanse traffic and wait out a hacker’s attacks.Other methods involve detection and inspections, like deep packet inspection or bot discernment.Another technique is to use captcha’s to differentiate between bots and humans.Many online/cloud based DDoS protection services like Incapsula Enterprise, F5 Silverline DDoS Protection and Arbor Cloud offer good protection from such DDoS attacks.

cloudflare — Fig. 2: Cloudflare DDoS Protection

Ok, theory over…Back to “CODING”..

Now think.. how can we implement such a massive DDoS attack.. Atleast for a demonstration purpose..

Well, use a simulation tool!! This brings us to ‘NS2′ .. A network simulation tool.

NS2 is an open-source simulation tool that runs on Linux. It is a discrete event simulator targeted at networking research and provides substantial support for simulation of routing, multicast protocols and IP protocols, such as UDP, TCP, RTP and SRM over wired and wireless (local and satellite) networks.

It is an object oriented simulator, written in C++, with an OTcl interpreter as a front-end. C++ is used ideally because we need a system programming language which can efficiently manipulate bytes, packet headers, and implement algorithms that run over large data sets.. C++ is fast to run but slower to change, making it suitable for detailed protocol implementation. OTcl runs much slower but can be changed very quickly (and interactively), making it ideal for simulation configuration.For the time being,we will just stick on to tcl scripting , since C++ is intended for advanced usage/scenarios..like per packet processing, changing class behavior etc.

“Let’s start at the very beginning,

a very nice place to start,

when you sing, you begin with sa, re, ga, ma (Do, Re Mi..or whatever..),

when you write, you begin with a, b, c ,

when you program, you begin with hello-world,

when you simulate, you begin with the topology.”

Here is the simulation code..

	# The preamble

	set ns [new Simulator] ;#initialise the simulation

	# Predefine tracing
	set f [open out.tr w]
	$ns trace-all $f
	set nf [open out.nam w]
	$ns namtrace-all $nf

	set n0 [$ns node]
	set n1 [$ns node]
	set n2 [$ns node]
	set n3 [$ns node]

	$ns duplex-link $n0 $n2 5Mb 2ms DropTail
	$ns duplex-link $n1 $n2 5Mb 2ms DropTail
	$ns duplex-link $n2 $n3 1.5Mb 10ms DropTail

	# Some agents.

	set udp0 [new Agent/UDP] ;#A UDP agent
	$ns attach-agent $n0 $udp0 ;#on node $n0
	set cbr0 [new Application/Traffic/CBR] ;#A CBR traffic generator agent
	$cbr0 attach-agent $udp0 ;#attached to the UDP agent
	$udp0 set class_ 0 ;#actually, the default, but.

	set null0 [new Agent/Null] ;#Its sink
	$ns attach-agent $n3 $null0 ;#on node $n3

	$ns connect $udp0 $null0
	$ns at 1.0 "$cbr0 start"

	puts [$cbr0 set packetSize_]
	puts [$cbr0 set interval_]

	# A FTP over TCP/Tahoe from $n1 to $n3, flowid 2
	set tcp [new Agent/TCP]
	$tcp set class_ 1
	$ns attach-agent $n1 $tcp

	set sink [new Agent/TCPSink]
	$ns attach-agent $n3 $sink

	set ftp [new Application/FTP] ;#TCP does not generate its own traffic
	$ftp attach-agent $tcp
	$ns at 1.2 "$ftp start"

	$ns connect $tcp $sink
	$ns at 1.35 "$ns detach-agent $n0 $tcp ; $ns detach-agent $n3 $sink"

	# The simulation runs for3s.
	# The simulation comes to an end when the scheduler invokes the finish{} method.
	# This procedure closes all trace files, and invokes nam visualization on trace files.

	$ns at 3.0 "finish"
	proc finish {} {
	global ns f nf
	$ns flush-trace
	close $f
	close $nf

	puts "running nam..."
	exec nam out.nam &
	exit 0
	}

	# Finally, start the simulation.
	$ns run

view raw first.tcl hosted with ❤ by GitHub

To run this code, first save it as first.tcl and using terminal, run the code a $ ns first.tcl

This script defines a simple topology of four nodes, and two agents, a UDP agent with a CBR traffic generator, and a TCP agent. The simulation runs for 3s. The output is two trace files, out.tr and out.nam. When the simulation completes at the end of 3s, it will attempt to run a nam visualisation of the simulation on your screen.

“Holy cow!”,That’s too much for a hello-world program….
Alright, lets take it bit by bit!!

set ns [new Simulator] – Creates a new simulator object (compulsory for working..)
The output files allows us to investigate and visualize the simulation process and various metrics.
So, create a output trace file out.tr and a nam visualization file out.nam.

To create nodes, use set command.So n0, n1, n2, n3 points to newly created nodes.

$ns duplex-link $n0 $n2 5Mb 2ms DropTail – This command setup a bi-directional link between n0 and n2 with a capacity of 5Mb/sec and a propagation delay of 2ms.DropTail mechanism typically implements a queuing method for corresponding nodes (See RED,SFQ,DRR etc.) .So implemet the rest of links as shown in the topology diagram.

Adding traffic: Agents of shield…

The command ‘set udp0[new Agent/UDP]‘ gives a pointer called ‘tcp’ which indicates the tcp agent which is a object of ns.Then the command ‘$ns attach-agent $n0 $udp0‘ defines the source node of tcp connection.

set cbr0 [new Application/Traffic/CBR] , $cbr0 attach-agent $udp0

Here we define a CBR connection over a UDP one.CBR-Constant Bit Rate Traffic

set null0 [new Agent/Null] , $ns attach-agent $n3 $null0 Here, we define a null agent as ‘sink’ and attach it to node n3.

set ftp [new Application/FTP]

$ftp attach-agent $tcp

$ns at 1.2 “$ftp start”

Here, we create a ftp agent and attach it to tcp.Note that TCP does not generate its own traffic.Finally start the traffic flow at 1.2 sec of the simulation.ie scheduling the events.

$ns connect $tcp $sink

$ns at 1.35 “$ns detach-agent $n0 $tcp ; $ns detach-agent $n3 $sink”

The command connect makes the TDP connection between the source and the destination i.e n1-n3.The detach command is like inverse of attach i.e breaks the connection/link between nodes.

The termination program is done by using a ‘finish‘ procedure.

In the above procedure, word ‘proc’ is used to declare a procedure called ‘finish’.The word ‘global’ is used to tell what variables are being used outside the procedure.

‘flush-trace’ is a simulator method that dumps the traces on the respective files.the command ‘close’ is used to close the trace files and the command ‘exec’ is used to execute the nam visualization.The command ‘exit’ closes the application and returns 0 as zero(0) is default for clean exit.

In ns we end the program by calling the ‘finish’ procedure.

To begin the simulation we will use the command $ns run.

Lets see the simulation output…

topo_simple — Fig 4: Output of simulation [ns2]

Output:-

Packet size: 210
Interval: 0.0037499999999999999

Ok, Cool..

Now lets try some REAL SIMULATION..A PRACTICAL EXAMPLE!!!

Forgot DDoS?? Lets give it a shot!!!

Here we go…The output of simulation is shown below…

You know where to get the code!! …. Jai Github!!

DDoS Simulation Video:-

Now, lets analyse our output…

We are going to use a new tool called Tracegraph to analyse our simulation. Tracegraph is a third party software helps in plotting the graphs for NS2 and other networking simulation softwares. But the sad point is the software is not maintained by anyone and the happiest point is the software works fine still and it is free.

It provides a easy to use interface to directly analyse and dplot grpahs for a given simulation.We can analyse performance characteristics like Throughput, End to End Delay, jitter, histograms etc.Also it gives simulation information like the packet loss, packet delivery, end to end delay for the total network, information about the intermediate nodes, source and destination nodes.

Results:-

simu_info1 — Fig. 5: Simulation Information & Router Information

simu_info2 — Fig. 6: Processing Time and End2End Delay

Now the graphs….Line..Bar..3D!!! What more do you want???

This slideshow requires JavaScript.

Holy Moly!!! Is it enough?? I can’t wait to get hands on the code…

This reminds me of something..Our ‘Old School Teaching Methods..’

The results are self explanatory..Note the performance metrics for the router node..in particular..You can easily observe that it is the most effected node due to DDoS attack.. ie packet loss, jitter, latency etc is highest at this node.

Also note how the queue overflows at router node as DDoS reaches the peak level.This leads to packet loss, congestion and performance degradation.It thus distrupts the normal or legitimate traffic i.e 4-10-11 (node path) and the corresponding users.

Ok, time to wind it up…

Finally, let me add some explanation to some of the terms used here.. Oh,not again…. theory!!

Communication over a computer network has the following performance characteristics relating to latency, bandwidth and jitter:

The delay between the sending of a message by one process and its receipt by another is referred to as latency. The latency includes the propagation delay through the media, the frame/message transmission time, and time taken by the operating system communication services (e.g. TCP/IP stack) at both the sending and receiving processes, which varies according to the current load on the operating system.

The bandwidth of a computer network is the total amount of information that can be transmitted over it in a given time.

Jitter is the variation in the time taken to deliver a series of messages. This is relevant to real-time and multimedia traffic.

TCP vs. UDP:-

	TCP	UDP
Connection	Connection Oriented	Connectionless
Speed	Slower	Faster
Reliability	More Relaible	Less Reliable
Acknowledgement	Acknowledgement Used	No Acknowledgement
Flow Control	Flow Control Implemented	No Flow Control
Error Recovery	Error Recovery Attempted	No Error Recovery
Handshake	Handshake Initiation	No Handshake
Ordering	Packets are Ordered	No guaranteed ordering
Complexity	Heavy weight	Light weight
Applications	HTTP, SMTP, FTP, SSH …	DNS, VoIP, Streaming, TFTP …

NB: Check out Mirai: An IOT bot used for DDoS attack (1 Tbps attack).

Also checkout DDoS by Anonymous, BBC DDoS attack, DynDNS attack, etc..

References:-

[ Date: 7th June ’17 ]

11. Real-time Collaborative Editing System on the Web: Intro to Google Realtime API

Most of you must have used google drive for saving your files like text, images,audio etc . This storage is, in fact associated with your google account (15 GB storage space),i.e including mail, photos etc.Drive also provides office applications like word, presentation, drawings, spreadsheet etc. as a service for the user.It provides additional functionalities like sharing the document among users, directory structures, editing tools etc.Here we are particularly interested in real-time editing feature used in google drive.It enables multiple users (accounts) to make edits simultaneously on a shared document and view them in real-time, ensuring consistency and conflict-free operation. So, now lets start by discussing, in general of collaborative editing systems on the web.

Real time Collaborative Systems or Cooperative Systems are groupware frameworks that permit a gathering of clients to observe and alter the same content/illustrations/mixed media, at same time from topographically scattered locales associated by systems’ networks. Depending on the context of working environment, the users can work either synchronously or asynchronously. Synchronous collaboration is also called real-time editing. The system architecture can be centralized or distributed. Distributed environment is superior as it delivers replication which makes the environment highly responsive. The main objective of a collaborative editing environment is to allow coherent and consistent object sharing and manipulation by distributed users. Examples include desktop conferencing, cooperative document editing, cooperative designing and modeling. Google Docs is the finest example of a collaborative editing application which allows users to create documents in the cloud, share them with others and edit artifacts simultaneously within the browser. In collaborative environment for improving the data availability, each user who is in the current shared environment has been assigned with a local copy of the shared documents or data. This ensures that the updates done by each user is locally executed in non-blocking manner and then broadcasted to the documents copies of other users. As these data and documents are hosted on a shared environment which are replicated at multiple sites, simultaneous modifications on these copies make potential inconsistencies.
One of the main issues in a collaborative environment is how to provide consistency. The succeeding issue is time lag between the edits, like when an edit is made and when to show the edit to others, where others are editing. The network delay plays major role in the latency. Generally, two consistency resolving methods are used. The first one is allowing the conflicts to occur and then resolve the conflicts at one site accommodating all the operations. The second measure is locking mechanism. When a client edits a particular part of the document, the part is converted into an object and it is locked from others. Two of the most trending collaborative applications are Google Drive and Etherpad. Operational transformation (OT) and Differential synchronization (DS) are two major techniques employed for ensuring consistency in collaborative editing applications. Here we are concentrating on OT, since drive uses similar technology for implementing real-time editing of the shared files.

What is Operational Transformation:-

It is a technology for collaboration functionalities in advanced collaborative software systems.
It is used for consistency maintenance and concurrency control in collaborative editing of plain text documents
Supports consistency models like CC, CSI, CSM, CA etc
Applications:- HTML/XML and tree-structured document editing, collaborative office productivity tools, application-sharing, and collaborative computer-aided media design tools etc.

OK, let’s stop beating around the bush… The basic working principle.. can be explained with a simple example…as follows…

Consider a text string shared between two sites containing word ‘abc’. Now two concurrent operations Insert (User 1 @ Site 1) and followed by Delete (User 2 @ Site 2).

Initially consider O1 as first operation and O2 as second one ..

The two operations are:-

O1 = Insert[0, “x”] (to insert character “x” at position “0”)
O2 = Delete[2, “c”] (to delete the character “c” at position “2”)

From the initial view of each user O1 converts ‘abc’ into ‘xabc’ and O2 makes ‘abc’ into ‘ab’. But internally these operations are transformed to maintain consistency of the shared string.Assuming O1 occured first, O1 is transformed to O1′ as shown in figure.Now, O2 is transformed to O2′ i.e O2′ = Delete[3, “c”], whose positional parameter is incremented by one due to the insertion of one character “x” by O1. However, if O2 is executed without transformation, it incorrectly deletes character “b” rather than “c”. The basic idea of OT is to transform (or adjust) the parameters of an editing operation according to the effects of previously executed concurrent operations so that the transformed operation can achieve the correct effect and maintain document consistency.

Now consider the other scenario, O2 operation modifies the string first and it transforms O2 to O2′ = Delete[2, “c”] i.e no change..Since its first one, there exists no operations to transform against .But surprisingly, in this case O1 will be also same as O1’…Because the operation is after all Insert[0, “x”] i.e insert at first position.So after transforming it against O2 (since its the first or previous…), we get the same operation as O1 (O1=O1′).

So, whatever be the order of operations, OT ensures consistency by transforming an operation against previous concurrent operations, thus maintaining a global consistent view of the document.

(Imagine both O2 and O1 performs insert a 0 position..In this case, the second operation will have to be transformed to get a new operation i.e with adjusted editing parameter, regardless of the order of the concurrent operations..)

OT is a system of multiple components. One established strategy of designing OT systems is to separate the high-level transformation control (or integration) algorithms from the low-level transformation functions.The layered system view can be described as follows:-

OT Control Algorithms determine which operations are transformed against others according to their concurrency/context relations.
OT properties and conditions divide responsibilities between algorithms and functions.
OT Transformation Functions determine how to transform a pair of primitive operations according to operation types, positions, and other parameters.

Ok, now what are some of the available frameworks for building a collaborative application?

This table shows a summary of some exclusive shared editing libraries and frameworks that can be used by developers to create applications incorporating collaborative features.

rt_framework — Fig. 2: Frameworks for collaborative application development

Now lets get into the real stuff…”Drive Realtime API”

It allows you to take your application and add to it the power of instant collaboration, just like what you get in Google Docs.It is all done in client side using javascript API without using a separate server.It offers powerful collaboration in terms of click by click or character by character updates of what your collaborators are doing in real-time.It handles conflict resolution and it ensures consistency on the shared data by handling simultaneous edits/updates . Another feature is collaborator presence which allows you to view who else is editing the document and what edits are being made while you are working with the document.All this is backed by google drive, which offers the could storage services, high availability and the related ecosystem for powerful real-time collaboration and persistent storage.Some other features that can be easily implemented with this API and Google Drive include: Authentication (OAuth), Undo/Redo, Sharing of document, Displaying edit logs, Downloading files (JSON), Opening drive files with your application etc.

The API is a JavaScript library hosted by Google that provides collaborative objects, events, and methods for creating collaborative applications.It uses a shared data model for real-time collaboration.When any user changes the document, first the local in-memory copy is updated and API silently sends the updates/changes as mutations to the server, who finally sends updates to all the collaborators.The mutations have been carefully designed so that conflicts between collaborators can always be automatically resolved, so users will never have to see a message about edit conflicts. Real-time data models are “eventually consistent” That means that if all collaborators stop editing, eventually everyone will see the same data model.

Lifcycle of a typical realtime Application:-

In a typical scenario a Realtime application will:

Enable the Drive API
Load the Realtime library
Authorize requests
Open/create a file
Load a document and initialize the data model
Make changes/listen for changes on the data model

OK….I don’t intend to drone on …….. Further ….. After all, its the basics…The theory stuff..U know!!!

Oookay…..Cooooooding!!!

Waaaait…I forgot the Architecture Diagram..Nooo Software Engineering again!!

Well, here is the generic diagram for ,Courtesy: Google

So, it handles all the network communication, storage, sharing internally, so that users can focus on creating really powerful applications without worrying much about the communication and data representation problems.The system stores the data, internally in the form of JSON (all changes/updates),from which the we can built the entire data model ,at any instant for your application.

Here a small sample code in javascript (Source: Google developers) is shown to demonstrate basic real-time collaboration functionality for text documents.The code is itself self explanatory (comments) and easy to understand.

	<!DOCTYPE html>
	<html>
	<head>
	<title>Google Realtime Quickstart</title>

	<!-- Load Styles -->
	<link href="https://www.gstatic.com/realtime/quickstart-styles.css" rel="stylesheet" type="text/css"/>

	<!-- Load the Realtime JavaScript library -->
	<script src="https://apis.google.com/js/api.js"></script>

	<!-- Load the utility library -->
	<script src="https://www.gstatic.com/realtime/realtime-client-utils.js"></script>
	</head>
	<body>
	<main>
	<h1>Realtime Collaboration Quickstart</h1>
	<p>Now that your application is running, simply type in either text box and see your changes instantly appear in the other one. Open
	this same document in a <a onclick="window.open(window.location.href);return false;" target="_blank">new tab</a> to see it work across tabs.</p>
	<textarea id="text_area_1"></textarea>
	<textarea id="text_area_2"></textarea>
	<button id="auth_button">Authorize</button>
	</main>
	<script>
	var clientId = 'INSERT CLIENT ID HERE';

	if (!/^([0-9])$/.test(clientId[0])) {
	alert('Invalid Client ID - did you forget to insert your application Client ID?');
	}
	// Create a new instance of the realtime utility with your client ID.
	var realtimeUtils = new utils.RealtimeUtils({ clientId: clientId });

	authorize();

	function authorize() {
	// Attempt to authorize
	realtimeUtils.authorize(function(response){
	if(response.error){
	// Authorization failed because this is the first time the user has used your application,
	// show the authorize button to prompt them to authorize manually.
	var button = document.getElementById('auth_button');
	button.classList.add('visible');
	button.addEventListener('click', function () {
	realtimeUtils.authorize(function(response){
	start();
	}, true);
	});
	} else {
	start();
	}
	}, false);
	}

	function start() {
	// With auth taken care of, load a file, or create one if there
	// is not an id in the URL.
	var id = realtimeUtils.getParam('id');
	if (id) {
	// Load the document id from the URL
	realtimeUtils.load(id.replace('/', ''), onFileLoaded, onFileInitialize);
	} else {
	// Create a new document, add it to the URL
	realtimeUtils.createRealtimeFile('New Quickstart File', function(createResponse) {
	window.history.pushState(null, null, '?id=' + createResponse.id);
	realtimeUtils.load(createResponse.id, onFileLoaded, onFileInitialize);
	});
	}
	}

	// The first time a file is opened, it must be initialized with the
	// document structure. This function will add a collaborative string
	// to our model at the root.
	function onFileInitialize(model) {
	var string = model.createString();
	string.setText('Welcome to the Quickstart App!');
	model.getRoot().set('demo_string', string);
	}

	// After a file has been initialized and loaded, we can access the
	// document. We will wire up the data model to the UI.
	function onFileLoaded(doc) {
	var collaborativeString = doc.getModel().getRoot().get('demo_string');
	wireTextBoxes(collaborativeString);
	}

	// Connects the text boxes to the collaborative string
	function wireTextBoxes(collaborativeString) {
	var textArea1 = document.getElementById('text_area_1');
	var textArea2 = document.getElementById('text_area_2');
	gapi.drive.realtime.databinding.bindString(collaborativeString, textArea1);
	gapi.drive.realtime.databinding.bindString(collaborativeString, textArea2);
	}
	</script>
	</body>
	</html>

view raw index.html hosted with ❤ by GitHub

So it is basically easy.. steps:-

Load the styles, api library,utility librarary etc..
Enforce/Implement authorization using OAuth (Google Account)
Create/Load the document
Initialize the document structure by adding root node.. a collaborative string/text
Wire up the data model to UI .

First save this code in a file (index.html), start a server in the directory and run the same using browser locally.

eg:- In terminal: python -m SimpleHTTPServer 4000 ->In browser: http://localhost:4000

Click authorize button and login with your google credentials..

Run the same in two windows/tabs and observe the changes in real-time..

NB: Activate Drive API using this link…Get client id and past it inside the code as directed.Ensure that port number (8000/4000 or whatever…) is same everywhere.

That’s it!! You have built your first real-time app…Thanks to Google Real time API…

Yabba Dabba Doo!!!

Now…Let’s do some heavy lifting this time, shall we?

Goto: Sudoku in Gaming section…The output looks something like this… This is a multiplayer sudoku application developed (by me off-course!!) with the help of Realtime API.

Output:-

Here are some cool apps made with Realtime API, available online.

realtime_apps — Fig. 3: Application developed using Realtime API

NB: Check out few other examples

Also checkout Drive REST API, Android API, Drive SDK etc..

What techniques/communication protcols would be ideal for such an RT application?

Candidates:- Long Polling, AJAX, Comet, XMPP, Websocket, SSE, WebRTC….

“Competition Makes Us Faster; Collaboration Makes Us Better”

Refrences:-

https://developers.google.com/google-apps/realtime/overview
https://en.wikipedia.org/wiki/Operational_transformation
Koren, I., Guth, A., & Klamma, R. (2013, October). Shared editing on the web: A classification of developer support libraries. In Collaborative Computing: Networking, Applications and Worksharing (Collaboratecom), 2013 9th International Conference Conference on (pp. 468-477). IEEE.

[ Date: 15th June ’17 ]

12. Intro to NoSQL and Graph Databases: An Eleventh Hour Inclusion !!

A couple of weeks ago my dear friend and room-mate, who apparently was a freelancer and a ‘web-tech expert’, asked me: “Hey do you know what is NoSQL??”.I said :”Thats simple No SQL!!”.Actually that’s not all…

It took me back couple of years back into time…when we as final year bachelor degree scholars were told to take seminars on topics of our interest.(Internals ..you know!!..we had to make a trade off between ‘preparations for final exam’ from the marks we get from these assignment and seminars!!! ).And there was these nerds/geeks..a group of two or three students..as usual..very passionate, exploring all those ‘out of syllabus stuffs’ and discussing the hifi-techno stuffs and codes in class.Two of them (VRPW and RS) had their seminar topics on the same domain ‘Graph DB and NoSQL’.Now, as a matter of coincidence these two nerds soon became GSOCians (Google Summer of Code) and finally one of them ended up (a year ago or so) in Amazon.And here i’am struggling!!!. So, i dedicate this article as a tribute to these guys, who’ve inspired me all the way….

So what is NoSQL?Why do we need it?How did it come about?

Lots of questions to think about??

NoSQL(Pronounced as ‘nosequel’ and not as No-Es-Qu-El!!)??

Ok, Wiki says

“A NoSQLdatabase provides a mechanism for storage and retrieval of data which is modeled in means other than the tabular relations used in relational databases.”

A simple and general definition for NoSQLwould be “Not only SQL”.

But first – SQL!! That’s familiar…

CREATE TABLE Students. (sid CHAR(9), name VARCHAR(20), login CHAR(8), gpa REAL);

That’s the first think that comes into our mind …

So, its all about tables,schema,transaction,normalization,joins,relations …. right?? We are familiar with relational databases and SQL.In fact, for the 20 years or so SQL and relational databases have been the de-facto standard in industry and enterprise for storing and retrieving data for practical data processing applications.

SQL has ruled for two decades.It has the following characteristics that made it popular in industry and enterprise level applications viz.

Store persistent data
Application integration
Concurrency control
Mostly standard:SQL
Reporting

But its dominance is cracking!!!

So what’s wrong with this approach?? Let’s see!!!

Impedance mismatch problem: We assemble structures of objects in memory often of the kind of a cohesive whole of things and then in order to save it off to the database we have to strip out into bits so that it goes into those individual rows and individual tables as a single logical structure in for our user interface and so our data that is processed in memory ends up being splattered across lots and lots of tables.The fact that we have these two different models of how to look at things and the fact that we have to match them causes difficulties.

2. Scaling: Relational databases are designed to run on a single machine, so to
scale, you need buy a bigger machine.One option is vertical scaling (scale-up) as data becomes larger and as network traffic increases; but there are limits to this approach.So But we scale horizontally (scale-out) by use of a cheaper, reliable and more effective approach of buying lots of machines and distributing them over large clusters (Typically for Cloud and Big-data applications).

cluster — Fig. 2: Horizontal Scaling (Clusters)

So a couple of organizations said we’ve had enough of these problems we need to do something different and they developed their own data storage systems that were really quite different from relational databases and they started talking a little bit about that published papers and that talked about what they were up to and it is this that really inspired a whole new movement of databases which is the nosequel movement.

The Tech Giants :- Google -> Bigtable & Amazon -> Dynamo

It’s origin is really very simple:-

It was this guy in London ‘Johannes Carson’ who did a lot of work with Hadoop and things initiated a meeting to discuss about these trends and problems and finally came up with a twitter hashtag to advertise this single meeting one point in time and the fat minute has now become the name of the whole movement and thus it was completely accidental.

Recent studies show that most of the data in internet are in the form of unstructured data.So what is it?.Unstructured data can be anything like video file, image file, PDF, Emails etc. What does these files have in common, nothing. Structured Information can be extracted from unstructured data, but the process is time consuming. And as more and more modern data is unstructured, there was a need to have something to store such data for growing applications, hence setting path for NoSQL.

Definition of NoSQL:-

Well there isn’t actually a proper definition for NoSQL databases; but we can give some common characteristics to it, which is as follows:-

Non-relational
Cluster friendliness
Open-source
21 century Web (Wen 2.0)
Schema-less

NoSQL databases differ from SQL based DB’s mainly by using a different data model and they can be further classified into these four chunks or groups according to the data model employed.

Key-Value Store
Here the idea is you have a key you go to the database and retrieve the value of this key. The database knows absolutely nothing about what’s in that value inside,it could be a single number it could be some complex document it could be an image the database doesn’t know doesn’t care.This is basically just like a hash map but persistent occur in the disk.
Document Store
The document data model thinks of a database as this storage of a whole mass of different documents where each document is some complex data structure usually in forms of JSON (XML maybe).So we have these different documents that all flash around and the usual document databases will allow you to query documents that has such and such fields and you can usually retrieve portions of the document to update those portions of a document .So the big difference here is that the key value store is a very opaque structure and the document is much more transparent.
Column Family In Column family database we have a single key, a row key and then within that we can store multiple column families where each column family is a combination of columns that kind of fit together the column family. Here is effectively your aggregate and you address it by a combination of the row key and the column family name. It is of course the same as storing an array in a document and of and something of that kind so again you get something of that that kind of rich structure and gives you a slightly more complex data to work with; but the benefit you get is again in terms of the retrieval you can more easily pull individual columns and things of that out of the case .

These models can be classified as a aggregate oriented models, since they in effect deal with chunks of aggregate data, stored/accessed almost together.They are in effect schema-less (Still they have use a implicit scheme) and storing these aggregates in a single node(a few nodes together) helps in improving performance for accessing and retrieving or processing the data.

4. Graph databases

A graph database is a data model with that of a node and arc graph structure and not a bar chart or anything like that but the nice thing about storing a graph database is that it’s very good at handling moving across relationships between things relational databases. I fact, relational databases are not terribly good jumping across relationships you have to set up foreign keys and you have to do joins if you do too many joins , modelling a graph structure or a hierarchy to special form of graph structure etc.
Graph databases makes it easy to do these stuffs and they optimize to make it fast to do that kind of things further more we can come up with an interesting query language that is designed around allowing you to query graph structures.

So, the decision to choose between the aggregate, graphical or relational model depends on what kind of data do you tend to work with.If you tend to work with the same aggregates all the time which would lead you towards an aggregate oriented approach and if do you want to really break things up and jump across lots and lots of relationships in the complex structure but would leave you to a graph approach or is the tabular structure working well for you in which case you want to stay with a relational approach.

Consistency:-

Acid vs. Base: When we talk about consistency of SQL, Relational Databases and Transactions we immediately come up with this acronym ‘ACID’.It refers to the properties of Atomicity, Consistency, Isolation and Durability (which we seem to be quite familiar with..).Now in NoSQL we follow a consistency approach, which is usually referred as ‘BASE’.

acid_vs_base

A BASE system gives up on consistency as follows:

Basically Available indicates that the system does guarantee availability, in terms of the CAP theorem.
Soft State indicates that the state of the system may change over time, even without input. This is because of the eventual consistency model.
Eventual Consistency indicates that the system will become consistent over time, given that the system doesn’t receive input during that time.

So, NoSQL databases sacrifice critical data protections in favor of high throughput performance for unstructured applications.But in fact, we still have to deal with these ACID properties over the transaction boundaries of a NoSQL system, which happened to be the ‘Aggregates’ boundary.On the other side,Graph databases still follow ACID properties.

The consistency we’ve been talking about so far is actually logical consistency these consistency issues occur whether you’re running on a customer of machines or whether you’re running on one single machine, you always have to worry about these kinds of consistency issues. Now when you start spreading data across multiple machines this can introduce more problems.When you comes to distributing data broadly you can talk about it in two different ways, one is sharding the data taking one copy of the data and putting it on different machines so that each piece of data lives in only one place.. Another thing however that’s common to do with clusters of machines replicate data to put the same piece of data in lots of places. This can be advantageous in terms of performance because now you’ve got more nodes handling the same set of requests it can also be very valuable in terms of resilience if one of your nodes goes down the other replicas can still keep going ad so on.Again this replication can lead to its own class of consistency problems i.e. replication consistency.

CAP Theorem:

It can be put as :If these three concepts appear you only need to choose 2 out of the 3:-

Consistency: Every read would get you the most recent write.
Availability: Every node (if not failed) always executes queries.
Partition-Tolerance: Even if the connections between nodes are down, the other two (A & C) promises, are kept.

scalability-cap-theorem1 — Fig 5: CAP Theorem

I think it’s easier to reformulate it it’s a bit clearer if you say if you’ve got a system that can get a network partition which basically means communication between different nodes in a cluster breaking down and if you have a distributed system with a network partition, then you have a choice : “Do you want to be consistent or do you want to be available ?”..That’s really what the cap theorem boils down !!!(If we dig it little more deep , its all about consistency vs. response time or in general it is like safety vs. liveliness!!).In fact,this trade-off between consistency vs. availability should be decided based the application domain and should be dealt with as a business decision rather than a technical issue.

So lets see what are some enablers or driving forces for using NoSQL databases:

Dealing with large-scale data: The amount of data that we have to deal keeps on growing(Big-data) and we lots of companies have to capture, store and process these data for their enterprise or business needs.
Easier Development: We don’t have to deal with impedance mismatch and we can avoid the pain of maintaining relational schemas and tables etc.In general, its makes development easier and faster.
Encapsulating Data : Nowadays, we encapsulate data around application by use of some service oriented approaches like web-services and later exchange/ communicate through some mechanisms like SOAP/REST over ESB etc.
Data Warehousing: Aggregate oriented approaches helps in analysing a particular form/group of data easily with respect to the application domain.Graph databases help in analysing relationships between entities (social networks) easily and efficiently.

SO what about Future of SQL Databses…Will it die away?? No!!! It would still thrive on!

But the future leads us to an era of polyglot persistence.

ie.” Using multiple data storage technologies, chosen based upon the way data is being used by individual applications. “

It would look something like this:-

But there are problems here too..Like Decision, Immaturity, Organizational Change, Eventual Change etc.

So when is it recommended to use/ go with NoSQL??

Basically if you need rapid and easier development or if you have to deal with data intensive applications, then you may try out NoSQL!!!

Also, if you are dealing with a project that needs some competitive advantage (strategic projects), then it is worth to explore these new techniques and its immaturity, flexibility etc.

Graph Databases:-

As we have seen in the earlier section graph databases fall into the category of NoSQL databases; but they follow the ACID approach.

OK, Lets begin…

What are Graphs?

Well, a graph is just a collection of vertices and edges—or, in less intimidating language
, a set of nodes and the relationships that connect them. Graphs represent entities as nodes and the ways in which those entities relate to the world as relationships.

We know the general properties of graphs and many of its theorems (I’am not going to dig up those stuff again!!..Hope we have had enough of it ..from Maths, Algorithms, Data Structures etc..Well, they are ‘prerequisites’ for understanding GD’s).

Now Graph Databases..

A graph database management system (henceforth, a graph database) is an online
database management system with Create, Read, Update, and Delete (CRUD) meth‐
ods that expose a graph data model. Graph databases are generally built for use with
transactional (OLTP) systems.

Wikidef:-

In computing, a graph database is a database that uses graph structures for semantic queries with nodes, edges and properties to represent and store data. A key concept of the system is the graph (or edge or relationship), which directly relates data items in the store. The relationships allow data in the store to be linked together directly, and in many cases retrieved with one operation.

Now already we have all those different types of databses like Relational, Object oriented databases (Oh where did that come from?We didn’t discuss it..More on this Later…), NoSQL DB’s of different types like Key-value,Document store etc…

So, Why Graph Databases??

Data is increasing in volume and getting more connected.
Traditional RD’s cannot model/store data and relationships without complexity.
For RD’s performance degrades with complexity of relationships and DB size.
RD’s have problem of Query complexity, that grows with need for JOINS.
Relational databases need schema redesign, increasing time to market.
In other NoSQL DB’s there are no data structures or query constructs to support relationships.
No ACID support in traditional NoSQL DB’s.

So this brings us to Graph Databases, the most popular one being Neo4j.

Neo4j is an enterprise grade graph database that enables you to model and store your data as a graph and query complex relationships with ease and in real-time.It provides high performance, scalability , agile development and seamless evolution.

Cypher is the Graph Query Language . Unlike SQL Cypher is not a standardized language. But Neo4j uses Cypher. It is a declarative query language and support all the common graph operations.It works based on pattern matching.It is human friendly query language.CREATE , SET , DELETE , MATCH are the core operations.The major benefits of using the neo4j model includes Intuitiveness,Speed and Agility.Also it is short and less complex.

The Property Graph Model:-

The property graph contains connected entities (the nodes) which can hold any number of attributes (key-value-pairs). Nodes can be tagged with labels representing their different roles in your domain. In addition to contextualizing node and relationship properties, labels may also serve to attach metadata index or constraint information to certain nodes.
Relationships provide directed, named semantically relevant connections between two node-entities. A relationship always has a direction, a type, a start node, and an end node. Like nodes, relationships can have any properties. In most cases, relationships have quantitative properties, such as weights, costs, distances, ratings, time intervals, or strengths. As relationships are stored efficiently, two nodes can share any number or type of relationships without sacrificing performance. Relationships can always be navigated regardless of direction.Also no broken links are allowed.

We can use neo4j to create models and query them.First step is to create model, then load them(csv) and finally query them.We can use many languages (language drivers in java,python etc) to query a neo4j database.An advantage of these graph databases is that it is ‘whiteboard friendly’ i.e. the whiteboard model usually constructed can be directly modelled using a graph databases ,easily.

A Neo4j example:

neosample — Fig. 7:Sample Neo4j Graph Model

Here Person & Car are the labels and Drives,Loves etc. are Relationships(Directed), whereas name, born , model etc. represent the properties of nodes and finally ‘since’ is the property of he relationship(Drives).

Sample Code:-

CREATE (:Person {name:”Ann”}) -[:LOVES]-> (:Person {name:”Dan”})

MATCH (p:Person) -[:LOVES]-> (:Person {name:”Dan”})

In general,

MATCH (node:Label) RETURN node.property

MATCH (node1:Label1)-->(node2:Label2)
WHERE node1.propertyA = {value}
RETURN node2.propertyA, node2.propertyB

Neo4j Use cases (GD’s in general) :-

Real-time Recommendations
Master Data Management
Fraud Detection
Graph Based Search
Network and IT Operations
Identity and Access Management.

Companies using Graph Databases :-

Google, LinkedIn, PayPal
Twitter ( FlockDB )
Adobe ( Neo4j )
Microsoft ( Infinite Graph )
Accenter (Neo4j)
Ebay (neo4j)
Facebook (Apache Giraph)
Cisco (Neo4j)
Lufthansa (Neo4j)
HP (Neo4j)

OK..Now where is the heavy stuff??

To our ‘GURUJI’…For some interesting lectures & tech-sessions!!!

For code and installation..Refer Github

[Refer Neo4j official site for complete understanding]

NB:

I have referred Martin Fowler’s Talk and official neo4j videos mainly for this article..

The code for GD’s are usually imported in for of csv format..

Interestingly neo4j uses D3.js, which we’ve seen in the gaming section!!!

Must watch : GOTO 2012 • Introduction to NoSQL • Martin Fowler

Graph databases and cypher seems similar to our Ontograph and SPARQL??Semantic Web??How are they related??

What happened to Object Oriented Databases?Why it didn’t click?

How these databases are related to bigdata and cloud?Inspiration and influences?? (Refer last book i.e reference link)

For case studies:- Famous NoSQL DB’s – Refer Cassandra(pdf in link;column family), Riak(key-value store), MongoDB(Document store), Neo4j (Graph), HBase(Hadoop based) etc. (Also in the book mentioned above)

“Information is the oil of the 21’st century and analytics is the combustion engine”– Peter Sondergaard

References:-

http://nosql-database.org/
https://en.wikipedia.org/wiki/NoSQL
https://www.thoughtworks.com/insights/blog/nosql-databases-overview
https://martinfowler.com/nosql.html
https://martinfowler.com/articles/nosql-intro-original.pdf
http://www.kdnuggets.com/2016/07/seven-steps-understanding-nosql-databases.html
https://www.youtube.com/watch?v=Yzbk6VaavoM
https://neo4j.com/
http://www.studytonight.com/mongodb/what-is-nosql
https://www.slideshare.net/rahuldausa/what-is-no-sql-and-cap-theorem
http://www.christof-strauch.de/nosqldbs.pdf
https://www.couchbase.com/binaries/content/assets/website/docs/whitepapers/why-nosql.pdf
http://info.neo4j.com/rs/neotechnology/images/Graph_Databases_2e_Neo4j.pdf
https://neo4j.com/docs/cypher-refcard/current/
https://www.slideshare.net/neo4j/intro-to-neo4j-and-graph-databases
https://drive.google.com/file/d/0B1nRmnzAbVYPem94b21nNXlLd2M/view?usp=sharing
https://drive.google.com/open?id=0By8lpnlInoVjRW84azFzSG90ZkU
https://drive.google.com/open?id=0By8lpnlInoVjejI5eUZwM0FfYmM
https://www.youtube.com/watch?v=qI_g07C_Q5I
Graph Database System : IEEE Engineering in Medicine and Biology , November 1995
Comparison of Graph Database and Relational Database ,University of Mississippi
Survey of Graph Database Models , University of Chile
Harrison, G. (2015). Next Generation Databases: NoSQLand Big Data. Apress.

[ Date: 16th July ’17 ]

13. Intro to container technology: Docker

You’re Gonna Love This; Trust Me……

goku_normal

You might have come across this term ‘Docker‘, when you were going through the installation instructions of some software’s for various platforms for eg.: if you go to neo4j official page for installation ,they have given a separate menu for installation instructions for Linux,Windows,MAC and finally ‘Docker’.So,what is ‘Docker’?New OS?Not exactly!!.Although,it is similar to the virtualization technology that we already know about (Oh…VMware!!).

What are CONTAINERS?

A container image is a lightweight, stand-alone, executable package of a piece of software that includes everything needed to run it: code, run-time, system tools, system libraries, settings.In simple words, containers are an encapsulation of an application with its dependencies.It packages software’s into standardized units for development, shipment and deployment.Containerisation is, in effect, provides OS-level virtualisation.This virtualization isolates processes,providing limited visibility and resource utilization to each such that the process appear to be running on separate machines, thus allowing more application to run on a single machine.In essence,it enables you to run applications in an isolated environment.

OS-level virtualisation??

Ok, what is that Wiki??

Operating-system-level virtualization is a computer virtualization method in which the kernel of an operating system allows the existence of multiple isolated user-space instances, instead of just one. Such instances, which are sometimes called containers, virtualization engines (VEs) or jails (FreeBSD jail or chroot jail), may look like real computers from the point of view of programs running in them.

Containers: A Useful Analogy

To understand why containers are such a big deal, let’s think about physical containers for a moment. The modern shipping industry only works as well as it does because we have standardized on a small set of shipping container sizes. Before the advent of this standard, shipping anything in bulk was a complicated, laborious process. Imagine what a hassle it would be to move some open pallet with smartphones off a ship and onto a truck, for example. Instead of ships that specialize in bringing smartphones from Asia, we can just put them all into containers and know that those will fit on every container ship.

The promise behind software containers is essentially the same. Instead of shipping around a full operating system and your software (and maybe the software that your software depends on), you simply pack your code and its dependencies into a container that can then run anywhere — and because they are usually pretty small, you can pack lots of containers onto a single computer.

Containers: History in a single glance…

container_history — Fig. 1: Container Technology: Development History

How are containers different from VM’s and virtualization technology?

A hypervisor is required to create and run VM and each VM requires a full copy of the OS.

In case of containers applications the host’s kernel is shared with the running containers. It uses the same libraries and can share this data rather than having redundant copies. The container engine is responsible for starting and stopping containers.The processes running inside containers are equivalent to native processes on the host and do not incur the overheads associated with hypervisor execution. VMs have an added degree of isolation from the hypervisor and are a trusted and battle-hardened technology. Containers are comparatively new, and many organizations are hesitant to completely trust the isolation features of containers before they have a proven track record. For this reason, it is common to find hybrid systems with containers running inside VMs in order to take advantage of both technologies. Thus, the containers are lightweight and are, mostly, in the megabytes size range. Comparing their performance to VMs, containers perform much better and can start almost instantly.

container_vs_vm — Fig. 2:Container vs. VM

The fundamental goals of VMs and containers are different—the purpose of a VM is to fully emulate a foreign environment, while the purpose of a container is to make applications portable and self-contained.

Container Classification

5512204-os-vs-app-containers-a3f09c304838c7687273371874f61 — Fig. 3:Container Classification

So in general when you want to package and distribute your application as components, application containers serve as a good resort. Whereas, if you just want an operating system in which you can install different libraries, languages, databases, etc., OS containers are better suited.

Docker has become the most popular container software platform …

Docker – Build, Ship, and Run Any App, Anywhere…

Advantages Of Using Container

1.They are more efficient than VM’s sinc ethey share the kernal.They are leight-weight requiring less resources.They can be started and stopped in no time and requires less memory.

They are more efficient than VM’s since hey share the kernel. They are light-weight requiring less resources.They can be started and stopped in no time and requires less memory.
The portability of containers has the potential to eliminate a whole class of bugs caused by subtle changes in the running environment.It just works everywhere-your laptop, server system,cloud etc. without the need of any changes.
It enables easier development and use.we can run run dozens of containers at the same time and end-users can download and run complex applications without needing to spend hours on configuration and installation issues,dependency problems etc.
The containerization allows for greater modularity. Rather than run an entire complex application inside a single container, the application can be split in to modules(micro-service approach).Application thus become more manageable (also version control,security etc.) and are easier to modify.

adv_container — Fig. 4: Advantages of Container Technology

Container technology is still not a mature one.A lot of research is still going on.There are question on issues like security, scalability etc.

The container stack:-

There are four technology layers that need consideration:

1. Container operating systems

Even though containers do not have an embedded OS, one is still needed. Any standard OS will do, including Linux or Windows. However, the actual OS resources required are usually limited, so the OS can be too.Specialist container OS’s such as Rancher OS, CoreOS, VMware Photon, Ubuntu Snappy, the Red Hat-backed Project Atomic and Microsoft Nano Server.

There are four technology layers that need consideration:

1. Container operating systems

Even though containers do not have an embedded OS, one is still needed. Any standard OS will do, including Linux or Windows.Other specialist container operating systems include Rancher OS, CoreOS, VMware Photon, Ubuntu Snappy, the Red Hat-backed Project Atomic and Microsoft Nano Server.

2. Container engine

Engines come with supporting tools, for example the Docker Toolbox, which simplifies the setup of Docker for developers, and the Docker Trusted Registry for image management. There are also third-party tools, such as Cloud66. Examples include Docker engine,CoreOS Rocket etc.

3. Container orchestration

Containers need to be intelligently clustered to form functioning applications. The engines provide basic support for defining simple multi-container applications, for example Docker Compose. However, full orchestration involves scheduling of how and when containers should run, cluster management and the provision of extra resources, often across multiple hosts. Tools include Docker Swarm, the Google-backed Kubernetes and Apache Mesos. You could use general-purpose configuration tools, such as Chef and Puppet etc.

4. Application support services

Many additional tools are emerging to support containerised applications – some examples : services for cloud deployment,networking services,support for life-cycle of containerised applications etc.

gokuss

Case Study: Docker

Docker is the world’s leading software container platform. Developers use Docker to eliminate “works on my machine” problems when collaborating on code with co-workers.

As Solomon Hykes, the creator of Docker, says “You’re going to test using Python 2.7, and then it’s going to run on Python 3 in production and something weird will happen. Or you’ll rely on the behavior of a certain version of an SSL library and another one will be installed. You’ll run your tests on Debian and production is on RedHat and all sorts of weird things happen.”

Operators use Docker to run and manage apps side-by-side in isolated containers to get better compute density. Enterprises use Docker to build agile software delivery pipelines to ship new features faster, more securely and with confidence for both Linux and Windows Server apps.

Technically speaking….

Docker provides an additional layer of abstraction and automation of operating-system-level virtualization on Windows and Linux. Docker uses the resource isolation features of the Linux kernel such as cgroups and kernel namespaces, and a union-capable file system such as OverlayFS and others to allow independent “containers” to run within a single Linux instance, avoiding the overhead of starting and maintaining virtual machines.Docker implements a high-level API to provide lightweight containers that run processes in isolation.It relies on the kernel’s functionality and uses resource isolation (CPU, memory, block I/O, network, etc.) and separate namespaces to isolate the application’s view of the operating system. Docker accesses the Linux kernel’s virtualization features either directly using the libcontainer library, which is available as of Docker 0.9, or indirectly via libvirt, LXC (Linux Containers) or systemd-nspawn.

Skadoosh!!!

What it means, in short….

“Docker can use different interfaces to access virtualization features of the Linux kernel”

In simple terms..

A container is a running instance of an image. An image is a template for creating the environment you wanted to snapshot of the system at a particular time. So, it’s got the operating system the software the application code all bundled up in a file. Images are defined using a docker file,which is just a text file with a list of steps to perform to create that image for e.g. you would configure the operating system, install the software you need, copy the project files into the right places etc.So,in short you write a docker file then you build that and you get an image which you can then to get containers.Containers contain binaries,libraries,filesystem etc. and uses the host for networking and kernel.

docker_deploy — Fig. 7: Docker Deployment

The entire needs of an application is defined in a text file called a docker file. Here is an an example docker file.It is very very simple and actually easy to read even though you don’t know docker. This is an example docker file where we build a container based on the binaries and libraries that you get from a Ubuntu distribution after we have that base we want to install an application called a redis server then we want to expose a port we want to open,i.e 6379 and then we have what’s called an entry point which is the application that we want to start in this container which is the Redis server. These four lines equals an entire container that is actually ready so containers contain everything your application needs.

The Docker platform consist of 3 main components:

The Docker Client:The front end interface that allows communication between the user and the Docker Daemon.
The Docker Daemon:The back end interface that handles requests from the Docker Client. It sits usually on top of the host Operating System (The Docker Host).
The Docker Index:A repository where we can search, push or pull Docker Containers or Images. It works much like a SaaS service and has a public and private access permission.

The following is an architectural view of a Docker environment.

docker-container — Fig. 9: Docker-Layerd Architecture

You can use already existing apps easily using docker.It leaves your host OS clean. Usually we use“one process per container” (best practice ).

Docker benefits:-

Resource-lite
Fast start-up and shut-down
Public repos
Standardization
Scalability
Unit testing

Use cases:

Modernize Traditional Apps:Docker Enterprise Edition increase security and portability of existing apps while saving costs.
Continuous Integration and Deployment (CI / CD): Accelerate app pipeline automation and app deployment and ship 13X more with Docker.
Microservices: Accelerate the path to modern app architectures.
IT Infrastructure Optimization: Get more out of your infrastructure and save money.
Hybrid Cloud: Avoid Lock In and Seamlessly Move Across Clouds.

Docker Facts:-

Solomon Hykes is the Founder, Chief Technology Officer and Chief Architect of Docker and the creator of the Docker open source initiative. Before that he was the co-founder and CEO of Dotcloud.

The image below ..I stole it from Wikipedia!!! Hmm..We have been doing this since the age of assignments….SO WHAT???

Oh my gosh…It was written in GO!!..I just noticed it now… (What’s the big deal about Go?)

Well..see…Copying also comes with some added benefits; provided you do it with DEDICATION!!

( INFO : GO is a free and open source programming language created at Google in 2007)

Docker Ecosystem:-

Docker can be integrated into various infrastructure tools, including Amazon Web Services, Ansible, CFEngine, Chef,Google Cloud Platform, IBM Bluemix, HPE Helion Stackato, Jelastic, Jenkins, Kubernetes, Microsoft Azure,OpenStack Nova, OpenSVC, Oracle Container Cloud Service, Puppet, Salt, Vagrant, and VMware vSphere Integrated Containers, Cloud Foundry PaaS, Red Hat’s OpenShift PaaS, Apprenda PaaS, Jelastic PaaS etc.

Docker Tools:-

Docker Compose: Compose is a tool for defining and running multi-container Docker applications. It uses YAML files to configure the application’s services and performs the creation and start-up process of all the containers with a single command.

Docker Swarm: Docker Swarm provides native clustering functionality for Docker containers, which turns a group of Docker engines into a single, virtual Docker engine.In Docker 1.12 and higher, Swarm mode is integrated with Docker Engine.

Others containerisation tools include :-

Kubernetes and Mesos are the most widely used tools. Other well-known tools are Rocket (rkt), Kurma, Packer, Jetpack, Mesos, Linux Containers, ClousSlang, Nomad, Marathon, Fleet, OpenVZ, Solaris Containers, Containership, Rancher and Tectonic.

Well..That’s Too much INFO!! Ain’t it??

Let’s get to work..

ssj2

IT’S CLOBBERIN’ TIME !!!

Hello World Example:-

First install Docker from this link:-

https://docs.docker.com/engine/installation/linux/docker-ce/ubuntu/

This is ‘Docker Community Edition for Ubuntu‘ (Instructions for other platform can be found here: Docker Installation ).

Lets write a simple hello world application, just going to echo “hello world”.

Save it in a folder called src for source as index.php . (say, inside a folder docker).Our goal is to use docker to create one server and run whole thing ‘inside(image)’.Initially make a new file called Dockerfile and save it next to the src folder.

To configure our environment,we want an operating system with PHP and Apache installed . We start in our Dockerfile with the name of an existing image an image that has already been built and then we build on top of that conveniently. You can find lots of existing images on at the docker hub .Official images are listed along with community one’s.You will have to sign up prior to accessing the hub-site.

You’ll find all the variations of the image.Check for required versions and use them accordingly. Pushing and Pulling images to/from repos is as easy as using ‘git’.Here you even get instructions telling you how to use the images. We’ll use as suggested 7.0 – Apache and then we want to copy our files inside the image using the copy keyword so we want to copy the contents of source into /var/www/html. Also,we want to use the expose keyword to expose port 80.This just means when you run the image and you get a container that container will listen on port 80 by default.

Our docket file docker is going to download PHP from the docker hub and it’s going to copy our files from source to this location inside of the image .It’s going to tell running containers to listen on port 80 and then it’s going to output a new image our new customized version which will be able to run so to build it .

In terminal:

$ docker build -t hello-world .

It builds the ‘hello-world’ in current directry (dot).

Now,Sign-in ad root (or provide access permissions using chmod command) and run our application in docker

# docker run -p 80:80 hello-world

Now, we can go to localhost and we’ll see hello world.

So we’ve done it ..We’ve got our application running inside a docker container!!

Hit control-c to stop this container .

To mount a volume we’re going to add another option to the docker run command

root@anilsathyan7-VirtualBox:/home/anilsathyan7/Desktop/docker# docker run -p 80:80 -v /home/anilsathyan7/Desktop/docker/src:/var/www/html/ hello-world

we want that folder that local folder to be mounted to put a caller inside the container and /var/www/html so the image it copies this folder to this location inside the container

This time when you run it you’ll see changes that we make are reflected straight away as soon as we refresh the docker container.Refresh localhost in browser to see the changes.

But before you deploy this and try to run the image somewhere else you will need to rebuild the image to get an updated copy of the files put inside volumes .

The life of that container is tied directly to a single process so you don’t want other things going on in the background that will all be brought down when without warning when the main process turn it and the ten it just stops but since containers are really lightweight you can run loads and loads of containers on your computer all at the same time.

FROM php:7.0-apache
COPY src/ /var/www/html
EXPOSE 80

BQ: Dockerfile

BQ: index.php

You want more??

Lets take on Node.js..

Tada….

This slideshow requires JavaScript.

Running node.js in docker!!!

That was blazingly fast…barely 5 minutes!!!From installation to running the application..

Now we have javascript running on nodejs, which is running on docker container,which is runnng on ubuntu 14.04 using virtual machine(Oracle VM VirtualBox), which is runs on windows 10 !!!

Well, window runs on my Intel i7 lenovo Laptop Z500!!!

WOW!! Container technology is ‘something else’!!

Now..

This is to go even further beyond… Ahaaaaaaaaaaaaaaaa!!!

Amazon AWS Elastic Beanstalk!!! Deploy Docker Container Online….

aws_fin — Fig. 14: AWS Elastic Beanstalk – Console

Goto : http://ssswebserver-env.gmmbcqeqj4.ap-south-1.elasticbeanstalk.com/index.php

First register to AWS(start from here & then to here)
Goto Elastic Beanstalk and Create New Application(see cloud academy)
Install EB CLI on your system,along with python &pip(from terminal) (see this link)
Create Security Credentials from AWS:Access Keys & Downlad Credentials (see last couple of links )
Now, once you app is up and running, copy the Public DNS (IPv4) from EC2 instances.
Now, go back to your terminal (Shortcut: CTRL+ALT+T)
Follow these commands:-

eb init: select mumbai region: 6 (or as required) and login with your credentials (access id and key downloaded earlier)
eb ssh –setup:follow the prompts and give key pair and passphrase(cretaes RSA 2048 public key)
eb ssh ‘environment-name’ (see tags section in EBS console of application for env-name)
Now you will be logged into docke container instance in AWS EBS.
Login as root: sudo su
Try all the docker commands….

Example:-

docker ps
docker info
docker images
docker stats
docker version
docker inspect
docker history

commit, deploy, diff, export, cp, import, load, logout and many more…see this link

Here is the screenshot of a couple of commands:-

This slideshow requires JavaScript.

Accessing AWS EBS remotely in linux shell

Now it Feels like…

2811500_640px

INFO:

Node.js is an open-source, cross-platform JavaScript run-time environment for executing JavaScript code server-side.

The company AWS is a subsidiary of Amazon.com and provides on-demand cloud computing platforms to individuals, companies and governments, on a paid subscription basis with a free-tier option available for 12 months.

NB:

Please be careful when you are register into AWS. Please ensure that your account is basic; don’t register as a developer(unless you are ready to pay!!).Basic account and its service’s are free for 12 months.(Actually, i made that mistake myself!!)

MUST WATCH : Cloud Academy: Docker and Container Technologies

It’s an EXCELLENT TUTORIAL…

The great news is that – ITS FREE (Don’t worry…no quizzes or tests!!!)You will get to know the real stuff..virtualization(theory)….and more importantly, you can get acquainted with Linux Environment, Commmands, Cloud Repos etc.

It’s bit old though…And not updated!!!

You will get a CERTIFICATE too… Upon Completion!!!

cloud-academy_docker

What are microservices and monoliths?

Check out :The Open Container Initiative (OCI), Microsoft Hyper-V, VMware ESX, Xen, Hypervisors, Sandbox, LXC, Docker commands, Tutum, quay.io, AWS Elastic Beanstalk etc.

Is OS virtualization possible on an Android phone?

Tribute: Mr J J (college buddy), through whom i came to know about containers ….

References:-

[ Date: 20th July ’17 ]

14. Intro to Linux and Free Software: Save the Best for Last

When we talk about OS or Operating System, in general, the first thing that comes into our mind are the words – Windows, Linux and Macintosh (Apple MAC).It’s no wonder, since these three are the most popular OS’s both in industry and normal users.Among these, Linux(GNU/Linux) OS is considered as Free and Open Source.

Let’s start with some geekspeak….

Unix, Linux and GNU…. A Mumbo Jumbo?

Let’s take it nice and slow!!!

Unix – It is a commercial operating system developed by AT&T, and licensed to various other companies.
Linux – It is a free, open source Unix workalike
GNU – It is a foundation that creates and promotes free, open source software, mostly on Unix type platforms…Linux uses many GNU packages.

Somewhat rough idea..

Now, to elaborate…

Unix is a family of multitasking, multi-user computer operating system system developed at AT&T’s Bell Labs research facility in the 1970’s by Ken Thompson, Dennis Ritchie, and others.. The operating system was licensed to a number of companies that developed their own. Sun Microsystems made Solaris, Unversity of California, Berkeley made BSD which is the base of a number of other systems such as Mac OS X.

GNU is an open source effort to produce an OS that is compatible with Unix without using any of the code from it. Development was started by Richard Stallman in 1984. Most of the OS functionality was replicated but development on the OS kernel proceeded slowly.GNU stands for “GNU’s Not Unix“, and it is an attempt to create a free, independent version of Unix, developed by the Free Software Foundation(FSF).It can be considered as an operating system and an extensive collection of computer software, most of which is licensed under the GNU Project’s own GPL.

Linux is a OS kernel originally developed by Linus Torvalds in 1991. Unlike GNU, Linux does use some closed source code to work with hardware and isn’t completely open source. Linux was that kernel and most of GNUs software was ported to work with Linux and you could get a working, Unix-compatible OS that was mostly open source.So, linux is an operating system kernel that is designed like Unix’s kernel.Linux is most commonly used as a name of Unix-like operating systems that use Linux as their kernel. As many of the tools outside the kernel are part of the GNU project, such systems are often known as GNU/Linux. All major Linux distributions consist of GNU/Linux and other software.

Oh..don’t bother with it….Just see WIKIPEDIA!!! UNIX, LINUX, GNU…..

Interested readers please see : Linux and the GNU System

Waaaaaiiitttttt…Then what do you mean by Free and Open Source Software???

Wikidef:

Free and open-source software (FOSS) is computer software that can be classified as both free software and open-source software.That is, anyone is freely licensed to use, copy, study, and change the software in any way, and the source code is openly shared so that people are encouraged to voluntarily improve the design of the software.This is in contrast to proprietary software, where the software is under restrictive copyright and the source code is usually hidden from the users.

Now FSF defines Free Software as follows :-

“Free software” means software that respects users’ freedom and community. Roughly, it means that the users have the freedom to run, copy, distribute, study, change and improve the software. Thus, “free software” is a matter of liberty, not price. To understand the concept, you should think of “free” as in “free speech,” not as in “free beer”. We sometimes call it “libre software,” borrowing the French or Spanish word for “free” as in freedom, to show we do not mean the software is gratis.(See FLOSS/FOSS)

The benefits of using FOSS includes:-

Decreasing software costs.
Increasing security and stability and protecting privacy.
Giving users more control over their own hardware and freedom.
Quality, collaboration and efficiency.

Now, what’s the relation between free software and open source software?

See this link : Why Open Source misses the point of Free Software

The Four Essential Freedoms

A program is free software if the program’s users have the four essential freedoms:

Freedom 0: The freedom to run the program as you wish, for any purpose
Freedom 1: The freedom to study how the program works, and change it so it does your computing as you wish .Access to the source code is a precondition for this.
Freedom 2: The freedom to redistribute copies so you can help your neighbour .
Freedom 3 : The freedom to distribute copies of your modified versions to others . By doing this you can give the whole community a chance to benefit from your changes. Access to the source code is a precondition for this.
A program is free software if it gives users adequately all of these freedoms. Otherwise, it is considered as non-free.

In this six-minutes video Richard Stallman explains briefly and to the point the principles of Free Software and how they connect to education.

Finally, COPYLEFT!!!

What do you mean by copyleft?

Copyleft is a general method for making a program (or other work) free (in the sense of freedom, not “zero price”), and requiring all modified and extended versions of the program to be free as well.Copyleft says that anyone who redistributes the software, with or without changes, must pass along the freedom to further copy and change it. Copyleft guarantees that every user has freedom.Copyleft is a way of using the copyright on the program. It doesn’t mean abandoning the copyright; in fact, doing so would make copyleft impossible. The “left” in “copyleft” is not a reference to the verb “to leave”—only to the direction which is the mirror image of “right”.

In general, copyright law is used by an author to prohibit recipients from reproducing, adapting, or distributing copies of their work. In contrast, under copyleft, an author may give every person who receives a copy of the work permission to reproduce, adapt, or distribute it, with the accompanying requirement that any resulting copies or adaptations are also bound by the same licensing agreement.

The concept of copyleft was described in Richard Stallman’s GNU Manifesto in 1985, where he wrote:

“GNU is not in the public domain. Everyone will be permitted to modify and redistribute GNU, but no distributor will be allowed to restrict its further redistribution. That is to say, proprietary modifications will not be allowed. I want to make sure that all versions of GNU remain free.”

For more info refer: What is Copyleft? And don’t forget our dear friend Wiki

The big picture!!!!

art-colors-creation-digital-materiaux-nature-texture-webmaster-wallpaper-68 — Fig. 2: Free Software

A Brief history:-

In 1950’s to almost 1970’s , the most companies had a business model based on hardware sales, and provided or bundled software with hardware, free of charge, along with the source code for all programs they used, and the permission and ability to modify it for their own use.From 1970’s to 1980’s the business model was slowly changing.There was a growing amount of software that was for sale only. Some parts of the software industry began using technical measures (distribute binary only) to prevent computer users from being able to use reverse engineering techniques to study and customize software they had paid for.In 1973 Unix system was officially released. It was actually developed by Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna at Bell Labs. In 1980, the copyright law was extended to computer programs in the United States.In 1983, GNU project was initiate by Richard Stallman and FSF was established in 1985.They released GNU Manifesto that included significant explanation of the GNU philosophy, Free Software Definition and “copyleft” ideas.The Linux kernel, started by Linus Torvalds, was released as freely modifiable source code in 1991, which was later released under GNU GPL in 1992.FreeBSD and NetBSD (both derived from 386BSD) were released as free software when the USL v. BSDi lawsuit was settled out of court in 1993.Apache was released in 1995 under Apache license 1.0.The Debian Project was first announced in 1993 by Ian Murdock, Debian 0.01 was released on September 15, 1993,and the first stable release was made in 1996.In 1997, Eric Raymond published The Cathedral and the Bazaar, a reflective analysis of the hacker community and free software principles.Netscape Communications Corporation released their popular Netscape Communicator Internet suite as free software,which later became Mozilla Firefox and Thunderbird.The Open Source Initiative was founded in February 1998 to encourage use of the new term and evangelize open-source principles.

unix_history-simple — Fig. 3: Evolution of Unix and Unix-like system

Now some LOCAL HISTORY…

The Government of Kerala, India, announced its official support for free/open-source software in its State IT Policy of 2001, which was formulated after the first-ever free software conference in India, Freedom First!, held in July 2001 in Trivandrum, the capital of Kerala. In 2009, Government of Kerala started the International Centre for Free and Open Source Software (ICFOSS).In March 2015 the Indian government announced a policy on adoption of open source software.

introduction-to-free-and-open-source-software-foss-5-728 — Fig. 4: FOSS Timeline

Ok..I think its enough details about FOSS philosophy and history…

Now lets see about what we are alreadyfamiliar with!!!

Linux Distributions:-

A Linux distribution (often abbreviated as distro) is an operating system made from a software collection, which is based upon the Linux kernel and, often, a package management system. A typical Linux distribution comprises a Linux kernel, GNU tools and libraries, additional software, documentation, a window system (the most common being the X Window System), a window manager, and a desktop environment.

Examples:- Debian, Ubuntu, Linux Mint, Fedora, RHEL, CentOS, openSUSE, Arch Linux, Chrome OS etc.Other’s include:- OpenWrt for embedded devices and Rocks Cluster Distribution for powerful supercomputers.

Distributions are normally segmented into packages. Each package contains a specific application or service. Examples of packages are a library for handling the PNG image format, a collection of fonts or a web browser.Thus a Linux distribution is usually built around a package management system, which puts together the Linux kernel, free and open-source software, and occasionally some proprietary software.

1280px-linux_kernel_ubiquity-svg — Fig. 5: Linux and Package Management System

Installation:-

Usual method of installation is by booting from an optical disc that contains the installation program and installable software.You can also download the ISO image from the official sites for different architectures.You may also dual boot your system along with windows.It involves selecting required partitions and formatting the corresponding drives.In new Windows(Windows 8 and later versions) system installation can get pretty messy due to the new UEFI firmaware that is used traditional BIOS.But recent versions of linux (say Ubuntu16.04 ) have managed to solve much of these problems and has made the installation process hassle-free.

You may run the OS as live CD/USB without actually installing into the disk.But here the data is not permanently saved and thus any updates and user data is removed once you reboot the system.

Another option is to install linux (typically Ubuntu) inside windows . Wubi installer (now deprecated Windows Ubuntu Installer) , which allows Windows users to download and install Ubuntu or its derivatives into a FAT32 or a NTFS partition without an installation CD, allowing users to easily dual boot between either operating system on the same hard drive without losing data.It can be installed just like an application inside windows.

Virtual machines (such as VirtualBox or VMware) also make it possible for Linux to be run inside another OS. The VM software simulates a separate computer onto which the Linux system is installed. After installation, the virtual machine can be booted as if it were an independent computer. Eg;:- Oracle VirtualBox.

On embedded devices, Linux is typically held in the device’s firmware and may or may not be consumer-accessible.

Android and Tizen (Mobile/Embedded/Smartphone OS)also uses Linux Kernel.

Ubuntu Touch (also known as Ubuntu Phone) is a mobile version of the Ubuntu operating system that was developed by Canonical Ltd. and the Ubuntu community for smartphones or mobile devices.Now it is managed by by the UBports.

Even though i have used like IT@School GNU/Linux, Debian and openSUSE; by far Ubuntu is my favourite and the most popular one (home & educational ..at-least where i come from..).Actually, Ubuntu is the most popular operating system running in hosted environments, so–called “clouds”,as it is the most popular server Linux distribution.

Fig. 5: Ubuntu

Development of Ubuntu is led by UK-based Canonical Ltd.,and Ubuntu community.The latest version is 17.04 Zesty Zapus.Version names are usually based on names of animlas. Every fourth release, in the second quarter of even-numbered years, has been designated as a Long Term Support (LTS) release, indicating that they are supported and receive updates for five years, with paid technical support also available from Canonical Ltd.Non-LTS version are typically supported for 18 months.

Let me move to a question which use to ponder in my earlier days as a Linux User.. It;s a bit technical though!!!

What is KDE, GTK, GTK+, QT, and/or GNOME?

The best answer from askubuntu.com is as follows.It is a question and answer site for Ubuntu users and developers.

“GTK, GTK+, and Qt are GUI toolkits. These are libraries that developers use to design graphical interfaces, all running on top of the X Server. These are things that you need to install as dependencies. They’re the Linux “equivalent” to Windows’ GDI/GDI+. When an application uses any of these, it will always have a general “look and feel”.

GNOME and KDE are Desktop Environments. GNOME primarily uses the GTK+ toolkit, while KDE primarily uses the Qt toolkit. There are applications designed for GNOME or KDE, such as a settings menu or a default music player, usually in the appropriate toolkit. These Desktop Environments have a set of utilities/window managers/design specification to create a more unified desktop. You can mix the two if you feel like it, but you may run into issues with colliding standards and applications (which you might occasionally run into on systems like Arch).

Unity uses many of the GNOME utilities (Nautilus, Rhythmbox, etc.), so Unity is more GNOME than KDE.”

Now lets see what are the popular software’s(FOSS) for linux and ubuntu…

Softwares:-

Synaptic Package Manager: Synaptic is a graphical package management tool based on GTK+ and APT. Synaptic enables you to install, upgrade and remove software packages in a user friendly way.
VLC Media Player: VLC is a free and open source cross-platform multimedia player and framework that plays most multimedia files as well as DVDs, Audio CDs, VCDs, and various streaming protocols.
GIMP: GNU Image Manipulation Program (GIMP) or GIMP is a cross-platform image editor available for GNU/Linux, OS X, Windows and more operating systems. It is free software, you can change its source code and distribute your changes.
Libre Office: LibreOffice is a free and open source office suite, a project of The Document Foundation. It was forked from OpenOffice.org in 2010, which was an open-sourced version of the earlier StarOffice.It supports slide presentations, text editing and slide presentations.
Mozilla Firefox: Mozilla Firefox is a free and open-source web browser developed by the Mozilla Foundation and its subsidiary the Mozilla Corporation.
Wine: Wine is a free and open-source compatibility layer that aims to allow computer programs developed for Microsoft Windows to run on Unix-like operating systems.
Vim: Vim is a clone of Bill Joy’s vi text editor program for Unix.It is a free software licensed under GNU GPL.
Evince: Evince is a document viewer for PDF, PostScript, DjVu, TIFF, XPS and DVI formats. It was designed for the GNOME desktop environment.It is free and open-source software subject to the requirements of the GNU General Public License version 2 or later.
GNU Octave: It is software featuring a high-level programming language, primarily intended for numerical computations.It is almost similar to Matlab. It may also be used as a batch-oriented language. Since it is part of the GNU Project, it is free software under the terms of the GNU General Public License.
Cheese: Cheese is a GNOME webcam application that supports image and video capturing.It was developed as a Google Summer of Code 2007 project by Daniel G. Siegel. It uses GStreamer to apply effects to photos and videos.

Blowfish, Simple Scan and many more….

Directory Structure :-

The following image shows an overview of Linux (Unix-like in general) file and directory structure..

linux-file-system-hierarchy-linux-file-structure-optimized — Fig. 6: Linux File System

A simple description of the UNIX system, also applicable to Linux.On a UNIX system, everything is a file; if something is not a file, it is a process.A Linux system, just like UNIX, makes no difference between a file and a directory, since a directory is just a file containing names of other files. Programs, services, texts, images, and so forth, are all files. Input and output devices, and generally all devices, are considered to be files, according to the system.

With such a view, we can consider files to have different types viz. regular, directory,special files, links , sockets and named pipes.In a file system, a file is represented by an inode, a kind of serial number containing information about the actual data that makes up the file: to whom this file belongs, and where is it located on the hard disk.

The Filesystem Hierarchy Standard (FHS) defines the directory structure and directory contents in Unix-like operating systems. It is maintained by the Linux Foundation.It is a reference describing the conventions used for the layout of a UNIX system. It has been made popular by its use in Linux distributions, but it is used by other UNIX variants as well. The Linux Standard Base (LSB) refers to it as a standard.

A short description of the directory structure (FHS) is given below:-

/	Primary hierarchy root and root directory of the entire file system hierarchy.
/bin	Essential command binaries that need to be available in single user mode; for all users, e.g., cat, ls, cp.
/boot	Boot loader files, e.g., kernels, initrd.
/dev	Essential device files, e.g., /dev/null.
/etc	Host-specific system-wide configuration files
/home	Users’ home directories, containing saved files, personal settings, etc.
/lib	Libraries essential for the binaries in /bin/ and /sbin/.
/lib	Alternate format essential libraries. Such directories are optional, but if they exist, they have some requirements.
/media	Mount points for removable media such as CD-ROMs .
/mnt	Temporarily mounted filesystems.
/opt	Optional application software packages.
/proc	Virtual filesystem providing process and kernel information as files. In Linux, corresponds to a procfs mount. Generally automatically generated and populated by the system, on the fly.
/root	Home directory for the root user.
/run	Run-time variable data: Information about the running system since last boot, e.g., currently logged-in users and running daemons.
/sbin	Essential system binaries, e.g., fsck, init, route.
/srv	Site-specific data served by this system, such as data and scripts for web servers, data offered by FTP servers, and repositories for version control systems
/sys	Contains information about devices, drivers, and some kernel features.
/tmp	Temporary files, often not preserved between system reboots, and may be severely size restricted.
/usr	Secondary hierarchy for read-only user data; contains the majority of (multi-)user utilities and applications
/var	Variable files—files whose content is expected to continually change during normal operation of the system—such as logs, spool files, and temporary e-mail files.

OK..Now Let’s get into the INDISPENSABLE part!!

CTRL+ALT+T….. Booom!!! -> TERMINAL!!!!

Oh, not again…Terminologies!!!…Hmmm…Last one I promise!!!

Terminal : Technically , A terminal window, also referred to as a terminal emulator, is a text-only window in a graphical user interface (GUI) that emulates a console.
In Our words A GUI Application , from where we can access an user’s console.

Console: An instrument panel containing the controls for a computer.

Shell :A shell is a program that provides the traditional, text-only user interface for Linux and other Unix-like operating systems.

Command-Line : A command line is the space to the right of the command prompt on an all-text display mode on a computer monitor (usually a CRT or LCD panel) in which a user enters commands and data.

Wikidef:

A command-line user interface (CLI), also known as a console user interface and character user interface (CUI), is a means of interacting with a computer program where the user (or client) issues commands to the program in the form of successive lines of text (command lines). A program which handles the interface is called a command language interpreter or shell.

Thus, users can access a Unix-like command-line interface called Terminal found in the Applications Utilities folder. This terminal uses bash by default,which is the standard shell for common users.

First let’s see the basic linux comands,

Here we are using Ubuntu 16.04 LTS…

Open a Terminal by from search (top-left corner) or use keyboard shortcut mentioned before..

Learning the ropes!!!

Some useful tips:-

Once you get acquainted with basic commands, you can easily get the recent commands that you have typed just using the up and down arrow keys in terminal.

Also try using Tab key for autocomplete feature (for commands and file-names).

Few keyboard-shortcuts

CTRL+ALT+T : Terminal, ALT+TAB: Window Switcher, ALT+F4: Close Window

CTRL+N : Open New Window, CTRL+P: Print, CTRL+C: Copy, CTRL+V: Paste

Oh sorry!! last two are familiar to all..i guess!!!

Hungry for more?? Welcome to the retrofit program…

DISCLAIMER : All characters and other entities appearing in this work are fictitious. Any resemblance to real persons or other real-life entities is purely coincidental.

First ,lets say we have a file with names as follows:-


1. LILIMA JAIN F
2. ARUNA S F
3. KANGAN ARORA F
4. PREETI IVY BARROW F
5. POOJA RAJENDRA GADGE F
6. RAMANI ARYA SHRIHARI F
7. RAHUL JOSHI M
8. SUMIT HAZRA M
9. PALAK JAIN F
10. ANANDANI POOJA MOHANDAS F
11. AGRAWAL KALPESH SITARAM M
12. GREETTA PINHEIRO F
13. SAYALI RAJIV F
14. JOY BHOWMICK M
15. RANVIRKAR MADHURI UDAY F
16. ANIL SATHYAN M
17. VINUPRIYA F
18. JEEVITHASHREE D V F
19. ARCHAKAM PARAMKUSAM SAGAR M
20. CHICHILI VINATHI F
21. POOJA MEHTA F
22. GUTHIKONDA LAKSHMI SOUJANYA F
23. SHRUTI JALAPUR F
24. ARCHANA A SAVALGI F
25. PRAMOD HUDED M
26. NIYATI PURI GOSWAMI F
27. KIRAN PRAKASH HIWARKAR M
28. SAVITHRI G F
29. PEMA NAMGYAL M
30. SHRUTI SARAF F
31. SWATI DHINGRA F
32. SURUCHI KAUR DHANJAL F
33. RADHIKA R F
34. SMITA SHIRISH HEDA F
35. FIONA MATHEWS F
36. AYUSHI DIXIT F
37. PATIL AMRUTA PRABHAKAR F
38. DILSHA DOMINIC F
39. BHAMIDIPATI VVS PADMASRIPRIYA F
40. PRINCY VICTOR F
41. PESALA VIDYA SAGAR M
42. SINGH NEHA AMULYA RATNA F
43. SHUSHEN SHARMA M
44. BHALARA JALPESH KANTILAL M
45. RUNNI KUMARI F
46. ANDANAPPA SUDISHETTAR M
47. VAGHANI RUSHIKUMAR LALLUBHAI M
48. KASALKAR ASHIRWAD SADASHIV M
49. SWATI JAIN F
50. POORVI S DODWAD F
51. DOSHI HARSHIL VIRALKUMAR M
52. GHETIYA RUTVI BABULAL F
53. ABHISHEK RATHORE M
54. JADHAV ASHISH SHIVAJI M
55. DIKSHA MISHRA F
56. MEGHNA MADAN F

Now let’s roll!!!

grep is a powerful tool that searches for matching a regular expression (or pattern) against text in a file, multiple files or a stream of input.

In the example given ‘SAGAR‘ is the pattern given as first argument and ‘namefile‘ is the name of file to search this pattern.Similarly the second one matches the pattern ‘JAIN‘.By default, it prints the entire line containing the pattern.

The next one is little complex..it seems.What is that ‘|‘ symbol? They are called pipes.It feeds the output from the program on the left as input to the program on the right.In this case wc stands for word count. The three numbers shown below it are 3 (number of lines), 9 (number of words) and 47 (number of bytes) of the file. wc -l gives just the number of lines.

Next up.. IO Redirection!!!

Every program we run on the command line automatically has three data streams connected to it.

STDIN (0) – Standard input (data fed into the program)
STDOUT (1) – Standard output (data printed by the program, defaults to the terminal)
STDERR (2) – Standard error (for error messages, also defaults to the terminal)

The ‘>‘ symbol redirects the output from standard output (screen) to a desired place..say a file.Here we have mentioned abc.txt. This command overwrites the existing file.A similar command ‘>>‘ does the same thing without overwriting the file(i.e just adding).

/dev/null is the null file (like a black hole). Anything written to it is discarded. Erros may be redirected to this file.

/usr/share/dict/words contains a long list of words(almost a lakh) in alphabetical order.

The find command is used to search for files in a directory hierarchy.Here . represents the current directory i.e the first argument is the directory to search for, then -name implies we search by filename and at last the pattern “*.txt” tells the command to search for text files or files with ‘.txt‘ extension.Ok, then what is star i.e. ‘*‘ symbol?It is kind of regular expression which means ‘match zero or more of preceding character’.Now, if you really want to search by file type, use -type option instead of name.In the example it shows the files which were saved with .txt extension.So a better way is to use the command as find . -type f -name “*.txt” .

ps command reports the snapshot of the current processes.ps -aux seems to conveniently list all processes and their status and resource usage .

a = show processes for all users
u = display the process’s user/owner
x = also show processes not attached to a terminal

The command ifconfig is used to view all netwrok settings like IP adress, hardware adress, network interface details etc and configuring the network interfaces.

ping command can be used to check if a network is reachable or if a host is alive.You can use ip adress/hostname as arguments.It also gives how much time it takes for that data to be exchanged.It actually uses ICMP ECHO messages for this purpose.

The apropos command displays a list of all topics in the man pages (i.e., the standard manual that is built into Unix-like operating systems ) that are related to the subject of a query.The help command displays brief summaries of shell builtin commands. If PATTERN is specified, gives detailed help on all commands matching PATTERN, otherwise the list of help topics is printed.A similar command is info (try info echo).

Head and Tail commands!! Any guess??

It’s easy head prints the first few lines (or part ) of a file and tail prints last few lines of a file (by default it prints 10 lines).You can also tell the command how much lines to print by using -n option along with the desired number of lines.Say 20 or 30.Obviously, 10 in the example is redundant (since default is 10).The tail command has another very powerful option: the -f option prints from the end of the file, but also keeps the file open, and keeps printing from the tail of the file as the file itself grows.

sed corresponds to stream editor and is used for filtering and transforming text.It is mostly used to replace the text in a file.Actually you can do much more than that (like pattern matching,deleting,adding,replacing, transforming text etc.)!!.

Here the “s” specifies the substitution operation. The “/” are delimiters. The “JAIN” is the search pattern and the “KUMAR” is the replacement string. ‘namefile‘ is the source file and finally the output of the command is redirected to ‘newnamefile’ .

From the subsequent results of head and tail commands, it can be seen that the string substitution operations were successful.Note that it doesn’t change the source file.

Even though this command seems small,its really powerful!!.Actually there is a lot more to explore… but i guess it’s enough for now!!!

The sort command sorts lines of text files.Here ‘sortfile’ contains the same names of our previous ‘namefile’.The -r option of this command prints the reverse sorted order.

The sort command sorts lines of text files and concatenation of all files in sorted order.Here ‘sortfile‘ contains the same names of our previous ‘namefile‘.The -r option of this command prints the reverse sorted order.Here also source file is not modified; the command just prints the result to standard output(you may redirect output to a file).

The xargs command builds and executes command lines from standard input..UNIX xargs command divide that list into sub-list with acceptable length and made it work. xargs command in unix or Linux is a powerful command used in conjunction with find and grep command in UNIX to divide a big list of arguments into small list received from standard input.

A common use fo unix xargs command is to first find the files and then look for specific keyword on that file using grep command.Here we find all the files with .txt extension and supply it to xargs command.It takes each of them and with the help of grep searches for pattern “treasure” inside it.Among all the text files in the current folder, ghi.txt and abc.txt contains the pattern/string “treasure“.

The AWK language is useful for manipulation of data files, text retrieval and pro‐
cessing, and for prototyping and experimenting with algorithms.It is a utility/language designed for data extraction.It is similar to sed and it reads one line at a time, performs some action depending on the condition you give it and outputs the result. One of the most simple and popular uses of awk is selecting a column from a text file or other command’s output.

Now the scenario is as follows:-We have modified our namefile to include the a extra bit of information regarding each name i.e Male(M)/Female(F). Either F or M is added to each line depending on the name and sex.A sample snapshot of file is as follows:-

In the example we search for line containing pattern ‘F‘ at the end and print $2 i.e. second column onto the screen.As a result, this command prints the first-name of all female’s.The symbol $F ($-regular expression) corresponds to any pattern which ends with ‘F’.

The tar command used to rip a collection of files and directories into highly compressed archive file commonly called tarball or tar, gzip and bzip in Linux.It is an archiving utility.It stands for tape archive.

To create a tar of a folder containing files and compress it, use the tar -cvfz command.Here we first provide ‘filename.tgz‘ as output file-name and then we also have to provide the source directory/folder path.Here ‘z‘ stands for compression/zipping.For extracting use tar with -xvzf option along with the source-file(.tgz) to be extracted as argument.With -C option you can specify the output directory to which the files are to be extracted.Refer Fig. 16 for more information.

Initially, testim-299 is a folder in Desktop with image files as seen in image. Finally, folderone contains the extracted images under the testim-299 directory.

The alias command makes it possible to launch any command or group of commands (inclusive of any options, arguments and redirection) by entering a pre-set string (i.e., sequence of characters).That is, it allows a user to create simple names or abbreviations (even consisting of just a single character) for commands regardless of how complex the original commands are and then use them in the same way that ordinary commands are used.

Here we have created an alias under name greet.It just greets the user with his name and the name of host system as shown above. $USER is an environment variables and $HOSTNAME is a shell variable(What’s the difference??).

command 13 — Fig. 18 Linux-Commands – 11

The chmod command changes the file mode bits or permissions of each given file according to mode, which can be either a symbolic representation of changes to make, or an octal number representing the bit pattern for the new mode bits.

permissions defines permissions for user(u),group(g) and others(o) for file access rights for read(r),write(w) or execute(e).

eg: chmod u=rwx,g=rx,o=r myfile

In octal representations, 4 stands for “read”,2 stands for “write”,1 stands for “execute”, and 0 stands for “no permission.”So 7 is the combination of permissions 4+2+1 (read, write, and execute).So, each bit represents yes/no for r,w,x for each group.

In the example given initially all permissions are set to 0(no)[chmod 000].Then finally all are set to 1(yes)[ie. chmod 777].

A symbolic link, also termed a soft link, is a special kind of file that points to another file, much like a shortcut in Windows or a Macintosh alias. Unlike a hard link, a symbolic link does not contain the data in the target file. It simply points to another entry somewhere in the file system.

The command ln is used to make links between files.By default it creates hard links.It refers to the specific location of physical data.Soft links or symbolic links have the ability to link to directories, or to files on remote computers networked through NFS.

ln -s command is used to create soft links.In the example given file2 links to file1 ie. a soft link.

The export command is used to set export attribute for shell variables.

It takes the form : export VAR=VALUE

It marks each VAR for automatic export to the environment of subsequently executed commands. If VALUE is supplied, assign VALUE before exporting.

In the example, we assign path of vi editor to newly created env-variable EDITOR.You may use ‘echo’ command to test the value.Now you may use this variable directly, instead of typing the long path-name every time.The command printenv can be used to print all environment variables.

You need to add export statements to ~/.bash_profile or ~/.profile or /etc/profile file or .~/.bashrc (Actually, there are slight differences here..). This will export variables permanently.This is similar to adding.

Mostly we use the following command to add path variables.

export PATH=$PATH:/path/to/dir

add path to /etc/environment file as root for system wide directories.

command007 — Fig. 20 Linux-Commands – 13

The processes that have been stopped by some interrupt signal(CTRL+Z) can be continued in background with bg command.With the & symbol you can run a command in background.Jobs command show the active jobs in shell with their status and Job-ID (or/and pid).

bg %id where id can be jobnumber or PID or even command name.

Here we have used job-number.

bg %4 command changes status of job-number 4 ie. /bin/sleep 3000 from stopped to running.

Similarly kill %3 kills job-number 3 .It’s then terminated.

Use fg %4 to bring job-number 4 to foreground (see example).

OMG…That’s enoughhhhhhhhhhhhhhhhh……

Ok, lets save few cracker’s for the after-party…

Here it goes ..some more commands to mess with…

set, free, uname, dmesg, pushd, popd, df, du, service, wget, dpkg, chroot, chown, chgrp, diff, lsof, tty, strace, iostat, ssh, make, nano, rename, telnet, exec, read, cal, cmp, watch, dir, eject, exit, last, size, tee, wall, fsck, scp, uptime,top, unmask ,curl, lynx, crontab etc…

Ok, let me tell you one thing……YOU CAN’T MASTER ALL THE COMMANDS!!!

Now lets goto shell-scripting….#!SHEBANG!!!!

Shellscripting:-

Let’s start with the basics..

Shell is a user program or it’s environment provided for user interaction. Shell is an command language interpreter that executes commands read from the standard input device (keyboard) or from a file. Shell is not part of system kernel, but uses the system kernel to execute programs, create files etc.

It is smilar to CMD in Windows.A shell-script file is similar to a batch file in windows.

In Linux there are many shells available…

Shell Name	Developed by	Where	Remark
BASH ( Bourne-Again SHell )	Brian Fox and Chet Ramey	Free Software Foundation	Most common shell in Linux. It’s Freeware shell.
CSH (C SHell)	Bill Joy	University of California (For BSD)	The C shell’s syntax and usage are very similar to the C programming language.
KSH (Korn SHell)	David Korn	AT & T Bell Labs	—
TCSH	See the man page. Type $ man tcsh	—	TCSH is an enhanced but completely compatible version of the Berkeley UNIX C shell (CSH).

Instead using command one by one (sequence of ‘n’ number of commands) , the you can store this sequence of command to text file and tell the shell to execute this text file instead of entering the commands. This is know as shell script.
Shell script defined as a series of command written in plain text file.

No more story this time..It’s already too long!!

Save a series of commands we have seen so far in a file(should be sensible), set required permissions and run the file as follows (any of he three method).

bash your-script-name
sh your-script-name
./your-script-name

For a shell script,
$ myshell foo bar

Shell Script name i.e. myshell
First command line argument passed to myshell i.e. foo
Second command line argument passed to myshell i.e. bar

In shell if we wish to refer this command line argument we refer above as follows

myshell it is $0
foo it is $1
bar it is $2

Again there are loops and conditionals…Advanced stuff!!!

I’am tired ..Please refer this link..It’s All-in-one package..I promise!!!

By the way…What was that shebang thing??..#!SHEBANG

A shebang or a “bang” line is nothing but the absolute path to the Bash interpreter. It consists of a number sign and an exclamation point character (#!), followed by the full path to the interpreter such as /bin/bash. All scripts under Linux execute using the interpreter specified on a first line.

Shellscripting Tutorial

Sample Code

	clear
	echo -e "\e[1;4mI LOVE SHELL SCRIPTING!!\n"
	echo -e "WELCOME \033[31m$USER to $HOSTNAME"
	$ echo -e "\033[7m Linux OS! Best OS!! \033[0m"
	echo "Today is `date`"
	val=`expr 6 + 3`
	echo "six plus three is $val"
	echo "What is your full name?"
	read fname
	echo "Hello $fname, my buddy!"
	if [ "$fname" = "$USER" ]
	then

	for (( i = 0 ; i <= 5; i++ ))
	do
	echo "You are honest!!!"
	done

	else
	echo "Not Cool Buddy..Be honest!!"

	fi

	echo -e "Bleed Blue... \e[96mOcean"

	echo "You are $1 years old!!!"
	fiveyearage=`expr $1 + 5`
	echo $fiveyearage > fivefile.txt
	echo "See fivefile.txt to see how old you will be after 5 years!!"

	echo -e "Light's Out... \e[40mBlack"

	echo -e "\033[1m SEE \033[2m YA \033[0m ..."

view raw shellscript.sh hosted with ❤ by GitHub

Output:-

Save the above code in a file shellscript.sh and run : ./shellscript.sh 25

type your name as hostname value..(Also try out other names!!!)

Now don’t get me wrong shellscripting is not about HTML colours or making a calculator app!!!OK, THEN WHAT???

WELL, YOU BETTER FIND OUT!!!

One more try..

Say if you have a html file or an srt (sub-title file).

You want to remove/delete some text between tags…Which command do you think will help ya?? Definitely sed and awk!!!

I’am not pasting the output-screenshot this time!!

And as always..you know where to find the codes…GITHUB

One last PIC…

final_linux — Fig. 22 Linux File Extensions

Allright..It’s Time..

Test your mettle!!! QUIZ!!

Here is the link to QUIZ (Warning: Advanced!!!)

NB:

There are different types of software viz. free software, open source software, freeware,public domain software etc.which are similar but different to each other.Another class includes: shareware, crippleware etc.Finally we have retail and proprietary software.Check out, if you are interested…

What’s the most successful company in open source history? Red Hat (RHT) and Canonical would probably top most people’s lists. By one measure, however, VA Linux is far and away the most explosively popular Linux company to ever exist (i..e. based on stock… yet again..a weird old story!!).

What ? Haven’t you heard about it…Check out from the links.. Also check out Cygnus Solutions, Netscape, Apache etc!

Check out this movie: Revolution OS , which tells the inside story of the hackers who rebelled against Microsoft and created GNU/Linux and the Open Source movement.

Other films: The Code: documentary about GNU/Linux, featuring some of the most influential people of the free software (FOSS) movement.

Also see:The Imitation Game, WarGames ,We Steal Secrets: The Story of WikiLeaks, Social Network, Snowden etc.

Tux is a penguin character and the official mascot of the Linux kernel. Originally created as an entry to a Linux logo competition, Tux is the most commonly used icon for Linux, although different Linux distributions depict Tux in various styles.

Can anybody forget this?? Jurassic Park!!!

unix_jurassic — Fig. 24: Jurassic Park – Unix System

Well, it saved the day!! Atleast for them..I suppose!!!

Is linux more reliable and secure than windows?Why the recent ransomware attack really couldn’t (or less likely to) effect linux system’s?

How do free software developers make money?Can we sell them (s/w) at a charge/price?

Check out: POSIX Standard, Awk programming language,GRUB, Cygwin,Zenity, smb and samba etc.

Have you felt that linux UI is dull or weary?Then check out Compiz, Cairo Dock etc. It provides an easy and fun-to-use windowing environment, allowing use of the graphics hardware to provide impressive effects, amazing speed and unrivalled usefulness.

If you are interested in magazines and journals related to FOSS and Linux, then check out some of these popular magazines:- Linux Journal, Linux Format, Linux User & Developer, Linux Magazine etc.

QUIZ/Interview Questions…

Online Terminal/Shell

Some Books: Drive-Link

Is Android really just Linux?

Also, see UBports & Ubuntu Touch

Check out IRC (Internet Relay Chat) is one of the main communication channels for open source projects. ( sevenshinestudios – my_channel )

Also see Diaspora, which is a nonprofit, user-owned, distributed social network that is based upon the free Diaspora software.

Finally, check out GitLab (CE), which is an open source end-to-end software development platform with built-in version control, issue tracking, code review, CI/CD, and more.

OpenStreetMap – A collaborative project to create a free editable map of the world.

OpenStack – A free and open-source software platform for cloud computing, mostly deployed as infrastructure-as-a-service, whereby virtual servers and other resources are made available to customers.

Trac – an open source, Web-based project management and bug tracking system.

“Sharing is good, and with digital technology, sharing is easy.”–Richard Stallman

References:-

[ Date: 27th July ’17 ]

15. Numpy and Friends: Intro to scientific computing in python

NumPy (Numerical Python) is an open source library for scientific computing in the python.It provides support for creation and processing of multi-dimensional arrays with the help of a large collection of high-level mathematical functions, tools and other specialized objects.The main objective of numpy is to provide efficient operations on arrays of homogeneous data.

The array data structure of numpy provides some additional benefits over Python lists viz. being more compact, faster access in reading and writing items, uses less memory, being more convenient and more efficient etc.

Other features include:-

It has sophisticated broadcasting functions.
It includes tools for integrating C/C++ and fortran code.
It comes with linear algebra, fourier transform, and random number capabilities.
It ca be used as an efficient multi-dimensional container of generic data.
It can be seamlessly and speedily integrated with a wide variety of databases.

Numpy in combination with other scientific computing libraries like Scipy, Matplotlib, Scikit-learn, Pandas etc. can be used as a replacement for MATLAB.

Note: If you don’t have numpy already installed, you may initially install it using pip or using the systems package manager. Additionally you may also build it from source.

Hold on tight it’s gonna a bumpy ride !!!

NumPy:-

First lets starts with the basics of NumPy …

It is highly recommended that you have a basic knowledge of python lists, dictionaries ,functions and basics of linear algebra.

Creating a numpy array:-

Numpys array class ‘ndarray‘ is used to create a homogeneous multidimensional array object.It is a grid of similar elements, indexed by a tuple of positive integers.The ndim attribute of ndarray gives the number of dimensions(axes) or rank of the array.Similarly, ndarray.shape gives the dimesnions of the array across each axes.The dtype attribute gives the datatype of the elements of the array.

You can create an numpy array from python lists or tuple. Numpy also provides certain functions to create and directly initialise the arrays implicitly. Another option is to create an numpy array from a function explicitly defined.

>>> import numpy as np

>>> a=np.array([1,2,3,4,5])	# Create a rank 1 array from list
>>> print(a)
[1 2 3 4 5]

>>> print(type(a))	# Print type of array
<class 'numpy.ndarray'>
>>> print(a.ndim)	# Print number of dimensions
1
>>> print(a.shape)	# Print the array shape
(5,)
>>> print(a.dtype)	# Print the array data-type 
int64

>>> b=np.arange(9)	# Create an array from a range (0-8)
>>> print(b)
[0 1 2 3 4 5 6 7 8]

>>> c=np.zeros((3,3))	# Create a zero matrix
>>> print(c)
[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]

>>> d=np.ones((2,2))	# Create a matrix of ones
>>> print(d)
[[1. 1.]
 [1. 1.]]

>>> e=np.eye(4)		# Create an identity matrix
>>> print(e)
[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]

>>> f=np.random.random((3,3))	# Create an array of random numbers
>>> print(f)
[[0.92626929 0.48004621 0.67337471]
 [0.30097446 0.45259595 0.61926968]
 [0.09580008 0.65535226 0.41688474]]

>>> g=np.linspace(0,99,10)	# An array with 10 elements evenly spaced between 0-99
>>> print(g)
[ 0. 11. 22. 33. 44. 55. 66. 77. 88. 99.]

>>> h=np.full((3,3),5)	# Create a matrix full of a particular number (5)
>>> print(h)
[[5 5 5]
 [5 5 5]
 [5 5 5]]

>>> def fun(x,y):		 
...     return x+y
... 

>>> s=np.fromfunction(fun,(4,4),dtype=int)	# Create an array from a function
>>> print(s)
[[0 1 2 3]
 [1 2 3 4]
 [2 3 4 5]
 [3 4 5 6]]

Accessing and printing numpy array:-

The numpy array elements can be accessed using square brackets. They can be modified or printed in the same way as normal lists or tuples. Iterating over multidimensional arrays is done with respect to the first axis. It can also be done with the help of a function called ‘nditer’ as shown below.

>>> import numpy as np

>>> a=np.array([ [1,2,3], [4,5,6], [7,8,9] ])	# Create a 2-D array
>>> print(a)
[[1 2 3]
 [4 5 6]
 [7 8 9]]

>>> print(a[0])		# Print the first row
[1 2 3]
>>> print(a[0,2])	# Print an element using array index
3

>>> a[1,1]=100		# Modifying an array element

>>> print(a)
[[  1   2   3]
 [  4 100   6]
 [  7   8   9]]

>>> for row in a:	# Iterating over rows
...     print(row)
... 
[1 2 3]
[  4 100   6]
[7 8 9]

>>> for x in np.nditer(a):	# Iterating over elements
...     print(x)
... 
1
2
3
4
100
6
7
8
9

Transforming the numpy array:-

The numpy array can be reshaped using a function called ‘reshape‘. The reshaped array must have the same number of elements as the source array. The transform operation is carried out using transform function ‘T‘.If a is an (m,n) array a.T gives an (n,m) array with its rows and columns interchanged. The flatten function return a copy of the array collapsed into one dimension.

>>> import numpy as np

>>> a=np.arange(12)

>>> b=a.reshape(3,4)	# Reshape the numpy array
>>> print(b)
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]

>>> c=b.T		# Performance the transpose operation
>>> print(c)
[[ 0  4  8]
 [ 1  5  9]
 [ 2  6 10]
 [ 3  7 11]]

>>> d=c.flatten()	# Prints the array collapsed into one dimension.
>>> print(d)
[ 0  4  8  1  5  9  2  6 10  3  7 11]

Indexing and slicing:-

Indexing and slicing a one dimensional numpy array is similar to that of normal lists.In the case of multi-dimensional array you may have to provide a slice for each dimension.The slices in any case would yield a sub-array. Mixing integer indexing with slicing will result in a array with lower rank. Another way to create and index an array is to use a index data from another array. Here we explicitly select elements from ordered pairs.In this case the resulting array need not be a sub array of the original array. Finally, in the case of boolean indexing we select elements from an array that satisfy some explicit conditions. The boolean array formed can be used for indexing from the original array.

>>> import numpy as np

>>> a=np.arange(16)
>>> print(a)
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]

>>> print(a[0:15:4])		# Print every n'th elements from a slice (4)
[ 0  4  8 12]

>>> b=a.reshape(4,4)

>>> print(b)
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]

>>> print( b[1:3,1:3] )		# Print the subarray using slicing
[[ 5  6]
 [ 9 10]]

>>> print(b[2,:3])		# Mixing slicing and integer indexing
[ 8  9 10]

>>> print( b[ [0,1,2,3], [0,1,2,3] ])	# Integer array indexing
[ 0  5 10 15]

>>> c=b>8
>>> print(c)
[[False False False False]
 [False False False False]
 [False  True  True  True]
 [ True  True  True  True]]
>>> print(b[c])			# Boolean indexing
[ 9 10 11 12 13 14 15]

Array math:-

The basic mathematical operations operate element-wise on the elements and are available as either functions or operators. Numpy also has familiar mathematical functions such as sin, cos, exp etc.these universal functions also operate element-wise on the arrays.Furthermore, standard statistical functions like sum,mean, median, standard deviation, variance etc. are also available in the form of simple function calls.

>>> a=np.array([ [1,2,3],[4,5,6], [7,8,9] ])
>>> b=np.array([ [2,4,6],[1,3,5], [9,7,1] ])

>>> a
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])
>>> b
array([[2, 4, 6],
       [1, 3, 5],
       [9, 7, 1]])

>>> print(np.add(a,b))	# Elementwise sum
[[ 3  6  9]
 [ 5  8 11]
 [16 15 10]]

>>> print(np.subtract(b,a))	# Elementwise difference
[[ 1  2  3]
 [-3 -2 -1]
 [ 2 -1 -8]]

>>> print(np.multiply(b,a))	# Elementwise multiplication
[[ 2  8 18]
 [ 4 15 30]
 [63 56  9]]

>>> print(np.divide(b,a))	# Elementwise division
[[2.         2.         2.        ]
 [0.25       0.6        0.83333333]
 [1.28571429 0.875      0.11111111]]

>>> print(np.dot(a,b))	# Matrix product
[[ 31  31  19]
 [ 67  73  55]
 [103 115  91]]

>>> c=np.array([[2,4],[6,8]])

>>> print(c.min())	# Find minimum value
2
>>> print(c.min(axis=0))
[2 4]
>>> print(c.max(axis=1))
[4 8]

>>> print(np.sqrt(c))	# Elementwise square-root
[[1.41421356 2.        ]
 [2.44948974 2.82842712]]

>>> print(np.sum(c))	# Compute sum of elements
20

>>> print(np.exp(c))	# Elementwise apply exponential function
[[   7.3890561    54.59815003]
 [ 403.42879349 2980.95798704]]

>>> a=np.array([ [20,30,40], [50, 60,70], [80,90,100] ])

>>> a
array([[ 20,  30,  40],
       [ 50,  60,  70],
       [ 80,  90, 100]])

>>> a**3	# Elementwise exponentiation
array([[   8000,   27000,   64000],
       [ 125000,  216000,  343000],
       [ 512000,  729000, 1000000]])

>>> a*4		# Calculate scalar product
array([[ 80, 120, 160],
       [200, 240, 280],
       [320, 360, 400]])

>>> np.sin(a)		# Elementwise sine operation
array([[ 0.91294525, -0.98803162,  0.74511316],
       [-0.26237485, -0.30481062,  0.77389068],
       [-0.99388865,  0.89399666, -0.50636564]])


>>> d=np.array([1,2,3,4,5,6,7,8,9])

>>> print(np.mean(d))		# Compute mean
5.0
>>> print(np.median(d))		# Compute median
5.0
>>> print(np.std(d))		# Compute standard deviation
2.581988897471611
>>> print(np.var(d))		# Compute variance
6.666666666666667

Broadcasting:-

Broadcasting allows us to perform arithmetic operation on arrays having dissimilar dimensions. It is especially useful when we need to perform certain operations on a large by using a smaller array repeatedly across the dimensions of the larger array. The universal functions works well with broadcasting.

The rules for broadcasting are briefly given below:-

If the arrays do not have the same rank, prepend the shape of the lower rank array with 1s until both shapes have the same length.
The two arrays are said to be compatible in a dimension if they have the same size in the dimension, or if one of the arrays has size 1 in that dimension.
The arrays can be broadcast together if they are compatible in all dimensions.
After broadcasting, each array behaves as if it had shape equal to the element-wise maximum of shapes of the two input arrays.
In any dimension where one array had size 1 and the other array had size greater than 1, the first array behaves as if it were copied along that dimension.

The simplest broadcasting example occurs when an array and a scalar value are combined in an operation. the scalar b is stretched to become an array of with the same shape as a so the shapes are compatible for element-by-element multiplication. An example is shown below.

broad1

A two dimensional array multiplied by a one dimensional array results in broadcasting if number of 1-d array elements matches the number of 2-d array columns.As shown in figure below, the broadcasting works as if b is copied multiple times to match the number of rows in a.

broad2

In the case shown below, where both the arrays are one dimensional, the new axis index operator inserts a new axis into a and both arrays are stretched to form an output array larger than either of initial arrays.

broad3

So, here is the code …

>>> import numpy as np
>>> from numpy import newaxis

>>> a=np.array([1,2,3])
>>> b=2
>>> a*b				# b is stretched to [2,2,2]
array([2, 4, 6])

>>> a=np.array([ [0,0,0], [10,10,10], [20,20,20], [30,30,30] ])
>>> b=np.array([1,2,3])
>>> a
array([[ 0,  0,  0],
       [10, 10, 10],
       [20, 20, 20],
       [30, 30, 30]])
>>> b
array([1, 2, 3])
>>> a+b				# b is stretched to match rows of a
array([[ 1,  2,  3],
       [11, 12, 13],
       [21, 22, 23],
       [31, 32, 33]])

>>> a=np.array([0,10,20,30])
>>> b=np.array([1,2,3]) 
>>> a
array([ 0, 10, 20, 30])
>>> b
array([1, 2, 3])
>>> a[:, newaxis] + b		# a (column-wise) and b (row-wise) are stretched 
array([[ 1,  2,  3],
       [11, 12, 13],
       [21, 22, 23],
       [31, 32, 33]])

Note: The stretching analogy is only conceptual. numpy is smart enough to use the original scalar value without actually making copies so that broadcasting operations are as memory and computationally efficient as possible.

Combining and splitting arrays:-

Arrays can be stacked horizontally or vertically using functions hstack and vstack respectively, resulting in a bigger combined array. We can also split the arrays across any of the axes, similarly using hsplit and vsplit functions.

>>> import numpy as np

>>> a=np.floor(10*np.random.random((3,3)))
>>> b=np.floor(10*np.random.random((3,3)))

>>> a
array([[6., 5., 7.],
       [5., 3., 8.],
       [6., 0., 9.]])
>>> b
array([[2., 4., 3.],
       [2., 8., 3.],
       [8., 4., 5.]])

>>> np.vstack((a,b))	# stack b vertically over a
array([[6., 5., 7.],
       [5., 3., 8.],
       [6., 0., 9.],
       [2., 4., 3.],
       [2., 8., 3.],
       [8., 4., 5.]])
>>> np.hstack((a,b))	# stack b horizontally aside a
array([[6., 5., 7., 2., 4., 3.],
       [5., 3., 8., 2., 8., 3.],
       [6., 0., 9., 8., 4., 5.]])

>>> np.hsplit(a,3)	# split a vertically into equal parts (3)
[array([[6.],
       [5.],
       [6.]]), array([[5.],
       [3.],
       [0.]]), array([[7.],
       [8.],
       [9.]])]
>>> np.vsplit(a,3)	# split a horizontall into equal parts (3)
[array([[6., 5., 7.]]), array([[5., 3., 8.]]), array([[6., 0., 9.]])]

Copies and views:-

There are three types of copies based on memory usage.In normal copy no objects are copied and they refer to the same data in memory. In case of shallow copy or views, we create a new object that looks at same data i.e they share same data. Finally in deep copy a complete copy is made of the object and no sharing of data occurs.

Here is an example for the same …

>>> import numpy as np

>>> a=np.arange(9).reshape(3,3)
>>> a
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])
>>> b=a		# no new object is created
>>> id(a)
139959296354624
>>> id(b)	# a & b are names referring yo same object
139959296354624


>>> a=np.arange(12).reshape(3,4)
>>> a
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
>>> b=a.view()	# b is a view of data owned by a
>>> id(a)
139959296354544
>>> id(b)
139959296354784
>>> a.reshape(4,3)
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])
>>> a.shape
(3, 4)
>>> a.shape=4,3	 # change shape of a
>>> a.shape
(4, 3)
>>> b.shape	# b's shape doesn't change
(3, 4)
>>> a
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])
>>> b
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
>>> a[0,0]=12	# modify a's data
>>> a
array([[12,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])
>>> b		# b also changes
array([[12,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

>>> c=b.copy()	# c is a new ndarray object
>>> id(b)
139959296354784
>>> id(c)
139959296354624
>>> b[0,0]=100	# modify b
>>> b
array([[100,   1,   2,   3],
       [  4,   5,   6,   7],
       [  8,   9,  10,  11]])
>>> c		# c doesn't change
array([[12,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

Searching and Sorting:-

The sort function returns a sorted array from an input array. Also, we can sort the elements along each axes by mentioning the axes explicitly. The where function helps us to search for an element and find the index of the elements satisfying a particular condition.

>>> import numpy as np

>>> a=np.array([3,2,5,1,6,9,7])

>>> np.sort(a)		# sort the whole array
array([1, 2, 3, 5, 6, 7, 9])

>>> b=np.array([[3,2,1], [6,7,4], [9,8,0] ])
>>> b
array([[3, 2, 1],
       [6, 7, 4],
       [9, 8, 0]])

>>> np.sort(b,axis=0)	# sort the array along vertical axis
array([[3, 2, 0],
       [6, 7, 1],
       [9, 8, 4]])
>>> np.sort(b,axis=1)	# sort the array along horizontal axis
array([[1, 2, 3],
       [4, 6, 7],
       [0, 8, 9]])


>>> x=np.arange(12).reshape(3,4)
>>> x
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

>>> y=np.where(x>6)	# search for elements satisfying a condition (>6)
>>> print(y)
(array([1, 2, 2, 2, 2]), array([3, 0, 1, 2, 3]))
>>> print(x[y])
[ 7  8  9 10 11]

>>> y=np.where(x==6)	# search for a particular element (6)
>>> y
(array([1]), array([2]))
>>> print(x[y])
[6]

Linear algebra:-

Numpy provides many built-in functions for linear algebra through ‘numpy.linalg‘ (also scipy.linalg) module.For example, there are built-in functions for finding determinant, inverse, rank, eigen values etc. There is also a ‘solve‘ method for solving a linear matrix equation, or system of linear scalar equations.

>>> import numpy as np
>>> from numpy.linalg import matrix_rank

>>> a=np.array([ [1,2,3], [0,1,4], [5,6,0]])

>>> b=np.linalg.inv(a) 		# find inverse of a matrix
>>> print(b)
[[-24.  18.   5.]
 [ 20. -15.  -4.]
 [ -5.   4.   1.]]


>>> a=np.array([ [1,1,3], [1,3,-3], [-2,-4,-4]])

>>> a
array([[ 1,  1,  3],
       [ 1,  3, -3],
       [-2, -4, -4]])

>>> print(np.trace(a))		# compute trace of a matrix
0

>>> print(a.transpose())	# find transpose of a matrix
[[ 1  1 -2]
 [ 1  3 -4]
 [ 3 -3 -4]

>>> print(np.floor(np.linalg.det(a)))	# find determinant of a  square matrix
-8.0

>>> x=[1,2,0]
>>> y=[4,5,6]
>>> print(np.cross(x,y))	# find cross product of vectors
[12 -6 -3]

>>> c=np.array([ [1,2,3], [1,4,2], [2,6,5] ])
>>> c
array([[1, 2, 3],
       [1, 4, 2],
       [2, 6, 5]])
>>> matrix_rank(c)		# find rank of a matrix
2

>>> d=np.array([ [8, -6, 2], [-6, 7, -4], [2, -4, 3] ])
>>> d
array([[ 8, -6,  2],
       [-6,  7, -4],
       [ 2, -4,  3]])

>>> eig_val,eig_vec=np.linalg.eig(d) # compute eigen values and eigen vectors
>>> print(eig_val)
[1.50000000e+01 3.00000000e+00 8.95419849e-16]
>>> print(np.floor(eig_val))		# print eigen values
[15.  3.  0.]
>>> print(eig_vec)			# print eigen vectors (normalized)
[[-0.66666667  0.66666667  0.33333333]
 [ 0.66666667  0.33333333  0.66666667]
 [-0.33333333 -0.66666667  0.66666667]]

>>> a=np.array([ [4, -2, 3], [1, 3, -4], [3, 1, 2] ])
>>> b=np.array([1, -7, 5])
>>> x=np.linalg.solve(a,b)		# solve linear matrix equation
>>> x
array([-1.,  2.,  3.])

Some interesting functions:-

The functions ‘scipy.io.loadmat‘ and ‘scipy.io.savemat‘ allow you to read and write MATLAB files.
The storage and retrieval of array data in simple text file format is done with ‘savetxt()‘ and ‘loadtxt()‘ functions.
The ‘numpy.save()‘ file stores the input array in a disk file with ‘npy’ extension and ‘numpy.load()‘ is used to reconstruct the array.
The functions ‘numpy.char.encode‘ and ‘numpy.char.decode‘ can be used for encoding and decoding literals using specified codecs.
The ix_ function can be used to combine different vectors so as to obtain the result for each n-uplet.

Numpy: Speed and Memory:-

One of the main advantages of numpy over python lists is it’s speed. Let’s compare the performance of numpy with lists using an example.Here is the code for array addition, using numpy as well as lists.

>>> import numpy as np

>>> def numpysum(n):	# sum of array using lists
...     a = np.arange(n) ** 2
...     b = np.arange(n) ** 3
...     return a + b
... 
>>> def pythonsum(n):	# sum of array using numpy
...     a = [i ** 2 for i in range(n)]
...     b = [i ** 3 for i in range(n)]
...     return [a[i] + b[i] for i in range(n)]
...

Here we will use the ‘timeit‘ module in python to compare the performance of numpy with the normal lists.

>>> timeit.timeit(stmt='pythonsum(10)', setup='from __main__ import pythonsum')
4.67477138600043

>>> timeit.timeit(stmt='numpysum(10)', setup='from __main__ import numpysum')
2.2512439270003597

>>> timeit.timeit(stmt='pythonsum(100)', setup='from __main__ import pythonsum') 
37.98177698499967

>>> timeit.timeit(stmt='numpysum(100)', setup='from __main__ import numpysum')
2.459720329999982

>>> timeit.timeit(stmt='pythonsum(1000)', setup='from __main__ import pythonsum')
388.303122798

>>> timeit.timeit(stmt='numpysum(1000)', setup='from __main__ import numpysum')
6.1212818439998955

>>> timeit.timeit(stmt='pythonsum(10000)', setup='from __main__ import pythonsum')
4194.793170344

>>> timeit.timeit(stmt='numpysum(10000)', setup='from __main__ import numpysum') 
33.7726019110014

As you can observe, the performance gain in terms of speed is almost ‘100x’‘ for larger input sizes.

perfcomb — Fig. 2: Execution Time : Numpy vs. List

Note: The time shown is not for single execution; the return value is seconds as a float.It is the total time taken to run the test (not counting the setup), so the average time per test is that number divided by the number argument, which defaults to 1 million.

Another advantage of numpy is lower memory consumption.The size of a Python list consists of the general list information, the size needed for the references to the elements and the size of all the elements of the list. If we apply sys.getsizeof to a list, we get only the size without the size of the elements.Whereas in numpy, size of the array includes only the general array information and the size of all ndarray elements.

lst — Fig. 3: List – memory representation

From the example given below, the size of empty list is 64 and that of a numpy array is 96.Adding an element makes a difference of 8 in both cases (depends on it’s type).The additional memory required in case of lists is used up for reference elements.

In short, an arbitrary integer array of length “n” in numpy needs

96 + n * 8 Bytes

whereas a list of integers need

64 + 8 len(lst) + len(lst) 28

(Again value 28 depends on the data type)

>>> import numpy as np, sys

>>> a=[]
>>> sys.getsizeof(a) 	# size of empty list
64

>>> a=[1]
>>> sys.getsizeof(a)
72

>>> b=np.array([])	
>>> sys.getsizeof(b)	# size of empty ndarray
96
>>> b
array([], dtype=float64)

>>> b=np.array([1])
>>> sys.getsizeof(b)
104

Histogram:-

A histogram provides a graphical representation of the frequency distribution of data. The numpy histogram function applied to an array returns a pair of vectors: the histogram of the array and the vector of bins.The bin represents the interval to which the data is placed.The histogram of the array computes the occurrences of input data that fall within each bin(interval).This will be proportional to height of the rectangular bars (assuming equal bin width), which represent the frequency distribution of data in that range.We can plot the graph with numpy and also using matplotlib module.

>>> import numpy as np
>>> import matplotlib.pyplot as plt

>>> vec=[2,0,1,2,0,2]	# array vector
>>> bins=[0,1,2,3]	# bins

>>> plt.hist(vec,bins)	# plot the histogram (Fig. 3)
(array([2., 1., 3.]), array([0, 1, 2, 3]), )
>>> plt.show()
		
>>> (hist, bins)=np.histogram(np.random.randint(1, 51, 500),np.arange(51))	# compute the histogram

>>> hist	# histogram array
array([ 0,  7,  7, 15,  8, 11, 13, 14, 13, 13, 12,  4, 16,  6, 10, 12, 14,
       10, 15, 10, 11,  9,  3, 10,  9,  9, 12, 11, 11,  9,  7,  9, 16, 10,
        9, 12,  8,  9,  6,  9,  8, 10, 13, 13,  8,  8,  8,  8,  8, 17])
>>> bins	# bin edges
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50])

>>> plt.plot(bins[:-1], hist)	# plot the histogram (Fig. 4)
[]
>>> plt.show()

Matplotlib:-

It is a plotting library in python.It can be used along with numpy to provide features similar to that of matlab. Here are some basic graph plotting methods for plotting functions within a co-ordinate system.

To plot a graph we have to provide the input for X and Y. We can give them using arrays directly or with the help of functions.We can additionally add title, labels, legend etc. to the graph plot.Here is a plot of a linear function: y = 3x+4.

>>> import numpy as np
>>> from matplotlib import pyplot as plt

>>> x=np.arange(0,10)		# x values
>>> y=3*x + 4 			# y = f(x)

>>> plt.title("Plot: y=3x+4")	# title
Text(0.5,1,'Plot: y=3x+4')

>>> plt.xlabel("X axis")	# x-axis label
Text(0.5,0,'X axis')
>>> plt.ylabel("Y axis")	# x-axis label
Text(0,0.5,'Y axis')

>>> plt.plot(x,y)		# plot the graph
[]

>>> plt.show()			# display the plot

plot1 — Fig. 7: Plot of a linear function

The following example below illustrates a plotting several lines with different format styles in one command using arrays. Here we can change the shape and colour properties of the points or lines. It is a plot of three function viz. x, x^2 and x^3 in a single graph.

>>> import numpy as np
>>> from matplotlib import pyplot as plt

>>> x=np.arange(0.,5.,0.2)	# X values

>>> plt.plot(x,x,'r--',x,x**2,'bs',x,x**3,'g^')	# plot the functions

>>> plt.show()			# display the plot

We can also plot two different functions side by side using subplot functions.
Here is a plot of sine vs. cosine using matplotlib. It is very useful for comparing the graphs of the functions.

>>> import numpy as np
>>> from matplotlib import pyplot as plt

>>> x=np.arange(-np.pi,np.pi,0.1)	# set the function domain

>>> y_sin=np.sin(x)			# sine function
>>> y_cos=np.cos(x)			# cosine function
>>> 
>>> plt.subplot(2,1,1)			# make a subplot grid with height 2 and width 1 & set the first subplot as active 


>>> plt.plot(x,y_sin)			# plot first subplot
[]
>>> plt.title('sine')
Text(0.5,1,'sine')

>>> plt.subplot(2,1,2)			# set second subplot as active


>>> plt.plot(x,y_cos,'r')		# plot second subplot
[]
>>> plt.title('cosine')
Text(0.5,1,'cosine')

>>> plt.show()		# display the plot

Finally, here is another plot of three popular functions: sigmoid, tanh and relu.
We can use different colours for each function and even add a legend to the graph to easily differentiate the lines.

The formulas for the functions are as follows:

sigmoid: y = 1/(1+e^(-x))
relu : y = max(0,x)
tanh : y = ( e^x – e^-x )/( e^x + e^-x )

>>> import numpy as np
>>> from matplotlib import pyplot as plt
>>> 
>>> x=np.arange(-5,5,0.1)	# set the domain 
>>> 
>>> ax=plt.subplot(111)		# create a subplot
>>> 
>>> ax.plot(x,1/(1+np.exp(-x) ), label='sigmoid')	# plot sigmoid function	
[]

>>> ax.plot(x,np.maximum(0,x),'r',label='relu')		# plot relu function
[]

>>> ax.plot(x,np.tanh(x),'g',label='tanh')		# plot tanh function
[]

>>> ax.legend()			# add legend to the graph


>>> plt.show()			# display the plot

plot5 — Fig. 10: Plot of sigmoid, relu & tanh

Configuring the figure settings:-

We can modify the default figure settings and set properties like line colour, line width, line style etc. Additionally we can set the ticks along y an and x axis along with their limits, configure figure size and assign tick labels. Here is an example of a plot of sine and cosine functions.

import numpy as np
import matplotlib.pyplot as plt

# Create a figure of size 8x6 inches, 80 dots per inch
plt.figure(figsize=(10, 6), dpi=80)

# Create a new subplot from a grid of 1x1
plt.subplot(1, 1, 1)

# Set X and Y axes values for sin and cosine functions
X = np.linspace(-np.pi, np.pi, 256, endpoint=True)
C, S = np.cos(X), np.sin(X)


# Plot cosine with a blue continuous line of width 2 (pixels) and add a label
plt.plot(X, C, color="blue", linewidth=2, linestyle="-", label="cosine")

# Plot sine with a red continuous line of width 2 (pixels) and add a label
plt.plot(X, S, color="red", linewidth=2 , linestyle="-", label="sine")

# Set x limits
plt.xlim(X.min() * 1.1, X.max() * 1.1)

# Set x ticks
plt.xticks([-np.pi, -np.pi/2, 0, np.pi/2, np.pi])

# Set x tick label
plt.xticks([-np.pi, -np.pi/2, 0, np.pi/2, np.pi],
          [r'$-\pi$', r'$-\pi/2$', r'$0$', r'$+\pi/2$', r'$+\pi$'])

# Set y limits
plt.ylim(C.min() * 1.1, C.max() * 1.1)

# Set y ticks
plt.yticks([-1, 0, +1])

# Set y tick label
plt.yticks([-1, 0, +1],
          [r'$-1$', r'$0$', r'$+1$'])

# Add a legend
plt.legend(loc='upper left')

# Show result on screen
plt.show()

Here is the plot output …

newplot — Fig. 11: A customized plot of sine and cosine curves

Bar chart:-

The pyplot module also provides a function bar() to plot bar graphs. Here is a short demo of plotting a bar chart using matplotlib module.

from matplotlib import pyplot as plt
 
x = [5,8,10] 	# inpuy data values
y = [12,16,6]  

x2 = [6,9,11] 	# input data values
y2 = [6,15,7]

 
b1=plt.bar(x, y, color='r',align = 'center',label='B1') 	# plot the data in bar chart
b2=plt.bar(x2, y2, color = 'y', align = 'center', label='B2') 	# plot the data in bar chart

plt.title('Bar graph') 	# set title

plt.ylabel('Y axis') 	# set y-axis label
plt.xlabel('X axis')  	# set x-axis label

plt.legend()		# add legend

plt.show()		# display the plot

Pie chart:-

The pyplot module also supports a function pie() similar to barchart, to draw a pie chart.
We have to provide input sizes and labels to draw a pie chart. Further, we can give custom colours and explode slices.

import matplotlib.pyplot as plt
 

labels = 'Python', 'C++', 'Ruby', 'Java'	# set data label
sizes = [215, 130, 245, 210]			# set data size

colors = ['red', 'green', 'blue', 'yellow']	# mention slice colours
explode = (0.1, 0, 0, 0)  			# mention slices to explode
 
plt.pie(sizes, explode=explode, labels=labels, colors=colors,	# configure & plot the pie chart
        autopct='%1.1f%%', shadow=True, startangle=120)
 
plt.axis('equal')
plt.show()		# display the plot

Scatter plot:-

A scatter plot is a type of plot that shows the data as a collection of points. The position of a point depends on its two-dimensional value, where each value is a position on either the horizontal or vertical dimension. The built-in function scatter can used to create scatterplots in matplotlib.In matplotlib, a colorbar is a separate axes that can provide a key for the meaning of colors in a plot. Plot legends identify discrete labels of discrete points. For continuous labels based on the color of points, lines, or regions, a labeled colorbar can be a great tool.

Here is an example for a scatter plot with a color bar …

>>> import numpy as np
>>> import matplotlib.pyplot as plt

>>> x=np.random.rand(50)	# set x values
>>> y=np.random.rand(50)	# set y values

>>> colors = np.random.rand(50)	# set colors
>>> area = np.pi*10		# set area

>>> fig,ax=plt.subplots()

>>> fig

>>> ax



>>> im=ax.scatter(x,y,s=area,c=colors,alpha=0.5)    #scatter plot

>>> fig.colorbar(im,ax=ax)	# add a color bar

>>> im.set_clim(0.0,1.0)	# set color limits


>>> plt.ylabel('y')	# set y-axis label
Text(0,0.5,'y')
>>> plt.xlabel('x')	# set x-axis label
Text(0.5,0,'x')

>>> plt.title("Scatter-plot")	# set plot title
Text(0.5,1,'Scatter-plot')

>>> plt.show()	# display the plot

Here is the output of the scatter-plot:

Pyplot and Pylab – Matplotlib is the whole package; matplotlib.pyplot is a module in matplotlib; and pylab is a module that gets installed alongside matplotlib.pylab is a convenience module that bulk imports matplotlib.pyplot (for plotting) and numpy (for mathematics and working with arrays) in a single name space. Although many examples use pylab, it is no longer recommended.For non-interactive plotting it is suggested to use pyplot to create the figures and then the OO interface for plotting.

Image processing:-

In python image processing can be performed using many powerful libraries like PIL and OpenCV. However, in this section we will focus more on basic image processing using numpy, scipy and scikit image.

Firstly, we can use ‘imread‘ function to read an image and plot the same with the help of matplotlib. The shape of image represents the way it is stored as array. First two indices represent the X and Y values and the third corresponds to RGB colour values.Now, to crop an image we can just mention the appropriate slice of the array, using normal slicing operations.

>>> import numpy as np
>>> import matplotlib.pylab as plt
>>> 
>>> im=plt.imread("cat.jpg")	# read an image
>>> 
>>> im.shape	# image shape
(360, 480, 3)
>>> im.dtype	# image type
dtype('uint8')
>>> 
>>> plt.imshow(im)	

>>> plt.show()		# display theimage


>>> im_crop=im[:,100:380,:]	# crop the image
>>> 
>>> plt.imshow(im_crop)

>>> plt.show()		# display the image

The image of the left is the input cat image and the image on the right is the cropped output.

cat

cat_cropped

In the case of RGB colour images, each pixel is represented using three integers corresponding to its R,G,B components. To split an image into into corresponding components, we can slice the image array across its colour dimension (third). Here is an example.

>>> import numpy as np
>>> import matplotlib.pylab as plt
>>> 
>>> im=plt.imread("cat.jpg")	# read an image

>>> im_r=np.zeros(im.shape,dtype="uint8")

>>> im_r[:,:,0]=im[:,:,0]	# red component
>>> plt.imshow(im_r)

>>> plt.show()

>>> im_g=np.zeros(im.shape,dtype="uint8")
>>> im_g[:,:,1]=im[:,:,1]	# green component
>>> plt.imshow(im_g)

>>> plt.show()

>>> im_b=np.zeros(im.shape,dtype="uint8")
>>> im_b[:,:,2]=im[:,:,2]	# blue component
>>> plt.imshow(im_b)

>>> plt.show()

Here is the output showing red, green and blue components of the input cat image.

Tinting is usually used to add white to the original colour to make it appear lighter than the original image.It can be achieved by just multiplying an image with a vector specifying the tint filter (uses broadcasting).

We can convert an image to grayscale by performing a weighted average of the colour components (popular method). There are number of other ways to do this and even some inbuilt functions are available for the same.

Now, here is the code for tinting and grayscale conversion …

>>> import numpy as np
>>> import matplotlib.pylab as plt
>>> 
>>> im=plt.imread("cat.jpg")	# read an image

>>> im_tint=im * [1,0.7,0.8]	# tint the image
>>> plt.imshow(np.uint8(im_tint))

>>> plt.show()		# display the image

>>> img_gray = np.average(im, weights=[0.299, 0.587, 0.114], axis=2) # rgb to grayscale
>>> plt.imshow(img_gray)

>>> plt.show()		# display the image

Scipy also provides a sub-module ‘ndimage‘ dedicated for image processing. It provides some functions for filtering and other basic image manipulations.The following code shows how we can perform image rotation and image blurring using ndimage. The face function mentioned, returns a a 1024 x 768, colour image of a raccoon face and the ‘imsave‘ is used to save the image file.

>>> from scipy import misc, ndimage
>>> import matplotlib.pyplot as plt
>>> 
>>> f=misc.face()	# color image of a raccoon face.
>>> misc.imsave('face.png',f)
>>> 
>>> face.shape		# image shape
(768, 1024, 3)

>>> face_rotate=ndimage.rotate(face,45)	    # image rotation
>>> plt.imshow(face_rotate)

>>> plt.show()	    # display the image

>>> face_blur=ndimage.gaussian_filter(face,sigma=3)	# gaussian blur
>>> plt.imshow(face_blur)

>>> plt.show()	    # display the image

Scikit-image is a Python package dedicated to image processing, and using naively NumPy arrays as image objects. The io module in the skimage provides various functions for reading and writing images (io.save, io.imread etc.). Besides io, the skimage provides various filter functions, data reduction functions visualisation functions etc.

Here is an example of binary segmentation with thresholding, using Otsu method in skimage.

>>> from skimage import data, io
>>> from skimage import filters
>>> 
>>> camera=data.camera()	# loads a grey-level camera image
>>> io.imsave('camera.png', camera)	# save the image file

>>> val=filters.threshold_otsu(camera)	# otsu thresholding
>>> mask=camera < val		# image mask

>>> plt.imshow(mask, cmap='gray', interpolation='nearest')

>>> plt.show()		# display the output image

Here is the output of binary segmentation …

segment-otsu — Fig. 17: Binary segmentation

Edge operators are used in image processing within edge detection algorithms. They are discrete differentiation operators, computing an approximation of the gradient of the image intensity function.The sobel function is used for detecting edges in the images and it returns a sobel edge map as output.

Here is a demo of edge detection using sobel filter.

>>> import numpy as np
>>> from skimage.data import camera	# provides standard test images
>>> from skimage.filters import sobel
	
>>> image=camera()	# loads a grey-level camera image
>>> image_sobel=sobel(image)	# apply sobel filter
>>> 
>>> plt.imshow(image_sobel,cmap='gray', interpolation='nearest')

>>> plt.show()		# display the output image

camera_edge — Fig. 18: Sobel edge detection

Finally, lets plot the histogram of R,G,B components of an image in a single graph plot.
First we have to pull out each component slices from array and calculate the histograms for each of them. Here, the bins will correspond to range: 0-256.Now, we can plot the three histograms in a single plot using matplotlib.

Here is the final code …

import numpy as np
from scipy import misc
from matplotlib import pyplot as plt
 
image=misc.face()	# load a rgb image
 
r=np.zeros((image.shape[0],image.shape[1]),dtype="uint8")	# initialise r,g,b components
g=np.zeros((image.shape[0],image.shape[1]),dtype="uint8")
b=np.zeros((image.shape[0],image.shape[1]),dtype="uint8")

r[:,:]=image[:,:,0]		# red component
g[:,:]=image[:,:,1]		# green component
b[:,:]=image[:,:,2]		# blue component

nr,hist_r=np.histogram(r, bins=np.arange(0, 256))	# compute histogram
ng,hist_g=np.histogram(g, bins=np.arange(0, 256))
nb,hist_b=np.histogram(b, bins=np.arange(0, 256))

plt.plot(hist_r[:-1],nr,'r',hist_g[:-1],ng,'g',hist_b[:-1],nb,'b')	# plot the histogram

plt.show()	# display the output

The output is as follows …

Fig. 19: R,GB Histogram (X-axis : Intensity)

Integration:-

Scipy provides a function quad from integrate module to compute a definite integral. It accepts a function and limits of integration as inputs and return the computed integral value along with a estimate of absolute error as output.It has support for double, triple and n-dimensional integrals. The module also features routines for integrating ordinary differential equations.

>>> from  scipy.integrate import quad 
>>> import numpy as np
>>> 
>>> res,err=quad(lambda x: 1/(1+np.sin(x)),0,np.pi) # compute the integral
>>>
>>> print(res)	# print result
2.0000000000000004
>>> print(err)	# print error
1.1699617667999825e-11

Fourier transform:-

Fourier analysis is a method for expressing a function as a sum of periodic components, and for recovering the signal from those components. When both the function and its Fourier transform are replaced with discretized counterparts, it is called the discrete Fourier transform (DFT).

The scipy.fftpack module computes fast Fourier transforms (FFTs) and offers utilities to handle them.

Here is an example of FFT of the sum of two sines.

dct_in — Fig. 20: Input signals (sine waves)

>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> from scipy.fftpack import fft
>>> 
>>> N=600	# Number of sample points
>>> T=1.0/800.0		# Sample spacing
>>> 
>>> x=np.linspace(0.0, N*T, N)
>>> y=np.sin(50.0*2.0*np.pi*x) + 0.5*np.sin(80.0*2.0*np.pi*x)
>>> 
>>> y1=np.sin(50.0*2.0*np.pi*x)		# first input signal
>>> y2=0.5*np.sin(80.0*2.0*np.pi*x)	# second input signal
>>> 
>>> plt.plot(x,y1,'b',x,y2,'r')
[, ]
>>> plt.show()

>>> yf=fft(y)		# compute fft
>>> xf=np.linspace(0.0,1.0/(2.0*T),N/2)

>>> plt.plot(x,yf)
 ComplexWarning: Casting complex values to real discards the imaginary part
  return array(a, dtype, copy=False, order=order)
[]
>>> plt.show()


>>> plt.plot(xf,2.0/N * np.abs(yf[0:N//2]))	# plot the fft
[]
>>> plt.grid()
>>> plt.show()		# display the plot

dct2

The initial plot of the fft shows that the graph is mirrored at the right half. Also, there are portions that correspond to the imaginary part. Therefore lets take the absolute value only, from the first half of the plot. Now, we have the amplitude spectrum of the time domain signal .

Now, to get the real physical value for frequency, we have to calculate Nyquist-frequency. It is simply the half of the maximum sampling frequency.The amplitude of the signal is multiplied by a factor of 2/N to obtain real physical value. The reason behind doubling is that the power of the signal in time and frequency domain have to be equal. Also, the output y obtained from fft is normalised with number of samples. Therefore we have to divide it by N.

Finally lets plot the graph of FFT …

Function minimization:-

Mathematical optimization deals with the problem of finding numerically minimums (or maximums or zeros) of a function.
The scipy.optimize package provides several commonly used optimization algorithms.

Here is an example of the minimize routine is used with the ‘L-BFGS-B‘ algorithm.

1. Find the global minimum for the function f(x) = 2×3+3x^{2} – 36x + 2 within the interval [-10, 3].

>>> import numpy as np
>>> from scipy import optimize
>>> from matplotlib import pyplot as plt

>>> def g(x):		# function to be optimized
...     return 2*x**3 + 3*x**2 - 36*x +2
... 
>>> result=optimize.minimize(g,x0=-7,method="L-BFGS-B",bounds=((-10,3),) ) # minimize function

>>> result		# result of minimization
      fun: array([-1338.])
 hess_inv: <1x1 LbfgsInvHessProduct with dtype=float64>
      jac: array([504.00003602])
  message: b'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL'
     nfev: 4
      nit: 1
   status: 0
  success: True
        x: array([-10.])

x0=-7 is the starting point and bounds=[-10,3]

The minimum is at x= -10 and it’s corresponding value is -1338.

optmize — Fig. 22: Plot of y=f(x) = 2×3+3x^(2) – 36x + 2

Here is another example of minimize routine is used with the Nelder-Mead simplex algorithm (selected through the method parameter):

Mininize : x² + 10*sin(x)

>>> import numpy as np
>>> from scipy import optimize

>>> def f(x):	# function to be optimized
...     return x**2 + 10*np.sin(x)
... 
>>> 
>>> x = np.arange(-10, 10, 0.1)

>>> plt.plot(x, f(x))	# plot the function graph
[]
>>> plt.show()		# display the plot

>>> print(optimize.minimize(f,x0=0))	# optimization result
      fun: -7.945823375615215
 hess_inv: array([[0.08589237]])
      jac: array([-1.1920929e-06])
  message: 'Optimization terminated successfully.'
     nfev: 18
      nit: 5
     njev: 6
   status: 0
  success: True
        x: array([-1.30644012])

Here, x0=0 is the starting point. The function has minimum value at x = -1.3 and the corresponding value is -7.94.

minimize — Fig. 23: Plot of f(x )= x2 + 10*sin(x)

Root of a function:-

The root of a function f(x) corresponds to value of x where f(x)=0.

The function ‘brentq‘ from optmize finds a root of a function in a bracketing interval using Brent’s method.We have to provide the input scalar function and an interval to the function, as input .

Note that you must provide an interval [a,b] across which the function is continuous and changes sign.

Here is an example : Find roots of f(x )= x2 + 10*sin(x)

>>> import numpy as np
>>> from scipy import optimize

>>> def f(x):	# input function
...  return x**2 + 10* np.sin(x)

>>> root=optimize.brentq(f,0,1)	# find root in [0,1]
>>> root
0.0

>>> root=optimize.brentq(f,-3,-1)	# find root in [-3,-1]
>>> root
-2.479481

Another option to find the root’s is to use ‘root’ function from the same module.

Machine learning with scikit-learn:-

Classification:-

In general, a learning problem considers a set of n samples of data and then tries to predict properties of unknown data.In case of supervised learning the data comes with additional attributes that we want to predict .The problem of classification falls under this category. Here, the samples belong to two or more classes and we try to learn from already labelled data how to predict the class of unlabelled data.For this purpose, we have a training set on which we learn data properties and a testing set on which we test these properties.

Scikit learn comes with a standard ‘digits‘ dataset for recognising numerical digits.It has 1797 samples of images of size 8*8. Thus, the attributes or features consist of 64 elements. Target contains labels for each sample i.e 0-9. First, we have to divide our data-set to training and testing categories.Here we train our model using SVM, with all the samples except the last one and finally try to predict the label of the unknown sample using this learned model.

>>> from sklearn import datasets
>>> from sklearn import svm
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> 
>>> digits=datasets.load_digits() # load the dataset
>>> 
>>> print(digits.data)		# feature vectors
[[ 0.  0.  5. ...  0.  0.  0.]
 [ 0.  0.  0. ... 10.  0.  0.]
 [ 0.  0.  0. ... 16.  9.  0.]
 ...
 [ 0.  0.  1. ...  6.  0.  0.]
 [ 0.  0.  2. ... 12.  0.  0.]
 [ 0.  0. 10. ... 12.  1.  0.]]
>>> print(digits.data.shape) 	# shape : (n_samples, n_features)
(1797, 64)
>>> print(digits.target)	# data-labels 
[0 1 2 ... 8 9 8]
>>> print(digits.target.shape)	# shape : (n_samples)
(1797,)
>>> digits.images[0]		# sample data
array([[ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.],
       [ 0.,  0., 13., 15., 10., 15.,  5.,  0.],
       [ 0.,  3., 15.,  2.,  0., 11.,  8.,  0.],
       [ 0.,  4., 12.,  0.,  0.,  8.,  8.,  0.],
       [ 0.,  5.,  8.,  0.,  0.,  9.,  8.,  0.],
       [ 0.,  4., 11.,  0.,  1., 12.,  7.,  0.],
       [ 0.,  2., 14.,  5., 10., 12.,  0.,  0.],
       [ 0.,  0.,  6., 13., 10.,  0.,  0.,  0.]])
>>> digits.images[0].shape	# image shape (n_features=8*8)
(8, 8)
>>> 
>>> clf=svm.SVC(gamma=0.001,C=100.)	# initialize SVM
>>> 
>>> clf.fit(digits.data[:-1],digits.target[:-1])	# train the model
SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
>>> 
>>> clf.predict(digits.data[-1:])	# predict the unknown class label
array([8])
>>> 
>>> plt.imshow(digits.images[-1],cmap='gray')	# display the image

>>> plt.show()		#show the output

svm1 — Fig. 24: Predicted data – image (class ~ 8)

Sci-kit additionally provides two functions joblib.dump & joblib.load for model persistence.The ‘joblib’ module helps us to save a trained model in a file and load it at a later time for prediction.

2.Clustering:-

In unsupervised learning, the training data consists of a set of input vectors x without any corresponding target values. The goal in such problems may be to discover groups of similar examples within the data.This classic problem in machine learning is called as clustering.

The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields.

Here, we are using the standard the iris data-set. If we knew that there were 3 types of iris, but did not have access to a taxonomist to label them: we could try a clustering task: split the observations into well-separated group called clusters.Thus our task is to cluster the data into three different groups without the help predefined labels.There are 150 samples and each sample consists of four features.

>>> import numpy as np
>>> from sklearn import cluster, datasets

>>> iris=datasets.load_iris()	# load the dataset

>>> x_iris=iris.data	# input features
>>> y_iris=iris.target	# target labels

>>> iris.data.shape	# shape: (n_samples, n_features)
(150, 4)
>>> iris.target.shape	# shape: (n_samples)
(150,)


>>> iris.data[:10,:]	# sample feature vectors (first 10)
array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1]])
>>> iris.target[:10]	# sample target labels (first 10)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

>>> k_means=cluster.KMeans(n_clusters=3)	# initialize k-means (clusters = 3)
>>> 
>>> k_means.fit(x_iris)		# K-Means clustering
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

>>> k_means.labels_[::10]	# k-means predicted cluster labels (every 10'th sample)
array([1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0], dtype=int32)
>>> y_iris[::10]		# ground truth (corresponding samples)
array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2])

>>> k_means.cluster_centers_	# centroids or cluster centers
array([[6.85      , 3.07368421, 5.74210526, 2.07105263],
       [5.006     , 3.418     , 1.464     , 0.244     ],
       [5.9016129 , 2.7483871 , 4.39354839, 1.43387097]])

After training the model we predicted the cluster labels of every tenth element from the sample and compared it with ground truth. The corresponding cluster centres are also given in the above example.

Here is another simple example of k-means clustering. The input data consist of co-ordinate points and we have to classify them into three clusters and plot the same using a scatter plot. Additionally, we also have to plot the cluster centres and colour them accordingly in the plot output.

K-means – Data: ( (2,10), (2,5), (8,4), (5,8), (7,5), (6,4), (1,2), (4,9) ), Number of Clusters: 3

Here is the code …

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use("ggplot")

from sklearn.cluster import KMeans

X=np.array([ [2,10],[2,5],[8,4],[5,8],[7,5],[6,4],[1,2],[4,9] ]) # input data

kmeans = KMeans(n_clusters=3)	# initialize k-means (clusters=3)
kmeans.fit(X)	# k-means clustering

centroid = kmeans.cluster_centers_ # centroids
labels = kmeans.labels_	# predicted labels

print (centroid)
print(labels)

colors = ["r.","g.","b."]	# cluster colours	

for i in range(len(X)):
   print ("coordinate:" , X[i], "label:", labels[i])	# print the cluster labels
   plt.plot(X[i][0],X[i][1],colors[labels[i]],markersize=10)	# plot the clusters

plt.scatter(centroid[:,0],centroid[:,1], marker = "x", s=150, linewidths = 5, zorder =10)	# plot the centroids

plt.show()	# display the output

The output of the scatter plot is shown below …

cluster2 — Fig. 25: Scatter plot – k-means clustering

3. Regression:-

In linear regression, the data are modelled to fit a straight line.A simple linear regression is useful for finding relationship between two continuous variables.For example, a random variable y (response variable) can be modelled as a linear function of another random variable x (predictor variable) with the equation

y = mx + b

The coefficients ‘w’ and ‘b’ are regression coefficients and specify slope and y-intercept of the line respectively. These coefficients can be solved for by the method of least squares, which minimizes the error between the actual line separating the data and the estimate of the line.

This example uses the ‘diabetes‘ dataset from scipy, in order to illustrate a two-dimensional plot of this regression technique. It uses only one feature of the dataset to plot a regression line that minimizes the error between observed and predicted responses. The coefficients, the residual sum of squares and the variance score are also calculated. The last twenty samples are used for testing and remaining are used for training the model.

Here is the code …

>>> import matplotlib.pyplot as plt
>>> import numpy as np
>>> from sklearn import datasets, linear_model
>>> from sklearn.metrics import mean_squared_error, r2_score
>>> 
>>> diabetes = datasets.load_diabetes() # load the dataset

>>> diabetes.data.shape		# data-shape: (n_samples, n_features)
(442, 10)
>>> diabetes.target.shape	# target-shape: (n_samples)
(442,)
>>> 
>>> X = diabetes.data[:, np.newaxis, 2]	# select one feature
>>> X.shape
(442, 1)
>>> X[:10,:]
array([[ 0.06169621],
       [-0.05147406],
       [ 0.04445121],
       [-0.01159501],
       [-0.03638469],
       [-0.04069594],
       [-0.04716281],
       [-0.00189471],
       [ 0.06169621],
       [ 0.03906215]])
>>> 
>>> X_train = X[:-20]	# split the data into training/testing sets
>>> X_test =  X[-20:]
>>> 
>>> Y_train = diabetes.target[:-20]	# split the targets into training/testing sets
>>> Y_test = diabetes.target[-20:]
>>> 
>>> regr = linear_model.LinearRegression()	# initialize linear regressor

>>> regr.fit(X_train, Y_train)		# train the model
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
>>> 
>>> Y_pred = regr.predict(X_test)	# predict the test data
>>> 
>>> print('Coefficients: \n', regr.coef_)	# regression coefficient (b)
Coefficients: 
 [938.23786125]
>>> print('Intercepts: \n', regr.intercept_)	# intercept (m)
Intercepts: 
 152.91886182616167

>>> print("Mean squared error: %.2f"		# print mse
...       % mean_squared_error(Y_test, Y_pred))
Mean squared error: 2548.07
>>> print('Variance score: %.2f' % r2_score(Y_test, Y_pred))	# print variance
Variance score: 0.47
>>> 
>>> plt.scatter(X_test, Y_test,  color='black')		# plot the data points


>>> plt.plot( X_test, Y_pred, color='blue', linewidth=3)	# plot the prediction line
[]
>>> 
>>> plt.xticks(())
([], )
>>> plt.yticks(())
([], )

>>> plt.show()		# display the output

The following figure show the graphical plot of the aforementioned linear regression problem.

linera_regr — Fig. 26: Linear regression graph

Scipy also comes with a ‘boston‘ dataset that can be used for regression.The above example contains just two variables.A regression problem can involve more than two variables, and is usually refereed to as multiple linear regression.

Decision Trees:-

A decision tree is a flow chart like structure, where each internal node denotes a test on a attribute , each branch represents an outcome of the test and each leaf node holds a class label. It is a non-parametric learning method used for classification and regression, where the goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.A tree structure is constructed that breaks the dataset down into smaller subsets eventually resulting in a prediction.Decision tree induction refers to the process of learning of decision trees from class labelled training tuples.

The tree models where the target variable can take a discrete set of values are called classification trees; whereas tree models where the target variable can take continous values are called regression trees. Decision trees have several advantages when compared with similar predictive models such as – simple and easy to interpret, ability to handle multi-dimensional data, fast learning, ability to handle numerical and categorical data etc.However, DT’ can create over-complex trees and may lead to overfitting.

During tree construction,an attribute selection measures like information gain , gain ration, gini index etc. are used for selecting the splitting criterion that best separates the given data partition into distinct classes.The popular decision tree algorithms include ID3, C4.5, C5.0 and CART. Scikit-learn uses an optimised version of the CART algorithm.

The following code shows an example of using decision trees for classification , using the standard iris dataset.

>>> from sklearn.datasets import load_iris
>>> from sklearn import tree
>>> 
>>> iris=load_iris()	# load the iris datset
>>> 
>>> X=iris.data		# feature vectors
>>> Y=iris.target	# class labels
>>> 
>>> from sklearn.model_selection import train_test_split	
>>> X_train, X_test, Y_train, Y_test=train_test_split(X,Y,test_size=0.30)   # split the data for training and testing
>>> 
>>> clf=tree.DecisionTreeClassifier()	# create a classifier object
>>> clf=clf.fit(X_train,Y_train)	# train the classifier
>>> 
>>> Y_pred=clf.predict(X_test)		# predict the classes of test data
>>> 
>>> from sklearn.metrics import classification_report, confusion_matrix
>>> 
>>> print(confusion_matrix(Y_test,Y_pred))	# confusion matrix (evaluation)
[[15  0  0]
 [ 0 12  1]
 [ 0  1 16]]
>>> print(classification_report(Y_test,Y_pred))	# performance metrics
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        15
          1       0.92      0.92      0.92        13
          2       0.94      0.94      0.94        17

avg / total       0.96      0.96      0.96        45

>>> 
>>> X_test[7]				
array([5.8, 2.7, 3.9, 1.2])

>>> clf.predict(X_test[7].reshape(1,-1))	# test a single data-sample
array([1])
>>> Y_test[7]
1
>>> clf.predict_proba(X_train[7].reshape(1,-1))	# predict the probability of each class
array([[0., 0., 1.]])

>>> import graphviz
>>> 
>>> dot_data=tree.export_graphviz(clf,out_file=None)	#  create a decision tree
>>> dot_data=tree.export_graphviz(clf,out_file=None,
...                     feature_names=iris.feature_names,
...                     class_names=iris.target_names,
...                     special_characters=True)

>>> graph=graphviz.Source(dot_data)	# plot the graph
>>> graph.render()	# show the output
'Source.gv.pdf'

Here is the output of decision tree …

PCA:-

PCA is fundamentally a dimensionality reduction algorithm, but it can also be useful as a tool for visualization, for noise filtering, for feature extraction and engineering, and much more. Principal Component Analysis (PCA) applied to a particular dataset identifies the combination of attributes (principal components, or directions in the feature space) that account for the most variance in the data. The task of dimensionality reduction is to ask whether there is a suitable lower-dimensional representation that retains the essential features of the data. Using PCA for dimensionality reduction involves zeroing out one or more of the smallest principal components, resulting in a lower-dimensional projection of the data that preserves the maximal data variance. PCA combines the essence of attributes by creating an alternative , smaller set of variables and thus the initial data can be projected onto this smaller set.

Here is an example of PCA using iris dataset. There are four attributes for each data sample in this dataset. Our task is to reduce the dimensions into two, with the help of PCA and represent the data using a scatter plot with that data-projection.

>>> import matplotlib.pyplot as plt 
>>> from sklearn import datasets
>>> from sklearn.decomposition import PCA

>>> iris=datasets.load_iris()	# load the iris dataset

>>> X=iris.data		# feature vectors
>>> Y=iris.target	# target labels

>>> X.shape		# input data shape
(150, 4)
>>> Y.shape		# target vector shape
(150,)

>>> X[:5,:]		# data samples (initial rows)
array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])
>>> Y[:5]		
array([0, 0, 0, 0, 0])
>>> Y			# target classes
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

>>> target_names=iris.target_names	# class label (3)
>>> target_names
array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

>>> pca=PCA(n_components=2)	# initialize PCA hyperparameters
>>> X_r=pca.fit(X).transform(X) # perform PCA and transform the data

>>> X_r[:5,:]			# sample output after PCA
array([[-2.68420713,  0.32660731],
       [-2.71539062, -0.16955685],
       [-2.88981954, -0.13734561],
       [-2.7464372 , -0.31112432],
       [-2.72859298,  0.33392456]])

>>> print(str(pca.explained_variance_ratio_))	# variance ratio
[0.92461621 0.05301557]
>>> print(pca.components_)	# principal components
[[ 0.36158968 -0.08226889  0.85657211  0.35884393]
 [ 0.65653988  0.72971237 -0.1757674  -0.07470647]]

>>> plt.figure()	# create a new figure
<Figure size 640x480 with 0 Axes>

>>> colors=['red','green','blue']	# set the data point colours
>>> lw=2	# set line width

>>> for color, i, target_name in zip( colors, [0,1,2], target_names):
...  plt.scatter(X_r[Y == i,0], X_r[Y==i,1], color =color,alpha=.8,lw=lw,label=target_name)	# scatter plot
... 
<matplotlib.collections.PathCollection object at 0x7f004cad0908>
<matplotlib.collections.PathCollection object at 0x7f004cad0c50>
<matplotlib.collections.PathCollection object at 0x7f004ca5a0f0>

>>> plt.title('PCA of IRIS dataset')	# set title
Text(0.5,1,'PCA of IRIS dataset')
>>> plt.legend(loc='best',shadow=False, scatterpoints=1)	# set legend
<matplotlib.legend.Legend object at 0x7f004cabf160>

>>> plt.show()	# display the plot

Here is the output of the scatter plot after PCA …

Preprocessing:-

Most of the real-world datasets often include missing values, categorical values and their corresponding features may have dissimilar value ranges. These can adversely effect the machine learning algorithm that is used for training the model. These algorithms mostly use the euclidean distance (numbers) in their estimation process and if the data is not preprocessed, it may lead to learning of meaningless correlations.Also, the preprocessing operations like data transformation, encoding, imputation etc. helps to improve the accuracy and efficiency of the ML algorithms involving distance measurements.

Standardization refers to the process of removing the mean from the feature values and scaling them by corresponding variance.Most ML algorithms assume the data are centred around zero and have variance in the same order (Gaussian with zero mean and unit variance).The StandardScaler in scikit learn may be used for this purpose. It initially computes mean and standard deviation on a training set and is later reapplied on a training or test data to transform it into standardized values. Another function MinMaxScaler is used for scaling the features to lie in a particular range of values(max,min).The MinMax scaler transforms the values into range [0,1], whereas MaxAbs function is used to scale the features into range [-1,1]. These also helps us to preserve zero value entries in the data.

Normalization is the process of scaling individual samples to have unit norm (l1 or l2 norm). This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.

Real world data can sometimes have missing values represented as ‘NaN‘ or blanks.If we wan to use these datasets with scikmit learn we have to remove these missing values by some strategy. One strategy is to remove rows or columns containing these missing values.A better strategy is to infer its value from the remaining data.The imputer class in scikit learn can be used for dealing with missing data.We can replace the missing values by mean, median, frequent values etc.

Often features may be represented by categorical values (ordered or unordered). By default they are incompatible with most of the ML estimators; so we have to convert them into some numerical representations. But this can lead to some problems if these representations infer some order even when there isn’t one. One strategy is to use the OneHotEncoding scheme. This estimator transforms each categorical feature with m possible values into m binary features, with only one active.

>>> from sklearn import preprocessing
>>> import numpy as np
>>> 
>>> X_train = np.array([ [1.,-1.,2.],	# input data
...                      [2.,0.,>0.],
...                      [0.,1.,-1.] ])
>>> 

# Standard scaling

>>> scaler = preprocessing.StandardScaler().fit(X_train)	# learn model parameters (standard scaler)
>>> scaler.transform(X_train)					# apply the transformation (train)
array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])
>>> 
>>> X_test = [[-1.,1.,0.]] 				
>>> scaler.transform(X_test)
array([[-2.44948974,  1.22474487, -0.26726124]])	# apply the transformation (test)
>>> 
>>> scaler.mean_					# model parameters
array([1.        , 0.        , 0.33333333])
>>> scaler.scale_					# model parameters
array([0.81649658, 0.81649658, 1.24721913])

# Scaling features to a range

>>> min_max_scaler=preprocessing.MinMaxScaler()			# create a scaler object (minmax)
>>> X_train_minmax = min_max_scaler.fit_transform(X_train)	# learn model parameters and apply transformation (train)

>>> X_train_minmax
array([[0.5       , 0.        , 1.        ],
       [1.        , 0.5       , 0.33333333],
       [0.        , 1.        , 0.        ]])
>>> 
>>> X_test = np.array( [[-3.,-1.,4.]] )
>>> X_test_minmax = min_max_scaler.transform(X_test)	# apply the transformation (test)
>>> X_test_minmax
array([[-1.5       ,  0.        ,  1.66666667]])
>>> 
>>> min_max_scaler.scale_				# model parameters
array([0.5       , 0.5       , 0.33333333])
>>> min_max_scaler.min_					# model parameters
array([0.        , 0.5       , 0.33333333]) 

 
>>> max_abs_scaler = preprocessing.MaxAbsScaler()	# create a scaler object (maxabs)
>>> X_train_maxabs = max_abs_scaler.fit_transform(X_train)	# learn model parameters and apply transformation (train)

>>> X_train_maxabs
array([[ 0.5, -1. ,  1. ],
       [ 1. ,  0. ,  0. ],
       [ 0. ,  1. , -0.5]])
>>> 
>>> X_test = np.array( [[-3.,-1., 4.]])
>>> X_test_maxabs = max_abs_scaler.transform((X_test))	# apply the transformation (test)
>>> X_test_maxabs
array([[-1.5, -1. ,  2. ]])
>>> 
>>> max_abs_scaler.scale_	# model parameters
array([2., 1., 2.])
>>> 

# Normalization

>>> normalizer = preprocessing.Normalizer().fit(X_train)	# create a normalizer object and learn the model parameters
>>> normalizer.transform(X_train)				# apply the transformation (train)
array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])
>>> 
>>> normalizer.transform([ [-1., 1., 0.] ])			# apply the transformation (test)
array([[-0.70710678,  0.70710678,  0.        ]])


#Imputation of missing values

>>> from sklearn.preprocessing import Imputer
>>> 
>>> im = Imputer(missing_values='NaN', strategy='mean', axis=0)		# create an imputer object
>>> im.fit([ [1,2], [np.nan,3], [7,6] ])				# learn the model parameters (mean)
Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)
>>> 
>>> X = [ [np.nan, 2], [6, np.nan], [7,6] ]
>>> im.transform(X)					# transform the data
array([[4.        , 2.        ],
       [6.        , 3.66666667],
       [7         , 6.        ]])

#One hot encoding

>>> from sklearn.preprocessing import OneHotEncoder
>>> 
>>> enc=OneHotEncoder()			# create a onehotencoder object
>>> enc.fit([ [0,0,3], [1,1,0], [0,2,1], [1,0,2] ])	# learn model parameters
OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,
       handle_unknown='error', n_values='auto', sparse=True)

>>> enc.n_values_	# model parameters
array([2, 3, 4])
>>> enc.feature_indices_	# model parameters
array([0, 2, 5, 9])

>>> enc.transform([[0,1,1]]).toarray()		# transform the data
array([[1., 0., 0., 1., 0., 0., 1., 0., 0.]])

OneHotEncoding:- For example a person could have features [“male”, “female”], [“from Europe”, “from US”, “from Asia”], [“uses Firefox”, “uses Chrome”, “uses Safari”, “uses Internet Explorer”]. Such features can be efficiently coded as integers, for instance [“male”, “from US”, “uses Internet Explorer”] could be expressed as [0, 1, 3] while [“female”, “from Asia”, “uses Chrome”] would be [1, 2, 1]. The data [0,1,3] is encoded as [1., 0., 0., 1., 0., 0., 0., 0., 1.] using onehotencoding. In the result, the first two numbers encode the gender, the next set of three numbers the continent and the last four the web browser.

Note:

fit() : used for generating learning model parameters from training data
transform() : parameters generated from fit() method,applied upon model to generate transformed data set.
fit_transform() : combination of fit() and transform() api on same data set

piaix

Pandas:-

Pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python. It offers data structures and operations for manipulating numerical tables and time series. Its data-structures along with numpy as matplotlib allows you to perform high-level data munging on large datasets, easily and efficiently. Other features include flexible indexing schemes, rich I/O, efficient data manipulations, easy to use visualization techniques etc.It is used in a wide variety of academic and commercial domains, including finance, neuroscience, economics, statistics, advertising, web-analytics etc.

Installation: The easiest way to install pandas is through a package managers like pip or conda. Other option is to use the usual APT command from the terminal.

pip install pandas
conda install pandas
sudo apt-get install python3-pandas

Note– NumPy is a necessary dependency for pandas.Other optional dependencies include scipy, matplotlib etc.

It can be used with IPython for interactive data visualization and use of GUI toolkits.

Here are the most popular resources for pandas …

Site: https://pandas.pydata.org/
Book: Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython – by Wes McKinney

Straight from the horse’s mouth !!!

Pandas mainly consists of three data-structures:-

Series:It is a one-dimensional labelled homogeneous array capable of holding any data type.
DataFrame: It is a 2-dimensional labelled data structure with columns of potentially different types.
Panel: It is a three-dimensional data structure with heterogeneous data.

The panel data structure is somewhat less used and has been deprecated in recent version of pandas.We will mainly concentrate on the first two structures, which are more frequently used for data processing and analysis.

Series

A series is a 1-D data-structure containing a particular type of data. It is a homogeneous and labelled data-structure.The axis labels are collectively known as index. We can create a series object from a list, ndarray,dict, scalars etc. If an additional index is not specified, then an automatic index will be created using a range; 0 – len(data)-1. The elements can be accessed like an ndarray using the position or like a dictionary using label-keys. Other basic modifications, data access methods like slicing is also supported.

>>> import numpy as np
>>> import pandas as pd
>>> 
>>> s_empty=pd.Series()		# empty series
>>> s_empty
Series([], dtype: float64)
>>> 
>>> s_list=pd.Series([1,2,3,4])		# series from list
>>> s_list
0    1
1    2
2    3
3    4
dtype: int64

>>> s_ndarray=pd.Series(np.random.rand(5))	# series from ndarray
>>> s_ndarray
0    0.190963
1    0.908449
2    0.928749
3    0.504798
4    0.781002
dtype: float64

>>> s_indexed=pd.Series(['a','b','c','d'],index=[1,2,3,4])	# series with custom index
>>> s_indexed
1    a
2    b
3    c
4    d
dtype: object

>>> s_dict=pd.Series({'a':1,'b':2,'c':3,'d':4})		# series from dictionary
>>> s_dict
a    1
b    2
c    3
d    4
dtype: int64

>>> s_scalar=pd.Series(7,index=list('1234567'))		# series from a scalar
>>> s_scalar
1    7
2    7
3    7
4    7
5    7
6    7
7    7
dtype: int64


>>> data={'a':1,'c':2,'d':3}
>>> s=pd.Series(data,index=['a','b','c','d'])		
>>> s							# series with NaN
a    1.0
b    NaN
c    2.0
d    3.0
dtype: float64

>>> s[2]	# data access by position
2.0
>>> s['b']	# data access by index label
nan

>>> s[1]=4	# data modification

>>> s[0:3]	# data slicing
a    1.0
b    4.0
c    2.0
dtype: float64

>>> s[['a','c','d']]	# retreive multiple elements
a    1.0
c    2.0
d    3.0
dtype: float64

Most of the python string functions can be applied on series data after converting those object to string objects.

DataFrame:-

A data-frame is a 2-D data structure organized as rows and columns.They may contain different types of data and can be accessed through row index and column labels.We can create a data-frame using list,dict,series, ndarray etc.We can additionally specify the column labels and their data-type at the time of it’s creation.

The following code shows the commonly used methods for creating a data-frame.

>>> import numpy as np
>>> import pandas as pd

>>> df=pd.DataFrame()		# empty data-frame
>>> df
Empty DataFrame
Columns: []
Index: []

>>> df=pd.DataFrame([1,2,3,4])	# data-frame from list
>>> df
   0
0  1
1  2
2  3
3  4

 
>>> data={'Name':['Alice','Bob','Trudy'], 'Age':np.random.randint(10,20,3)} 
>>> df=pd.DataFrame(data, index=['a','b','c']) 	# data-frame from dict of list/ndarray (custom index)
>>> df
   Age   Name
a   19  Alice
b   11    Bob
c   15  Trudy

>>> data=[ {'a':1, 'b':2}, {'a':3,'b':4,'c':5} ]
>>> df=pd.DataFrame(data)	# data-frame from list of dicts
>>> df
   a  b    c
0  1  2  NaN
1  3  4  5.0

>>> data= {'one' : pd.Series([1,2,3], index=list('abc') ),
...        'two' : pd.Series([4,5,6,7], index=['a','b','c','d'])}
>>> df=pd.DataFrame(data)	# data-frame from dict of series
>>> df
   one  two
a  1.0    4
b  2.0    5
c  3.0    6
d  NaN    7

>>> data=[(1,2.,'field_1'), (3,4.,'field_2') ]
>>> pd.DataFrame.from_records(data)	# data-frame from list of tuples (auto row-index & columns)
   0    1        2
0  1  2.0  field_1
1  3  4.0  field_2
>>>

If no row-index or columns is passed, then by default, index will be range(n), where n is the array length.
When a particular set of columns is passed along with a dict of data, the passed columns override the keys in the dict.
In the case of ‘DataFrame.from_record’, it takes a list of tuples or an ndarray with structured dtype as input.
As usual, missing values are represented by NaN’s.

Selection, addition and deletion of dataframes :-

Rows can be selected and indexed by using row labels or integer locations.There are separate functions for each type of indexing. Boolean indexing and slicing are also carried out similar to normal numpy arrays. Pandas also comes with number of functions for addition and deletion of rows and columns.

Here are a few examples …

>>> import numpy as np
>>> import pandas as pd
>>> 
>>> df=pd.DataFrame([ [1,2,3], [4,5,6], [7,8,9] ], index=['a','b','c'],columns=['A','B','C'] )
>>> 
>>> df
   A  B  C
a  1  2  3
b  4  5  6
c  7  8  9
>>> 
>>> df['A']	# select column
a    1
b    4
c    7
Name: A, dtype: int64
>>> 
>>> df.loc['a']		#select row by label
A    1
B    2
C    3
Name: a, dtype: int64
>>> 
>>> df.iloc[0]		# select row by location
A    1
B    2
C    3
Name: a, dtype: int64
>>> 
>>> df[1:3]	# slice rows
   A  B  C
b  4  5  6
c  7  8  9
>>> 
>>> df
   A  B  C
a  1  2  3
b  4  5  6
c  7  8  9
>>> df[df['A']>3]	# boolean indexing rows
   A  B  C
b  4  5  6
c  7  8  9

>>> df[['A','B']]	# select columns
   A  B
a  1  2
b  4  5
c  7  8
>>> 
>>> df2=pd.DataFrame([ [1,2], [3,4] ], index=['d','e'],columns=['A','B'] )
>>> df2
   A  B
d  1  2
e  3  4
>>> 
>>> df=df.append(df2)	# addition of rows
>>> df
   A  B    C
a  1  2  3.0
b  4  5  6.0
c  7  8  9.0
d  1  2  NaN
e  3  4  NaN
>>> 
>>> df=df.drop('e')	# deletion of a row
>>> df
   A  B    C
a  1  2  3.0
b  4  5  6.0
c  7  8  9.0
d  1  2  NaN
>>> col_C=df.pop('C')	# deletion of a column
>>> df
   A  B
a  1  2
b  4  5
c  7  8
d  1  2
>>> col_C
a    3.0
b    6.0
c    9.0
d    NaN
Name: C, dtype: float64
>>> 
>>> df.insert(1,'C',[0,1,2,3])	# insertion of a column (pos,label,col_ele)
>>> df
   A  C  B
a  1  0  2
b  4  1  5
c  7  2  8
d  1  3  2
>>> df[ ['C', 'B'] ].iloc[1:3]	# combined indexing
   C  B
b  1  5
c  2  8
>>> 
>>> df.at['c','B']	# access a single value
8

Row selection returns a series whose index is the columns of the data frame.
When inserting a scalar value, it will naturally be propagated to fill the column.
As in the previous cases, during index mismatches new elements will have NaN values.

Basic operations on series:-

The series data structure supports the basic mathematical operations similar to numeric ndarrays. Adding and removing elements is done with the help of simple functions. Here is a brief overview of the basic operations on series,its properties and other functionalities.

>>> import numpy as np
>>> import pandas as pd
>>> 
>>> s=pd.Series(np.arange(10),index=list('abcdefghij'))
>>> s
a    0
b    1
c    2
d    3
e    4
f    5
g    6
h    7
i    8
j    9
dtype: int64

>>> s.values	# returns values as ndarray
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> s.axes	# return row axis labels
[Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')]
 
>>> s.head()	# return the first n elements (def=5)
a    0
b    1
c    2
d    3
e    4
dtype: int64
>>> s.tail()	# return the last n elements (def=5)
f    5
g    6
h    7
i    8
j    9
dtype: int64

>>> s.size	# return the number of elements
10
>>> s.describe()	# descriptive statistics 
count    10.00000
mean      4.50000
std       3.02765
min       0.00000
25%       2.25000
50%       4.50000
75%       6.75000
max       9.00000
dtype: float64

>>> s.sort_values(ascending=False)	# sort the values
j    9
i    8
h    7
g    6
f    5
e    4
d    3
c    2
b    1
a    0
dtype: int64

>>> s * 2	# vector operations (multiply - broadcasted)
a     0
b     2
c     4
d     6
e     8
f    10
g    12
h    14
i    16
j    18
dtype: int64
>>> s+s		# sum of series
a     0
b     2
c     4
d     6
e     8
f    10
g    12
h    14
i    16
j    18
dtype: int64
>>> np.exp(s)		# numpy-exponent (universal functions)
a       1.000000
b       2.718282
c       7.389056
d      20.085537
e      54.598150
f     148.413159
g     403.428793
h    1096.633158
i    2980.957987
j    8103.083928
dtype: float64

>>> s.sum()	# sum
45
>>> s.mean()	# mean
4.5
>>> s.median()	# median
4.5
>>> s.max()	# max value
9
>>> s.std()	# min value
3.0276503540974917
>>> s.cumsum()	# cumulative sum
a     0
b     1
c     3
d     6
e    10
f    15
g    21
h    28
i    36
j    45
dtype: int64

>>> s=pd.Series(np.random.randint(0,99,10),index=list('abcdefghij'))
>>> s
a    11
b    46
c    84
d    43
e    30
f    77
g    71
h    84
i    41
j    81
dtype: int64

>>> s.nlargest(10) 	# find n largest values
c    84
h    84
j    81
f    77
g    71
b    46
d    43
i    41
e    30
a    11
dtype: int64

>>> s1=pd.Series([1,2,3,4,5])
>>> s2=pd.Series([6,7,8,9,10])

>>> s1.append(s2)
0     1
1     2
2     3
3     4
4     5
0     6
1     7
2     8
3     9
4    10
		
>>> s1.mul(s2)	# multiply elementwise 
0     6
1    14
2    24
3    36
4    50
dtype: int64
>>> s1.dot(s2)		# inner product
130

>>> s3=pd.Series(np.random.randn(10))
>>> s3
0   -0.038673
1    0.667981
2    0.690666
3   -0.681968
4   -0.468518
5    0.102020
6    0.798626
7    0.990772
8    0.966635
9   -1.333172
dtype: float64

>>> s3.abs()		# find absolute value (element-wise)
0    0.038673
1    0.667981
2    0.690666
3    0.681968
4    0.468518
5    0.102020
6    0.798626
7    0.990772
8    0.966635
9    1.333172
dtype: float64

>>> s3=pd.Series([1,2,3,2,2,3])
>>> s3.drop_duplicates()	# remove duplicate values
0    1
1    2
2    3
dtype: int64
>>> s3.drop(5)		# remove an element (by index-label)
0    1
1    2
2    3
3    2
4    2
dtype: int64
>>> s3.replace(1,4)	# replace an element (by value)
0    4
1    2
2    3
3    2
4    2
5    3
dtype: int64

Methods like add, sub, mul, div etc. are element-wise binary operators for series.
The function describe generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
Series can also be passed into most NumPy methods expecting an ndarray.
A key difference between Series and ndarray is that operations between Series automatically align the data based on label.

Basic operations on data frame:-

Most of the basic data manipulations and operations on series can be extended to data frames. The main difference is that data frames deal with two dimensions of rows and columns.Arithmetic functions can be applied on data frames element-wise and descriptive statistics like mean and sum can be computed along a particular axis(row or column-wise). Other operations like transpose or dot product can be performed easily through function calls. Here is a brief overview of various data munging techniques and operations on data frames.

>>> import numpy as np
>>> import pandas as pd

>>> df=pd.DataFrame(np.arange(25).reshape(5,5), index=['a','b','c','d','e'],columns=list('ABCDE'))
>>> df
    A   B   C   D   E
a   0   1   2   3   4
b   5   6   7   8   9
c  10  11  12  13  14
d  15  16  17  18  19
e  20  21  22  23  24
>>> 
>>> df.size	# return number of elements
25

>>> df.head(3)		# return first n elements (3)
    A   B   C   D   E
a   0   1   2   3   4
b   5   6   7   8   9
c  10  11  12  13  14
>>> df.tail(3)		# return last n elements (3)
    A   B   C   D   E
c  10  11  12  13  14
d  15  16  17  18  19
e  20  21  22  23  24

>>> df.shape		# return data-frame dimension
(5, 5)
>>> df.index		# return row index 
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
>>> df.values		# return values as ndarray
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

>>> df.T		# transpose of a data frame
   a  b   c   d   e
A  0  5  10  15  20
B  1  6  11  16  21
C  2  7  12  17  22
D  3  8  13  18  23
E  4  9  14  19  24

>>> df.describe()	# summary of descriptive statistics
               A          B          C          D          E
count   5.000000   5.000000   5.000000   5.000000   5.000000
mean   10.000000  11.000000  12.000000  13.000000  14.000000
std     7.905694   7.905694   7.905694   7.905694   7.905694
min     0.000000   1.000000   2.000000   3.000000   4.000000
25%     5.000000   6.000000   7.000000   8.000000   9.000000
50%    10.000000  11.000000  12.000000  13.000000  14.000000
75%    15.000000  16.000000  17.000000  18.000000  19.000000
max    20.000000  21.000000  22.000000  23.000000  24.000000

>>> 
>>> df.sort_index(axis=1,ascending=False)	# sort by annaxis
    E   D   C   B   A
a   4   3   2   1   0
b   9   8   7   6   5
c  14  13  12  11  10
d  19  18  17  16  15
e  24  23  22  21  20
>>> df.sort_values(by='A',ascending=False)	# sort by values
    A   B   C   D   E
e  20  21  22  23  24
d  15  16  17  18  19
c  10  11  12  13  14
b   5   6   7   8   9
a   0   1   2   3   4

>>> df.at['a','A']=25	# setting a value using labels
>>> df
    A   B   C   D   E
a  25   1   2   3   4
b   5   6   7   8   9
c  10  11  12  13  14
d  15  16  17  18  19
e  20  21  22  23  24

>>> df.mean()	# compute mean
A    15.0
B    11.0
C    12.0
D    13.0
E    14.0
dtype: float64

>>> df * 2	# vector operations (multiply - broadcast)
    A   B   C   D   E
a  50   2   4   6   8
b  10  12  14  16  18
c  20  22  24  26  28
d  30  32  34  36  38
e  40  42  44  46  48

>>> df1=pd.DataFrame(np.random.randint(0,50,25).reshape(5,5), index=['a','b','c','d','e'],columns=list('ABCDE'))
>>> df1
    A   B   C   D   E
a  36  21  42  28  17
b  43  20  39  43  33
c  27  45  44  18   0
d  11  23  18  26  32
e   0   8  33  42  13

>>> df1.sub(df)		# subtraction (element-wise)
    A   B   C   D   E
a  11  20  40  25  13
b  38  14  32  35  24
c  17  34  32   5 -14
d  -4   7   1   8  13
e -20 -13  11  19 -11

>>> df1.apply( lambda x: x.max() -x.min() )	# apply functions over a data-frame
A    43
B    37
C    26
D    25
E    33
dtype: int64

>>> df1=pd.DataFrame([ [1,2,3], [4,5,6], [7,8,np.nan] ], index=['a','b','c'],columns=list('ABC'))
>>> df1
   A  B    C
a  1  2  3.0
b  4  5  6.0
c  7  8  NaN

>>> df1.dropna()	# remove rows with missing values
   A  B    C
a  1  2  3.0
b  4  5  6.0

>>> df1.reindex(index=['a','b'],columns=['C','B','A'])	# reindex with rows and columns
     C  B  A
a  3.0  2  1
b  6.0  5  4

>>> df1.assign( D=(df1['A']+df1['B'])/2 ) # create new column from existing
   A  B    C    D
a  1  2  3.0  1.5
b  4  5  6.0  4.5
c  7  8  NaN  7.5

The ‘all‘ function when applied over a series or data frame returns ‘True’ if all elements are True; else it return False (over an axis).
The ‘any‘ function when applied over a series or data frame returns ‘True’ if at-least one element is True; else it returns False (over an axis).

Combining and reshaping data frames:-

Pandas provides various functions for combining objects like series and data-frames. The ‘concat‘ function allows us to combine two data-frames along an axis (row or column). The merge operations allows us to perform in-memory join operations similar to standard SQL operations. Finally, the pivot function helps us to create a pivot table by spreading rows into columns.
Here are a few examples …

>>> import numpy as np
>>> import pandas as pd
>>> 
>>> df1=pd.DataFrame( {"a": [4,5,6],
...                    "b": [7,8,9],
...                    "c": [10,11,12]},
...                     index=[1,2,3])
>>> df2=pd.DataFrame( [[4,7,10],
...                    [5,8,11],
...                    [6,9,12]],
...                    index=[1,2,3],
...                    columns=['a','b','c'])

>>> df1
   a  b   c
1  4  7  10
2  5  8  11
3  6  9  12
>>> df2
   a  b   c
1  4  7  10
2  5  8  11
3  6  9  12

>>> pd.melt(df1)	# gather columns into rows
  variable  value
0        a      4
1        a      5
2        a      6
3        b      7
4        b      8
5        b      9
6        c     10
7        c     11
8        c     12

>>> pd.concat([df1,df2])	# combine dataframe rows
   a  b   c
1  4  7  10
2  5  8  11
3  6  9  12
1  4  7  10
2  5  8  11
3  6  9  12
>>> pd.concat([df1,df2],axis=1)		# combine dataframe columns
   a  b   c  a  b   c
1  4  7  10  4  7  10
2  5  8  11  5  8  11
3  6  9  12  6  9  12

>>> df1=pd.DataFrame({"x1":['A','B','C'],"x2":[1,2,3]})
>>> df2=pd.DataFrame({"x1":['A','B','D'],"x3":['T','F','T']})

>>> df1
  x1  x2
0  A   1
1  B   2
2  C   3
>>> df2
  x1 x3
0  A  T
1  B  F
2  D  T

>>> pd.merge(df1,df2,how='left',on='x1')	# left outer join (left-keys)
  x1  x2   x3
0  A   1    T
1  B   2    F
2  C   3  NaN
>>> pd.merge(df1,df2,how='right',on='x1')	# right outer join (right-keys)
  x1   x2 x3
0  A  1.0  T
1  B  2.0  F
2  D  NaN  T
>>> pd.merge(df1,df2,how='inner',on='x1')	# inner join (intersection)
  x1  x2 x3
0  A   1  T
1  B   2  F
>>> pd.merge(df1,df2,how='outer',on='x1')	# outer join (union)
  x1   x2   x3
0  A  1.0    T
1  B  2.0    F
2  C  3.0  NaN
3  D  NaN    T


>>> df=pd.DataFrame( {"Name":['Alice','Bob','Clark','Dave','Elvis','Fred'],
...                   "Sex" :['F','M','M','M','F','M'], 
...                   "Age": [10,12,10,13,12,13]})

>>> df
   Age   Name Sex
0   10  Alice   F
1   12    Bob   M
2   10  Clark   M
3   13   Dave   M
4   12  Elvis   F
5   13   Fred   M

>>> pd.pivot_table(df,index="Name",columns="Sex") # create a pivot table by spreading rows into columns
        Age      
Sex       F     M
Name             
Alice  10.0   NaN
Bob     NaN  12.0
Clark   NaN  10.0
Dave    NaN  13.0
Elvis  12.0   NaN
Fred    NaN  13.0

In case of concatenation, the index of the resultant is duplicated and each index is repeated.
In pandas, two data-frames can also be combined using multiple keys.

Grouping

Grouping can be used either to split the data into groups or to apply some functions group-wise on data.Further, it may be used to combine the results to a data structure. Pandas provides a groupby function to grou p the data and perform these operations. Often, the groupby function is used to group the data in order to apply some operations like aggregation, filtration or transformations on the resulting subsets of data.

>>> import numpy as np
>>> import pandas as pd

>>> df=pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
...                       'foo', 'bar', 'foo', 'foo'],
...                  'B' : ['one', 'one', 'two', 'three',
...                        'two', 'two', 'one', 'three'],
...                  'C' : np.random.randint(0,50,8),
...                  'D' : np.random.randint(0,50,8)})
>>> df
     A      B   C   D
0  foo    one  39   7
1  bar    one  48  23
2  foo    two   9  40
3  bar  three  46  30
4  foo    two   5  31
5  bar    two   1   2
6  foo    one  33  49
7  foo  three  13  12

>>> gp1=df.groupby('A')		# group by a colum
>>> gp1.groups
{'bar': Int64Index([1, 3, 5], dtype='int64'), 'foo': Int64Index([0, 2, 4, 6, 7], dtype='int64')}

>>> gp2=df.groupby(['A','B'])	# group by multiple columns
>>> gp2.group			# view groups       
{('bar', 'one'): Int64Index([1], dtype='int64'), ('bar', 'three'): Int64Index([3], dtype='int64'), ('bar', 'two'): Int64Index([5], dtype='int64'), ('foo', 'one'): Int64Index([0, 6], dtype='int64'), ('foo', 'three'): Int64Index([7], dtype='int64'), ('foo', 'two'): Int64Index([2, 4], dtype='int64')}

>>> gp1.get_group('foo')	# select groups
     A      B   C   D
0  foo    one  39   7
2  foo    two   9  40
4  foo    two   5  31
6  foo    one  33  49
7  foo  three  13  12

>>> gp1.sum()			# aggregation functions
      C    D
A           
bar  95   55
foo  99  139

>>> gp2.transform(lambda x: x*2)	# transformation functions
    C   D
0  78  14
1  96  46
2  18  80
3  92  60
4  10  62
5   2   4
6  66  98
7  26  24

>>> gp3=df.groupby('B')		# group data by a column
>>> gp3.groups			# view groups
{'one': Int64Index([0, 1, 6], dtype='int64'), 'three': Int64Index([3, 7], dtype='int64'), 'two': Int64Index([2, 4, 5], dtype='int64')}

>>> gp3.filter(lambda x: len(x)>2)	# filter groups
     A    B   C   D
0  foo  one  39   7
1  bar  one  48  23
2  foo  two   9  40
4  foo  two   5  31
5  bar  two   1   2
6  foo  one  33  49

Grouping by multiple columns forms a hierarchical index
We may pass a list or dict of functions for performing multiple aggregation at once.

Missing data

In general missing data are represented by ‘NaN‘ (Not a Number) and are not considered for calculations. They may be filled by mechanisms like padding or replacing. Other option is to drop the particular rows or columns from the data frame. Often we encounter missing data while re-indexing the data frame. Here are some examples for the same.

>>> import numpy as np
>>> import pandas as pd
>>> 
>>> df=pd.DataFrame(np.random.randint(10,50,9).reshape(3,3), index=['a','c','e'],columns=[1,2,3])

>>> df
    1   2   3
a  32  35  21
c  25  34  28
e  22  28  30
>>> df=df.reindex(['a','b','c'])	# reindex the data-frame

>>> df
      1     2     3
a  32.0  35.0  21.0
b   NaN   NaN   NaN
c  25.0  34.0  28.0

>>> pd.isna(df)				# get boolean mask for missing data
       1      2      3
a  False  False  False
b   True   True   True
c  False  False  False

>>> df.fillna(value=2)			# fill nan with a value (2)
      1     2     3
a  32.0  35.0  21.0
b   2.0   2.0   2.0
c  25.0  34.0  28.0

>>> df.fillna(method='pad')		# fill forward missing data
      1     2     3
a  32.0  35.0  21.0
b  32.0  35.0  21.0
c  25.0  34.0  28.0
>>> df.fillna(method='backfill')	# fill backward missing data
      1     2     3
a  32.0  35.0  21.0
b  25.0  34.0  28.0
c  25.0  34.0  28.0
>>> df.replace({32:25,35:34,21:28})	# replace data values
      1     2     3
a  25.0  34.0  28.0
b   NaN   NaN   NaN
c  25.0  34.0  28.0
>>> df
      1     2     3
a  32.0  35.0  21.0
b   NaN   NaN   NaN
c  25.0  34.0  28.0
>>>

We can use ‘isnull‘ or ‘notnull‘ functions to detect missing values across rows or columns.
The ‘dropna‘ function may be used to drop columns or rows containing missing values

Window functions:-

Pandas provides function like rolling, expanding and ewm for calculating window statistics. The rolling function return a rolling object allowing summary functions to be
applied to windows of length n. The expanding function return an expanding object allowing summary functions to be applied cumulatively. In rolling, window is the number of observations used for calculating the statistic. Each window will be a fixed size.In case of expanding, the window size keeps expanding and min_periods refers to the minimum number of observations in window required to have a value.The ewm function provides exponential weighted calculations on the series of data.It assigns the weights exponentially. The decay may be specified in terms of com, span, half-life etc.

>>> import numpy as np
>>> import pandas as pd
>>> 
>>> df=pd.DataFrame(np.arange(21).reshape(7,3), columns=['A','B','C'])
>>> df
    A   B   C
0   0   1   2
1   3   4   5
2   6   7   8
3   9  10  11
4  12  13  14
5  15  16  17
6  18  19  20
>>> 
>>> df.rolling(window=3).sum()		# rolling window calculations
      A     B     C
0   NaN   NaN   NaN
1   NaN   NaN   NaN
2   9.0  12.0  15.0
3  18.0  21.0  24.0
4  27.0  30.0  33.0
5  36.0  39.0  42.0
6  45.0  48.0  51.0
>>> 
>>> df.expanding(min_periods=3).sum()	# expanding transformations
      A     B     C
0   NaN   NaN   NaN
1   NaN   NaN   NaN
2   9.0  12.0  15.0
3  18.0  22.0  26.0
4  30.0  35.0  40.0
5  45.0  51.0  57.0
6  63.0  70.0  77.0

>>> df.ewm(com=0.5).mean()		#  exponential weighted functions
           A          B          C
0   0.000000   1.000000   2.000000
1   2.250000   3.250000   4.250000
2   4.846154   5.846154   6.846154
3   7.650000   8.650000   9.650000
4  10.561983  11.561983  12.561983
5  13.524725  14.524725  15.524725
6  16.509607  17.509607  18.509607
>>>

Window functions are majorly used in finding the trends within the data graphically by smoothing the curve.

SevenShineStudios

Think … Explore … Discuss … Share …

Programming

1. Experiments with Prolog : Deontic Logic [Academic Project]

2. Brainfuck : The Craziest Programming Language

3. SOES

4. Speeding up Python: NUMBA!!!

5. GPGPU: Programming Massively Parallel Processors- The Real McCoy!!

6. Steganography: Prime Component Alteration Technique

Steganography: “What your eyes don’t see”

“A user interface is like a joke. If you have to explain it, it’s not that good.”

10. Alone we can do so little, together we can do so much: A Distributed DoS Simulation

11. Real-time Collaborative Editing System on the Web: Intro to Google Realtime API

12. Intro to NoSQL and Graph Databases: An Eleventh Hour Inclusion !!

13. Intro to container technology: Docker

1. Container operating systems

1. Container operating systems

2. Container engine

3. Container orchestration

4. Application support services

( INFO : GO is a free and open source programming language created at Google in 2007)

14. Intro to Linux and Free Software: Save the Best for Last

15. Numpy and Friends: Intro to scientific computing in python

1. Experiments with Prolog : Deontic Logic [Academic Project]

2. Brainfuck : The Craziest Programming Language

3. SOES

4. Speeding up Python: NUMBA!!!

5. GPGPU: Programming Massively Parallel Processors- The Real McCoy!!

6. Steganography: Prime Component Alteration Technique

Steganography: “What your eyes don’t see”

“A user interface is like a joke. If you have to explain it, it’s not that good.”

10. Alone we can do so little, together we can do so much: A Distributed DoS Simulation

11. Real-time Collaborative Editing System on the Web: Intro to Google Realtime API

12. Intro to NoSQL and Graph Databases: An Eleventh Hour Inclusion !!

13. Intro to container technology: Docker

1. Container operating systems

1. Container operating systems

2. Container engine

3. Container orchestration

4. Application support services

( INFO : GO is a free and open source programming language created at Google in 2007)

14. Intro to Linux and Free Software: Save the Best for Last

15. Numpy and Friends: Intro to scientific computing in python

Share this:

Share this: