Differences Among A, CNAME, ALIAS, and URL records

Understanding the differences

These are the main differences:

  • The A record maps a name to one or more IP addresses when the IP are known and stable.
  • The CNAME record maps a name to another name. It should only be used when there are no other records on that name.
  • The ALIAS record maps a name to another name, but can coexist with other records on that name.
  • The URL record redirects the name to the target name using the HTTP 301 status code.

Important rules:

  • The ACNAME, and ALIAS records cause a name to resolve to an IP. Conversely, the URL record redirects the name to a destination. The URL record is a simple and effective way to apply a redirect for one name to another name, for example redirecting www.example.com to example.com.
  • The A name must resolve to an IP. The CNAME and ALIAS records must point to a name.

General rules:

  • Use an A record if you manage which IP addresses are assigned to a particular machine, or if the IP are fixed (this is the most common case).
  • Use a CNAME record if you want to alias one name to another name, and you don’t need other records (such as MX records for emails) for the same name.
  • Use an ALIAS record if you’re trying to alias the root domain (apex zone), or if you need other records for the same name.
  • Use the URL record if you want the name to redirect (change address) instead of resolving to a destination.

Solution For Running Out of Inodes

The inode is a data structure in the Unix-like file system and which stores information about the file except for its name and the path of the file. An inode is always said to be metadata of data. A file in the Unix-like system is stored in two places on the disk – data block and inodes. The contents of the file stored in the data block and the inode contain the information about the data. The inode contains the following information:

1) Mode/permission (protection)

2) Owner ID

3) Group ID

4) Size of file

5) Number of hard links to the file

6) Time last accessed

7) Time last modified

8) Time inode last modified

 

To list the inode of files on a directory, we can use the following command.

# cd /

# ls -lai

 

total 132

2 drwxr-xr-x  24 root root   4096 Feb 26 13:31 .

2 drwxr-xr-x  24 root root   4096 Feb 26 13:31 ..

2637825 drwxr-xr-x   2 root root   4096 Jan 14 19:02 bin

196609 drwxr-xr-x   3 root root   4096 Feb 24 10:41 boot

3 drwxr-xr-x  16 root root   4460 Mar  5 09:35 dev

983041 drwxr-xr-x 206 root root  12288 Mar  5 07:45 etc

2 drwxr-xr-x  14 root root   4096 Dec 29 09:24 home

One of the major issues on the server-side is the disk space issues and high load average issues. Most likely you got this message “No space left on device” or “disk is full” despite having enough space free on your server. If you ever face this trouble, most likely it’s because your server exceeds the available inodes. Let’s discuss the solution for this problem;

 

Solution

1) The first step is for it to check whether your server has enough free disk space. Use the below-given command for checking the available disk space on the server.

# df -h

Filesystem         1K-blocks    Used     Available   Use%     Mounted on

/dev/xvda         33030016   10407780  22622236   32%     /

tmpfs             368748     0        368748     0%      /lib/init/rw

varrun            368748     56       368692      1%     /var/run

varlock            368748     0        368748     0%      /var/lock

udev              368748     108      368640     1%      /dev

tmpfs             368748      0       368748     0%      /dev/shm

 

2) The second step is to check the available inodes on your server. For that use below command to check the available inodes on the server.

# df -i

Filesystem         Inodes     IUsed      IFree     IUse%   Mounted on

/dev/xvda         2080768    2080768     0      100%    /

tmpfs             92187      3          92184   1%     /lib/init/rw

varrun            92187      38          92149   1%    /var/run

varlock            92187      4          92183   1%    /var/lock

udev              92187     4404        87783   5%    /dev

tmpfs             92187       1         92186   1%    /dev/shm

If the result has 100% or near IUse% value, then a large number of files is the reason for this issue.

 

3) The next step is to find those files. For that, we can use a small script that will list the directories and the number of files on them.

#  for i in /*; do echo $i; find $i |wc -l; done

From the output, you can see the directory which uses a large number of files, then repeat this script for that directory like below. Repeat it until you see the suspected directory.

#  for i in /home/*; do echo $i; find $i |wc -l; done

 

4) When you find the suspected directory with a large number of unwanted files. Just delete the unwanted files on that directory and free up some inode space by the following command.

#  rm -rf /home/bad_user/directory_with_lots_of_empty_files

You have successfully solved the problem. Check the inode usage now with the df -i command again, you can see the difference like this.

# df -i

Filesystem            Inodes    IUsed    IFree  IUse%  Mounted on

/dev/xvda            2080768  284431  1796337    14%   /

tmpfs                92187       3    92184     1%    /lib/init/rw

varrun                92187      38    92149     1%   /var/run

varlock                92187       4    92183     1%   /var/lock

udev                  92187    4404    87783     5%   /dev

tmpfs                 92187       1    92186     1%   /dev/shm

Moving Elasticsearch indexes with elasticdump

Elasticdump is an open-source tool, which according to its official description has the goal of moving and saving Elasticsearch indexes.
Elasticdump works by requesting data from an input and consequently shipping it into an output. Either input or output may be an Elasticsearch URL or a File.
To install Elasticdump, we can make use of an npm package or a docker image.
For those of you who wonder what is npm, it is short for Node Package Manager, which first and foremost it is an online repository hosting open-source Node.js projects.
You can install npm through:
# Ubuntu
sudo apt-get install npm

# CentOS
sudo yum install npm

With npm installed, install elasticdump with:

npm install elasticdump -g

Using elasticdump

Using elasticdump is as simple as performing something similar to:
elasticdump \
  --input={{INPUT}} \
  --output={{OUTPUT}} \
  --type={{TYPE}}
Where:
{{INPUT}} or {{OUTPUT}} can be either one of:
Elasticsearch URL: {protocol}://{host}:{port}/{index}
or
Fille: {FilePath}
{{TYPE}} can be one of the following: analyzer, mapping, data

Export my data – Option 1

OK. Let’s get our hands dirty and export some data.
On my case, I want to export data from a docker-daemon index and push it into a remote index.
Using Elasticsearch URL for both input and output, this is my command:
elasticdump \
  --input=http://user:password@old_node:9200/docker-daemon \
  --output=http://user:password@new_node:9200/docker-daemon \
  --type=data
If you follow the output, you will see something similar to:
Thu, 21 Sep 2017 14:40:29 GMT | starting dump
Thu, 21 Sep 2017 14:40:31 GMT | got 53 objects from source elasticsearch (offset: 0)
Thu, 21 Sep 2017 14:40:33 GMT | sent 53 objects to destination elasticsearch, wrote 53
Thu, 21 Sep 2017 14:40:33 GMT | got 0 objects from source elasticsearch (offset: 53)
Thu, 21 Sep 2017 14:40:33 GMT | Total Writes: 53
Thu, 21 Sep 2017 14:40:33 GMT | dump complete
Let’s now confirm that the index was transferred successfully:
On the target Elaticsearch node, perform:
[kelson@localhost ~]# curl -u user:password localhost:9200/_cat/indices?v | grep docker-daemon
Positive output would be:
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                             Dload  Upload   Total   Spent    Left  Speed
100  2394  100  2394    0     0   245k      0 --:--:-- --:--:-- --:--:--  259k
green  open   logstash-docker-daemon        eilJdiZvSGixTNIfMwP-kw   5   2         41            0    292.3kb        292.3kb

Export my data – Option 2

In option 1, we want first to output the index into a file before moving it to the new Node.
This can be achieved by something similar to:
elasticdump \
  --input=http://user:password@old_node:9200/docker-daemon \
  --output=/data/docker-daemon.json \
  --type=data
elasticdump \
  --input=/data/docker-daemon.json \
  --output=http://user:password@new_node:9200/docker-daemon \
  --type=data
Note that we first export the data from the index into the /data/docker-daemon.json file.
We then use this file as input to be moved into the new node.

Analyzers and Mappings

What was shown was the most basic method of moving an index from a node into a new one.
In a probable scenario when moving an index, you will want to move the index with its appropriate analyzers and field mappings.
If this is the case, you will want to move these before moving the index. This would be achieved by cascading the 3 statements as shown:
elasticdump \
  --input=http://user:password@old_node:9200/docker-daemon \
  --output=http://user:password@new_node:9200/docker-daemon \
  --type=analyzer
elasticdump \
  --input=http://user:password@old_node:9200/docker-daemon \
  --output=http://user:password@new_node:9200/docker-daemon \
  --type=mapping
elasticdump \
  --input=http://user:password@old_node:9200/docker-daemon \
  --output=http://user:password@new_node:9200/docker-daemon \
  --type=data

Extra Options

What was shown is the basic manipulation of elasticdump.
A series of other parameters may be used depending on your requirement.
Some commonly used parameters include:
–searchBody: Useful when you do not want to export an entire index. Example:  –searchBody ‘{“query”:{“term”:{“containerName”: “nginx”}}}’
–limit: Indicates how many objects to move in batch per operation. Defaults to 100
–delete: Delete documents from the input as they are moved.
A full list of these can be found on the official tool page here.

Cannot remove file: “Structure needs cleaning”

That is strongly indicative of file-system corruption. You should unmount, make a sector-level backup of your disk, and then run fsck to see what is up. If there is major corruption, you may later be happy that you did a sector-level backup before letting fsck tamper with the data.

How to Check for Errors on a Disk

Run fsck on the target disk, using the desired options. This example checks all file systems (-A) on /dev/sdb:

fsck -A /dev/sdb

Understand fsck Error Codes

The error codes that fsck returns can be understood with the following table from man7.org:

Code Error Code Meaning
0 No errors
1 Filesystem errors corrected
2 System should be rebooted
4 Filesystem errors left uncorrected
8 Operational error
16 Usage or syntax error
32 Checking canceled by user request
128 Shared-library error

Use fsck to Repair File System Errors

Use the -r option to use the interactive repair option.

This example uses fsck to check all file systems except the root, and will attempt repair using the interactive feature:

fsck -AR -y

To check and attempt to repair any errors on /dev/sdb, use this format:

fsck -y /dev/sdb

Saltstack and Vault integration

First install and configure vault using this tutorial:
https://apassionatechie.wordpress.com/2017/03/05/hashicorp-vault/

Use the latest version of vault.

Then install salt using the steps given here:
https://docs.saltstack.com/en/latest/topics/installation/

If you face any issues then refer these links:
https://apassionatechie.wordpress.com/2017/07/31/salt-issues/

https://apassionatechie.wordpress.com/2017/08/03/salt-stack-formulas/

Now let’s integrate vault and salt so that we can access vault secrets from inside salt state.

    1. First let’s add some key values into our vault.
      vault write secret/ssh/user1 password=”abc123″
      Then you can check it by reading: vault read secret/ssh/user1
    2. To allow salt to access your secrets you must firstly create a policy as follows:
      salt-policy.hcl

      path "secret/*" {
        capabilities = ["read", "list"]
      }
      
      path "auth/*" {
        capabilities = ["read", "list","sudo","create","update","delete"]
      }
      

      You can also point to your secret like secret/ssh/*
      We have added auth/* so that our token can create other tokens.

    3. Then create a new policy with the following command:
      vault policy-write salt-policy salt-policy.hcl
    4. Then we will create a token from the new salt-policy
      vault token-create -policy=salt-policy
      Save the token created.
    5. Then in the salt-master create a file:
      /etc/salt/master.d/vault.conf with the follwoing contents:

      vault:
        url: http://127.0.0.1:8200
        auth:
          method: token
          token: xxxxxx48-xxxx-xxxx-xxxx-xxxx1xxxx<span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			></span>c4a
        policies:
            - salt-policy
      
      

      Then create a file /etc/salt/master.d//peer_run.conf

      peer_run:
          .*:
              - vault.generate_token
      
      

      Then restart the salt-master with service salt-master restart

    6. Then execute the following command to access the secret stored in vault:
      salt ‘*’ vault.read_secret “secret/ssh/user1”
    7. To access the secret from inside jinja:
      my-secret: {{ salt[‘vault’].read_secret(‘secret/ssh/user1’, ‘password’) }}
      OR
      {% set supersecret = salt[‘vault’].read_secret(‘secret/ssh/user1’) %}
      secrets:
          my_secret: {{ supersecret.password }}
    8. If you want to access the secret as pillar then add the following in salt master configuration:
      ext_pillar:
       – vault: sdb_vault path=secret/ssh/user1
      Restart the salt-master and salt-minion
      Then access the data with the following command:
      salt ‘*’ pillar.get ‘password’
      Then refresh the pillar data with: salt ‘*’ saltutil.refresh_pillar
    9. If your vault policy is not configured correctly you might get an error as:
      ERROR: {‘error’: ‘Forbidden’}
      2017-09-21 06:51:39,320 [salt.loaded.int.utils.vault][ERROR ][26333] Failed to get token from master! An error was returned: Forbidden
      2017-09-21 06:51:39,350 [salt.pillar ][ERROR ][26333] Execption caught loading ext_pillar ‘vault’:
      File “/usr/lib/python2.7/site-packages/salt/pillar/__init__.py”, line 822, in ext_pillar
      key)
      File “/usr/lib/python2.7/site-packages/salt/pillar/__init__.py”, line 765, in _external_pillar_data
      val)
      File “/usr/lib/python2.7/site-packages/salt/pillar/vault.py”, line 91, in ext_pillar
      response = __utils__[‘vault.make_request’](‘GET’, url)
      File “/usr/lib/python2.7/site-packages/salt/utils/vault.py”, line 124, in make_request
      connection = _get_vault_connection()
      File “/usr/lib/python2.7/site-packages/salt/utils/vault.py”, line 113, in _get_vault_connection
      return _get_token_and_url_from_master()
      File “/usr/lib/python2.7/site-packages/salt/utils/vault.py”, line 89, in _get_token_and_url_from_master
      raise salt.exceptions.CommandExecutionError(result)2017-09-21 06:51:39,351 [salt.pillar ][CRITICAL][26333] Pillar render error: Failed to load ext_pillar vault: {‘error’: ‘Forbidden’}

      Make sure you have added auth/* in the policy.

    10. If you get the following error:
      Failed to get token from master! No result returned – is the peer publish configuration correct?
      OR
      ERROR: {}
      Then make sure you have peer_run.conf created and configured.
    11. You can also access your secret with command:
      salt-call sdb.get ‘sdb://vault/secret/ssh/user1?password’

 

Diamond installation on centos 7

$ yum install make rpm-build python-configobj python-setuptools
$ git clone https://github.com/python-diamond/Diamond
$ cd Diamond
$ make buildrpm
Then use the package you built like this:

$ yum localinstall –nogpgcheck dist/diamond-4.0.449-0.noarch.rpm
$ cp /etc/diamond/{diamond.conf.example,diamond.conf}
$ $EDITOR /etc/diamond/diamond.conf
# Start Diamond service via service manager.

$ service diamond start

diamond-setup -C ElasticSearchCollector
diamond-setup -C NetworkCollector

Issues

  1. failed to connect socket to ‘/var/run/libvirt/libvirt-sock-ro’ no such file or directory

execute: egrep ‘(vmx|svm)’ /proc/cpuinfo

2. If the above commands returns with any output showing vmx or svm then your hardware supports VT else it does not.

yum install qemu-kvm qemu-img virt-manager libvirt libvirt-python libvirt-client virt-install virt-viewer

Elasticsearch Queries

  1. Create indices
curl -XPUT 'localhost:9200/twitter?pretty' -H 'Content-Type: application/json' -d'
{
 "settings" : {
 "index" : {
 "number_of_shards" : 3,
 "number_of_replicas" : 2
 }
 }
}
'

2. Search

curl -XGET 'localhost:9200/sw/_search?pretty' -H 'Content-Type: application/json' -d'
{
 "query": { "match_all": {} },
 "_source": ["gender", "height"]
}
'</pre>
3. Creating index and adding documents to it
<pre>curl -XPUT 'localhost:9200/my_index?pretty' -H 'Content-Type: application/json' -d'
{
 "mappings": {
 "my_type": {
 "properties": {
 "user": {
 "type": "nested"
 }
 }
 }
 }
}
'
curl -XPUT 'localhost:9200/my_index/my_type/1?pretty' -H 'Content-Type: application/json' -d'
{
 "group" : "fans",
 "user" : [
 {
 "first" : "John",
 "last" : "Smith"
 },
 {
 "first" : "Alice",
 "last" : "White"
 }
 ]
}
'

4. Must match

curl -XGET 'localhost:9200/my_index/_search?pretty' -H 'Content-Type: application/json' -d'
{
 "query": {
 "nested": {
 "path": "user",
 "query": {
 "bool": {
 "must": [
 { "match": { "user.first": "Alice" }},
 { "match": { "user.last": "Smith" }}
 ]
 }
 }
 }
 }
}
'

5. Highlight

curl -XGET 'localhost:9200/my_index/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "query": {
    "nested": {
      "path": "user",
      "query": {
        "bool": {
          "must": [
            { "match": { "user.first": "Alice" }},
            { "match": { "user.last":  "White" }}
          ]
        }
      },
      "inner_hits": {
        "highlight": {
          "fields": {
            "user.first": {}
          }
        }
      }
    }
  }
}
'

6. To get all records:
curl -XGET ‘localhost:9200//_search?size=100&pretty=true’ -d ”

7. Match all


curl -XGET 'localhost:9200/foo/_search?size=NO_OF_RESULTS' -d '
{
"query" : {
 "match_all" : {}
 }
}'

8. This example does a match_all and returns documents 11 through 20


curl -XGET 'localhost:9200/bank/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query": { "match_all": {} },
"from": 10,
"size": 10
}
'

9. This example does a match_all and sorts the results by account balance in descending order and returns the top 10 (default size) documents


curl -XGET 'localhost:9200/bank/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query": { "match_all": {} },
"sort": { "balance": { "order": "desc" } }
}
'

10. This example shows how to return two fields, account_number and balance (inside of _source), from the search


curl -XGET 'localhost:9200/bank/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query": { "match_all": {} },
"_source": ["account_number", "balance"]
}
'

11. This example returns the account numbered 20


curl -XGET 'localhost:9200/bank/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query": { "match": { "account_number": 20 } }
}
'

12. This example returns all accounts containing the term “mill” in the address


curl -XGET 'localhost:9200/bank/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query": { "match": { "address": "mill" } }
}
'

13. This example returns all accounts containing the term “mill” or “lane” in the address


curl -XGET 'localhost:9200/bank/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query": { "match": { "address": "mill lane" } }
}
'

14. This example is a variant of match (match_phrase) that returns all accounts containing the phrase “mill lane” in the address


curl -XGET 'localhost:9200/bank/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query": { "match_phrase": { "address": "mill lane" } }
}
'

15. This example composes two match queries and returns all accounts containing “mill” and “lane” in the address


curl -XGET 'localhost:9200/bank/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must": [
        { "match": { "address": "mill" } },
        { "match": { "address": "lane" } }
      ]
    }
  }
}
'

16. In contrast, this example composes two match queries and returns all accounts containing “mill” or “lane” in the address


curl -XGET 'localhost:9200/bank/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "should": [
        { "match": { "address": "mill" } },
        { "match": { "address": "lane" } }
      ]
    }
  }
}
'

17. This example returns all accounts of anybody who is 40 years old but doesn’t live in ID

curl -XGET 'localhost:9200/bank/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must": [
        { "match": { "age": "40" } }
      ],
      "must_not": [
        { "match": { "state": "ID" } }
      ]
    }
  }
}
'

18. This example uses a bool query to return all accounts with balances between 20000 and 30000


curl -XGET 'localhost:9200/bank/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must": { "match_all": {} },
      "filter": {
        "range": {
          "balance": {
            "gte": 20000,
            "lte": 30000
          }
        }
      }
    }
  }
}
'

19. To start with, this example groups all the accounts by state, and then returns the top 10 (default) states sorted by count descending (also default)

curl -XGET 'localhost:9200/bank/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state.keyword"
      }
    }
  }
}
'

20. Building on the previous aggregation, let’s now sort on the average balance in descending order

curl -XGET 'localhost:9200/bank/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state.keyword",
        "order": {
          "average_balance": "desc"
        }
      },
      "aggs": {
        "average_balance": {
          "avg": {
            "field": "balance"
          }
        }
      }
    }
  }
}
'

21. This example demonstrates how we can group by age brackets (ages 20-29, 30-39, and 40-49), then by gender, and then finally get the average account balance, per age bracket, per gender

curl -XGET 'localhost:9200/bank/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggs": {
    "group_by_age": {
      "range": {
        "field": "age",
        "ranges": [
          {
            "from": 20,
            "to": 30
          },
          {
            "from": 30,
            "to": 40
          },
          {
            "from": 40,
            "to": 50
          }
        ]
      },
      "aggs": {
        "group_by_gender": {
          "terms": {
            "field": "gender.keyword"
          },
          "aggs": {
            "average_balance": {
              "avg": {
                "field": "balance"
              }
            }
          }
        }
      }
    }
  }
}
'

22. Assuming the data consists of documents representing exams grades (between 0 and 100) of students we can average their scores with

curl -XPOST 'localhost:9200/exams/_search?size=0&pretty' -H 'Content-Type: application/json' -d'
{
    "aggs" : {
        "avg_grade" : { "avg" : { "field" : "grade" } }
    }
}
'

23. Multiply current marks with 1.2 then get the aggregate

curl -XPOST 'localhost:9200/exams/_search?size=0&pretty' -H 'Content-Type: application/json' -d'
{
    "aggs" : {
        "avg_corrected_grade" : {
            "avg" : {
                "field" : "grade",
                "script" : {
                    "lang": "painless",
                    "inline": "_value * params.correction",
                    "params" : {
                        "correction" : 1.2
                    }
                }
            }
        }
    }
}
'

24. Documents without a value in the grade field will fall into the same bucket as documents that have the value 10

curl -XPOST 'localhost:9200/exams/_search?size=0&pretty' -H 'Content-Type: application/json' -d'
{
    "aggs" : {
        "grade_avg" : {
            "avg" : {
                "field" : "grade",
                "missing": 10
            }
        }
    }
}
'

25. Type count for the balance

curl -XPOST 'localhost:9200/bank/_search?size=0&pretty' -H 'Content-Type: application/json' -d'
{
    "aggs" : {
        "type_count" : {
            "cardinality" : {
                "field" : "balance"
            }
        }
    }
}
'

26. Use of inline painless script for adding promoted value to type value

curl -XPOST 'localhost:9200/bank/_search?size=0&pretty' -H 'Content-Type: application/json' -d'
{
    "aggs" : {
        "type_promoted_count" : {
            "cardinality" : {
                "script": {
                    "lang": "painless",
                    "inline": "doc[\u0027type\u0027].value + \u0027 \u0027 + doc[\u0027promoted\u0027].value"
                }
            }
        }
    }
}
'

27. Extended stats for balance

curl -XPOST 'localhost:9200/bank/_search?size=0&pretty' -H 'Content-Type: application/json' -d'
{
    "aggs" : {
        "grades_stats" : { "extended_stats" : { "field" : "balance" } }
    }
}
'

28. Geopoint and geo centroid example

curl -XPUT 'localhost:9200/museums' -H 'Content-Type: application/json' -d'
{
    "mappings": {
        "doc": {
            "properties": {
                "location": {
                    "type": "geo_point"
                }
            }
        }
    }
}
'

curl -XPOST 'localhost:9200/museums/doc/_bulk?refresh' -H 'Content-Type: application/json' -d'
{"index":{"_id":1}}
{"location": "52.374081,4.912350", "name": "NEMO Science Museum"}
{"index":{"_id":2}}
{"location": "52.369219,4.901618", "name": "Museum Het Rembrandthuis"}
{"index":{"_id":3}}
{"location": "52.371667,4.914722", "name": "Nederlands Scheepvaartmuseum"}
{"index":{"_id":4}}
{"location": "51.222900,4.405200", "name": "Letterenhuis"}
{"index":{"_id":5}}
{"location": "48.861111,2.336389", "name": "Musée du Louvre"}
{"index":{"_id":6}}
{"location": "48.860000,2.327000", "name": "Musée dOrsay"}'

curl -XPOST 'localhost:9200/museums/_search?size=0' -H 'Content-Type: application/json' -d'
{
    "query" : {
        "match" : { "name" : "musée" }
    },
    "aggs" : {
        "viewport" : {
            "geo_bounds" : {
                "field" : "location",
                "wrap_longitude" : true
            }
        }
    }
}
'

curl -XPOST 'localhost:9200/museums/_search?size=0' -H 'Content-Type: application/json' -d'
{
    "aggs" : {
        "centroid" : {
            "geo_centroid" : {
                "field" : "location"
            }
        }
    }
}
'

curl -XPOST 'localhost:9200/museums/_search?size=0' -H 'Content-Type: application/json' -d'
{
    "aggs" : {
        "cities" : {
            "terms" : { "field" : "city.keyword" },
            "aggs" : {
                "centroid" : {
                    "geo_centroid" : { "field" : "location" }
                }
            }
        }
    }
}
'

29. Max balance

curl -XPOST 'localhost:9200/bank/_search?size=0&pretty' -H 'Content-Type: application/json' -d'
{
    "aggs" : {
        "max_price" : { "max" : { "field" : "balance" } }
    }
}
'

30. Min balance

curl -XPOST 'localhost:9200/sales/_search?size=0&pretty' -H 'Content-Type: application/json' -d'
{
    "aggs" : {
        "min_price" : { "min" : { "field" : "price" } }
    }
}
'

31. Percentiles

{
    "aggs" : {
        "load_time_outlier" : {
            "percentiles" : {
                "field" : "load_time"
            }
        }
    }
}

32. Percentiles of values within specific bounds

curl -XPOST 'localhost:9200/bank/account/_search?size=0&pretty' -H 'Content-Type: application/json' -d'
{
    "aggs": {
        "balance_outlier": {
            "percentile_ranks": {
                "field": "balance",
                "values": [25000, 50000],
                "keyed": false
            }
        }
    }
}
'

33. Sum of hat prices

{
"aggs" : {
        "hat_prices" : { "sum" : { "field" : "price" } }
    }
}

34. Sort by call_duration in descending order

curl -u elastic:changeme -XGET 'localhost:9200/index-alias2-events-2015.01.01-00/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "query": { "match_all": {} },
  "sort": { "call_duration": { "order": "desc" } }
}
'

The Ceres Database

Ceres is a time-series database format intended to replace Whisper as the default storage format for Graphite. In contrast with Whisper, Ceres is not a fixed-size database and is designed to better support sparse data of arbitrary fixed-size resolutions. This allows Graphite to distribute individual time-series across multiple servers or mounts.

Ceres is not actively developped at the moment. For alternatives to whisper look at alternative storage backends.

Storage Overview

Ceres databases are comprised of a single tree contained within a single path on disk that stores all metrics in nesting directories as nodes.

A Ceres node represents a single time-series metric, and is composed of at least two data files. A slice to store all data points, and an arbitrary key-value metadata file. The minimum required metadata a node needs is a 'timeStep'. This setting is the finest resolution that can be used for writing. A Ceres node however can contain and read data with other, less-precise values in its underlying slice data.

Other metadata keys that may be set for compatibility with Graphite are 'retentions' , 'xFilesFacter', and 'aggregationMethod'.

A Ceres slice contains the actual data points in a file. The only other information a slice holds is the timestamp of the oldest data point, and the resolution. Both of which are encoded as part of its filename in the format timestamp@resolution.

Data points in Ceres are stored on-disk as a contiguous list of big-endian double-precision floats. The timestamp of a datapoint is not stored with the value, rather it is calculated by using the timestamp of the slice plus the index offset of the value multiplied by the resolution.

The timestamp is the number of seconds since the UNIX Epoch (01-01-1970). The data value is parsed by the Python float() function and as such behaves in the same way for special strings such as 'inf'. Maximum and minimum values are determined by the Python interpreter’s allowable range for float values which can be found by executing:

python -c 'import sys; print sys.float_info'

Slices: Precision and Fragmentation

Ceres databases contain one or more slices, each with a specific data resolution and a timestamp to mark the beginning of the slice. Slices are ordered from the most recent timestamp to the oldest timestamp. Resolution of data is not considered when reading from a slice, only that when writing a slice with the finest precision configured for the node exists.

Gaps in data are handled in Ceres by padding slices with null datapoints. If the slice gap however is too big, then a new slice is instead created. If a Ceres node accumulates too many slices, read performance can suffer. This can be caused by intermittently reported data. To mitigate slice fragmentation there is a tolerance for how much space can be wasted within a slice file to avoid creating a new one. That tolerance level is determined by 'MAX_SLICE_GAP', which is the number of consecutive null datapoints allowed in a slice file.

If set very low, Ceres will waste less of the tiny bit disk space that this feature wastes, but then will be prone to performance problems caused by slice fragmentation, which can be pretty severe.

If set really high, Ceres will waste a bit more disk space. Although each null datapoint wastes 8 bytes, you must keep in mind your filesystem’s block size. If you suffer slice fragmentation issues, you should increase this or defrag the data more often. However you should not set it to be huge because then if a large but allowed gap occurs it has to get filled in, which means instead of a simple 8-byte write to a new file we could end up doing an (8 * MAX_SLICE_GAP)-byte write to the latest slice.

Rollup Aggregation

Expected features such as roll-up aggregation and data expiration are not provided by Ceres itself, but instead are implemented as maintenance plugins.

Such a rollup plugin exists in Ceres that aggregates data points in a way that is similar behavior of Whisper archives. Where multiple data points are collapsed and written to a lower precision slice, and data points outside of the set slice retentions are trimmed. By default, an average function is used, however alternative methods can be chosen by changing the metadata.

Retrieval Behavior

When data is retrieved (scoped by a time range), the first slice which has data within the requested interval is used. If the time period overlaps a slice boundary, then both slices are read, with their values joined together. Any missing data between them are filled with null data points.

There is currently no support in Ceres for handling slices with mixed resolutions in the same way that is done with Whisper archives.

Database Format

CeresSlice Data
Data Point+

Data types in Python’s struct format:

Point !d

Metadata for Ceres is stored in JSON format:

{“retentions”: [[30, 1440]], “timeStep”: 30, “xFilesFactor”: 0.5, “aggregationMethod”: “average”}

Graphite and your data

Getting your data into Graphite is very flexible. There are three main methods for sending data to Graphite: Plaintext, Pickle, and AMQP.

It’s worth noting that data sent to Graphite is actually sent to the Carbon and Carbon-Relay, which then manage the data. The Graphite web interface reads this data back out, either from cache or straight off disk.

Choosing the right transfer method for you is dependent on how you want to build your application or script to send data:

  • There are some tools and APIs which can help you get your data into Carbon.
  • For a singular script, or for test data, the plaintext protocol is the most straightforward method.
  • For sending large amounts of data, you’ll want to batch this data up and send it to Carbon’s pickle receiver.
  • Finally, Carbon can listen to a message bus, via AMQP.

The plaintext protocol

The plaintext protocol is the most straightforward protocol supported by Carbon.

The data sent must be in the following format: <metric path> <metric value> <metric timestamp>. Carbon will then help translate this line of text into a metric that the web interface and Whisper understand.

On Unix, the nc program (netcat) can be used to create a socket and send data to Carbon (by default, ‘plaintext’ runs on port 2003):

If you use the OpenBSD implementation of netcat, please follow this example:

PORT=2003
SERVER=graphite.your.org
echo "local.random.diceroll 4 `date +%s`" | nc -q0 ${SERVER} ${PORT}

The -q0 parameter instructs nc to close socket once data is sent. Without this option, some nc versions would keep the connection open.

If you use the GNU implementation of netcat, please follow this example:

PORT=2003
SERVER=graphite.your.org
echo "local.random.diceroll 4 `date +%s`" | nc -c ${SERVER} ${PORT}

The -c parameter instructs nc to close socket once data is sent. Without this option, nc will keep the connection open and won’t end.

The pickle protocol

The pickle protocol is a much more efficient take on the plaintext protocol, and supports sending batches of metrics to Carbon in one go.

The general idea is that the pickled data forms a list of multi-level tuples:

[(path, (timestamp, value)), ...]

Once you’ve formed a list of sufficient size (don’t go too big!), and pickled it (if your client is running a more recent version of python than your server, you may need to specify the protocol) send the data over a socket to Carbon’s pickle receiver (by default, port 2004). You’ll need to pack your pickled data into a packet containing a simple header:

payload = pickle.dumps(listOfMetricTuples, protocol=2)
header = struct.pack("!L", len(payload))
message = header + payload

You would then send the message object through a network socket.

Using AMQP

When AMQP_METRIC_NAME_IN_BODY is set to True in your carbon.conf file, the data should be of the same format as the plaintext protocol, e.g. echo “local.random.diceroll 4 date +%s”. When AMQP_METRIC_NAME_IN_BODY is set to False, you should omit ‘local.random.diceroll’.

Getting Your Data Into Graphite

The Basic Idea

Graphite is useful if you have some numeric values that change over time and you want to graph them. Basically you write a program to collect these numeric values which then sends them to graphite’s backend, Carbon.

Step 1 – Plan a Naming Hierarchy

Everything stored in graphite has a path with components delimited by dots. So for example, website.orbitz.bookings.air or something like that would represent the number of air bookings on orbitz. Before producing your data you need to decide what your naming scheme will be. In a path such as “foo.bar.baz”, each thing surrounded by dots is called a path component. So “foo” is a path component, as well as “bar”, etc.

Each path component should have a clear and well-defined purpose. Volatile path components should be kept as deep into the hierarchy as possible.

Step 2 – Configure your Data Retention

Graphite is built on fixed-size databases (see Whisper.) so we have to configure in advance how much data we intend to store and at what level of precision. For instance you could store your data with 1-minute precision (meaning you will have one data point for each minute) for say 2 hours. Additionally you could store your data with 10-minute precision for 2 weeks, etc. The idea is that the storage cost is determined by the number of data points you want to store, the less fine your precision, the more time you can cover with fewer points. To determine the best retention configuration, you must answer all of the following questions.

  1. How often can you produce your data?
  2. What is the finest precision you will require?
  3. How far back will you need to look at that level of precision?
  4. What is the coarsest precision you can use?
  5. How far back would you ever need to see data? (yes it has to be finite, and determined ahead of time)

Once you have picked your naming scheme and answered all of the retention questions, you need to create a schema by creating/editing the /opt/graphite/conf/storage-schemas.conf file.

The format of the schemas file is easiest to demonstrate with an example. Let’s say we’ve written a script to collect system load data from various servers, the naming scheme will be like so:

servers.HOSTNAME.METRIC

Where HOSTNAME will be the server’s hostname and METRIC will be something like cpu_load, mem_usage, open_files, etc. Also let’s say we want to store this data with minutely precision for 30 days, then at 15 minute precision for 10 years.

For details of implementing your schema, see the Configuring Carbon document.

Basically, when carbon receives a metric, it determines where on the filesystem the whisper data file should be for that metric. If the data file does not exist, carbon knows it has to create it, but since whisper is a fixed size database, some parameters must be determined at the time of file creation (this is the reason we’re making a schema). Carbon looks at the schemas file, and in order of priority (highest to lowest) looks for the first schema whose pattern matches the metric name. If no schema matches the default schema (2 hours of minutely data) is used. Once the appropriate schema is determined, carbon uses the retention configuration for the schema to create the whisper data file appropriately.

Step 3 – Understanding the Graphite Message Format

Graphite understands messages with this format:

metric_path value timestamp\n

metric_path is the metric namespace that you want to populate.

value is the value that you want to assign to the metric at this time.

timestamp is the number of seconds since unix epoch time.

A simple example of doing this from the unix terminal would look like this:

echo "test.bash.stats 42 `date +%s`" | nc graphite.example.com 2003

There are many tools that interact with Graphite. See the Tools page for some choices of tools that may be used to feed Graphite.

How Machines Make Sense of Big Data: an Introduction to Clustering Algorithms

Take a look at the image below. It’s a collection of bugs and creepy-crawlies of different shapes and sizes. Take a moment to categorize them by similarity into a number of groups.

This isn’t a trick question. Start with grouping the spiders together.

Images via Google Image Search, labelled for reuse

Done? While there’s not necessarily a “correct” answer here, it’s most likely you split the bugs into four clusters. The spiders in one cluster, the pair of snails in another, the butterflies and moth into one, and the trio of wasps and bees into one more.

That wasn’t too bad, was it? You could probably do the same with twice as many bugs, right? If you had a bit of time to spare — or a passion for entomology — you could probably even do the same with a hundred bugs.

For a machine though, grouping ten objects into however many meaningful clusters is no small task, thanks to a mind-bending branch of maths called combinatorics, which tells us that are 115,975 different possible ways you could have grouped those ten insects together. Had there been twenty bugs, there would have been over fifty trillion possible ways of clustering them.

With a hundred bugs — there’d be many times more solutions than there are particles in the known universe. How many times more? By my calculation, approximately five hundred million billion billion times more. In fact, there are more than four million billion googol solutions (what’s a googol?). For just a hundred objects.

Almost all of those solutions would be meaningless — yet from that unimaginable number of possible choices, you pretty quickly found one of the very few that clustered the bugs in a useful way.

Us humans take it for granted how good we are categorizing and making sense of large volumes of data pretty quickly. Whether it’s a paragraph of text, or images on a screen, or a sequence of objects — humans are generally fairly efficient at making sense of whatever data the world throws at us.

Given that a key aspect of developing A.I. and Machine Learning is getting machines to quickly make sense of large sets of input data, what shortcuts are there available? Here, you can read about three clustering algorithms that machines can use to quickly make sense of large datasets. This is by no means an exhaustive list — there are other algorithms out there — but they represent a good place to start!

You’ll find for each a quick summary of when you might use them, a brief overview of how they work, and a more detailed, step-by-step worked example. I believe it helps to understand an algorithm by actually carrying out yourself. If you’re really keen, you’ll find the best way to do this is with pen and paper. Go ahead — nobody will judge!

Three suspiciously neat clusters, with K = 3

K-means clustering

Use when:

…you have an idea of how many groups you’re expecting to find a priori.

How it works:

The algorithm randomly assigns each observation into one of k categories, then calculates the mean of each category. Next, it reassigns each observation to the category with the closest mean before recalculating the means. This step repeats over and over until no more reassignments are necessary.

Worked Example:

Take a group of 12 football (or ‘soccer’) players who have each scored a certain number of goals this season (say in the range 3–30). Let’s divide them into separate clusters — say three.

Step 1 requires us to randomly split the players into three groups and calculate the means of each.

Group 1
Player A (5 goals), Player B (20 goals), Player C (11 goals) 
Group Mean = (5 + 20 + 11) / 3 = 12
Group 2
Player D (5 goals), Player E (9 goals), Player F (19 goals)
Group Mean = 11
Group 3
Player G (30 goals), Player H (3 goals), Player I (15 goals)
Group Mean = 16

Step 2: For each player, reassign them to the group with the closest mean. E.g., Player A (5 goals) is assigned to Group 2 (mean = 11). Then recalculate the group means.

Group 1 (Old Mean = 12)
Player C (11 goals), Player E (9 goals)
New Group Mean = (11 + 9) / 2 = 10
Group 2 (Old Mean = 11)
Player A (5 goals), Player D (5 goals), Player H (3 goals)
New Group Mean = 4.33
Group 3 (Old Mean = 16)
Player B (20 goals), Player F (19 goals), Player G (30 goals), Player I (15 goals)
New Group Mean = 21

Repeat Step 2 over and over until the group means no longer change. For this somewhat contrived example, this happens on the next iteration. Stop! You have now formed three clusters from the dataset!

Group 1 (Old Mean = 10)
Player C (11 goals), Player E (9 goals), Player I (15 goals)
Final Group Mean = 11.33
Group 2 (Old Mean = 4.33)
Player A (5 goals), Player D (5 goals), Player H (3 goals)
Final Group Mean = 4.33
Group 3 (Old Mean = 21)
Player B (20 goals), Player F (19 goals), Player G (30 goals), 
Final Group Mean = 23

With this example, the clusters could correspond to the players’ positions on the field — such as defenders, midfielders and attackers. K-means works here because we could have reasonably expected the data to fall naturally into these three categories.

In this way, given data on a range of performance statistics, a machine could do a reasonable job of estimating the positions of players from any team sport — useful for sports analytics, and indeed any other purpose where classification of a dataset into predefined groups can provide relevant insights.

Finer details:

There are several variations on the algorithm described here. The initial method of ‘seeding’ the clusters can be done in one of several ways. Here, we randomly assigned every player into a group, then calculated the group means. This causes the initial group means to tend towards being similar to one another, which ensures greater repeatability.

An alternative is to seed the clusters with just one player each, then start assigning players to the nearest cluster. The returned clusters are more sensitive to the initial seeding step, reducing repeatability in highly variable datasets. However, this approach may reduce the number of iterations required to complete the algorithm, as the groups will take less time to diverge.

An obvious limitation to K-means clustering is that you have to provide a priori assumptions about how many clusters you’re expecting to find. There are methods to assess the fit of a particular set of clusters. For example, the Within-Cluster Sum-of-Squares is a measure of the variance within each cluster. The ‘better’ the clusters, the lower the overall WCSS.

Hierarchical clustering

Use when:

…you wish to uncover the underlying relationships between your observations.

How it works:

A distance matrix is computed, where the value of cell (i, j) is a distance metric between observations i and j. Then, pair the closest two observations and calculate their average. Form a new distance matrix, merging the paired observations into a single object. From this distance matrix, pair up the closest two observations and calculate their average. Repeat until all observations are grouped together.

Worked example:

Here’s a super-simplified dataset about a selection of whale and dolphin species. As a trained biologist, I can assure you we normally use much more detailed datasets for things like reconstructing phylogeny. For now though, we’ll just look at the typical body lengths for these six species. We’ll be using just two repeated steps.

Species          Initials  Length(m)
Bottlenose Dolphin     BD        3.0
Risso's Dolphin        RD        3.6
Pilot Whale            PW        6.5
Killer Whale           KW        7.5
Humpback Whale         HW       15.0
Fin Whale              FW       20.0

Step 1: compute a distance matrix between each species. Here, we’ll use the Euclidean distance — how far apart are the data points? Read this exactly as you would a distance chart in a road atlas. The difference in length between any pair of species can be looked up by reading the value at the intersection of the relevant row and column.

    BD   RD   PW   KW   HW
RD  0.6                    
PW  3.5  2.9               
KW  4.5  3.9  1.0          
HW 12.0 11.4  8.5  7.5     
FW 17.0 16.4 13.5 12.5  5.0

Step 2: Pair up the two closest species. Here, this will be the Bottlenose & Risso’s Dolphins, with an average length of 3.3m.

Repeat Step 1 by recalculating the distance matrix, but this time merge the Bottlenose & Risso’s Dolphins into a single object with length 3.3m.

    [BD, RD]   PW   KW   HW
PW       3.2               
KW       4.2   1.0          
HW      11.7   8.5  7.5     
FW      16.7  13.5 12.5  5.0

Next, repeat Step 2 with this new distance matrix. Here, the smallest distance is between the Pilot & Killer Whales, so we pair them up and take their average — which gives us 7.0m.

Then, we repeat Step 1 — recalculate the distance matrix, but now we’ve merged the Pilot & Killer Whales into a single object of length 7.0m.

         [BD, RD] [PW, KW]   HW
[PW, KW]      3.7              
HW           11.7      8.0     
FW           16.7     13.0   5.0

Next, we repeat Step 2 with this distance matrix. The smallest distance (3.7m) is between the two merged objects — so now we merge them into an even bigger object, and take the average (which is 5.2m).

Then, we repeat Step 1 and compute a new distance matrix, having merged the Bottlenose & Risso’s Dolphins with the Pilot & Killer Whales.

   [[BD, RD] , [PW, KW]]    HW
HW                   9.8    
FW                  14.8   5.0

Next, we repeat Step 2. The smallest distance (5.0m) is between the Humpback & Fin Whales, so we merge them into a single object, and take the average (17.5m).

Then, it’s back to Step 1 — compute the distance matrix, having merged the Humpback & Fin Whales.

         [[BD, RD] , [PW, KW]]
[HW, FW]                  12.3

Finally, we repeat Step 2 — there is only one distance (12.3m) in this matrix, so we pair everything into one big object, and now we can stop! Let’s look at the final merged object:

[[[BD, RD],[PW, KW]],[HW, FW]]

It has a nested structure (think JSON), which allows it to be drawn up as a tree-like graph, or dendrogram. It reads in much the same way a family tree might. The nearer two observations are on the tree, the more similar or closely-related they are taken to be.

A no-frills dendrogram generated at R-Fiddle.org

The structure of the dendrogram gives us insight into how our dataset is structured. In our example, we see two main branches, with Humpback Whale and Fin Whale on one side, and the Bottlenose Dolphin/Risso’s Dolphin and Pilot Whale/Killer Whale on the other.

In evolutionary biology, much larger datasets with many more specimens and measurements are used in this way to infer taxonomic relationships between them. Outside of biology, hierarchical clustering has applications in Data Mining and Machine Learning contexts.

The cool thing is that this approach requires no assumptions about the number of clusters you’re looking for. You can split the returned dendrogram into clusters by “cutting” the tree at a given height. This height can be chosen in a number of ways, depending on the resolution at which you wish to cluster the data.

For instance, looking at the dendrogram above, if we draw a horizontal line at height = 10, we’d intersect the two main branches, splitting the dendrogram into two sub-graphs. If we cut at height = 2, we’d be splitting the dendrogram into three clusters.

Finer details:

There are essentially three aspects in which hierarchical clustering algorithms can vary to the one given here.

Most fundamental is the approach — here, we have used an agglomerative process, whereby we start with individual data points and iteratively cluster them together until we’re left with one large cluster. An alternative (but more computationally intensive) approach is to start with one giant cluster, and then proceed to divide the data into smaller and smaller clusters until you’re left with isolated data points.

There are also a range of methods that can be used to calculate the distance matrices. For many purposes, the Euclidean distance (think Pythagoras’ Theorem) will suffice, but there are alternatives that may be more applicable in some circumstances.

Finally, the linkage criterion can also vary. Clusters are linked according to how close they are to one another, but the way in which we define ‘close’ is flexible. In the example above, we measured the distances between the means (or ‘centroids’) of each group and paired up the nearest groups. However, you may want to use a different definition.

For example, each cluster is made up of several discrete points. We could define the distance between two clusters to be the minimum (or maximum) distance between any of their points — as illustrated in the figure below. There are still other ways of defining the linkage criterion, which may be suitable in different contexts.

Red/Blue: centroid linkage; Red/Green: minimum linkage; Green/Blue: maximum linkage

Graph Community Detection

Use when:

…you have data that can be represented as a network, or ‘graph’.

How it works:

A graph community is very generally defined as a subset of vertices which are more connected to each other than with the rest of the network. Various algorithms exist to identify communities, based upon more specific definitions. Algorithms include, but are not limited to, Edge Betweenness, Modularity-Maximsation, Walktrap, Clique Percolation, Leading Eigenvector…

Worked example:

Graph theory, or the mathematical study of networks, is a fascinating branch of mathematics that lets us model complex systems as an abstract collection of ‘dots’ (or vertices) connected by ‘lines’ (or edges).

Perhaps the most intuitive case-studies are social networks. Here, the vertices represent people, and edges connect vertices who are friends/followers. However, any system can be modelled as a network if you can justify a method to meaningfully connect different components. Among the more innovative applications of graph theory to clustering include feature extraction from image data, and analysing gene regulatory networks.

As an entry-level example, take a look at this quickly put-together graph. It shows the eight websites I most recently visited, linked according to whether their respective Wikipedia articles link out to one another. You could assemble this data manually, but for larger-scale projects, it’s much quicker to write a Python script to do the same. Here’s one I wrote earlier.

Graph plotted with ‘igraph’ package for R version 3.3.3

The vertices are colored according to their community membership, and sized according to their centrality. See how Google and Twitter are the most central?

Also, the clusters make pretty good sense in the real-world (always an important performance indicator). The yellow vertices are generally reference/look-up sites; the blue vertices are all used for online publishing (of articles, tweets, or code); and the red vertices include YouTube, which was of course founded by former PayPal employees. Not bad deductions for a machine!

Aside from being a useful way to visualize large systems, the real power of networks comes from their mathematical analysis. Let’s start by translating our nice picture of the network into a more mathematical format. Below is the adjacency matrix of the network.

         GH Gl  M  P  Q  T  W  Y
GitHub    0  1  0  0  0  1  0  0  
Google    1  0  1  1  1  1  1  1
Medium    0  1  0  0  0  1  0  0
PayPal    0  1  0  0  0  1  0  1
Quora     0  1  0  0  0  1  1  0
Twitter   1  1  1  1  1  0  0  1
Wikipedia 0  1  0  0  1  0  0  0
YouTube   0  1  0  1  0  1  0  0

The value at the intersection of each row and column records whether there is an edge between that pair of vertices. For instance, there is an edge between Medium and Twitter (surprise, surprise!), so the value where their rows/columns intersect is 1. Similarly, there is no edge between Medium and PayPal, so the intersection of their rows/columns returns 0.

Encoded within the adjacency matrix are all the properties of this network — it gives us the key to start unlocking all manner of valuable insights. For a start, summing any column (or row) gives you the degree of each vertex — i.e., how many others it is connected to. This is commonly denoted with the letter k.

Likewise, summing the degrees of every vertex and dividing by two gives you L, the number of edges (or ‘links’) in the network. The number of rows/columns gives us N, the number of vertices (or ‘nodes’) in the network.

Knowing just k, L, N and the value of each cell in the adjacency matrix A lets us calculate the modularity of any given clustering of the network.

Say we’ve clustered the network into a number of communities. We can use the modularity score to assess the ‘quality’ of this clustering. A higher score will show we’ve split the network into ‘accurate’ communities, whereas a low score suggests our clusters are more random than insightful. The image below illustrates this.

Modularity serves as a measure of the ‘quality’ of a partition.

Modularity can be calculated using the formula below:

That’s a fair amount of math, but we can break it down bit by bit and it’ll make more sense.

M is of course what we’re calculating — modularity.

1/2L tells us to divide everything that follows by 2L, i.e., twice the number of edges in the network. So far, so good.

The Σ symbol tells us we’re summing up everything to the right, and lets us iterate over every row and column in the adjacency matrix A. For those unfamiliar with sum notation, the i, j = 1 and the N work much like nested for-loops in programming. In Python, you’d write it as follows:

sum = 0
for i in range(1,N):
    for j in range(1,N):
        ans = #stuff with i and j as indices
        sum += ans

So what is #stuff with i and j in more detail?

Well, the bit in brackets tells us to subtract ( k_i k_j ) / 2L from A_ij.

A_ij is simply the value in the adjacency matrix at row i, column j.

The values of k_i and k_j are the degrees of each vertex — found by adding up the entries in row i and column j respectively. Multiplying these together and dividing by 2L gives us the expected number of edges between vertices i and j if the network were randomly shuffled up.

Overall, the term in the brackets reveals the difference between the network’s real structure and the expected structure it would have if randomly reassembled. Playing around with the values shows that it returns its highest value when A_ij = 1, and ( k_i k_j ) / 2L is low. This means we see a higher value if there is an ‘unexpected’ edge between vertices i and j.

Finally, we multiply the bracketed term by whatever the last few symbols refer to.

The 𝛿c_i, c_j is the fancy-sounding but totally harmless Kronecker-delta function. Here it is, explained in Python:

def Kronecker_Delta(ci, cj):
    if ci == cj:
        return 1
    else:
        return 0
Kronecker_Delta("A","A")    #returns 1
Kronecker_Delta("A","B")    #returns 0

Yes — it really is that simple. The Kronecker-delta function takes two arguments, and returns 1 if they are identical, otherwise, zero.

This means that if vertices i and j have been put in the same cluster, then 𝛿c_i, c_j = 1. Otherwise, if they are in different clusters, the function returns zero.

As we are multiplying the bracketed term by this Kronecker-delta function, we find that for the nested sum Σ, the outcome is highest when there are lots of ‘unexpected’ edges connecting vertices assigned to the same cluster. As such, modularity is a measure of how well-clustered the graph is into separate communities.

Dividing by 2L bounds the upper value of modularity at 1. Modularity scores near to or below zero indicate the current clustering of the network is really no use. The higher the modularity, the better the clustering of the network into separate communities. By maximising modularity, we can find the best way of clustering the network.

Notice that we have to pre-define how the graph is clustered to find out how ‘good’ that clustering actually is. Unfortunately, employing brute force to try out every possible way of clustering the graph to find which has the highest modularity score would be computationally impossible beyond a very limited sample size.

Combinatorics tells us that for a network of just eight vertices, there are 4140 different ways of clustering them. A network twice the size would have over ten billion possible ways of clustering the vertices. Doubling the network again (to a very modest 32 vertices) would give 128 septillion possible ways, and a network of eighty vertices would be cluster-able in more ways than there are atoms in the observable universe.

Instead, we have to turn to a heuristic method that does a reasonably good job at estimating the clusters that will produce the highest modularity score, without trying out every single possibility. This is an algorithm called Fast-Greedy Modularity-Maximization, and it’s somewhat analogous to the agglomerative hierarchical clustering algorithm describe above. Instead of merging according to distance, ‘Mod-Max’ merges communities according to changes in modularity. Here’s how it goes:

Begin by initially assigning every vertex to its own community, and calculating the modularity of the whole network, M.

Step 1 requires that for each community pair linked by at least a single edge, the algorithm calculates the resultant change in modularity ΔM if the two communities were merged into one.

Step 2 then takes the pair of communities that produce the biggest increase in ΔM, which are then merged. Calculate the new modularity M for this clustering, and keep a record of it.

Repeat steps 1 and 2 — each time merging the pair of communities for which doing so produces the biggest gain in ΔM, then recording the new clustering pattern and its associated modularity score M.

Stop when all the vertices are grouped into one giant cluster. Now the algorithm checks the records it kept as it went along, and identifies the clustering pattern that returned the highest value of M. This is the returned community structure.

Finer details:

Whew! That was computationally intensive, at least for us humans. Graph theory is a rich source of computationally challenging, often NP-hard problems — yet it also has incredible potential to provide valuable insights into complex systems and datasets. Just ask Larry Page, whose eponymous PageRank algorithm — which helped propel Google from start-up to basically world domination in less than a generation — was based entirely in graph theory.

Community detection is a major focus of current research in graph theory, and there are plenty of alternatives to Modularity-Maximization, which while useful, does have some drawbacks.

For a start, its agglomerative approach often sees small, well-defined communities swallowed up into larger ones. This is known as the resolution limit — the algorithm will not find communities below a certain size. Another challenge is that rather than having one distinct, easy-to-reach global peak, the Mod-Max approach actually tends to produce a wide ‘plateau’ of many similar high modularity scores — making it somewhat difficult to truly identify the absolute maximum score.

Other algorithms use different ways to define and approach community detection. Edge-Betweenness is a divisive algorithm, starting with all vertices grouped in one giant cluster. It proceeds to iteratively remove the least ‘important’ edges in the network, until all vertices are left isolated. This produces a hierarchical structure, with similar vertices closer together in the hierarchy.

Another algorithm is Clique Percolation, which takes into account possible overlap between graph communities. Yet another set of algorithms are based on random-walks across the graph, and then there are spectral clustering m

ethods which start delving into the eigendecomposition of the adjacency matrix and other matrices derived therefrom. These ideas are used in feature extraction in, for example, areas such as Computer Vision.

It’d be well beyond the scope of this article to give each algorithm its own in-depth worked example. Suffice to say that this is an active area of research, providing powerful methods to make sense of data that even a generation ago would have been extremely difficult to process.

Conclusion

Hopefully this article has informed and inspired you to better understand how machines can make sense of Big Data! The future is a rapidly changing place, and many of those changes will be driven by what technology becomes capable of in the next generation or two.

As outlined in the introduction, Machine Learning is an extraordinarily ambitious field of research, in which massively complex problems require solving in as accurate and as efficient a way possible. Tasks that come naturally to us humans require innovative solutions when taken on by machines.

There’s still plenty of progress to be made, and whoever contributes the next breakthrough idea will no doubt be generously rewarded. Maybe someone reading this article will be behind the next powerful algorithm? All great ideas have to start somewhere!