Cyber Security Intelligence & Analytics (CSIA)¶

Project Description : @GitHub README ¶

Core: Malicious Software (Malware) Profile Analysis

Original Dataset - Thousands of data capture daily for 36 months
Data Source - Passive Domain Name System (DNS) malicious malware attack
Data Structure - 4-column entries Date, MD5Hash Domains, IP Address.
Capture Period - January 01, 2011 ~ December 31, 2013 36 months
.CSV Data Source - GT Malware Passive DNS Data 2011 - 2013
Data Size Approx. 44,000,000 Rows X 4 Columns resulting in 3.5GB!

In [2]:

%matplotlib inline

import os
import numpy as np
import pandas as pd

pd.set_option('precision', 2)
pd.set_option('notebook_repr_html', True)
# pd.set_option('display.mpl_style', 'default')

import seaborn as sns
sns.set_context("talk")
sns.set_style("darkgrid")

import matplotlib.pyplot as plt
from IPython.display import Image, HTML

import plotly as py
import cufflinks as cf
import plotly.graph_objs as go

cf.set_config_file(theme='ggplot', sharing='public', offline=False, world_readable=True)

print('Done!')

Done!

In [3]:

col_names = ['Date', 'MD5Hash', 'Domains', 'IP_Address']
p_url = "./csv_dataset/gt_mpdns_2011-01.csv"
demo_data = pd.read_csv(p_url, index_col=0, names=col_names).head(5)
print("\nWe can't do much with this data without augmentation.\n")
demo_data

We can't do much with this data without augmentation.

Out[3]:

	MD5Hash	Domains	IP_Address
Date
2011-01-01	0003b97144cf06801627fdd07a917f51	mozilla.com	63.245.209.106
2011-01-01	0003b97144cf06801627fdd07a917f51	rapidshare.com	195.122.131.12
2011-01-01	0003b97144cf06801627fdd07a917f51	tagshare.in	69.10.39.21
2011-01-01	00134ecf2477c1f477a8767c87572107	mozilla.com	63.245.209.106
2011-01-01	00134ecf2477c1f477a8767c87572107	rapidshare.com	195.122.131.5

In [470]:

Image("ip_class.jpeg")

Out[470]:

Data Preparation, Augmentation and Scraping (DPAS) using `geolocation` API (1)¶

Data Preparation¶

Filter rows with vital missing data.
Wrote script to classify each IP to IP Classes column.

Data Augmentation¶

Introduced additional columns to understand the time series.
Use TWO Geo-Data API services to extract location data about each IP.

Data Scraping¶

Many of the data contains abbrev./codes(Country, City, Continent)
Scraping to fetch comprehensive web data for decoding (Long/Lat => Country)

Data Merging & Cleaning¶

Merging the augmented data, web data and API fetched data.
Filter the clean data, ready for EDA and modeling.

Data preparation and description of features: Notebook 1
API Data scraping using GeoIPS API Service Notebook 2
API Data scraping using freegeoip.net and data cleaning Notebook 3

API Services: FreeGeoip.net and GeoIPS.com

Final dataset - augmented_ip_geodata.csv

Data Preparation, Augmentation and Scraping (DPAS) using `geolocation` API (2)¶

In [4]:

# Dataset with unique IPs
data_url = "https://s3-us-west-2.amazonaws.com/cyber-security-dataset/clean_malware_dataset.csv.gz"
clean_malware_dataset = pd.read_csv(data_url, index_col=0, compression='gzip')

# Compute elapsed days and weeks since infection logging started.
parsed_date = pd.to_datetime(clean_malware_dataset.Date.values)
clean_malware_dataset['Days_elapsed'] = (parsed_date - pd.datetime(2011,1,1)).days
clean_malware_dataset['Weeks_elapsed'] = [divmod(dayx, 7)[0] for dayx in clean_malware_dataset.Days_elapsed]

# We assume an average of 30 Days / months for simplicity.
clean_malware_dataset['Months_elapsed'] = [divmod(dayx, 30)[0] for dayx in clean_malware_dataset.Days_elapsed]
clean_malware_dataset.drop(['Year', 'month'], axis=1, inplace=True)
clean_malware_dataset.tail(3)

Out[4]:

	Date	MD5Hash	Domains	IP_Address	Days_elapsed	Weeks_elapsed	Months_elapsed
44502004	2013-12-31	ffeb275390d1588378373efc28e628cf	partner.googleadservices.com	74.125.225.121	1095	156	36
44502005	2013-12-31	ffeb275390d1588378373efc28e628cf	pubads.g.doubleclick.net	74.125.225.122	1095	156	36
44502006	2013-12-31	ffeb275390d1588378373efc28e628cf	www.googletagservices.com	74.125.225.122	1095	156	36

In [5]:

# Augmented dataset obtained from using geolocation API with IP address as input parameter
file_name = "https://s3-us-west-2.amazonaws.com/cyber-security-dataset/augmented_ip_geodata.csv.gz"
augmented_ip_geodata = pd.read_csv(file_name, low_memory=False, index_col=0, compression='gzip')
augmented_ip_geodata.drop(['CONTINENT_NAME', 'COUNTRY_CODE'], axis=1, inplace=True)
print('Dataset Shape', augmented_ip_geodata.shape)
augmented_ip_geodata.head(3)

Dataset Shape (397423, 15)

Out[5]:

	IP_CLASS	CITY_NAME	CONTINENT_CODE	COUNTRY_NAME	COUNTY_NAME	HOSTNAME	IP	LATITUDE	LONGITUDE	METRO_CODE	OWNER	REGION_CODE	REGION_NAME	TIMEZONE	ZIP_CODE
0	CLASS A	BEIJING	AS	CHINA	NaN	UNKNOWN	111.3.42.65	39.90	116.41	NaN	CHINA MOBILE COMMUNICATIONS CORPORATION	BJ	NaN	GMT+8	NaN
1	CLASS A	ENID	NaN	UNITED STATES	GARFIELD	C74-197-189-185.DH.SUDDENLINK.NET	74.197.189.185	36.40	-97.88	NaN	SUDDENLINK COMMUNICATIONS	OK	OKLAHOMA	CST	NaN
2	CLASS A	NANNING	AS	CHINA	NaN	UNKNOWN	116.252.23.146	22.82	108.32	NaN	CHINANET GUANGXI PROVINCE NETWORK	GX	NaN	GMT+8	NaN

Scrape Data for Countries Profiles¶

In [7]:

country_url = "https://s3-us-west-2.amazonaws.com/cyber-security-dataset/augmented_countries_profile.csv.gz"
country_profile = pd.read_csv(country_url, index_col=0, compression='gzip')
country_profile.columns = ['AREA (KM2)', 'CAPITAL', 'COUNTRY_CODE', 'COUNTRY_NAME', 'POPULATION','CONTINENT_NAME']

# # Merge Country Profiles with `augmented_ip_geodata` Dataset
df_augmented_ip_geodata = pd.merge(country_profile, augmented_ip_geodata, 
                                left_on='COUNTRY_NAME', right_on='COUNTRY_NAME', how='right')
df_augmented_ip_geodata.head(3)

Out[7]:

	AREA (KM2)	CAPITAL	COUNTRY_CODE	COUNTRY_NAME	POPULATION	CONTINENT_NAME	IP_CLASS	CITY_NAME	CONTINENT_CODE	COUNTY_NAME	HOSTNAME	IP	LATITUDE	LONGITUDE	OWNER	REGION_CODE	REGION_NAME	TIMEZONE	ZIP_CODE
0	647500	KABUL	AF	AFGHANISTAN	29121286	ASIA	CLASS B	KABUL	NaN	NaN	NaN	180.94.94.116	34.52	69.18	NaN	KAB	KABUL	ASIA/KABUL	NaN
1	647500	KABUL	AF	AFGHANISTAN	29121286	ASIA	CLASS B	NaN	NaN	NaN	NaN	175.106.41.235	33.00	65.00	NaN	NaN	NaN	ASIA/KABUL	NaN
2	647500	KABUL	AF	AFGHANISTAN	29121286	ASIA	CLASS B	NaN	NaN	NaN	NaN	175.106.35.230	33.00	65.00	NaN	NaN	NaN	ASIA/KABUL	NaN

Import more accurate Country `abbrev` and `names`¶

In [8]:

# Utility function to change case
def change_type(df):
    df.columns = df.columns.str.upper()
    for col in df.columns:
        if df[col].dtypes == 'object':
            df[col] = df[col].str.upper()
    return df


# Add country specific profile from dataset from GAPMINDER project
data_url = 'http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt'
gapmndr_ctry = pd.read_csv(data_url, sep='\t')
gapmndr_ctry.drop(['year','pop', 'lifeExp', 'gdpPercap'], axis=1, inplace=True)
gapmndr_ctry.columns = ['COUNTRY_NAME', 'CONTINENT_NAME']
gapmndr_ctry = change_type(gapmndr_ctry)
# print(gapmndr_ctry.head(3))


# Update `augmented_ip_geodata` with `gapmndr_ctry` with `augmented_ip_geodata` prioritized.
df_augmented_ip_geodata = df_augmented_ip_geodata.combine_first(gapmndr_ctry)
df_augmented_ip_geodata.head()

Out[8]:

	AREA (KM2)	CAPITAL	CITY_NAME	CONTINENT_CODE	CONTINENT_NAME	COUNTRY_CODE	COUNTRY_NAME	COUNTY_NAME	HOSTNAME	IP	IP_CLASS	LATITUDE	LONGITUDE	OWNER	POPULATION	REGION_CODE	REGION_NAME	TIMEZONE	ZIP_CODE
0	647500	KABUL	KABUL	NaN	ASIA	AF	AFGHANISTAN	NaN	NaN	180.94.94.116	CLASS B	34.52	69.18	NaN	29121286	KAB	KABUL	ASIA/KABUL	NaN
1	647500	KABUL	NaN	NaN	ASIA	AF	AFGHANISTAN	NaN	NaN	175.106.41.235	CLASS B	33.00	65.00	NaN	29121286	NaN	NaN	ASIA/KABUL	NaN
2	647500	KABUL	NaN	NaN	ASIA	AF	AFGHANISTAN	NaN	NaN	175.106.35.230	CLASS B	33.00	65.00	NaN	29121286	NaN	NaN	ASIA/KABUL	NaN
3	647500	KABUL	NaN	NaN	ASIA	AF	AFGHANISTAN	NaN	NaN	175.106.35.89	CLASS B	33.00	65.00	NaN	29121286	NaN	NaN	ASIA/KABUL	NaN
4	647500	KABUL	NaN	NaN	ASIA	AF	AFGHANISTAN	NaN	NaN	175.106.40.242	CLASS B	33.00	65.00	NaN	29121286	NaN	NaN	ASIA/KABUL	NaN

Data Preparation, Augmentation and Scraping (DPAS) using `geolocation` API (3)¶

In [9]:

valued_cols = ['IP_CLASS', 'IP', 'COUNTRY_CODE', 'COUNTRY_NAME', 'CITY_NAME', 'CONTINENT_CODE', 
               'CONTINENT_NAME','AREA (KM2)','CAPITAL', 'COUNTY_NAME','HOSTNAME', 'LATITUDE', 
               'LONGITUDE', 'OWNER','POPULATION', 'REGION_CODE', 'REGION_NAME']

# Merge dataframes and drop unneeded columns
xdata = df_augmented_ip_geodata[valued_cols]
merged_data = pd.merge(xdata, clean_malware_dataset, left_on='IP', right_on='IP_Address', how='right')
merged_data.drop('IP', inplace=True, axis=1)
merged_data.columns = merged_data.columns.str.upper()

print("Merged!")

Merged!

In [30]:

# Work on (1/10) fraction of the data for speed 
print( "\nRows: {:,} and {} columns".format(merged_data.shape[0], merged_data.shape[1]))
# merged_data = merged_data.head(2000000).copy()
merged_data.head(3)

Rows: 23,307,040 and 23 columns

Out[30]:

	IP_CLASS	COUNTRY_CODE	COUNTRY_NAME	CITY_NAME	CONTINENT_CODE	CONTINENT_NAME	AREA (KM2)	CAPITAL	COUNTY_NAME	HOSTNAME	...	POPULATION	REGION_CODE	REGION_NAME	DATE	MD5HASH	DOMAINS	IP_ADDRESS	DAYS_ELAPSED	WEEKS_ELAPSED	MONTHS_ELAPSED
0	CLASS B	AF	AFGHANISTAN	KABUL	NaN	ASIA	647500	KABUL	NaN	NaN	...	29121286	KAB	KABUL	2013-01-21	a9c0092ac0b225d324bf773b99a3f72b	guphumsa.ru	180.94.94.116	751	107	25
1	CLASS B	AF	AFGHANISTAN	NaN	NaN	ASIA	647500	KABUL	NaN	NaN	...	29121286	NaN	NaN	2012-07-10	1ed98bd550d8a3816a974032ccf5ff70	fahimjan.no-ip.biz	175.106.41.235	556	79	18
2	CLASS B	AF	AFGHANISTAN	NaN	NaN	ASIA	647500	KABUL	NaN	NaN	...	29121286	NaN	NaN	2012-07-10	90b5593db166cc690d53b8865ea69d97	fahimjan.no-ip.biz	175.106.35.230	556	79	18

3 rows × 23 columns

Code to convert `IP CLASS` to Integer for ML¶

In [69]:

augmented_data = augmented_ip_geodata.copy()
class_labels = list(augmented_ip_geodata.IP_CLASS.unique())


def parse_label(label):
    options = {ip_class:ft.index(ip_class) for ip_class in class_labels}
    return options[label]

# Show IP Class before and after Encoding
print(augmented_data.IP_CLASS.unique())
augmented_data['IP_CLASS'] = augmented_data.IP_CLASS.map(parse_label)
augmented_data.IP_CLASS.unique()

['CLASS A' 'CLASS C' 'CLASS B' 'CLASS E' 'LOOPBACK IP' 'CLASS D']

Out[69]:

array([0, 1, 2, 3, 4, 5])

Data Preparation, Augmentation and Scraping (DPAS) using `geolocation` API (4)¶

In [214]:

# Which of these columns have complete NOT NULL entries?
print("Columns with NON NULL entries:\n\n", merged_data.columns[merged_data.notnull().all()].values)
print("\n\nColumns with SOME NULL entries:\n\n", 
      merged_data.columns.difference(merged_data.columns[merged_data.notnull().all()]).values)

Columns with NON NULL entries:

 ['DATE' 'MD5HASH' 'DOMAINS' 'IP_ADDRESS' 'DAYS_ELAPSED' 'WEEKS_ELAPSED'
 'MONTHS_ELAPSED']


Columns with SOME NULL entries:

 ['AREA (KM2)' 'CAPITAL' 'CITY_NAME' 'CONTINENT_CODE' 'CONTINENT_NAME'
 'COUNTRY_CODE' 'COUNTRY_NAME' 'COUNTY_NAME' 'HOSTNAME' 'IP_CLASS'
 'LATITUDE' 'LONGITUDE' 'OWNER' 'POPULATION' 'REGION_CODE' 'REGION_NAME']

In [142]:

# Check for ROW with TOTALLY complete entries on ALL columns

nice_rows = []
for row in merged_data.index:
    if merged_data.ix[row].notnull().all():
        nice_rows.append(row)

if len(nice_rows) < 1:
    print("Bad Luck: No ROW has complete NON NULL column!")
else:
    nice_rows[:5]

Bad Luck: No ROW has complete NON NULL column!

In [31]:

print("\nOld Column Structure: {} \t {}".format(len(demo_data.keys()), demo_data.keys().values))
print("\nNew Column Structure: {}".format(len(merged_data.keys())))
merged_data.ix[99]

Old Column Structure: 3 	 ['MD5Hash' 'Domains' 'IP_Address']

New Column Structure: 23

Out[31]:

IP_CLASS                                   CLASS A
COUNTRY_CODE                                    AL
COUNTRY_NAME                               ALBANIA
CITY_NAME                                      NaN
CONTINENT_CODE                                 NaN
CONTINENT_NAME                              EUROPE
AREA (KM2)                                 2.9e+04
CAPITAL                                     TIRANA
COUNTY_NAME                                    NaN
HOSTNAME                                       NaN
LATITUDE                                        41
LONGITUDE                                       20
OWNER                                          NaN
POPULATION                                   3e+06
REGION_CODE                                    NaN
REGION_NAME                                    NaN
DATE                                    2011-03-30
MD5HASH           93eec824d0baca0f440b33e2f4db4c02
DOMAINS                            gerti.zapto.org
IP_ADDRESS                            79.106.8.249
DAYS_ELAPSED                                    88
WEEKS_ELAPSED                                   12
MONTHS_ELAPSED                                   2
Name: 99, dtype: object

Consider the statistics of the `IP_CLASS` in relation to other attributes¶

In [32]:

merged_data.head()

Out[32]:

	IP_CLASS	COUNTRY_CODE	COUNTRY_NAME	CITY_NAME	CONTINENT_CODE	CONTINENT_NAME	AREA (KM2)	CAPITAL	COUNTY_NAME	HOSTNAME	...	POPULATION	REGION_CODE	REGION_NAME	DATE	MD5HASH	DOMAINS	IP_ADDRESS	DAYS_ELAPSED	WEEKS_ELAPSED	MONTHS_ELAPSED
0	CLASS B	AF	AFGHANISTAN	KABUL	NaN	ASIA	647500	KABUL	NaN	NaN	...	29121286	KAB	KABUL	2013-01-21	a9c0092ac0b225d324bf773b99a3f72b	guphumsa.ru	180.94.94.116	751	107	25
1	CLASS B	AF	AFGHANISTAN	NaN	NaN	ASIA	647500	KABUL	NaN	NaN	...	29121286	NaN	NaN	2012-07-10	1ed98bd550d8a3816a974032ccf5ff70	fahimjan.no-ip.biz	175.106.41.235	556	79	18
2	CLASS B	AF	AFGHANISTAN	NaN	NaN	ASIA	647500	KABUL	NaN	NaN	...	29121286	NaN	NaN	2012-07-10	90b5593db166cc690d53b8865ea69d97	fahimjan.no-ip.biz	175.106.35.230	556	79	18
3	CLASS B	AF	AFGHANISTAN	NaN	NaN	ASIA	647500	KABUL	NaN	NaN	...	29121286	NaN	NaN	2012-06-12	b2b54ac3c30956163ccec5b8239e902c	fahimjan.no-ip.biz	175.106.35.89	528	75	17
4	CLASS B	AF	AFGHANISTAN	NaN	NaN	ASIA	647500	KABUL	NaN	NaN	...	29121286	NaN	NaN	2012-05-19	335a7cfdf6747f9494fe307d2034c1a2	fahimjan.no-ip.biz	175.106.40.242	504	72	16

5 rows × 23 columns

EDA Visualization - Summary Statistics¶

In [10]:

grpby_ipclass = merged_data.groupby('IP_CLASS').count()
df = pd.DataFrame(grpby_ipclass.IP_ADDRESS)
df.head()

Out[10]:

	IP_ADDRESS
IP_CLASS
CLASS A	13937603
CLASS B	2123895
CLASS C	6146203
CLASS D	930
CLASS E	7512

In [11]:

df.iplot(kind='bar', logy=True, yTitle="FREQUENCIES of malicious IPs", xTitle="IP CLASSES",
        title="Malicious IP CLASSES hitting the Network", filename='incubator/bar_ipclasses')

Out[11]:

What's the `MONTHLY` Occurences of Malicious IP by CLASSES?¶

In [18]:

active_classes = ['CLASS A', 'CLASS B', 'CLASS C', 'CLASS D', 'CLASS E', 'LOOPBACK IP']
df1 = pd.pivot_table(merged_data, index='DAYS_ELAPSED', columns='IP_CLASS', aggfunc=len)
df1 = df1.IP_ADDRESS[active_classes]
df1.iplot(kind="box", logy=True, xTitle='CLASSES OF IP ADDRESSES', filename='incubator/daily_box', 
          yTitle='DAILY FREQUENCY',title='DAILY Occurences of Malicious Attacks by CLASSES')

Out[18]:

What is the Average Elapsed Duration of Occurences of Malicious Attacks by IP CLASSES?¶

In [19]:

df_elapsed = merged_data[['IP_CLASS','DAYS_ELAPSED', 'WEEKS_ELAPSED','MONTHS_ELAPSED']]
df_elapsed = df_elapsed.groupby(['IP_CLASS']).mean()
df_elapsed.iplot(kind='bar', error_type='percent', error_y=15, xTitle='CLASSES OF IP ADDRESSES', filename='incubator/bar_attacks',
                 yTitle='AVERAGE ELAPSED DURATION', title='AVERAGE Duration of Attacks Occurences by IP CLASSES')

Out[19]:

Which `Continents` are the Malicious traffics coming from?¶

Are there relationship between the `IP Classes` and the `Continents` where the traffics are coming from?¶

In [14]:

# Generalize the result above with a function.
def pivot_merged_dataset_by(cols_needed, ndx_name):
    """ Function to transform merged data on any of its columns. """
    return pd.pivot_table(merged_data[cols_needed], index=ndx_name, aggfunc='mean')


# Pivot Table to obtain average attack duration wrt Continent
activex = ['DATE', 'COUNTRY_NAME' ,'DAYS_ELAPSED', 'WEEKS_ELAPSED', 'MONTHS_ELAPSED', 'CONTINENT_NAME']


# Pivot Table to obtain average attack duration wrt Continent
continent_duratn = pivot_merged_dataset_by(activex, 'CONTINENT_NAME')

# Malicious Traffic by Continent
df_class = merged_data[['CONTINENT_NAME','IP_CLASS']].groupby('CONTINENT_NAME').count()


# Malicious Attacks by Continent Over time...
df_class = pd.merge(df_class, continent_duratn, left_index='on', right_index='on', how='inner' )

df_continent_traffic = pd.merge(df_class, country_profile, left_index='on', right_on='CONTINENT_NAME', how='inner')
df_continent_traffic.head()

Out[14]:

	IP_CLASS	DAYS_ELAPSED	MONTHS_ELAPSED	WEEKS_ELAPSED	AREA (KM2)	CAPITAL	COUNTRY_CODE	COUNTRY_NAME	POPULATION	CONTINENT_NAME
2	77768	446.49	14.39	63.35	2381740	ALGIERS	DZ	ALGERIA	34586184	AFRICA
4	77768	446.49	14.39	63.35	1246700	LUANDA	AO	ANGOLA	13068161	AFRICA
18	77768	446.49	14.39	63.35	112620	PORTO-NOVO	BJ	BENIN	9056010	AFRICA
21	77768	446.49	14.39	63.35	600370	GABORONE	BW	BOTSWANA	2029307	AFRICA
24	77768	446.49	14.39	63.35	27830	BUJUMBURA	BI	BURUNDI	9863117	AFRICA

In [15]:

df_continent_traffic.iplot(kind='bubble', y='IP_CLASS', x='MONTHS_ELAPSED', size='IP_CLASS', filename='incubator/bubble_continent',
                         text='CONTINENT_NAME', categories='CONTINENT_NAME', xTitle='AVERAGE ELAPSED PERIOD', 
                         yTitle='AVERAGE TRAFFIC', title='Average Malicious Traffic by CONTINENT over Time.')

Out[15]:

How about the originating countries of the malicious attacks?¶

In [16]:

# Pivot Table to obtain average attack duration wrt Country
country_duratn = pivot_merged_dataset_by(activex, 'COUNTRY_NAME')

# Malicious Traffic by COUNTRY
df_class = merged_data[['COUNTRY_NAME','IP_CLASS']].groupby('COUNTRY_NAME').count()

# Malicious Attacks by Country Over time...
df_class = pd.merge(df_class, country_duratn, left_index='on', right_index='on', how='inner' )

# Merge both Country traffic details and Time Details
df_country_traffic = pd.merge(df_class, country_profile, left_index='on', right_on='COUNTRY_NAME', how='inner')
df_country_traffic.head(3)

Out[16]:

	IP_CLASS	DAYS_ELAPSED	MONTHS_ELAPSED	WEEKS_ELAPSED	AREA (KM2)	CAPITAL	COUNTRY_CODE	COUNTRY_NAME	POPULATION	CONTINENT_NAME
0	9	589.00	19.11	83.78	647500	KABUL	AF	AFGHANISTAN	29121286	ASIA
1	202	458.05	14.75	65.00	28748	TIRANA	AL	ALBANIA	2986952	EUROPE
2	24108	444.62	14.34	63.09	2381740	ALGIERS	DZ	ALGERIA	34586184	AFRICA

In [17]:

df_country_traffic.iplot(kind='bubble', x='IP_CLASS', y='WEEKS_ELAPSED', size='IP_CLASS', filename='incubator/bubble_country',
                         text=['CAPITAL', 'COUNTRY_NAME'], xTitle='AVERAGE ELAPSED PERIOD', categories='CONTINENT_NAME', 
                         yTitle='AVERAGE TRAFFIC', title='Average Malicious Traffic by COUNTRY over Time')

Out[17]:

It's getting more interesting right?¶

Here's the roadmap of future work:¶

1. Complete exploratory process to dectect hidden patterns¶

2. Build predictive model for forecasting cyber attack based on history of country's attacks.¶

3. Make data-backed recommendations based on traffic sources.¶

4. Build interactive web app for visualization¶

Packaged as : `PLOTLY + FLASK` WEB APPLICATION¶

http://turaquo-analytics.herokuapp.com/network_intel ¶

Cyber Security Intelligence & Analytics (CSIA)¶

Project Description : @GitHub README¶

Data Preparation, Augmentation and Scraping (DPAS) using geolocation API (1)¶