Assignment 1

COMP9418 - Advanced Topics in Statistical Machine Learning

Deadline: Friday, October 11th, 2019, 9pm.

Student Student ID
Christopher Sheaffe z******
P**** G**** z******
Y**** H**** z******

Introduction

In this assignment, you will develop some sub-routines in Python to create useful operations on Bayesian Networks. You will implement an efficient independence test, learn parameters from data, sample from the joint distribution and classify examples. We will use a Bayesian Network for diagnosis of breast cancer.

Prerequisites

This notebook requires the following libraries in a Python 3.7 environment to run correctly use conda install <library> to install from list below.

  1. numpy
  2. pandas
  3. tabulate
  4. scikit-learn
  5. matplotlib
  6. seaborn

Basic Set Up

In [1]:
#libraries
import numpy as np
import pandas as pd
from itertools import product, combinations
from collections import OrderedDict as odict
from tabulate import tabulate
from copy import deepcopy
import random
from pprint import pprint
import csv
from sklearn import metrics
import matplotlib.pyplot as plt
import seaborn as sn
In [2]:
#graph creation
breastCancerGraph = {
        "Age": ["BC"],
        "Location": ["BC"],
        "BreastDensity": ["Mass"],
        "Size": [],
        "Mass": ["Shape","Margin","Size"],
        "BC": ["Metastasis", "Mass", "MC", "SkinRetract", "NippleDischarge", "AD"],
        "Metastasis": ["LymphNodes"],
        "LymphNodes": [],
        "MC": [],
        "SkinRetract": [],
        "NippleDischarge": [],
        "AD": ["FibrTissueDev"],
        "FibrTissueDev": ["SkinRetract","NippleDischarge", "Spiculation"],
        "Spiculation": ["Margin"],
        "Margin": [],
        "Shape": [],
    }

DFS Core Algorithm

In [3]:
# This is the main DFS recursive function
def dfs_r(G, v, colour):
    """
    argument 
    `G`, an adjacency list representation of a graph
    `v`, next vertex to be visited
    `colour`, dictionary with the colour of each node
    """
    #print('Visiting: ', v)
    # Visited vertices are coloured 'grey'
    colour[v] = 'grey'
    # Let's visit all outgoing edges from v
    for w in G[v]:
        # To avoid loops, we vist check if the next vertex hasn't been visited yet
        if colour[w] == 'white':
            dfs_r(G, w, colour)
    # When we finish the for loop, we know we have visited all nodes from v. It is time to turn it 'black'
    colour[v] = 'black'

# This is an auxiliary DFS function to create and initialize the colour dictionary
def dfs(G, start):
    """
    argument 
    `G`, an adjacency list representation of a graph
    `start`, starting vertex
    """    
    # Create a dictionary with keys as node numbers and values equal to 'white'
    colour = dict([(node, 'white') for node in G.keys()])
    # Call recursive DFS 
    dfs_r(G, start, colour)
    # We can return colour dictionary. It is useful for some operations, such as detecting connected components
    return colour

Topological Sort Algorithm

In [4]:
def topologicalSort_r(G, v, colour, stack):
    """
    argument 
    `G`, an adjacency list representation of a graph
    `v`, current vertex
    `colour`, colouring dictionary
    `stack`, list with topological ordering of nodes
    """     
    colour[v] = 'grey'
    for w in G[v]:
        if colour[w] == 'white':
            topologicalSort_r(G, w, colour, stack)
    colour[v] = 'black'
    stack.append(v)
    
def topologicalSort(G, start):
    """
    argument 
    `G`, an adjacency list representation of a graph
    `start`, starting vertex
    """        
    colour = dict([(node, 'white') for node in G.keys()])
    # We use a stack to store the topological ordering of the nodes, so we can reverse it later
    stack = []
    topologicalSort_r(G, start, colour, stack)

    return reversed(stack)

Graph Helper Functions

In [5]:
def IsDescendant(graph, parent, descendant):
    colours = dfs(graph,parent)
    if colours.get(descendant,"") == "black":
        return True
    else:
        return False

def IsAncestor(graph, child, ancestor):
    colours = dfs(graph,ancestor)
    if colours.get(child,"") == "black":
        return True
    else:
        return False
    
def PathExist(graph, v1, v2, directional = True):
    if IsDescendant(graph,v1,v2):
        return True
    if not directional:
        if IsDescendant(graph,v2,v1):
            return True
    return False

def PathExistSet(graph, set1, set2, directional = True):
    for s1 in set1:
        for s2 in set2:
            if PathExist(graph,s1,s2):
                return True
    return False

def RemoveNode(graph, node):
    graph.pop(node)
    for key in graph.keys():
        if node in graph[key]:
            graph[key].remove(node)
            
def RemoveLeafNodes(graph, filterSet):
    toRemove = []
    for key in graph.keys():
        if key not in filterSet:
            if graph[key] == []:
                toRemove.append(key)
    for key in toRemove:
        RemoveNode(graph,key)                    
    return len(toRemove)

def DirectedGraph2UndirectedGraph(graph):
    undirectedGraph = deepcopy(graph)
    for node in undirectedGraph.keys():
        for child in undirectedGraph[node]:
            if node not in undirectedGraph[child]:
                undirectedGraph[child].append(node)
    return undirectedGraph

def transposeGraph(G): #reverse direction of edges
    GT = dict((v, []) for v in G)
    for v in G:
        for w in G[v]:
            if w in GT:
                GT[w].append(v)
            else:
                GT[w] = [v]
    return GT

def printFactor(f):
    """
    argument 
    `f`, a factor to print on screen
    """
    # Create a empty list that we will fill in with the probability table entries
    table = list()
    
    # Iterate over all keys and probability values in the table
    for key, item in f['table'].items():
        # Convert the tuple to a list to be able to manipulate it
        k = list(key)
        # Append the probability value to the list with key values
        k.append(item)
        # Append an entire row to the table
        table.append(k)
    # dom is used as table header. We need it converted to list
    dom = list(f['dom'])
    # Append a 'Pr' to indicate the probabity column
    dom.append('Pr')
    print(tabulate(table,headers=dom,tablefmt='orgtbl'))
    

Task 1 [25 Marks] - Efficient d-separation test

Implement the efficient version of the d-separation algorithm in a function d_separation(G,X,Y,Z) that return a boolean: true if X is d-separated from Y given Z and false otherwise. Comment about the time complexity of this procedure.

In [6]:
def d_separation(G,X,Y,Z):
    graph = deepcopy(G)
    
    #remove leaves
    iterations = 0
    while RemoveLeafNodes(graph,X+Y+Z) != 0:
        iterations += 1
        
    #remove all outgoing edges for Z
    for z in Z:
        graph[z] = []
        
    #convert to undirected graph
    graph = DirectedGraph2UndirectedGraph(graph)
    
    #if X and Y connected return False
    if PathExistSet(graph, X,Y):
        return False
    else:
        return True

# Testing
testGraph = {
        "A": ["C"],
        "B": ["C","D"],
        "C": ["E","F"],
        "D": ["E"],
        "E": [],
        "F": [],
    }


#(a) A ⫫ B	 =T
X = ["A"]
Y = ["B"]
Z = []
print(X, "⫫", Y, "|", Z, " = ", d_separation(testGraph,X,Y,Z))

#(b) A ⫫ D|C	 =F 
X = ["A"]
Y = ["D"]
Z = ["C"]
print(X, "⫫", Y, "|", Z, " = ", d_separation(testGraph,X,Y,Z))

#(c) F ⫫ D|B  =T	
X = ["F"]
Y = ["D"]
Z = ["B"]
print(X, "⫫", Y, "|", Z, " = ", d_separation(testGraph,X,Y,Z))

#(d) A ⫫ D|F	 =F
X = ["A"]
Y = ["D"]
Z = ["F"]
print(X, "⫫", Y, "|", Z, " = ", d_separation(testGraph,X,Y,Z))

#(e) D ⫫ F|C  =T
X = ["D"]
Y = ["F"]
Z = ["C"]
print(X, "⫫", Y, "|", Z, " = ", d_separation(testGraph,X,Y,Z))
['A'] ⫫ ['B'] | []  =  True
['A'] ⫫ ['D'] | ['C']  =  False
['F'] ⫫ ['D'] | ['B']  =  True
['A'] ⫫ ['D'] | ['F']  =  False
['D'] ⫫ ['F'] | ['C']  =  True

Task 2 [5 Marks] - Estimate Bayesian Network parameters from data

Implement a function learn_bayes_net(G, file, outcomeSpace, prob_tables) that learns the parameters of the Bayesian Network G. This function should output a dictionary prob_tables with the all conditional probability tables (one for each node), as well as the outcomeSpace with the variables domain values.

We are working with a small Bayesian Network with 16 nodes. What will be the size of the joint distribution with all 16 variables?

In [7]:
def prob(factor, *entry):
    """
    argument 
    `factor`, a dictionary of domain and probability values,
    `entry`, a list of values, one for each variable in the same order as specified in the factor domain.
    
    Returns p(entry)
    """
    return factor['table'][entry]     # insert your code here, 1 line  

def allEqualThisIndex(dict_of_arrays, **fixed_vars):
    """
    Helper function to create a boolean index vector into a tabular data structure,
    such that we return True only for rows of the table where, e.g.
    column_a=fixed_vars['column_a'] and column_b=fixed_vars['column_b'].
    
    This is a simple task, but it's not *quite* obvious
    for various obscure technical reasons.
    
    It is perhaps best explained by an example.
    
    >>> all_equal_this_index(
    ...    {'X': [1, 1, 0], Y: [1, 0, 1]},
    ...    X=1,
    ...    Y=1
    ... )
    [True, False, False]
    """
    # base index is a boolean vector, everywhere true
    first_array = dict_of_arrays[list(dict_of_arrays.keys())[0]]
    index = np.ones_like(first_array, dtype=np.bool_)
    for var_name, var_val in fixed_vars.items():
        index = index & (np.asarray(dict_of_arrays[var_name])==var_val)
    return index

def estProbTable(data, var_name, parent_names, outcomeSpace):
    """
    Calculate a dictionary probability table by ML given
    `data`, a dictionary or dataframe of observations
    `var_name`, the column of the data to be used for the conditioned variable and
    `var_outcomes`, a tuple of possible outcomes for the conditiona varible and
    `parent_names`, a tuple of columns to be used for the parents and
    `parent_outcomes` a tuple of all possible parent outcomes 
    Return a dictionary containing an estimated conditional probability table.
    """    
    var_outcomes = outcomeSpace[var_name]
    parent_outcomes = [outcomeSpace[var] for var in (parent_names)]
    # cartesian product to generate a table of all possible outcomes
    all_parent_combinations = product(*parent_outcomes)

    prob_table = odict()
    
    for i, parent_combination in enumerate(all_parent_combinations):
        cond_array = []
        parent_vars = dict(zip(parent_names, parent_combination))
        parent_index = allEqualThisIndex(data, **parent_vars)
        for var_outcome in var_outcomes:
            var_index = (np.asarray(data[var_name])==var_outcome)
            prob_table[tuple(list(parent_combination)+[var_outcome])] = (var_index & parent_index).sum()/parent_index.sum()
            
    return {'dom': tuple(list(parent_names)+[var_name]), 'table': prob_table}


def learn_bayes_net(G, file, outcomeSpace, prob_tables):
    with open(file) as f:
        
        #load data
        data = pd.read_csv(f)
        data.head()
        
        #load domains for each feature
        for col in data.columns:
            domain = data[col].unique()
            outcomeSpace[col] = list(domain)
            
        #estimate probabilites for each node
        
        #first reverse edge direction   
        graphT = transposeGraph(G)    
        #prob_tables = odict()
        for node, parents in graphT.items():
            prob_tables[node] = estProbTable(data,node,parents,outcomeSpace) 
            
#testing
outcomeSpace = {}
prob_tables = {}
learn_bayes_net(breastCancerGraph, "bc 2.csv", outcomeSpace, prob_tables)
for f in prob_tables.keys():
    print("*"*50)
    buffer = int((50-(len(f)+2))/2)
    print("*"*buffer + " " + f + " " + "*"*(50-buffer-len(f)-2))
    print("*"*50)
    printFactor(prob_tables[f])
    print("\n")
**************************************************
********************** Age ***********************
**************************************************
| Age   |       Pr |
|-------+----------|
| <35   | 0.103995 |
| 35-49 | 0.247988 |
| 50-74 | 0.500225 |
| >75   | 0.147793 |


**************************************************
******************** Location ********************
**************************************************
| Location    |       Pr |
|-------------+----------|
| UpInQuad    | 0.251987 |
| LolwOutQuad | 0.251087 |
| UpOutQuad   | 0.246188 |
| LowInQuad   | 0.250737 |


**************************************************
***************** BreastDensity ******************
**************************************************
| BreastDensity   |       Pr |
|-----------------+----------|
| medium          | 0.499125 |
| high            | 0.301435 |
| low             | 0.19944  |


**************************************************
********************** Size **********************
**************************************************
| Mass   | Size   |       Pr |
|--------+--------+----------|
| No     | <1cm   | 1        |
| No     | 1-3cm  | 0        |
| No     | >3cm   | 0        |
| Benign | <1cm   | 0.103523 |
| Benign | 1-3cm  | 0.256799 |
| Benign | >3cm   | 0.639679 |
| Malign | <1cm   | 0.287165 |
| Malign | 1-3cm  | 0.560551 |
| Malign | >3cm   | 0.152284 |


**************************************************
********************** Mass **********************
**************************************************
| BreastDensity   | BC       | Mass   |        Pr |
|-----------------+----------+--------+-----------|
| medium          | No       | No     | 0.892233  |
| medium          | No       | Benign | 0.107767  |
| medium          | No       | Malign | 0         |
| medium          | Invasive | No     | 0.205734  |
| medium          | Invasive | Benign | 0.163575  |
| medium          | Invasive | Malign | 0.630691  |
| medium          | Insitu   | No     | 0.259958  |
| medium          | Insitu   | Benign | 0.397624  |
| medium          | Insitu   | Malign | 0.342418  |
| high            | No       | No     | 0.84992   |
| high            | No       | Benign | 0.15008   |
| high            | No       | Malign | 0         |
| high            | Invasive | No     | 0.105152  |
| high            | Invasive | Benign | 0.0966831 |
| high            | Invasive | Malign | 0.798165  |
| high            | Insitu   | No     | 0.200234  |
| high            | Insitu   | Benign | 0.407494  |
| high            | Insitu   | Malign | 0.392272  |
| low             | No       | No     | 0.942046  |
| low             | No       | Benign | 0.0579536 |
| low             | No       | Malign | 0         |
| low             | Invasive | No     | 0.266595  |
| low             | Invasive | Benign | 0.187366  |
| low             | Invasive | Malign | 0.546039  |
| low             | Insitu   | No     | 0.242315  |
| low             | Insitu   | Benign | 0.44123   |
| low             | Insitu   | Malign | 0.316456  |


**************************************************
*********************** BC ***********************
**************************************************
| Age   | Location    | BC       |         Pr |
|-------+-------------+----------+------------|
| <35   | UpInQuad    | No       | 0.94697    |
| <35   | UpInQuad    | Invasive | 0.032197   |
| <35   | UpInQuad    | Insitu   | 0.0208333  |
| <35   | LolwOutQuad | No       | 0.986891   |
| <35   | LolwOutQuad | Invasive | 0.00749064 |
| <35   | LolwOutQuad | Insitu   | 0.00561798 |
| <35   | UpOutQuad   | No       | 0.946535   |
| <35   | UpOutQuad   | Invasive | 0.0237624  |
| <35   | UpOutQuad   | Insitu   | 0.029703   |
| <35   | LowInQuad   | No       | 0.966862   |
| <35   | LowInQuad   | Invasive | 0.0155945  |
| <35   | LowInQuad   | Insitu   | 0.0175439  |
| 35-49 | UpInQuad    | No       | 0.653481   |
| 35-49 | UpInQuad    | Invasive | 0.160601   |
| 35-49 | UpInQuad    | Insitu   | 0.185918   |
| 35-49 | LolwOutQuad | No       | 0.762726   |
| 35-49 | LolwOutQuad | Invasive | 0.140394   |
| 35-49 | LolwOutQuad | Insitu   | 0.0968801  |
| 35-49 | UpOutQuad   | No       | 0.546699   |
| 35-49 | UpOutQuad   | Invasive | 0.198873   |
| 35-49 | UpOutQuad   | Insitu   | 0.254428   |
| 35-49 | LowInQuad   | No       | 0.685275   |
| 35-49 | LowInQuad   | Invasive | 0.141586   |
| 35-49 | LowInQuad   | Insitu   | 0.173139   |
| 50-74 | UpInQuad    | No       | 0.493585   |
| 50-74 | UpInQuad    | Invasive | 0.345229   |
| 50-74 | UpInQuad    | Insitu   | 0.161187   |
| 50-74 | LolwOutQuad | No       | 0.549206   |
| 50-74 | LolwOutQuad | Invasive | 0.30119    |
| 50-74 | LolwOutQuad | Insitu   | 0.149603   |
| 50-74 | UpOutQuad   | No       | 0.499797   |
| 50-74 | UpOutQuad   | Invasive | 0.303289   |
| 50-74 | UpOutQuad   | Insitu   | 0.196914   |
| 50-74 | LowInQuad   | No       | 0.540348   |
| 50-74 | LowInQuad   | Invasive | 0.354826   |
| 50-74 | LowInQuad   | Insitu   | 0.104826   |
| >75   | UpInQuad    | No       | 0.655172   |
| >75   | UpInQuad    | Invasive | 0.201592   |
| >75   | UpInQuad    | Insitu   | 0.143236   |
| >75   | LolwOutQuad | No       | 0.641333   |
| >75   | LolwOutQuad | Invasive | 0.201333   |
| >75   | LolwOutQuad | Insitu   | 0.157333   |
| >75   | UpOutQuad   | No       | 0.620448   |
| >75   | UpOutQuad   | Invasive | 0.254902   |
| >75   | UpOutQuad   | Insitu   | 0.12465    |
| >75   | LowInQuad   | No       | 0.715447   |
| >75   | LowInQuad   | Invasive | 0.185637   |
| >75   | LowInQuad   | Insitu   | 0.098916   |


**************************************************
******************* Metastasis *******************
**************************************************
| BC       | Metastasis   |       Pr |
|----------+--------------+----------|
| No       | no           | 1        |
| No       | yes          | 0        |
| Invasive | no           | 0.103748 |
| Invasive | yes          | 0.896252 |
| Insitu   | no           | 0.856589 |
| Insitu   | yes          | 0.143411 |


**************************************************
******************* LymphNodes *******************
**************************************************
| Metastasis   | LymphNodes   |        Pr |
|--------------+--------------+-----------|
| no           | no           | 0.903327  |
| no           | yes          | 0.0966734 |
| yes          | no           | 0.156466  |
| yes          | yes          | 0.843534  |


**************************************************
*********************** MC ***********************
**************************************************
| BC       | MC   |        Pr |
|----------+------+-----------|
| No       | No   | 0.972026  |
| No       | Yes  | 0.0279743 |
| Invasive | No   | 0.530807  |
| Invasive | Yes  | 0.469193  |
| Insitu   | No   | 0.487315  |
| Insitu   | Yes  | 0.512685  |


**************************************************
****************** SkinRetract *******************
**************************************************
| BC       | FibrTissueDev   | SkinRetract   |       Pr |
|----------+-----------------+---------------+----------|
| No       | No              | No            | 0.951328 |
| No       | No              | Yes           | 0.048672 |
| No       | Yes             | No            | 0.653686 |
| No       | Yes             | Yes           | 0.346314 |
| Invasive | No              | No            | 0.643967 |
| Invasive | No              | Yes           | 0.356033 |
| Invasive | Yes             | No            | 0.156607 |
| Invasive | Yes             | Yes           | 0.843393 |
| Insitu   | No              | No            | 0.756242 |
| Insitu   | No              | Yes           | 0.243758 |
| Insitu   | Yes             | No            | 0.327508 |
| Insitu   | Yes             | Yes           | 0.672492 |


**************************************************
**************** NippleDischarge *****************
**************************************************
| BC       | FibrTissueDev   | NippleDischarge   |        Pr |
|----------+-----------------+-------------------+-----------|
| No       | No              | No                | 0.95082   |
| No       | No              | Yes               | 0.0491803 |
| No       | Yes             | No                | 0.664406  |
| No       | Yes             | Yes               | 0.335594  |
| Invasive | No              | No                | 0.664077  |
| Invasive | No              | Yes               | 0.335923  |
| Invasive | Yes             | No                | 0.153452  |
| Invasive | Yes             | Yes               | 0.846548  |
| Insitu   | No              | No                | 0.768725  |
| Insitu   | No              | Yes               | 0.231275  |
| Insitu   | Yes             | No                | 0.366261  |
| Insitu   | Yes             | Yes               | 0.633739  |


**************************************************
*********************** AD ***********************
**************************************************
| BC       | AD   |        Pr |
|----------+------+-----------|
| No       | No   | 0.948071  |
| No       | Yes  | 0.0519293 |
| Invasive | No   | 0.54711   |
| Invasive | Yes  | 0.45289   |
| Insitu   | No   | 0.703665  |
| Insitu   | Yes  | 0.296335  |


**************************************************
***************** FibrTissueDev ******************
**************************************************
| AD   | FibrTissueDev   |       Pr |
|------+-----------------+----------|
| No   | No              | 0.650443 |
| No   | Yes             | 0.349557 |
| Yes  | No              | 0.255929 |
| Yes  | Yes             | 0.744071 |


**************************************************
****************** Spiculation *******************
**************************************************
| FibrTissueDev   | Spiculation   |       Pr |
|-----------------+---------------+----------|
| No              | No            | 0.850246 |
| No              | Yes           | 0.149754 |
| Yes             | No            | 0.255046 |
| Yes             | Yes           | 0.744954 |


**************************************************
********************* Margin *********************
**************************************************
| Mass   | Spiculation   | Margin       |        Pr |
|--------+---------------+--------------+-----------|
| No     | No            | Well-defined | 1         |
| No     | No            | Ill-defined  | 0         |
| No     | Yes           | Well-defined | 0         |
| No     | Yes           | Ill-defined  | 1         |
| Benign | No            | Well-defined | 0.747629  |
| Benign | No            | Ill-defined  | 0.252371  |
| Benign | Yes           | Well-defined | 0.35725   |
| Benign | Yes           | Ill-defined  | 0.64275   |
| Malign | No            | Well-defined | 0.206498  |
| Malign | No            | Ill-defined  | 0.793502  |
| Malign | Yes           | Well-defined | 0.0555556 |
| Malign | Yes           | Ill-defined  | 0.944444  |


**************************************************
********************* Shape **********************
**************************************************
| Mass   | Shape     |        Pr |
|--------+-----------+-----------|
| No     | Other     | 1         |
| No     | Oval      | 0         |
| No     | Round     | 0         |
| No     | Irregular | 0         |
| Benign | Other     | 0.0553152 |
| Benign | Oval      | 0.239184  |
| Benign | Round     | 0.652967  |
| Benign | Irregular | 0.052534  |
| Malign | Other     | 0         |
| Malign | Oval      | 0.153493  |
| Malign | Round     | 0.104907  |
| Malign | Irregular | 0.7416    |


Task 3 [25 Marks] - Sampling

Use forward sampling to generate 1000 samples from the Breast Cancer Bayesian Network. Comment about the time complexity of the procedure and accuracy of the estimates. What happens as you add more observed variables in the query in terms of accuracy and effective sample size?

In [8]:
def generate_instances(G, GTopo, prob_tables, number_of_instances):
    GT = transposeGraph(G)
    #GTopo = topologicalSort(G) left out until know how to get root start node always
    instances = []
    for instance in range (number_of_instances):
        tempInstance = generate_instance(GT, prob_tables)
        instances.append(tempInstance)
    headers = [list(instances[0].keys())]
    data = headers + [list(i.values()) for i in instances]
    return data

def generate_instance(GT, prob_tables):
    instance = {}
    for node in GT:
        generate_variable(node, prob_tables, GT, instance)
    return instance
    
def generate_variable(node, prob_tables, GT, instance):
    randomValue = random.uniform(0,1)
    parents = GT[node]
    #check if parents have been set
    for parent in parents:
        if parent not in instance:
            generate_variable(parent, prob_tables, GT, instance)
    if node not in instance:
        generate_random_variable(parents, node, prob_tables, instance)

def generate_random_variable(parents, node, prob_tables, instance):
    parentValues = [instance[val] for val in parents]
    randomVal = random.uniform(0,1)
    assignedVal = None
    items = [list(key) +[item] for key, item in prob_tables[node]["table"].items()]
    
    new_list = [i for i in items if i[:len(parentValues)] == parentValues]
    #new_list.sort(key=lambda x: x[-1])
    
    runningProb = 0
    for x in new_list:
        runningProb+=float(x[-1])
        if randomVal < runningProb:
            assignedVal = x[-2]
            break
    instance[node] = assignedVal

#Testing
#Convert to topological graph starting with "Location"
breastCancerGraphTopo = topologicalSort(breastCancerGraph, "Location")

#generate 10000 instances and save result
generatedInstances = generate_instances(breastCancerGraph, breastCancerGraphTopo, prob_tables,10000)
with open('generated_instances.csv', 'w', newline='') as writeFile:
    writer = csv.writer(writeFile)
    writer.writerows(generatedInstances)

#test generated instances by using the learn_bayes_net function and comparing the generated prob_tables with the originals
outcomeSpace = {}
prob_tables = {}
learn_bayes_net(breastCancerGraph, "generated_instances.csv", outcomeSpace, prob_tables)

for f in prob_tables.keys():
    print("*"*50)
    buffer = int((50-(len(f)+2))/2)
    print("*"*buffer + " " + f + " " + "*"*(50-buffer-len(f)-2))
    print("*"*50)
    printFactor(prob_tables[f])
    print("\n")
**************************************************
********************** Age ***********************
**************************************************
| Age   |     Pr |
|-------+--------|
| 50-74 | 0.4935 |
| <35   | 0.1065 |
| 35-49 | 0.2479 |
| >75   | 0.1521 |


**************************************************
******************** Location ********************
**************************************************
| Location    |     Pr |
|-------------+--------|
| LowInQuad   | 0.2465 |
| UpOutQuad   | 0.2488 |
| UpInQuad    | 0.2516 |
| LolwOutQuad | 0.2531 |


**************************************************
***************** BreastDensity ******************
**************************************************
| BreastDensity   |     Pr |
|-----------------+--------|
| high            | 0.3039 |
| medium          | 0.4991 |
| low             | 0.197  |


**************************************************
********************** Size **********************
**************************************************
| Mass   | Size   |       Pr |
|--------+--------+----------|
| Malign | >3cm   | 0.163127 |
| Malign | <1cm   | 0.270753 |
| Malign | 1-3cm  | 0.56612  |
| No     | >3cm   | 0        |
| No     | <1cm   | 1        |
| No     | 1-3cm  | 0        |
| Benign | >3cm   | 0.648683 |
| Benign | <1cm   | 0.091201 |
| Benign | 1-3cm  | 0.260116 |


**************************************************
********************** Mass **********************
**************************************************
| BreastDensity   | BC       | Mass   |        Pr |
|-----------------+----------+--------+-----------|
| high            | Insitu   | Malign | 0.391304  |
| high            | Insitu   | No     | 0.214976  |
| high            | Insitu   | Benign | 0.39372   |
| high            | No       | Malign | 0         |
| high            | No       | No     | 0.85964   |
| high            | No       | Benign | 0.14036   |
| high            | Invasive | Malign | 0.788331  |
| high            | Invasive | No     | 0.115332  |
| high            | Invasive | Benign | 0.0963365 |
| medium          | Insitu   | Malign | 0.337243  |
| medium          | Insitu   | No     | 0.28739   |
| medium          | Insitu   | Benign | 0.375367  |
| medium          | No       | Malign | 0         |
| medium          | No       | No     | 0.890956  |
| medium          | No       | Benign | 0.109044  |
| medium          | Invasive | Malign | 0.624685  |
| medium          | Invasive | No     | 0.207389  |
| medium          | Invasive | Benign | 0.167926  |
| low             | Insitu   | Malign | 0.333333  |
| low             | Insitu   | No     | 0.268116  |
| low             | Insitu   | Benign | 0.398551  |
| low             | No       | Malign | 0         |
| low             | No       | No     | 0.948988  |
| low             | No       | Benign | 0.0510121 |
| low             | Invasive | Malign | 0.572985  |
| low             | Invasive | No     | 0.233115  |
| low             | Invasive | Benign | 0.1939    |


**************************************************
*********************** BC ***********************
**************************************************
| Age   | Location    | BC       |         Pr |
|-------+-------------+----------+------------|
| 50-74 | LowInQuad   | Insitu   | 0.103188   |
| 50-74 | LowInQuad   | No       | 0.542785   |
| 50-74 | LowInQuad   | Invasive | 0.354027   |
| 50-74 | UpOutQuad   | Insitu   | 0.178542   |
| 50-74 | UpOutQuad   | No       | 0.502048   |
| 50-74 | UpOutQuad   | Invasive | 0.31941    |
| 50-74 | UpInQuad    | Insitu   | 0.126613   |
| 50-74 | UpInQuad    | No       | 0.507258   |
| 50-74 | UpInQuad    | Invasive | 0.366129   |
| 50-74 | LolwOutQuad | Insitu   | 0.164587   |
| 50-74 | LolwOutQuad | No       | 0.549142   |
| 50-74 | LolwOutQuad | Invasive | 0.286271   |
| <35   | LowInQuad   | Insitu   | 0.0180505  |
| <35   | LowInQuad   | No       | 0.963899   |
| <35   | LowInQuad   | Invasive | 0.0180505  |
| <35   | UpOutQuad   | Insitu   | 0.0265152  |
| <35   | UpOutQuad   | No       | 0.965909   |
| <35   | UpOutQuad   | Invasive | 0.00757576 |
| <35   | UpInQuad    | Insitu   | 0.0180505  |
| <35   | UpInQuad    | No       | 0.945848   |
| <35   | UpInQuad    | Invasive | 0.0361011  |
| <35   | LolwOutQuad | Insitu   | 0.0121457  |
| <35   | LolwOutQuad | No       | 0.987854   |
| <35   | LolwOutQuad | Invasive | 0          |
| 35-49 | LowInQuad   | Insitu   | 0.170111   |
| 35-49 | LowInQuad   | No       | 0.680445   |
| 35-49 | LowInQuad   | Invasive | 0.149444   |
| 35-49 | UpOutQuad   | Insitu   | 0.278583   |
| 35-49 | UpOutQuad   | No       | 0.520129   |
| 35-49 | UpOutQuad   | Invasive | 0.201288   |
| 35-49 | UpInQuad    | Insitu   | 0.183007   |
| 35-49 | UpInQuad    | No       | 0.660131   |
| 35-49 | UpInQuad    | Invasive | 0.156863   |
| 35-49 | LolwOutQuad | Insitu   | 0.102107   |
| 35-49 | LolwOutQuad | No       | 0.758509   |
| 35-49 | LolwOutQuad | Invasive | 0.139384   |
| >75   | LowInQuad   | Insitu   | 0.0980926  |
| >75   | LowInQuad   | No       | 0.716621   |
| >75   | LowInQuad   | Invasive | 0.185286   |
| >75   | UpOutQuad   | Insitu   | 0.117801   |
| >75   | UpOutQuad   | No       | 0.594241   |
| >75   | UpOutQuad   | Invasive | 0.287958   |
| >75   | UpInQuad    | Insitu   | 0.134367   |
| >75   | UpInQuad    | No       | 0.640827   |
| >75   | UpInQuad    | Invasive | 0.224806   |
| >75   | LolwOutQuad | Insitu   | 0.142857   |
| >75   | LolwOutQuad | No       | 0.672727   |
| >75   | LolwOutQuad | Invasive | 0.184416   |


**************************************************
******************* Metastasis *******************
**************************************************
| BC       | Metastasis   |       Pr |
|----------+--------------+----------|
| Insitu   | no           | 0.848397 |
| Insitu   | yes          | 0.151603 |
| No       | no           | 1        |
| No       | yes          | 0        |
| Invasive | no           | 0.108923 |
| Invasive | yes          | 0.891077 |


**************************************************
******************* LymphNodes *******************
**************************************************
| Metastasis   | LymphNodes   |        Pr |
|--------------+--------------+-----------|
| no           | yes          | 0.0986301 |
| no           | no           | 0.90137   |
| yes          | yes          | 0.837687  |
| yes          | no           | 0.162313  |


**************************************************
*********************** MC ***********************
**************************************************
| BC       | MC   |        Pr |
|----------+------+-----------|
| Insitu   | No   | 0.501458  |
| Insitu   | Yes  | 0.498542  |
| No       | No   | 0.973722  |
| No       | Yes  | 0.0262778 |
| Invasive | No   | 0.529535  |
| Invasive | Yes  | 0.470465  |


**************************************************
****************** SkinRetract *******************
**************************************************
| BC       | FibrTissueDev   | SkinRetract   |        Pr |
|----------+-----------------+---------------+-----------|
| Insitu   | Yes             | Yes           | 0.646204  |
| Insitu   | Yes             | No            | 0.353796  |
| Insitu   | No              | Yes           | 0.233732  |
| Insitu   | No              | No            | 0.766268  |
| No       | Yes             | Yes           | 0.346555  |
| No       | Yes             | No            | 0.653445  |
| No       | No              | Yes           | 0.0527821 |
| No       | No              | No            | 0.947218  |
| Invasive | Yes             | Yes           | 0.830601  |
| Invasive | Yes             | No            | 0.169399  |
| Invasive | No              | Yes           | 0.351718  |
| Invasive | No              | No            | 0.648282  |


**************************************************
**************** NippleDischarge *****************
**************************************************
| BC       | FibrTissueDev   | NippleDischarge   |       Pr |
|----------+-----------------+-------------------+----------|
| Insitu   | Yes             | No                | 0.376414 |
| Insitu   | Yes             | Yes               | 0.623586 |
| Insitu   | No              | No                | 0.756972 |
| Insitu   | No              | Yes               | 0.243028 |
| No       | Yes             | No                | 0.672234 |
| No       | Yes             | Yes               | 0.327766 |
| No       | No              | No                | 0.950338 |
| No       | No              | Yes               | 0.049662 |
| Invasive | Yes             | No                | 0.148322 |
| Invasive | Yes             | Yes               | 0.851678 |
| Invasive | No              | No                | 0.69349  |
| Invasive | No              | Yes               | 0.30651  |


**************************************************
*********************** AD ***********************
**************************************************
| BC       | AD   |        Pr |
|----------+------+-----------|
| Insitu   | No   | 0.73105   |
| Insitu   | Yes  | 0.26895   |
| No       | No   | 0.95113   |
| No       | Yes  | 0.0488704 |
| Invasive | No   | 0.534143  |
| Invasive | Yes  | 0.465857  |


**************************************************
***************** FibrTissueDev ******************
**************************************************
| AD   | FibrTissueDev   |       Pr |
|------+-----------------+----------|
| No   | Yes             | 0.360117 |
| No   | No              | 0.639883 |
| Yes  | Yes             | 0.7486   |
| Yes  | No              | 0.2514   |


**************************************************
****************** Spiculation *******************
**************************************************
| FibrTissueDev   | Spiculation   |       Pr |
|-----------------+---------------+----------|
| Yes             | No            | 0.250058 |
| Yes             | Yes           | 0.749942 |
| No              | No            | 0.855215 |
| No              | Yes           | 0.144785 |


**************************************************
********************* Margin *********************
**************************************************
| Mass   | Spiculation   | Margin       |        Pr |
|--------+---------------+--------------+-----------|
| Malign | No            | Ill-defined  | 0.788121  |
| Malign | No            | Well-defined | 0.211879  |
| Malign | Yes           | Ill-defined  | 0.949153  |
| Malign | Yes           | Well-defined | 0.0508475 |
| No     | No            | Ill-defined  | 0         |
| No     | No            | Well-defined | 1         |
| No     | Yes           | Ill-defined  | 1         |
| No     | Yes           | Well-defined | 0         |
| Benign | No            | Ill-defined  | 0.263217  |
| Benign | No            | Well-defined | 0.736783  |
| Benign | Yes           | Ill-defined  | 0.637725  |
| Benign | Yes           | Well-defined | 0.362275  |


**************************************************
********************* Shape **********************
**************************************************
| Mass   | Shape     |        Pr |
|--------+-----------+-----------|
| Malign | Oval      | 0.162162  |
| Malign | Other     | 0         |
| Malign | Round     | 0.102799  |
| Malign | Irregular | 0.735039  |
| No     | Oval      | 0         |
| No     | Other     | 1         |
| No     | Round     | 0         |
| No     | Irregular | 0         |
| Benign | Oval      | 0.225434  |
| Benign | Other     | 0.0552344 |
| Benign | Round     | 0.671805  |
| Benign | Irregular | 0.0475273 |


Task 4 [25 Marks] - Classification

Use the Bayesian Network to classify cases of the dataset. Propose an experimental setup to estimate the classification error. Compare the classification error of the Bayesian Network with your favourite Machine Learning classifier.

Comparion classification in separate notebook

Used ensemble learning and reached ~90% accuracy more information (source code) is provided at the bottom of this notebook.

In [9]:
def join(f1, f2, outcomeSpace):
    """
    argument 
    `f1`, first factor to be joined.
    `f2`, second factor to be joined.
    `outcomeSpace`, dictionary with the domain of each variable
    
    Returns a new factor with a join of f1 and f2
    """
    if f1 == {}:
        return f2
    # First, we need to determine the domain of the new factor. It will be union of the domain in f1 and f2
    # But it is important to eliminate the repetitions
    common_vars = list(f1['dom']) + list(set(f2['dom']) - set(f1['dom']))
    
    # We will build a table from scratch, starting with an empty list. Later on, we will transform the list into a odict
    table = list()
    
    # Here is where the magic happens. The product iterator will generate all combinations of varible values 
    # as specified in outcomeSpace. Therefore, it will naturally respect observed values
    for entries in product(*[outcomeSpace[node] for node in common_vars]):
        
        # We need to map the entries to the domain of the factors f1 and f2
        entryDict = dict(zip(common_vars, entries))
        f1_entry = (entryDict[var] for var in f1['dom'])
        f2_entry = (entryDict[var] for var in f2['dom'])
        
        # Insert your code here
        p1 = prob(f1, *f1_entry)           # Use the fuction prob to calculate the probability in factor f1 for entry f1_entry 
        p2 = prob(f2, *f2_entry)           # Use the fuction prob to calculate the probability in factor f2 for entry f2_entry 
        
        # Create a new table entry with the multiplication of p1 and p2
        table.append((entries, p1 * p2))
    return {'dom': tuple(common_vars), 'table': odict(table)}

def p_joint(outcomeSpace, cond_tables):#=cond_tables_ml):
    """
    argument 
    `outcomeSpace`, dictionary with domain of each variable
    `cond_tables`, conditional probability distributions estimated from data
    
    Returns a new factor with full joint distribution
    """    
    p = {}
    for table in cond_tables.keys():
        p = join(p, cond_tables[table], outcomeSpace)

    return p
In [10]:
def evidence(var, e, outcomeSpace):
    """
    argument 
    `var`, a valid variable identifier.
    `e`, the observed value for var.
    `outcomeSpace`, dictionary with the domain of each variable
    
    Returns dictionary with a copy of outcomeSpace with var = e
    """    
    newOutcomeSpace = outcomeSpace.copy()      # Make a copy of outcomeSpace with a copy to method copy(). 1 line
    newOutcomeSpace[var] = (e,)                # Replace the domain of variable var with a tuple with a single element e. 1 line
    return newOutcomeSpace

def marginalize(f, var, outcomeSpace):
    """
    argument 
    `f`, factor to be marginalized.
    `var`, variable to be summed out.
    `outcomeSpace`, dictionary with the domain of each variable
    
    Returns a new factor f' with dom(f') = dom(f) - {var}
    """    
    
    # Let's make a copy of f domain and convert it to a list. We need a list to be able to modify its elements
    new_dom = list(f['dom'])
    
    #########################
    # Insert your code here #
    #########################
    new_dom.remove(var)            # Remove var from the list new_dom by calling the method remove(). 1 line
    table = list()                 # Create an empty list for table. We will fill in table from scratch. 1 line
    for entries in product(*[outcomeSpace[node] for node in new_dom]):
        s = 0;                     # Initialize the summation variable s. 1 line

        # We need to iterate over all possible outcomes of the variable var
        for val in outcomeSpace[var]:
            # To modify the tuple entries, we will need to convert it to a list
            entriesList = list(entries)
            # We need to insert the value of var in the right position in entriesList
            entriesList.insert(f['dom'].index(var), val)
            

            #########################
            # Insert your code here #
            #########################
            
            p = prob(f, *tuple(entriesList))     # Calculate the probability of factor f for entriesList. 1 line
            s = s + p                            # Sum over all values of var by accumulating the sum in s. 1 line
            
        # Create a new table entry with the multiplication of p1 and p2
        table.append((entries, s))
    return {'dom': tuple(new_dom), 'table': odict(table)}

def normalize(f):
    """
    argument 
    `f`, factor to be normalized.
    
    Returns a new factor f' as a copy of f with entries that sum up to 1
    """ 
    table = list()
    sum = 0
    for k, p in f['table'].items():
        sum = sum + p
    for k, p in f['table'].items():
        table.append((k, p/sum))
    return {'dom': f['dom'], 'table': odict(table)}

def query(p, outcomeSpace, q_vars, **q_evi):
    """
    argument 
    `p`, probability table to query.
    `outcomeSpace`, dictionary will variable domains
    `q_vars`, list of variables in query head
    `q_evi`, dictionary of evidence in the form of variables names and values
    
    Returns a new factor NORMALIZED factor will all hidden variables eliminated as evidence set as in q_evi
    """     
    
    # Let's make a copy of these structures, since we will reuse the variable names
    pm = p.copy()
    outSpace = outcomeSpace.copy()
    
    # First, we set the evidence 
    for var_evi, e in q_evi.items():
        outcomeSpace = evidence(var_evi, e, outcomeSpace)# Set the evidence var_evi = e. 2 lines
        
    # Second, we eliminate hidden variables NOT in the query
    for var in outcomeSpace:
        if not var in q_vars:
            pm = marginalize(pm,var,outcomeSpace)
            # Marginalize to eliminate variable var. 3 lines
            
    # Third, return a normalized factor with the query answer
    return normalize(pm)
In [11]:
#hard coded to BC graph
def QueryOnDataFrameRow(row, p, outcomeSpace,):
    #hardcoded to BC
    q = query(  p,
                outcomeSpace,
                'BC',
                Age=row["Age"],
                Location=row["Location"],
                BreastDensity=row["BreastDensity"],
                Size=row["Size"],
                Mass=row["Mass"],
                #BC=row["BC"],
                Metastasis=row["Metastasis"],
                LymphNodes=row["LymphNodes"],
                MC=row["MC"],
                SkinRetract=row["SkinRetract"],
                NippleDischarge=row["NippleDischarge"],
                AD=row["AD"],
                FibrTissueDev=row["FibrTissueDev"],
                Spiculation=row["Spiculation"],
                Margin=row["Margin"],
                Shape=row["Shape"],
    )
    
    preds = [list(key) +[item] for key, item in q["table"].items()]
    #preds = q["table"].items()
    preds.sort(key=lambda x: x[-1])
    y_pred = preds[-1][0]
    confidence = preds[-1][1]
    return y_pred, confidence

def PredictOnDataFrame(df,p,outcomeSpace):
    preds = df.apply(lambda row: QueryOnDataFrameRow(row,p,outcomeSpace), axis=1)
    print(preds)
    df["y_pred"], df['confidence'] = zip(*preds)
    return df

def PredictOnFile(file, outfile, p, outcomeSpace):
    with open(file) as f:
        data = pd.read_csv(f)
        data = PredictOnDataFrame(data,p,outcomeSpace)
        data.to_csv(outfile, index=False)
        return data
In [12]:
# Testing
p = p_joint(outcomeSpace, prob_tables) #takes a long time
In [13]:
#Testing
print(query(p, outcomeSpace, 'BC', Shape= "Other"))
{'dom': ('BC',), 'table': OrderedDict([(('Insitu',), 0.059963708302090066), (('No',), 0.8687516538711714), (('Invasive',), 0.07128463782673854)])}
In [14]:
#Testing
predictionsDF = PredictOnFile("bc 2.csv", "bc 2 predictions.csv", p, outcomeSpace)
0              (No, 0.9985225253250584)
1              (No, 0.9923645447884393)
2        (Invasive, 0.7758281849208036)
3        (Invasive, 0.7636710463578393)
4        (Invasive, 0.9165992976685713)
                      ...              
19996    (Invasive, 0.8511865179370092)
19997    (Invasive, 0.9650510118709286)
19998          (No, 0.9919794659904102)
19999          (No, 0.9734414562421011)
20000          (No, 0.9823969273691606)
Length: 20001, dtype: object
In [15]:
cm = metrics.confusion_matrix(predictionsDF["BC"],predictionsDF["y_pred"])

print(metrics.classification_report(predictionsDF["BC"],predictionsDF["y_pred"]))
#plt.matshow(cm)
#plt.title('Confusion matrix of the classifier')
#plt.colorbar()
#plt.show()
df = pd.DataFrame(predictionsDF, columns=['BC','y_pred'])
confusion_matrix = pd.crosstab(df['BC'], df['y_pred'], rownames=['Actual'], colnames=['Predicted'], margins = True)

sn.set(font_scale=1)
sn.heatmap(confusion_matrix,linewidths=2,)
              precision    recall  f1-score   support

      Insitu       0.75      0.60      0.67      2838
    Invasive       0.90      0.91      0.90      4723
          No       0.94      0.98      0.96     12440

    accuracy                           0.91     20001
   macro avg       0.87      0.83      0.85     20001
weighted avg       0.91      0.91      0.91     20001

Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x1dad9d16b48>

Task 5 [25 Marks] - Report

Write a two-page report (around 1000 words) summarising your findings in this assignment. Some suggestions for the report are:

  1. Which were the main challenges and how you solved these issues?
  2. Answer the questions of each task.
  3. Discuss the complexity of the implemented algorithms.
  4. Include plots to illustrate your results.

Answers are provided in the PDF report.

Classifier

In [16]:
%matplotlib inline
In [17]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

data = pd.read_csv("bc 2.csv")
data = data.reindex(sorted(data.columns), axis=1)
originalData = data.copy()
labels = data["BC"]
In [18]:
data.head()
Out[18]:
AD Age BC BreastDensity FibrTissueDev Location LymphNodes MC Margin Mass Metastasis NippleDischarge Shape Size SkinRetract Spiculation
0 No <35 No medium No UpInQuad no No Well-defined No no No Other <1cm No No
1 No 35-49 No high No LolwOutQuad no No Well-defined No no No Other <1cm No No
2 No 50-74 Invasive medium No UpOutQuad yes Yes Well-defined Benign yes Yes Oval <1cm No No
3 No 50-74 Invasive low Yes UpInQuad yes No Well-defined Benign yes No Other 1-3cm Yes Yes
4 No >75 Invasive medium No LowInQuad no Yes Ill-defined Malign yes No Round <1cm No No
In [19]:
originalData.describe()
Out[19]:
AD Age BC BreastDensity FibrTissueDev Location LymphNodes MC Margin Mass Metastasis NippleDischarge Shape Size SkinRetract Spiculation
count 20001 20001 20001 20001 20001 20001 20001 20001 20001 20001 20001 20001 20001 20001 20001 20001
unique 2 4 3 3 2 4 2 2 2 3 2 2 4 3 2 2
top No 50-74 No medium No UpInQuad no No Well-defined No no No Other <1cm No No
freq 16375 10005 12440 9983 11579 5040 14602 15982 10314 12628 15361 14013 12807 14151 13862 11993
In [20]:
#load domains for each feature
for col in data.columns:
    domain = data[col].unique()
    print(col,domain)
AD ['No' 'Yes']
Age ['<35' '35-49' '50-74' '>75']
BC ['No' 'Invasive' 'Insitu']
BreastDensity ['medium' 'high' 'low']
FibrTissueDev ['No' 'Yes']
Location ['UpInQuad' 'LolwOutQuad' 'UpOutQuad' 'LowInQuad']
LymphNodes ['no' 'yes']
MC ['No' 'Yes']
Margin ['Well-defined' 'Ill-defined']
Mass ['No' 'Benign' 'Malign']
Metastasis ['no' 'yes']
NippleDischarge ['No' 'Yes']
Shape ['Other' 'Oval' 'Round' 'Irregular']
Size ['<1cm' '1-3cm' '>3cm']
SkinRetract ['No' 'Yes']
Spiculation ['No' 'Yes']

Age, BreastDensity, Size are Ordinal

AD, FibrTissueDev, LymphNodes, MC, Margin, Metastasis, NippleDischarge, SkinRetract, Spiculation are binary

BC, Location, Mass, Shape need to be OHE (BC IS LABEL DATA SO CAN BE REMOVED)

In [21]:
#one hot encoding
data = pd.get_dummies(data, columns = ['Location','Shape'])
data.head()
Out[21]:
AD Age BC BreastDensity FibrTissueDev LymphNodes MC Margin Mass Metastasis ... SkinRetract Spiculation Location_LolwOutQuad Location_LowInQuad Location_UpInQuad Location_UpOutQuad Shape_Irregular Shape_Other Shape_Oval Shape_Round
0 No <35 No medium No no No Well-defined No no ... No No 0 0 1 0 0 1 0 0
1 No 35-49 No high No no No Well-defined No no ... No No 1 0 0 0 0 1 0 0
2 No 50-74 Invasive medium No yes Yes Well-defined Benign yes ... No No 0 0 0 1 0 0 1 0
3 No 50-74 Invasive low Yes yes No Well-defined Benign yes ... Yes Yes 0 0 1 0 0 1 0 0
4 No >75 Invasive medium No no Yes Ill-defined Malign yes ... No No 0 1 0 0 0 0 0 1

5 rows × 22 columns

In [22]:
#ordinal
data.loc[data['Age'] == "<35", 'Age'] = 0
data.loc[data['Age'] == "35-49", 'Age'] = 1
data.loc[data['Age'] == "50-74", 'Age'] = 2
data.loc[data['Age'] == ">75", 'Age'] = 3
data.loc[data['BreastDensity'] == "low", 'BreastDensity'] = 0
data.loc[data['BreastDensity'] == "medium", 'BreastDensity'] = 1
data.loc[data['BreastDensity'] == "high", 'BreastDensity'] = 2
data.loc[data['Size'] == "<1cm", 'Size'] = 0
data.loc[data['Size'] == "1-3cm", 'Size'] = 1
data.loc[data['Size'] == ">3cm", 'Size'] = 2

#categorical boolean mask
categorical_feature_mask = data.dtypes==object
categorical_feature_mask
# filter categorical columns using mask and turn it into a list
categorical_cols = data.columns[categorical_feature_mask].tolist()
categorical_cols

# import labelencoder
from sklearn.preprocessing import LabelEncoder
#instantiate labelencoder object
le = LabelEncoder()

# apply le on categorical feature columns
data[categorical_cols] = data[categorical_cols].apply(lambda col: le.fit_transform(col))
data[categorical_cols].head(10)
#import labelencoder
Out[22]:
AD BC FibrTissueDev LymphNodes MC Margin Mass Metastasis NippleDischarge SkinRetract Spiculation
0 0 2 0 0 0 1 2 0 0 0 0
1 0 2 0 0 0 1 2 0 0 0 0
2 0 1 0 1 1 1 0 1 1 0 0
3 0 1 1 1 0 1 0 1 0 1 1
4 0 1 0 0 1 0 1 1 0 0 0
5 0 2 0 0 0 0 0 0 0 0 1
6 1 2 1 0 0 0 2 0 0 1 1
7 0 0 1 0 0 0 2 0 0 1 1
8 0 2 1 0 0 0 2 0 0 0 1
9 0 2 0 0 0 1 2 0 0 0 0
In [23]:
#ensure original data has not changed
originalData.head()
Out[23]:
AD Age BC BreastDensity FibrTissueDev Location LymphNodes MC Margin Mass Metastasis NippleDischarge Shape Size SkinRetract Spiculation
0 No <35 No medium No UpInQuad no No Well-defined No no No Other <1cm No No
1 No 35-49 No high No LolwOutQuad no No Well-defined No no No Other <1cm No No
2 No 50-74 Invasive medium No UpOutQuad yes Yes Well-defined Benign yes Yes Oval <1cm No No
3 No 50-74 Invasive low Yes UpInQuad yes No Well-defined Benign yes No Other 1-3cm Yes Yes
4 No >75 Invasive medium No LowInQuad no Yes Ill-defined Malign yes No Round <1cm No No
In [24]:
data.head()
Out[24]:
AD Age BC BreastDensity FibrTissueDev LymphNodes MC Margin Mass Metastasis ... SkinRetract Spiculation Location_LolwOutQuad Location_LowInQuad Location_UpInQuad Location_UpOutQuad Shape_Irregular Shape_Other Shape_Oval Shape_Round
0 0 0 2 1 0 0 0 1 2 0 ... 0 0 0 0 1 0 0 1 0 0
1 0 1 2 2 0 0 0 1 2 0 ... 0 0 1 0 0 0 0 1 0 0
2 0 2 1 1 0 1 1 1 0 1 ... 0 0 0 0 0 1 0 0 1 0
3 0 2 1 0 1 1 0 1 0 1 ... 1 1 0 0 1 0 0 1 0 0
4 0 3 1 1 0 0 1 0 1 1 ... 0 0 0 1 0 0 0 0 0 1

5 rows × 22 columns

Correlation Matrix

In [25]:
# Compute the correlation matrix
corr = data.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

plt.show()

Feature Selection

In [26]:
new_data = data.drop([
    'BC',
], axis=1)

new_data.head()
Out[26]:
AD Age BreastDensity FibrTissueDev LymphNodes MC Margin Mass Metastasis NippleDischarge ... SkinRetract Spiculation Location_LolwOutQuad Location_LowInQuad Location_UpInQuad Location_UpOutQuad Shape_Irregular Shape_Other Shape_Oval Shape_Round
0 0 0 1 0 0 0 1 2 0 0 ... 0 0 0 0 1 0 0 1 0 0
1 0 1 2 0 0 0 1 2 0 0 ... 0 0 1 0 0 0 0 1 0 0
2 0 2 1 0 1 1 1 0 1 1 ... 0 0 0 0 0 1 0 0 1 0
3 0 2 0 1 1 0 1 0 1 0 ... 1 1 0 0 1 0 0 1 0 0
4 0 3 1 0 0 1 0 1 1 0 ... 0 0 0 1 0 0 0 0 0 1

5 rows × 21 columns

Split into training and testing datasets

In [27]:
from sklearn.model_selection import train_test_split

X_train,X_val,y_train,y_val = train_test_split(new_data,labels,test_size=0.33)

Scale Dataset

In [28]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()            #Instantiate the scaler
scaled_X_train = scaler.fit_transform(X_train)    #Fit and transform the data
scaled_X_val = scaler.transform(X_val)                  #Fit and transform the validation set using the MinMaxScaler

Decision Tree

In [29]:
from sklearn.tree import DecisionTreeClassifier

clf_dt = DecisionTreeClassifier().fit(scaled_X_train,y_train)  
y_pred_dt = clf_dt.predict(scaled_X_val)

Naive Bayes (Bernoulli)

In [30]:
from sklearn.naive_bayes import GaussianNB, BernoulliNB

clf_nb = BernoulliNB().fit(scaled_X_train,y_train)  
y_pred_nb = clf_nb.predict(scaled_X_val)

Support Vector Machine

In [31]:
from sklearn.svm import SVC

clf_svm = SVC(gamma=1, C=1000,probability=True).fit(scaled_X_train,y_train)
y_pred_svm = clf_svm.predict(scaled_X_val)

Random Forest

In [32]:
from sklearn.ensemble import RandomForestClassifier

clf_rf = RandomForestClassifier().fit(scaled_X_train,y_train)
y_pred_rf = clf_rf.predict(scaled_X_val)
C:\Users\sheaf\Anaconda3\envs\testing\lib\site-packages\sklearn\ensemble\forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

Ada Boost

In [33]:
from sklearn.ensemble import AdaBoostClassifier
clf_ada = AdaBoostClassifier(n_estimators=100, random_state =0).fit(scaled_X_train, y_train)
y_pred_ada = clf_ada.predict(scaled_X_val)

Multi-layer Perceptron

In [34]:
from sklearn.neural_network import MLPClassifier
clf_mlp = MLPClassifier(solver='adam', alpha=1e-5, hidden_layer_sizes=(1000,), random_state=1).fit(scaled_X_train, y_train)
y_pred_mlp = clf_mlp.predict(scaled_X_val)
C:\Users\sheaf\Anaconda3\envs\testing\lib\site-packages\sklearn\neural_network\multilayer_perceptron.py:566: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)

Ensemble Learning

(Majority Voting)

In [35]:
from sklearn.ensemble import VotingClassifier
clf_ensemble = VotingClassifier(estimators=[('dt', clf_dt), ('svm', clf_svm), ('rf', clf_rf), ('ada', clf_ada), ('mlp', clf_mlp)], voting='hard').fit(scaled_X_train,y_train)
y_pred_ensemble = clf_ensemble.predict(scaled_X_val)
C:\Users\sheaf\Anaconda3\envs\testing\lib\site-packages\sklearn\neural_network\multilayer_perceptron.py:566: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)

Accuracy Measurements

In [36]:
from sklearn.metrics import accuracy_score

acc_rf = accuracy_score(y_val, y_pred_rf)
acc_svm = accuracy_score(y_val, y_pred_svm)
acc_dt = accuracy_score(y_val, y_pred_dt)                            
acc_nb = accuracy_score(y_val, y_pred_nb)                       
acc_ada = accuracy_score(y_val, y_pred_ada)                    
acc_mlp = accuracy_score(y_val, y_pred_mlp) 
acc_ensemble = accuracy_score(y_val, y_pred_ensemble) 

print("The accuracy of Decision Tree: {} %".format(acc_dt*100))
print("The accuracy of Bernoulli Naive Bayes: {} %".format(acc_nb*100))
print("The accuracy of SVM: {} %".format(acc_svm*100))
print("The accuracy of RF: {} %".format(acc_rf*100))
print("The accuracy of Ada Boost: {} %".format(acc_ada*100))
print("The accuracy of Multi-layer Perceptron: {} %".format(acc_mlp*100))
print("The accuracy of Ensemble Learning: {} %".format(acc_ensemble*100))
The accuracy of Decision Tree: 86.63838812301167 %
The accuracy of Bernoulli Naive Bayes: 86.80502954097864 %
The accuracy of SVM: 88.12301166489925 %
The accuracy of RF: 89.21375549159218 %
The accuracy of Ada Boost: 91.1983032873807 %
The accuracy of Multi-layer Perceptron: 89.0622632934404 %
The accuracy of Ensemble Learning: 90.00151492198152 %

Precision Metrics

In [37]:
from sklearn.metrics import precision_score

prec_dt = precision_score(y_val,y_pred_dt,average='weighted')
prec_nb = precision_score(y_val,y_pred_nb,average='weighted')
prec_rf = precision_score(y_val,y_pred_rf,average='weighted')
prec_svm = precision_score(y_val,y_pred_svm,average='weighted')
prec_ada = precision_score(y_val,y_pred_ada,average='weighted')
prec_mlp = precision_score(y_val,y_pred_mlp,average='weighted')
prec_ensemble = precision_score(y_val,y_pred_ensemble,average='weighted')

print("The precision of Decision Tree: {} %".format(prec_dt*100))
print("The precision of Bernoulli Naive Bayes: {} %".format(prec_nb*100))
print("The precision of SVM: {} %".format(prec_svm*100))
print("The precision of Random Forest: {} %".format(prec_rf*100))
print("The precision of Ada Boost: {} %".format(prec_ada*100))
print("The precision of Multi-layer Perceptron: {} %".format(prec_mlp*100))
print("The precision of Ensemble Learning: {} %".format(prec_ensemble*100))
The precision of Decision Tree: 86.46610168490007 %
The precision of Bernoulli Naive Bayes: 87.98111766389366 %
The precision of SVM: 87.50373038849682 %
The precision of Random Forest: 88.67720767805001 %
The precision of Ada Boost: 90.64037101449219 %
The precision of Multi-layer Perceptron: 88.32084332937413 %
The precision of Ensemble Learning: 89.4336994191635 %

Recall Metrics

In [38]:
from sklearn.metrics import recall_score

recall_dt = recall_score(y_val,y_pred_dt,average='weighted')
recall_nb = recall_score(y_val,y_pred_nb,average='weighted')
recall_rf = recall_score(y_val,y_pred_rf,average='weighted')
recall_svm = recall_score(y_val,y_pred_svm,average='weighted')
recall_ada = recall_score(y_val,y_pred_ada,average='weighted')
recall_mlp = recall_score(y_val,y_pred_mlp,average='weighted')
recall_ensemble = recall_score(y_val,y_pred_ensemble,average='weighted')

print("The recall of Decision Tree: {} %".format(recall_dt*100))
print("The recall of Bernoulli Naive Bayes: {} %".format(recall_nb*100))
print("The recall of SVM: {} %".format(recall_svm*100))
print("The recall of Random Forest: {} %".format(recall_rf*100))
print("The recall of Ada Boost: {} %".format(recall_ada*100))
print("The recall of Multi-layer Perceptron: {} %".format(recall_mlp*100))
print("The recall of Ensemble Learning: {} %".format(recall_ensemble*100))
The recall of Decision Tree: 86.63838812301167 %
The recall of Bernoulli Naive Bayes: 86.80502954097864 %
The recall of SVM: 88.12301166489925 %
The recall of Random Forest: 89.21375549159218 %
The recall of Ada Boost: 91.1983032873807 %
The recall of Multi-layer Perceptron: 89.0622632934404 %
The recall of Ensemble Learning: 90.00151492198152 %

F1 Scores

In [39]:
from sklearn.metrics import f1_score

f1_dt = f1_score(y_pred_dt,y_val,average='weighted')
f1_nb = f1_score(y_pred_nb,y_val,average='weighted')
f1_svm = f1_score(y_pred_svm,y_val,average='weighted')
f1_rf = f1_score(y_pred_rf,y_val,average='weighted')
f1_ada = f1_score(y_pred_ada,y_val,average='weighted')
f1_mlp = f1_score(y_pred_mlp,y_val,average='weighted')
f1_ensemble = f1_score(y_pred_ensemble,y_val,average='weighted')

print("The F1-score of Decision Tree: {} %".format(f1_dt*100))
print("The F1-score of Bernoulli Naive Bayes: {} %".format(f1_nb*100))
print("The F1-score of SVM: {} %".format(f1_svm*100))
print("The F1-score of Random Forest: {} %".format(f1_rf*100))
print("The F1-score of Ada Boost: {} %".format(f1_ada*100))
print("The F1-score of Multi-layer Perceptron: {} %".format(f1_mlp*100))
print("The F1-score of Ensemble Learning: {} %".format(f1_ensemble*100))
The F1-score of Decision Tree: 86.73004249384194 %
The F1-score of Bernoulli Naive Bayes: 86.32020772856286 %
The F1-score of SVM: 88.49641811442996 %
The F1-score of Random Forest: 89.54576301417893 %
The F1-score of Ada Boost: 91.68939071859609 %
The F1-score of Multi-layer Perceptron: 89.55668440085697 %
The F1-score of Ensemble Learning: 90.37806374963002 %