Seaborn: Statistical Data Visualization

 

 


 

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

import seaborn as sns 
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
dv = cd.diagnosis         # dependent variable                 # M or B 
list = ['id','diagnosis']
iv = cd.drop(list,axis = 1 ) # independent variable
iv.head()

radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean fractal_dimension_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678

5 rows × 30 columns

Data Visualization : Cancer Data

Count plot ->

A count plot can be thought of as a histogram across a categorical, instead of quantitative, variable. The basic API and options are identical to those for barplot(), so you can compare counts across nested variables.

ax = sns.countplot(dv,label="Count")
plt.grid(True,color='G')
B, M = dv.value_counts()
print('Number of Benign: ',B)
print('Number of Malignant : ',M)
Number of Benign:  357
Number of Malignant :  212

Violin Plot ->

A violin plot is a method of plotting numeric data. It is similar to a box plot, with the addition of a rotated kernel density plot on each side. Violin plots are similar to box plots, except that they also show the probability density of the data at different values, usually smoothed by a kernel density estimator.

y=dv
x=iv
data_n_2 = (iv - iv.mean()) / (iv.std())              # standardization
data = pd.concat([y,data_n_2.iloc[:,0:10]],axis=1)
data = pd.melt(data,id_vars="diagnosis",
                    var_name="features",
                    value_name='value')
plt.figure(figsize=(10,10))
sns.violinplot(x="features", y="value", hue="diagnosis", data=data,split=True, inner="quart", aspect=9)
plt.grid(True,color='G')
plt.xticks(rotation=90)
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), <a list of 10 Text xticklabel objects>)

Cat Plot ->

sns.catplot(x="features", y="value", hue="diagnosis", aspect=.6,
            kind="swarm", data=data);
sns.catplot(x="features", y="value", hue="diagnosis", palette="ch:.25", data=data, aspect=3);
sns.catplot(x="features", y="value", hue="diagnosis", kind="bar", data=data, aspect=3);
sns.catplot(x="features", y="value", hue="diagnosis", jitter=False, data=data, aspect=3);
sns.catplot(x="features", y="value", data=data, legend=True, aspect=3);
sns.set(rc={'figure.figsize':(1001.7,8.27)})
sns.catplot(x="features", y="value", data=data,kind="box", legend=True , aspect=3);
sns.catplot(x="features", y="value", hue="diagnosis", kind="box", data=data, aspect=3);
%matplotlib inline
plt.figure(figsize=(20, 5))
sns.catplot(x="features", y="value", kind="boxen",
            data=data, aspect=3);
<Figure size 1440x360 with 0 Axes>

Factor Plot ->

g = sns.factorplot(x="features", y="value", hue="diagnosis",
        data=data, kind="box", aspect=3)
/opt/conda/lib/python3.6/site-packages/seaborn/categorical.py:3669: UserWarning: The `factorplot` function has been renamed to `catplot`. The original name will be removed in a future release. Please update your code. Note that the default `kind` in `factorplot` (`'point'`) has changed `'strip'` in `catplot`.
  warnings.warn(msg)
sns.set_style('ticks')
sns.violinplot(data=data, inner="points", ax=ax, aspect=3)    
sns.despine()
<Figure size 432x288 with 0 Axes>

Box Plot ->

The box plot is a standardized way of displaying the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum.

plt.figure(figsize=(10,10))
sns.boxplot(x="features", y="value", hue="diagnosis", data=data)
plt.xticks(rotation=90)
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), <a list of 10 Text xticklabel objects>)

diagnosis features value
0 M radius_mean 1.096100
1 M radius_mean 1.828212
2 M radius_mean 1.578499
3 M radius_mean -0.768233
4 M radius_mean 1.748758
x.head()

radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean fractal_dimension_mean ... radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 ... 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 ... 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678

5 rows × 30 columns

Joint Plot ->

g = sns.jointplot(x.loc[:,'texture_mean'], x.loc[:,'smoothness_mean'], data=cd, kind="hex", color="R")
sns.jointplot(x.loc[:,'texture_mean'], x.loc[:,'smoothness_mean'], kind="regg", color="G")
<seaborn.axisgrid.JointGrid at 0x7faf2ed11d68>
 

Joint Plot ->

sns.jointplot(x.loc[:,'concavity_worst'], x.loc[:,'concave points_worst'], kind="regg", color="#ce1414")
<seaborn.axisgrid.JointGrid at 0x7faf2edbcd68>
y.head()
0    M
1    M
2    M
3    M
4    M
Name: diagnosis, dtype: object

Scatter Plot ->

A scatter plot is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. If the points are coded, one additional variable can be displayed.

import plotly.express as px
fig = px.scatter(x, x=x.loc[:,'concavity_worst'], y=x.loc[:,'concave points_worst'],log_x=True, size_max=600)
fig.show()
 

Pair Grid Plot ->

Subplot grid for plotting pairwise relationships in a dataset.

This class maps each variable in a dataset onto a column and row in a grid of multiple axes. Different axes-level plotting functions can be used to draw bivariate plots in the upper and lower triangles, and the the marginal distribution of each variable can be shown on the diagonal.

sns.set(style="white")
df = x.loc[:,['radius_worst','perimeter_worst','area_worst']]
g = sns.PairGrid(df, diag_sharey=False)
g.map_lower(sns.kdeplot, cmap="Blues_d")
g.map_upper(plt.scatter)
g.map_diag(sns.kdeplot, lw=3)
g = g.add_legend()
/opt/conda/lib/python3.6/site-packages/matplotlib/legend.py:449: UserWarning:

The handle <matplotlib.patches.Patch object at 0x7faf27c67828> has a label of '_nolegend_' which cannot be automatically added to the legend.

Color the points using a categorical variable

df = cd.loc[:,['diagnosis','radius_worst','perimeter_worst','area_worst']]
g = sns.PairGrid(df, diag_sharey=False, hue="diagnosis")
g = g.map_offdiag(plt.scatter)
g.map_upper(plt.scatter)
g.map_diag(sns.kdeplot, lw=3)
g = g.add_legend()
df = cd.loc[:,['diagnosis','radius_worst','perimeter_worst','area_worst']]
g = sns.PairGrid(df, diag_sharey=False, hue="diagnosis")
g = g.map(sns.scatterplot, linewidths=1, edgecolor="w", s=40)
g = g.add_legend()

Swarm Plot ->

import time
from subprocess import check_output
sns.set(style="whitegrid", palette="muted")
data_dia = y
data = x
data_n_2 = (data - data.mean()) / (data.std())              # standardization
data = pd.concat([y,data_n_2.iloc[:,0:10]],axis=1)
data = pd.melt(data,id_vars="diagnosis",
                    var_name="features",
                    value_name='value')
plt.figure(figsize=(10,10))
tic = time.time()
sns.swarmplot(x="features", y="value", hue="diagnosis", data=data)

plt.xticks(rotation=90)
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), <a list of 10 Text xticklabel objects>)

Correlation Map

f,ax = plt.subplots(figsize=(18, 18))
sns.heatmap(x.corr(), annot=True, linewidths=.5, fmt= '.1f',ax=ax)
<matplotlib.axes._subplots.AxesSubplot at 0x7faf2c28c160>
fig, (axis1,axis2,axis3) = plt.subplots(1,3,figsize=(14,12))

sns.boxplot(x="features", y="value", hue="diagnosis", data = data, ax = axis1)
sns.violinplot(x="features", y="value", hue="diagnosis", data = data, split = True, ax = axis2)
sns.boxplot(x="features", y="value", hue="diagnosis", data = data, ax = axis3)
<matplotlib.axes._subplots.AxesSubplot at 0x7faf26f73390>
fig, saxis = plt.subplots(2, 3,figsize=(16,12))

sns.barplot(x="features", y="value", data=data, ax = saxis[0,0])
sns.barplot(x="features", y="value", order=[1,2,3], data=data, ax = saxis[0,1])
sns.barplot(x="features", y="value", order=[1,0], data=data, ax = saxis[0,2])

sns.pointplot(x="features", y="value",  data=data, ax = saxis[1,0])
sns.pointplot(x="features", y="value",  data=data, ax = saxis[1,1])
sns.pointplot(x="features", y="value", data=data, ax = saxis[1,2])
<matplotlib.axes._subplots.AxesSubplot at 0x7faf2c2709b0>

Comments

Popular posts from this blog

House Price Prediction by Machine Learning & Deep Learning Algorithms

AWS Certified Machine Learning Specialty: Roadmap for Completion (MLS-C01)