红酒数据集的分析¶

这个notebook分析了红酒的通用数据集。这个数据集有1599个样本，11个红酒的理化性质，以及红酒的品质（评分从0到10）。这里主要目的在于展示进行数据分析的常见python包的调用，以及数据可视化。主要内容分为：单变量，双变量，和多变量分析。

%matplotlib inline
#%config InlineBackend.figure_format = 'retina'

%matplotlib inline¶

%matplotlib inline是一个魔法函数（Magic Functions）。IPython有一组预先定义好的所谓的魔法函数（Magic Functions），你可以通过命令行的语法形式来访问它们。

可以使用 %matplotlib 将 matplotlib 设置为以交互方式在 notebook 中工作。默认情况下，图形呈现在各自的窗口中。但是，可以向命令传递参数，以选择特定的“后端”（呈现图像的软件）。要直接在 notebook中呈现图形，应将内联后端与命令 %matplotlib inline 一起使用。

%config InlineBackend.figure_format = ‘retina’¶

提示：在分辨率较高的屏幕（例如 Retina 显示屏）上，notebook 中的默认图像可能会显得模糊。可以在 %matplotlib inline 之后使用 %config InlineBackend.figure_format = ‘retina’ 来呈现分辨率较高的图像。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns  

plt.rc("font",family="SimHei",size="15")  #解决中文乱码问题

通过可视化数据：更容易地识别模式，掌握到困难的概念以及注意到关键的要素，当你使用数据科学中的Python时，你很有可能已经用了Matplotlib,一个供你创建高质量图像的2D库。另一个免费的可视化库是Seabon,他提供了一个绘制统计图形的高级接口

学习地址 https://www.datacamp.com/community/tutorials/seaborn-python-tutorial ¶

#颜色
color = sns.color_palette()
'''
color_palette()
默认6种颜色：deep,muted, pastel, bright, dark, colorblind
#其它颜色风格
#风格内容：Accent,Blues,BrBG等等
sns.palplot(sns.color_palette('Accent',8))
#这里颜色风格为Accent
#颜色色块个数为8个
#风格颜色转换（不是所有颜色都可以反转）：Blues/Blues_r
#分组颜色设置 -'Paried'
sns.palplot(sns.color_palette('Paired', 16))
'''
sns.palplot(color)
#加载调色板


# 数据print的精度
pd.set_option('precision',3)

df = pd.read_csv('F:/AI/wine/data_wine/winequality-red.csv',sep = ';')

df.head(5)

#读取数据集的前几行

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
fixed acidity           1599 non-null float64
volatile acidity        1599 non-null float64
citric acid             1599 non-null float64
residual sugar          1599 non-null float64
chlorides               1599 non-null float64
free sulfur dioxide     1599 non-null float64
total sulfur dioxide    1599 non-null float64
density                 1599 non-null float64
pH                      1599 non-null float64
sulphates               1599 non-null float64
alcohol                 1599 non-null float64
quality                 1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 150.0 KB

单变量分析¶

#简单的数据统计
df.describe()
#数据的统计分析通过pd包

#set plot style
'''
对于画图来说主要有下面这些样式
[‘bmh’, ‘classic’, ‘dark_background’,‘fast’, 
‘fivethirtyeight’, ‘ggplot’, ‘grayscale’, ‘seaborn-bright’,
‘seaborn-colorblind’, ‘seaborn-dark-palette’, ‘seaborn-dark’,
‘seaborn-darkgrid’, ‘seaborn-deep’, ‘seaborn-muted’, 
‘seaborn-notebook’, ‘seaborn-paper’,‘seaborn-pastel’,
‘seaborn-poster’, ‘seaborn-talk’, ‘seaborn-ticks’,
‘seaborn-white’, ‘seaborn-whitegrid’, ‘seaborn’, 
‘Solarize_Light2’,‘tableau-colorblind10’, ‘_classic_test’]
'''

#我们使用ggglot的风格
plt.style.use('ggplot')

#代表有多少的搜索列
colnm = df.columns.tolist()

#画出图像的大小
fig = plt.figure(figsize = (10, 6))


#  plt.subplot()函数用于直接指定划分方式和位置进行绘图
#其中各个参数也可以用逗号，分隔开。第一个参数代表子图的行数；第二个参数代表该行图像的列数； 第三个参数代表每行的第几个图像。
'''
使用plt.subplot来创建小图. plt.subplot(221)表示将整个图像窗口分为2行2列, 当前位置为1.
plt.subplot(221)
# plt.subplot(222)表示将整个图像窗口分为2行2列, 当前位置为2.
plt.subplot(222) # 第一行的右图
# plt.subplot(223)表示将整个图像窗口分为2行2列, 当前位置为3.
plt.subplot(223)
# plt.subplot(224)表示将整个图像窗口分为2行2列, 当前位置为4.
plt.subplot(224)
'''
#  boxplot
'''
箱形图（Box-plot）又称为盒须图、盒式图或箱线图，是一种用作显示一组数据分散情况资料的统计图。
它能显示出一组数据的最大值、最小值、中位数及上下四分位数。因形状如箱子而得名。
在各种领域也经常被使用，常见于品质管理。图解如下：

实现方法
seaborn.boxplot(x=None, y=None, hue=None, data=None, order=None, hue_order=None,
orient=None, color=None, palette=None, saturation=0.75, width=0.8, dodge=True, 
fliersize=5, linewidth=None, whis=1.5, notch=False, ax=None, **kwargs)
'''
#详细参数：https://blog.csdn.net/qq_39949963/article/details/79387486


for i in range(12):
    plt.subplot(2,6,i+1)
    sns.boxplot(df[colnm[i]], orient="v", width = 0.5, color = color[0])#表示每个列的内容，竖直显示，宽度为0.5站整体的，颜色为0第一个
    plt.ylabel(colnm[i],fontsize = 12)#plt 的竖直的y的名称
    
plt.tight_layout()# 调正合适的宽度
print('\nFigure 1: Univariate Boxplots')

Figure 1: Univariate Boxplots

colnm = df.columns.tolist()
plt.figure(figsize = (10, 8))
for i in range(12):
    plt.subplot(4,3,i+1)#指定宽度为4行3列
    df[colnm[i]].hist(bins = 100, color = color[0]) #详细教学  https://blog.csdn.net/ChenVast/article/details/81563561
#此处代表使用直方图的数量1此处的100代表每一个图中具有100个长方形，并且颜色为第一个颜色
    plt.xlabel(colnm[i],fontsize = 12)
    plt.ylabel('Frequency')
plt.tight_layout()
print('\nFigure 2: Univariate Histograms')

Figure 2: Univariate Histograms

品质¶

这个数据集的目的是研究红酒品质和理化性质之间的关系。品质的评价范围是0-10，这个数据集中范围是3到8，有82%的红酒品质是5或6。

酸度相关的特征¶

这个数据集有7个酸度相关的特征：fixed acidity, volatile acidity, citric acid, free sulfur dioxide, total sulfur dioxide, sulphates, pH。前6个特征都与红酒的pH的相关。pH是在对数的尺度，下面对前6个特征取对数然后作histogram。另外，pH值主要是与fixed acidity有关，fixed acidity比volatile acidity和citric acid高1到2个数量级(Figure 4)，比free sulfur dioxide, total sulfur dioxide, sulphates高3个数量级。一个新特征total acid来自于前三个特征的和。

acidityFeat = ['fixed acidity', 'volatile acidity', 'citric acid',
               'free sulfur dioxide', 'total sulfur dioxide', 'sulphates']

plt.figure(figsize = (10, 4))

for i in range(6):
    ax = plt.subplot(2,3,i+1)
    v = np.log10(np.clip(df[acidityFeat[i]].values, a_min = 0.001, a_max = None))
    plt.hist(v, bins = 50, color = color[0])
    plt.xlabel('log(' + acidityFeat[i] + ')',fontsize = 12)

    plt.ylabel('Frequency')
plt.tight_layout()
print('\nFigure 3: Acidity Features in log10 Scale')

Figure 3: Acidity Features in log10 Scale

plt.figure(figsize=(6,3))

bins = 10**(np.linspace(-2, 2))
plt.hist(df['fixed acidity'], bins = bins, edgecolor = 'k', label = 'Fixed Acidity')
plt.hist(df['volatile acidity'], bins = bins, edgecolor = 'k', label = 'Volatile Acidity')
plt.hist(df['citric acid'], bins = bins, edgecolor = 'k', alpha = 0.8, label = 'Citric Acid')
plt.xscale('log')
plt.xlabel('Acid Concentration (g/dm^3)')
plt.ylabel('Frequency')
plt.title('Histogram of Acid Concentration')
plt.legend()
plt.tight_layout()

print('Figure 4')

D:\anacoda\lib\site-packages\matplotlib\mathtext.py:849: MathTextWarning: Font 'default' does not have a glyph for '-' [U+2212]
  MathTextWarning)
D:\anacoda\lib\site-packages\matplotlib\mathtext.py:850: MathTextWarning: Substituting with a dummy symbol.
  warn("Substituting with a dummy symbol.", MathTextWarning)

Figure 4

# 总酸度
df['total acid'] = df['fixed acidity'] + df['volatile acidity'] + df['citric acid']

plt.figure(figsize = (8,3))

plt.subplot(121)
plt.hist(df['total acid'], bins = 50, color = color[0])
plt.xlabel('total acid')
plt.ylabel('Frequency')
plt.subplot(122)
plt.hist(np.log(df['total acid']), bins = 50 , color = color[0])
plt.xlabel('log(total acid)')
plt.ylabel('Frequency')
plt.tight_layout()

print("Figure 5: Total Acid Histogram")

Figure 5: Total Acid Histogram

甜度(sweetness)¶

Residual sugar 与酒的甜度相关，通常用来区别各种红酒，干红（<=4 g/L), 半干（4-12 g/L）,半甜（12-45 g/L），和甜（>45 g/L)。这个数据中，主要为干红，没有甜葡萄酒

# Residual sugar
df['sweetness'] = pd.cut(df['residual sugar'], bins = [0, 4, 12, 45], 
                         labels=["dry", "medium dry", "semi-sweet"])

# pandas.cut用来把一组数据分割成离散的区间。比如有一组年龄数据，可以使用pandas.cut将年龄数据分割成不同的年龄段并打上标签
#题中的例子如上面所述

plt.figure(figsize = (5,3))
df['sweetness'].value_counts().plot(kind = 'bar', color = color[0])
#统计表中某个类别数
plt.xticks(rotation=0)
plt.xlabel('sweetness', fontsize = 12)
plt.ylabel('Frequency', fontsize = 12)
plt.tight_layout()
print("Figure 6: Sweetness")

Figure 6: Sweetness

双变量分析¶

红酒品质和理化特征的关系¶

下面Figure 7和8分别显示了红酒理化特征和品质的关系。其中可以看出的趋势有：¶

品质好的酒有更高的柠檬酸，硫酸盐，和酒精度数。硫酸盐(硫酸钙)的加入通常是调整酒的酸度的。其中酒精度数和品质的相关性最高。品质好的酒有较低的挥发性酸类，密度，和pH。残留糖分，氯离子，二氧化硫似乎对酒的品质影响不大。

sns.set_style('ticks')
sns.set_context("notebook", font_scale= 1.1)

colnm = df.columns.tolist()[:11] + ['total acid']
plt.figure(figsize = (10, 8))

for i in range(12):
    plt.subplot(4,3,i+1)
    sns.boxplot(x ='quality', y = colnm[i], data = df, color = color[1], width = 0.6)    
    plt.ylabel(colnm[i],fontsize = 12)
plt.tight_layout()
print("\nFigure 7: Physicochemical Properties and Wine Quality by Boxplot")

Figure 7: Physicochemical Properties and Wine Quality by Boxplot

sns.set_style("dark")

plt.figure(figsize = (10,8))
colnm = df.columns.tolist()[:11] + ['total acid', 'quality']
mcorr = df[colnm].corr()
mask = np.zeros_like(mcorr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
cmap = sns.diverging_palette(220, 10, as_cmap=True)
g = sns.heatmap(mcorr, mask=mask, cmap=cmap, square=True, annot=True, fmt='0.2f')
print("\nFigure 8: Pairwise Correlation Plot")

Figure 8: Pairwise Correlation Plot

密度和酒精浓度¶

密度和酒精浓度是相关的，物理上，两者并不是线性关系。Figure 8展示了两者的关系。另外密度还与酒中其他物质的含量有关，但是关系很小。

# style
sns.set_style('ticks')
sns.set_context("notebook", font_scale= 1.4)

# plot figure
plt.figure(figsize = (6,4))
sns.regplot(x='density', y = 'alcohol', data = df, scatter_kws = {'s':10}, color = color[1])
plt.xlim(0.989, 1.005)
plt.ylim(7,16)
print('Figure 9: Density vs Alcohol')

Figure 9: Density vs Alcohol

D:\anacoda\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

酸性物质含量和pH¶

pH和非挥发性酸性物质有-0.683的相关性。因为非挥发性酸性物质的含量远远高于其他酸性物质，总酸性物质(total acidity)这个特征并没有太多意义。

acidity_related = ['fixed acidity', 'volatile acidity', 'total sulfur dioxide', 
                   'sulphates', 'total acid']

plt.figure(figsize = (10,6))

for i in range(5):
    plt.subplot(2,3,i+1)
    sns.regplot(x='pH', y = acidity_related[i], data = df, scatter_kws = {'s':10}, color = color[1])
plt.tight_layout()
print("Figure 10: pH vs acid")

D:\anacoda\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

Figure 10: pH vs acid

多变量分析¶

与品质相关性最高的三个特征是酒精浓度，挥发性酸度，和柠檬酸。下面图中显示的酒精浓度，挥发性酸和品质的关系。

酒精浓度，挥发性酸和品质对于好酒（7，8）以及差酒（3，4），关系很明显。但是对于中等酒（5，6），酒精浓度的挥发性酸度有很大程度的交叉。

plt.style.use('ggplot')

sns.lmplot(x = 'alcohol', y = 'volatile acidity', hue = 'quality', 
           data = df, fit_reg = False, scatter_kws={'s':10}, size = 5)
print("Figure 11-1: Scatter Plots of Alcohol, Volatile Acid and Quality")

D:\anacoda\lib\site-packages\seaborn\regression.py:546: UserWarning: The `size` paramter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)

Figure 11-1: Scatter Plots of Alcohol, Volatile Acid and Quality

sns.lmplot(x = 'alcohol', y = 'volatile acidity', col='quality', hue = 'quality', 
           data = df,fit_reg = False, size = 3,  aspect = 0.9, col_wrap=3,
           scatter_kws={'s':20})
print("Figure 11-2: Scatter Plots of Alcohol, Volatile Acid and Quality")

D:\anacoda\lib\site-packages\seaborn\regression.py:546: UserWarning: The `size` paramter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)

Figure 11-2: Scatter Plots of Alcohol, Volatile Acid and Quality

pH，非挥发性酸，和柠檬酸¶

pH和非挥发性的酸以及柠檬酸有相关性。整体趋势也很合理，即浓度越高，pH越低。

# style
sns.set_style('ticks')
sns.set_context("notebook", font_scale= 1.4)

plt.figure(figsize=(6,5))
cm = plt.cm.get_cmap('RdBu')
sc = plt.scatter(df['fixed acidity'], df['citric acid'], c=df['pH'], vmin=2.6, vmax=4, s=15, cmap=cm)
bar = plt.colorbar(sc)
bar.set_label('pH', rotation = 0)
plt.xlabel('fixed acidity')
plt.ylabel('citric acid')
plt.xlim(4,18)
plt.ylim(0,1)
print('Figure 12: pH with Fixed Acidity and Citric Acid')

Figure 12: pH with Fixed Acidity and Citric Acid

总结：¶

整体而言，红酒的品质主要与酒精浓度，挥发性酸，和柠檬酸有关。对于品质优于7，或者劣于4的酒，直观上是线性可分的。但是品质为5，6的酒很难线性区分。

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.998	3.51	0.56	9.4	5
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.997	3.20	0.68	9.8	5
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.997	3.26	0.65	9.8	5
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.998	3.16	0.58	9.8	6
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.998	3.51	0.56	9.4	5

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
count	1599.000	1599.000	1599.000	1599.000	1599.000	1599.000	1599.000	1599.000	1599.000	1599.000	1599.000	1599.000
mean	8.320	0.528	0.271	2.539	0.087	15.875	46.468	0.997	3.311	0.658	10.423	5.636
std	1.741	0.179	0.195	1.410	0.047	10.460	32.895	0.002	0.154	0.170	1.066	0.808
min	4.600	0.120	0.000	0.900	0.012	1.000	6.000	0.990	2.740	0.330	8.400	3.000
25%	7.100	0.390	0.090	1.900	0.070	7.000	22.000	0.996	3.210	0.550	9.500	5.000
50%	7.900	0.520	0.260	2.200	0.079	14.000	38.000	0.997	3.310	0.620	10.200	6.000
75%	9.200	0.640	0.420	2.600	0.090	21.000	62.000	0.998	3.400	0.730	11.100	6.000
max	15.900	1.580	1.000	15.500	0.611	72.000	289.000	1.004	4.010	2.000	14.900	8.000