我要投稿

【AI】DataCleaning

发布日期：2024-06-07 06:09:26 浏览次数： 2042

作者：毛毛Post

微信搜一搜，关注“毛毛Post”

上次做了特征的选择的处理，再往前走下！拿到一份数据，先进行数据的清洗大法！

看效果

缺失值检查和处理（这里是使用中位数填充）

异常值检查，本次是去除异常值

重复值检测，本次去除重复项

样本平衡性评估，就是值频数统计，和加权处理

字符编码检查

话原理

数据处理的4步骤:：按个名字，缺衣服很（缺异复衡）哈哈哈哈

1）缺失值判断和处理

2）异常值的判断和处理

3）重复值的判断和处理

4）平衡性判断和处理

另外，本次的实践时，遇到文件读取的编码格式不对，导致decode不下去，这个也Mark在这里！

来实践

1、缺失值检测和使用中位数填充

import pandas as pd # 读取CSV文件到DataFramedf = pd.read_csv('abc_valuemissing.csv')# 计算每列中缺失值的数量missing_values_count = df.isnull().sum()# 打印缺失值数量print(missing_values_count)####下面是填充缺失值# 计算每列的中位数（不包括缺失值）median_values = df.median()# 使用每列的中位数来填充该列的缺失值df_filled = df.fillna(median_values)# 查看填充后的DataFrameprint(df_filled)# 如果需要，将填充后的DataFrame保存回CSV文件df_filled.to_csv('data1_filled_with_median.csv', index=False)

2、异常值的检测-箱线图

import pandas as pd import seaborn as sns import matplotlib.pyplot as plt # 读取CSV文件df = pd.read_csv('abc.csv')# 假设CSV文件有两列：'age' 和 weight'# 找出年龄和体重的异常值def find_outliers_by_iqr(series):Q1 = series.quantile(0.25)Q3 = series.quantile(0.75)IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR return series[(series < lower_bound) | (series > upper_bound)]outliers_weight = find_outliers_by_iqr(df['weight']) # 假设列名是'weight'# 创建一个统一的图形窗口plt.figure(figsize=(10, 5)) # 只需要一个箱线图，所以高度不需要那么大# 体重的箱线图sns.boxplot(x=df['weight'], color='lightgreen') # 假设列名是'weight'plt.title('weight box plot')# 标记体重异常值for index, value in outliers_weight.items():# 使用异常值的索引找到对应的x位置x_pos = df['weight'].tolist().index(value) + 1 # 索引加1是为了避免与箱体的边缘重叠# 使用固定的y位置来放置文本，确保它们不会重叠y_pos = plt.ylim()[1] * 0.95 # 使用y轴的上限并稍微向下偏移一点plt.text(x_pos, y_pos, value, horizontalalignment='center', size='small', color='red', weight='semibold')# 显示图形窗口plt.tight_layout() # 调整子图参数plt.show()

异常值检测-统计标准差法

import pandas as pdimport numpy as np# 假设df是包含年龄和体重的DataFrame# df = pd.read_csv('data.csv')# 如果数据是从CSV文件读取的，取消注释这行代码# 这里我们使用上面生成的数据作为示例data = {'年龄': [25, 32, 41, 28, 37, 30, 23, 39, 45, 29, 34, 22, 40, 31, 27, 36, 24, 42, 33, 38, 44, 26, 43, 35, 21, 46, 20, 32, 47, 25, 39, 30, 41, 28, 37, 34, 23, 40, 31, 27, 36, 24, 42, 33, 38, 29, 26, 45, 35, 22, 43, 21, 48, 20, 32, 47, 25, 39, 30, 41, 28, 37, 34],'体重': [65, 165, 80, 68, 75, 70, 62, 85, 90, 102, 74, 60, 82, 71, 67, 78, 64, 87, 73, 84, 89, 66, 88, 77, 59, 92, 58, 72, 95, 65, 83, 70, 81, 68, 75, 74, 63, 86, 71, 67, 78, 64, 89, 73, 84, 69, 66, 91, 76, 61, 88, 59, 96, 58, 72, 93, 65, 83, 70, 81, 68, 75, 74]}df = pd.DataFrame(data)# 计算年龄和体重的标准差age_std = df['年龄'].std()weight_std = df['体重'].std()# 定义异常值的阈值，这里使用3倍标准差threshold_age = 3 * age_stdthreshold_weight = 3 * weight_std# 识别年龄和体重的异常值df['年龄_异常'] = df['年龄'].apply(lambda x: '是' if abs(x - df['年龄'].mean()) > threshold_age else '否')df['体重_异常'] = df['体重'].apply(lambda x: '是' if abs(x - df['体重'].mean()) > threshold_weight else '否')print(df[['年龄', '体重', '年龄_异常', '体重_异常']])

3、重复值检测

import pandas as pd # 读取CSV文件df = pd.read_csv('abc_valuemissingdup.csv') ## 检查age和weight是否重复duplicates = df.duplicated(subset=['age', 'weight'])# 计算重复的行数（不包括第一行，因为duplicated默认不包括第一次出现的项）num_duplicates = duplicates.sum()# 打印重复的行数if num_duplicates > 0:print(f"存在 {num_duplicates} 个重复项（基于age和weight）")# 打印重复的行print("重复项如下：")print(df[duplicates])else:print("没有重复项（基于age和weight）")# 如果你还想查看每个重复项的具体出现次数duplicate_groups = df[duplicates].groupby(['age', 'weight']).size().reset_index(name='count')print("\n每个重复项出现的次数：")print(duplicate_groups)

删除重复值

import pandas as pd# 读取CSV文件df = pd.read_csv('abc.csv')# 假设CSV文件名为abc.csv# 删除age和weight的重复项，保留第一个出现的项df_no_duplicates = df.drop_duplicates(subset=['age', 'weight'])# 打印删除重复项后的DataFrameprint("删除重复项后的DataFrame：")print(df_no_duplicates)# 如果你想要将结果保存回CSV文件df_no_duplicates.to_csv('data_no_duplicates.csv', index=False)

‍

4、平衡检测

import pandas as pd import matplotlib.pyplot as plt # 读取CSV文件df = pd.read_csv('abc.csv') # 假设CSV文件名为abc.csv# 统计age列的不同值及其出现次数age_counts = df['age'].value_counts().sort_index()# 打印不同age值的统计信息print("Age值统计：")print(age_counts)# 绘制直方图plt.figure(figsize=(10, 6)) # 设置图形大小plt.bar(age_counts.index, age_counts.values, color='blue', alpha=0.7) # 绘制条形图（直方图的一种）plt.title('Age Distribution') # 设置标题plt.xlabel('Age') # 设置x轴标签plt.ylabel('Count') # 设置y轴标签plt.xticks(rotation=45) # 如果age值太多导致x轴标签重叠，可以旋转标签角度plt.tight_layout() # 调整布局以防止标签重叠plt.show() # 显示图形

5、编码的检测

import chardetwith open('purchase.csv', 'rb') as f:result = chardet.detect(f.read())print(result)

写在最后

最近最大的感受是另外一半锻炼起来了，在这里给她加加油！希望她通过跑步变得越来越好！

另外最近看到一个隐马尔可夫链，预告研究下！

下面是几个句子的解析！

Many plants and animals disappear abruptly from the fossil record as one moves from layers of rock documenting the end of the Cretaceous up into rocks representing the beginning of the Cenozoic;（the era after the Mesozoic）

主句Many plants and animals disappear abruptly from the fossil record时间状语as one moves from layers of rock非谓语后置定语documenting the end of the Cretaceous up into rocks非谓语后置定语

representing the beginning of the Cenozoic

The growth of mutual trust among merchants facilitated the growth of sales on credit and led to new developments in finance，such as the bill of exchange，a device that made the long, slow, and very dangerous shipment of coins unnecessary.

并列主句The growth of mutual trust among merchants facilitated the growth of sales on credit and led to new developments in financesuch as the bill of exchange，举例说明new developments in financea device that made the long, slow, and very dangerous shipment of coins unnecessary。是bill of exchange的同位语

Amphibians are therefore hardly at mercy of ambient temperature, since by means of the mechanisms described above they are more than exercise some control over their body temperature.

主句Amphibians are therefore hardly at mercy of ambient temperature原因状语从句since they are more than exercise some control over their body temperature.方式状语 by means of the mechanisms described above