微信扫码
与创始人交个朋友
我要投稿
上次做了特征的选择的处理,再往前走下!拿到一份数据,先进行数据的清洗大法!
看效果
缺失值检查和处理(这里是使用中位数填充)
异常值检查,本次是去除异常值
重复值检测,本次去除重复项
样本平衡性评估,就是值频数统计,和加权处理
字符编码检查
话原理
数据处理的4步骤::按个名字,缺衣服很(缺异复衡)哈哈哈哈
1)缺失值判断和处理
2)异常值的判断和处理
3)重复值的判断和处理
4)平衡性判断和处理
另外,本次的实践时,遇到文件读取的编码格式不对,导致decode不下去,这个也Mark在这里!
来实践
1、缺失值检测和使用中位数填充
import pandas as pd
# 读取CSV文件到DataFrame
df = pd.read_csv('abc_valuemissing.csv')
# 计算每列中缺失值的数量
missing_values_count = df.isnull().sum()
# 打印缺失值数量
print(missing_values_count)
####下面是填充缺失值
# 计算每列的中位数(不包括缺失值)
median_values = df.median()
# 使用每列的中位数来填充该列的缺失值
df_filled = df.fillna(median_values)
# 查看填充后的DataFrame
print(df_filled)
# 如果需要,将填充后的DataFrame保存回CSV文件
df_filled.to_csv('data1_filled_with_median.csv', index=False)
2、异常值的检测-箱线图
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# 读取CSV文件
df = pd.read_csv('abc.csv')
# 假设CSV文件有两列:'age' 和 weight'
# 找出年龄和体重的异常值
def find_outliers_by_iqr(series):
Q1 = series.quantile(0.25)
Q3 = series.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
return series[(series < lower_bound) | (series > upper_bound)]
outliers_weight = find_outliers_by_iqr(df['weight']) # 假设列名是'weight'
# 创建一个统一的图形窗口
plt.figure(figsize=(10, 5)) # 只需要一个箱线图,所以高度不需要那么大
# 体重的箱线图
sns.boxplot(x=df['weight'], color='lightgreen') # 假设列名是'weight'
plt.title('weight box plot')
# 标记体重异常值
for index, value in outliers_weight.items():
# 使用异常值的索引找到对应的x位置
x_pos = df['weight'].tolist().index(value) + 1 # 索引加1是为了避免与箱体的边缘重叠
# 使用固定的y位置来放置文本,确保它们不会重叠
y_pos = plt.ylim()[1] * 0.95 # 使用y轴的上限并稍微向下偏移一点
plt.text(x_pos, y_pos, value, horizontalalignment='center', size='small', color='red', weight='semibold')
# 显示图形窗口
plt.tight_layout() # 调整子图参数
plt.show()
异常值检测-统计标准差法
import pandas as pd
import numpy as np
# 假设df是包含年龄和体重的DataFrame
# df = pd.read_csv('data.csv') # 如果数据是从CSV文件读取的,取消注释这行代码
# 这里我们使用上面生成的数据作为示例
data = {
'年龄': [25, 32, 41, 28, 37, 30, 23, 39, 45, 29, 34, 22, 40, 31, 27, 36, 24, 42, 33, 38, 44, 26, 43, 35, 21, 46, 20, 32, 47, 25, 39, 30, 41, 28, 37, 34, 23, 40, 31, 27, 36, 24, 42, 33, 38, 29, 26, 45, 35, 22, 43, 21, 48, 20, 32, 47, 25, 39, 30, 41, 28, 37, 34],
'体重': [65, 165, 80, 68, 75, 70, 62, 85, 90, 102, 74, 60, 82, 71, 67, 78, 64, 87, 73, 84, 89, 66, 88, 77, 59, 92, 58, 72, 95, 65, 83, 70, 81, 68, 75, 74, 63, 86, 71, 67, 78, 64, 89, 73, 84, 69, 66, 91, 76, 61, 88, 59, 96, 58, 72, 93, 65, 83, 70, 81, 68, 75, 74]
}
df = pd.DataFrame(data)
# 计算年龄和体重的标准差
age_std = df['年龄'].std()
weight_std = df['体重'].std()
# 定义异常值的阈值,这里使用3倍标准差
threshold_age = 3 * age_std
threshold_weight = 3 * weight_std
# 识别年龄和体重的异常值
df['年龄_异常'] = df['年龄'].apply(lambda x: '是' if abs(x - df['年龄'].mean()) > threshold_age else '否')
df['体重_异常'] = df['体重'].apply(lambda x: '是' if abs(x - df['体重'].mean()) > threshold_weight else '否')
print(df[['年龄', '体重', '年龄_异常', '体重_异常']])
3、重复值检测
import pandas as pd
# 读取CSV文件
df = pd.read_csv('abc_valuemissingdup.csv') #
# 检查age和weight是否重复
duplicates = df.duplicated(subset=['age', 'weight'])
# 计算重复的行数(不包括第一行,因为duplicated默认不包括第一次出现的项)
num_duplicates = duplicates.sum()
# 打印重复的行数
if num_duplicates > 0:
print(f"存在 {num_duplicates} 个重复项(基于age和weight)")
# 打印重复的行
print("重复项如下:")
print(df[duplicates])
else:
print("没有重复项(基于age和weight)")
# 如果你还想查看每个重复项的具体出现次数
duplicate_groups = df[duplicates].groupby(['age', 'weight']).size().reset_index(name='count')
print("\n每个重复项出现的次数:")
print(duplicate_groups)
删除重复值
import pandas as pd
# 读取CSV文件
df = pd.read_csv('abc.csv') # 假设CSV文件名为abc.csv
# 删除age和weight的重复项,保留第一个出现的项
df_no_duplicates = df.drop_duplicates(subset=['age', 'weight'])
# 打印删除重复项后的DataFrame
print("删除重复项后的DataFrame:")
print(df_no_duplicates)
# 如果你想要将结果保存回CSV文件
df_no_duplicates.to_csv('data_no_duplicates.csv', index=False)
4、平衡检测
import pandas as pd
import matplotlib.pyplot as plt
# 读取CSV文件
df = pd.read_csv('abc.csv') # 假设CSV文件名为abc.csv
# 统计age列的不同值及其出现次数
age_counts = df['age'].value_counts().sort_index()
# 打印不同age值的统计信息
print("Age值统计:")
print(age_counts)
# 绘制直方图
plt.figure(figsize=(10, 6)) # 设置图形大小
plt.bar(age_counts.index, age_counts.values, color='blue', alpha=0.7) # 绘制条形图(直方图的一种)
plt.title('Age Distribution') # 设置标题
plt.xlabel('Age') # 设置x轴标签
plt.ylabel('Count') # 设置y轴标签
plt.xticks(rotation=45) # 如果age值太多导致x轴标签重叠,可以旋转标签角度
plt.tight_layout() # 调整布局以防止标签重叠
plt.show() # 显示图形
5、编码的检测
import chardet
with open('purchase.csv', 'rb') as f:
result = chardet.detect(f.read())
print(result)
写在最后
最近最大的感受是另外一半锻炼起来了,在这里给她加加油!希望她通过跑步变得越来越好!
另外最近看到一个隐马尔可夫链,预告研究下!
下面是几个句子的解析!
Many plants and animals disappear abruptly from the fossil record as one moves from layers of rock documenting the end of the Cretaceous up into rocks representing the beginning of the Cenozoic;(the era after the Mesozoic)
主句
Many plants and animals disappear abruptly from the fossil record
时间状语
as one moves from layers of rock
非谓语后置定语
documenting the end of the Cretaceous up into rocks
非谓语后置定语
representing the beginning of the Cenozoic
The growth of mutual trust among merchants facilitated the growth of sales on credit and led to new developments in finance,such as the bill of exchange,a device that made the long, slow, and very dangerous shipment of coins unnecessary.
并列主句
The growth of mutual trust among merchants facilitated the growth of sales on credit and led to new developments in finance
such as the bill of exchange,举例说明new developments in finance
a device that made the long, slow, and very dangerous shipment of coins unnecessary。是bill of exchange的同位语
Amphibians are therefore hardly at mercy of ambient temperature, since by means of the mechanisms described above they are more than exercise some control over their body temperature.
主句Amphibians are therefore hardly at mercy of ambient temperature
原因状语从句
since they are more than exercise some control over their body temperature.
方式状语 by means of the mechanisms described above
参考
暂无
53AI,企业落地应用大模型首选服务商
产品:大模型应用平台+智能体定制开发+落地咨询服务
承诺:先做场景POC验证,看到效果再签署服务协议。零风险落地应用大模型,已交付160+中大型企业
2024-08-18
当产品经理谈到用LLM Agent构建新一代智能体的时候,他们在说什么?
2024-08-15
对话AI教育从业者们:AI如何解决因材施教的难题?
2024-08-03
工业应用中的向量数据库与知识向量化存储方案
2024-07-25
两大深度学习框架TensorFlow与PyTorch对比
2024-07-17
让生成式 AI 触手可及:NVIDIA NIM on VKE 部署实践
2024-07-16
中文大模型基准测评2024上半年报告
2024-07-16
一文看懂人工智能的起源、发展、三次浪潮与未来趋势
2024-07-14
"自拍" 秒变 "证件照" 看Coze如何实现
2024-05-14
2024-04-26
2024-05-22
2024-04-12
2024-07-18
2024-03-30
2024-05-10
2024-08-13
2024-04-25
2024-04-26