【PYTHON10年经验总结】电商平台销售数据分析实践 -实践前自动生成数据集
📌 为什么需要“自动生成数据集”?
在实际工作中,我们常常面临:
- 数据权限受限(如敏感信息无法导出)
- 需要快速搭建演示环境或教学材料
- 想进行模型训练前的验证与测试
- 新人学习时缺乏真实数据
因此,掌握生成符合业务逻辑的模拟数据集,是一项非常实用的技能。
🎯 目标:提供10个常用数据集场景及代码模板
我们将使用 pandas
+ numpy
+ Faker
等库,模拟生成电商领域的常见数据集,并提供可执行的 Python 示例代码,便于你直接复用。
✅ 场景1:生成订单交易数据(含时间、用户ID、商品ID、金额等)
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from faker import Faker
fake = Faker()
def generate_order_data(n=1000):
data = []
for i in range(n):
order_id = f"ORD{i:06d}"
user_id = np.random.randint(10000, 99999)
product_id = np.random.choice(['P1001', 'P1002', 'P1003', 'P1004'], p=[0.4, 0.3, 0.2, 0.1])
order_date = fake.date_between(start_date='-1y', end_date='today')
quantity = np.random.randint(1, 5)
unit_price = round(np.random.uniform(10, 500), 2)
total_amount = round(quantity * unit_price, 2)
payment_method = np.random.choice(['Alipay', 'WeChat', 'CreditCard'], p=[0.6, 0.3, 0.1])
region = np.random.choice(['East', 'South', 'North', 'West'])
data.append((order_id, user_id, product_id, order_date, quantity, unit_price, total_amount, payment_method, region))
columns = ['OrderID', 'UserID', 'ProductID', 'OrderDate', 'Quantity', 'UnitPrice', 'TotalAmount', 'PaymentMethod', 'Region']
return pd.DataFrame(data, columns=columns)
df_orders = generate_order_data(5000)
print(df_orders.head())
✅ 适用场景:用于订单分析、销售额统计、渠道分析、区域分布等。
✅ 场景2:生成用户画像数据(年龄、性别、注册日期、地区)
def generate_user_profile(n=1000):
genders = ['Male', 'Female', 'Other']
regions = ['East', 'South', 'North', 'West']
data = []
for i in range(n):
user_id = i + 10000
gender = np.random.choice(genders, p=[0.48, 0.48, 0.04])
age = np.random.randint(18, 70)
register_date = fake.date_between(start_date='-3y', end_date='today')
region = np.random.choice(regions)
is_vip = np.random.choice([0, 1], p=[0.85, 0.15])
data.append((user_id, gender, age, register_date, region, is_vip))
columns = ['UserID', 'Gender', 'Age', 'RegisterDate', 'Region', 'IsVIP']
return pd.DataFrame(data, columns=columns)
df_users = generate_user_profile(2000)
print(df_users.head())
✅ 适用场景:用于用户分层、活跃度分析、复购率建模等。
✅ 场景3:生成商品基础信息(品类、品牌、价格区间)
def generate_product_info(n=100):
categories = ['Electronics', 'Clothing', 'Home', 'Beauty', 'Books']
brands = ['BrandA', 'BrandB', 'BrandC', 'BrandD']
data = []
for i in range(n):
product_id = f"P{i+100:04d}"
category = np.random.choice(categories)
brand = np.random.choice(brands)
price = round(np.random.uniform(50, 1000), 2)
stock = np.random.randint(100, 1000)
launch_date = fake.date_between(start_date='-2y', end_date='today')
data.append((product_id, category, brand, price, stock, launch_date))
columns = ['ProductID', 'Category', 'Brand', 'Price', 'Stock', 'LaunchDate']
return pd.DataFrame(data, columns=columns)
df_products = generate_product_info(100)
print(df_products.head())
✅ 适用场景:用于商品生命周期管理、库存预警、定价策略等。
✅ 场景4:生成促销活动记录(活动名称、时间、折扣力度)
def generate_promotion_data(n=20):
channels = ['App', 'PC', 'MiniProgram']
data = []
for i in range(n):
promo_id = f"PROMO{i+1:03d}"
name = fake.catch_phrase()
start_date = fake.date_between(start_date='-6m', end_date='+3m')
end_date = start_date + timedelta(days=np.random.randint(3, 30))
discount_rate = round(np.random.uniform(0.7, 0.95), 2)
channel = np.random.choice(channels)
sales_lift = round(np.random.uniform(1.2, 2.5), 2)
data.append((promo_id, name, start_date, end_date, discount_rate, channel, sales_lift))
columns = ['PromoID', 'Name', 'StartDate', 'EndDate', 'DiscountRate', 'Channel', 'SalesLift']
return pd.DataFrame(data, columns=columns)
df_promos = generate_promotion_data(20)
print(df_promos.head())
✅ 适用场景:用于评估促销效果、ROI分析、营销预算优化等。
✅ 场景5:生成每日访问日志(用户ID、访问时间、页面停留时长)
def generate_visit_log(n=5000):
pages = ['/home', '/product', '/cart', '/checkout', '/profile']
data = []
for _ in range(n):
user_id = np.random.randint(10000, 99999)
visit_time = fake.date_time_between(start_date='-1y', end_date='now')
page = np.random.choice(pages)
duration = np.random.exponential(5) # 平均停留5分钟
device = np.random.choice(['Mobile', 'Desktop'], p=[0.7, 0.3])
data.append((user_id, visit_time, page, round(duration, 2), device))
columns = ['UserID', 'VisitTime', 'Page', 'DurationMin', 'Device']
return pd.DataFrame(data, columns=columns)
df_visits = generate_visit_log(2000)
print(df_visits.head())
✅ 适用场景:用于用户行为分析、转化漏斗建模、页面优化建议等。
✅ 场景6~10简要说明(完整代码可在Jupyter中展开)
场景编号 | 名称 | 描述 |
---|---|---|
场景6 | 浏览记录数据生成 | 生成包含浏览记录ID、顾客ID、产品ID、浏览时间和IP地址的浏览记录数据集 |
场景7 | 购物车数据生成 | 生成包含购物车ID、顾客ID、产品ID、添加时间和移除时间的购物车数据集 |
场景8 | 促销活动数据生成 | 生成包含促销ID、名称、开始日期、结束日期和折扣率的促销活动数据集 |
场景9 | 用户评论数据生成 | 包括评分、评论内容、时间、是否好评等 |
场景10 | 物流数据生成 | 包括入库、出库、库存变化、仓库地点等 |
场景6: 浏览记录数据生成
描述: 生成包含浏览记录ID、顾客ID、产品ID、浏览时间和IP地址的浏览记录数据集。
import ipaddress
np.random.seed(42)
num_views = 2000
view_ids = range(1, num_views + 1)
customer_ids = np.random.choice(range(1, 501), size=num_views)
product_ids = np.random.choice(['P{:03d}'.format(i+1) for i in range(100)], size=num_views)
view_times = [pd.Timestamp.now().normalize() + pd.Timedelta(minutes=np.random.randint(0, 1440)) for _ in range(num_views)]
ip_addresses = [str(ipaddress.IPv4Address(np.random.randint(0, 2**32))) for _ in range(num_views)]
views_df = pd.DataFrame({
'ViewID': view_ids,
'CustomerID': customer_ids,
'ProductID': product_ids,
'ViewTime': view_times,
'IPAddress': ip_addresses
})
print(views_df.head())
场景7: 购物车数据生成
描述: 生成包含购物车ID、顾客ID、产品ID、添加时间和移除时间的购物车数据集。
np.random.seed(42)
num_cart_items = 1500
cart_item_ids = range(1, num_cart_items + 1)
customer_ids = np.random.choice(range(1, 501), size=num_cart_items)
product_ids = np.random.choice(['P{:03d}'.format(i+1) for i in range(100)], size=num_cart_items)
add_times = [pd.Timestamp.now().normalize() + pd.Timedelta(minutes=np.random.randint(0, 1440)) for _ in range(num_cart_items)]
remove_times = [(t + pd.Timedelta(minutes=np.random.randint(0, 1440))) if np.random.rand() > 0.3 else None for t in add_times]
carts_df = pd.DataFrame({
'CartItemID': cart_item_ids,
'CustomerID': customer_ids,
'ProductID': product_ids,
'AddTime': add_times,
'RemoveTime': remove_times
})
print(carts_df.head())
场景8: 促销活动数据生成
描述: 生成包含促销ID、名称、开始日期、结束日期和折扣率的促销活动数据集。
np.random.seed(42)
num_promotions = 20
promotion_ids = range(1, num_promotions + 1)
names = ['Promotion {}'.format(i+1) for i in range(num_promotions)]
start_dates = [date.today() - timedelta(days=np.random.randint(0, 365)) for _ in range(num_promotions)]
end_dates = [sd + timedelta(days=np.random.randint(1, 30)) for sd in start_dates]
discount_rates = np.random.uniform(0.05, 0.5, size=num_promotions).round(2)
promotions_df = pd.DataFrame({
'PromotionID': promotion_ids,
'Name': names,
'StartDate': start_dates,
'EndDate': end_dates,
'DiscountRate': discount_rates
})
print(promotions_df.head())
场景9: 用户评论数据生成
描述: 生成包含评论ID、顾客ID、产品ID、评分和评论内容的用户评论数据集。
np.random.seed(42)
num_reviews = 500
review_ids = range(1, num_reviews + 1)
customer_ids = np.random.choice(range(1, 501), size=num_reviews)
product_ids = np.random.choice(['P{:03d}'.format(i+1) for i in range(100)], size=num_reviews)
ratings = np.random.randint(1, 6, size=num_reviews)
comments = ['This is a great product!', 'Not what I expected.', 'Average quality.', 'Highly recommend!', 'Terrible experience.'] * (num_reviews // 5)
reviews_df = pd.DataFrame({
'ReviewID': review_ids,
'CustomerID': customer_ids,
'ProductID': product_ids,
'Rating': ratings,
'Comment': comments
})
print(reviews_df.head())
场景10: 物流数据生成
描述: 生成包含物流ID、订单ID、配送员ID、发货日期、预计送达日期和实际送达日期的物流数据集。
np.random.seed(42)
num_shipments = 1000
shipment_ids = range(1, num_shipments + 1)
order_ids = np.random.choice(range(1, 1001), size=num_shipments)
courier_ids = np.random.choice(range(1, 101), size=num_shipments)
dispatch_dates = [date.today() - timedelta(days=np.random.randint(0, 365)) for _ in range(num_shipments)]
estimated_delivery_dates = [dd + timedelta(days=np.random.randint(1, 10)) for dd in dispatch_dates]
actual_delivery_dates = [(ed + timedelta(days=np.random.randint(-2, 2))) if np.random.rand() > 0.2 else ed for ed in estimated_delivery_dates]
logistics_df = pd.DataFrame({
'ShipmentID': shipment_ids,
'OrderID': order_ids,
'CourierID': courier_ids,
'DispatchDate': dispatch_dates,
'EstimatedDeliveryDate': estimated_delivery_dates,
'ActualDeliveryDate': actual_delivery_dates
})
print(logistics_df.head())
以上每个场景都提供了详细的代码示例,您可以根据需要进行调整或扩展以适应您的具体业务需求。