📌 为什么需要“自动生成数据集”?

在实际工作中,我们常常面临:

  • 数据权限受限(如敏感信息无法导出)
  • 需要快速搭建演示环境或教学材料
  • 想进行模型训练前的验证与测试
  • 新人学习时缺乏真实数据

因此,掌握生成符合业务逻辑的模拟数据集,是一项非常实用的技能。


🎯 目标:提供10个常用数据集场景及代码模板

我们将使用 pandas + numpy + Faker 等库,模拟生成电商领域的常见数据集,并提供可执行的 Python 示例代码,便于你直接复用。


✅ 场景1:生成订单交易数据(含时间、用户ID、商品ID、金额等)

import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from faker import Faker

fake = Faker()

def generate_order_data(n=1000):
    data = []
    for i in range(n):
        order_id = f"ORD{i:06d}"
        user_id = np.random.randint(10000, 99999)
        product_id = np.random.choice(['P1001', 'P1002', 'P1003', 'P1004'], p=[0.4, 0.3, 0.2, 0.1])
        order_date = fake.date_between(start_date='-1y', end_date='today')
        quantity = np.random.randint(1, 5)
        unit_price = round(np.random.uniform(10, 500), 2)
        total_amount = round(quantity * unit_price, 2)
        payment_method = np.random.choice(['Alipay', 'WeChat', 'CreditCard'], p=[0.6, 0.3, 0.1])
        region = np.random.choice(['East', 'South', 'North', 'West'])
        data.append((order_id, user_id, product_id, order_date, quantity, unit_price, total_amount, payment_method, region))

    columns = ['OrderID', 'UserID', 'ProductID', 'OrderDate', 'Quantity', 'UnitPrice', 'TotalAmount', 'PaymentMethod', 'Region']
    return pd.DataFrame(data, columns=columns)

df_orders = generate_order_data(5000)
print(df_orders.head())

适用场景:用于订单分析、销售额统计、渠道分析、区域分布等。


✅ 场景2:生成用户画像数据(年龄、性别、注册日期、地区)

def generate_user_profile(n=1000):
    genders = ['Male', 'Female', 'Other']
    regions = ['East', 'South', 'North', 'West']
    data = []
    for i in range(n):
        user_id = i + 10000
        gender = np.random.choice(genders, p=[0.48, 0.48, 0.04])
        age = np.random.randint(18, 70)
        register_date = fake.date_between(start_date='-3y', end_date='today')
        region = np.random.choice(regions)
        is_vip = np.random.choice([0, 1], p=[0.85, 0.15])
        data.append((user_id, gender, age, register_date, region, is_vip))

    columns = ['UserID', 'Gender', 'Age', 'RegisterDate', 'Region', 'IsVIP']
    return pd.DataFrame(data, columns=columns)

df_users = generate_user_profile(2000)
print(df_users.head())

适用场景:用于用户分层、活跃度分析、复购率建模等。


✅ 场景3:生成商品基础信息(品类、品牌、价格区间)

def generate_product_info(n=100):
    categories = ['Electronics', 'Clothing', 'Home', 'Beauty', 'Books']
    brands = ['BrandA', 'BrandB', 'BrandC', 'BrandD']
    data = []
    for i in range(n):
        product_id = f"P{i+100:04d}"
        category = np.random.choice(categories)
        brand = np.random.choice(brands)
        price = round(np.random.uniform(50, 1000), 2)
        stock = np.random.randint(100, 1000)
        launch_date = fake.date_between(start_date='-2y', end_date='today')
        data.append((product_id, category, brand, price, stock, launch_date))

    columns = ['ProductID', 'Category', 'Brand', 'Price', 'Stock', 'LaunchDate']
    return pd.DataFrame(data, columns=columns)

df_products = generate_product_info(100)
print(df_products.head())

适用场景:用于商品生命周期管理、库存预警、定价策略等。


✅ 场景4:生成促销活动记录(活动名称、时间、折扣力度)

def generate_promotion_data(n=20):
    channels = ['App', 'PC', 'MiniProgram']
    data = []
    for i in range(n):
        promo_id = f"PROMO{i+1:03d}"
        name = fake.catch_phrase()
        start_date = fake.date_between(start_date='-6m', end_date='+3m')
        end_date = start_date + timedelta(days=np.random.randint(3, 30))
        discount_rate = round(np.random.uniform(0.7, 0.95), 2)
        channel = np.random.choice(channels)
        sales_lift = round(np.random.uniform(1.2, 2.5), 2)
        data.append((promo_id, name, start_date, end_date, discount_rate, channel, sales_lift))

    columns = ['PromoID', 'Name', 'StartDate', 'EndDate', 'DiscountRate', 'Channel', 'SalesLift']
    return pd.DataFrame(data, columns=columns)

df_promos = generate_promotion_data(20)
print(df_promos.head())

适用场景:用于评估促销效果、ROI分析、营销预算优化等。


✅ 场景5:生成每日访问日志(用户ID、访问时间、页面停留时长)

def generate_visit_log(n=5000):
    pages = ['/home', '/product', '/cart', '/checkout', '/profile']
    data = []
    for _ in range(n):
        user_id = np.random.randint(10000, 99999)
        visit_time = fake.date_time_between(start_date='-1y', end_date='now')
        page = np.random.choice(pages)
        duration = np.random.exponential(5)  # 平均停留5分钟
        device = np.random.choice(['Mobile', 'Desktop'], p=[0.7, 0.3])
        data.append((user_id, visit_time, page, round(duration, 2), device))

    columns = ['UserID', 'VisitTime', 'Page', 'DurationMin', 'Device']
    return pd.DataFrame(data, columns=columns)

df_visits = generate_visit_log(2000)
print(df_visits.head())

适用场景:用于用户行为分析、转化漏斗建模、页面优化建议等。


✅ 场景6~10简要说明(完整代码可在Jupyter中展开)

场景编号名称描述
场景6浏览记录数据生成生成包含浏览记录ID、顾客ID、产品ID、浏览时间和IP地址的浏览记录数据集
场景7购物车数据生成生成包含购物车ID、顾客ID、产品ID、添加时间和移除时间的购物车数据集
场景8促销活动数据生成生成包含促销ID、名称、开始日期、结束日期和折扣率的促销活动数据集
场景9 用户评论数据生成包括评分、评论内容、时间、是否好评等
场景10物流数据生成包括入库、出库、库存变化、仓库地点等

场景6: 浏览记录数据生成

描述: 生成包含浏览记录ID、顾客ID、产品ID、浏览时间和IP地址的浏览记录数据集。

import ipaddress

np.random.seed(42)
num_views = 2000
view_ids = range(1, num_views + 1)
customer_ids = np.random.choice(range(1, 501), size=num_views)
product_ids = np.random.choice(['P{:03d}'.format(i+1) for i in range(100)], size=num_views)
view_times = [pd.Timestamp.now().normalize() + pd.Timedelta(minutes=np.random.randint(0, 1440)) for _ in range(num_views)]
ip_addresses = [str(ipaddress.IPv4Address(np.random.randint(0, 2**32))) for _ in range(num_views)]

views_df = pd.DataFrame({
    'ViewID': view_ids,
    'CustomerID': customer_ids,
    'ProductID': product_ids,
    'ViewTime': view_times,
    'IPAddress': ip_addresses
})

print(views_df.head())

场景7: 购物车数据生成

描述: 生成包含购物车ID、顾客ID、产品ID、添加时间和移除时间的购物车数据集。

np.random.seed(42)
num_cart_items = 1500
cart_item_ids = range(1, num_cart_items + 1)
customer_ids = np.random.choice(range(1, 501), size=num_cart_items)
product_ids = np.random.choice(['P{:03d}'.format(i+1) for i in range(100)], size=num_cart_items)
add_times = [pd.Timestamp.now().normalize() + pd.Timedelta(minutes=np.random.randint(0, 1440)) for _ in range(num_cart_items)]
remove_times = [(t + pd.Timedelta(minutes=np.random.randint(0, 1440))) if np.random.rand() > 0.3 else None for t in add_times]

carts_df = pd.DataFrame({
    'CartItemID': cart_item_ids,
    'CustomerID': customer_ids,
    'ProductID': product_ids,
    'AddTime': add_times,
    'RemoveTime': remove_times
})

print(carts_df.head())

场景8: 促销活动数据生成

描述: 生成包含促销ID、名称、开始日期、结束日期和折扣率的促销活动数据集。

np.random.seed(42)
num_promotions = 20
promotion_ids = range(1, num_promotions + 1)
names = ['Promotion {}'.format(i+1) for i in range(num_promotions)]
start_dates = [date.today() - timedelta(days=np.random.randint(0, 365)) for _ in range(num_promotions)]
end_dates = [sd + timedelta(days=np.random.randint(1, 30)) for sd in start_dates]
discount_rates = np.random.uniform(0.05, 0.5, size=num_promotions).round(2)

promotions_df = pd.DataFrame({
    'PromotionID': promotion_ids,
    'Name': names,
    'StartDate': start_dates,
    'EndDate': end_dates,
    'DiscountRate': discount_rates
})

print(promotions_df.head())

场景9: 用户评论数据生成

描述: 生成包含评论ID、顾客ID、产品ID、评分和评论内容的用户评论数据集。

np.random.seed(42)
num_reviews = 500
review_ids = range(1, num_reviews + 1)
customer_ids = np.random.choice(range(1, 501), size=num_reviews)
product_ids = np.random.choice(['P{:03d}'.format(i+1) for i in range(100)], size=num_reviews)
ratings = np.random.randint(1, 6, size=num_reviews)
comments = ['This is a great product!', 'Not what I expected.', 'Average quality.', 'Highly recommend!', 'Terrible experience.'] * (num_reviews // 5)

reviews_df = pd.DataFrame({
    'ReviewID': review_ids,
    'CustomerID': customer_ids,
    'ProductID': product_ids,
    'Rating': ratings,
    'Comment': comments
})

print(reviews_df.head())

场景10: 物流数据生成

描述: 生成包含物流ID、订单ID、配送员ID、发货日期、预计送达日期和实际送达日期的物流数据集。

np.random.seed(42)
num_shipments = 1000
shipment_ids = range(1, num_shipments + 1)
order_ids = np.random.choice(range(1, 1001), size=num_shipments)
courier_ids = np.random.choice(range(1, 101), size=num_shipments)
dispatch_dates = [date.today() - timedelta(days=np.random.randint(0, 365)) for _ in range(num_shipments)]
estimated_delivery_dates = [dd + timedelta(days=np.random.randint(1, 10)) for dd in dispatch_dates]
actual_delivery_dates = [(ed + timedelta(days=np.random.randint(-2, 2))) if np.random.rand() > 0.2 else ed for ed in estimated_delivery_dates]

logistics_df = pd.DataFrame({
    'ShipmentID': shipment_ids,
    'OrderID': order_ids,
    'CourierID': courier_ids,
    'DispatchDate': dispatch_dates,
    'EstimatedDeliveryDate': estimated_delivery_dates,
    'ActualDeliveryDate': actual_delivery_dates
})

print(logistics_df.head())

以上每个场景都提供了详细的代码示例,您可以根据需要进行调整或扩展以适应您的具体业务需求。