从零开始训练一个神经网络之三：设计和训练预测金价走势的深度学习模型

2024-08-15

字数统计: 7.6k字 | 阅读时长≈ 31分

当前步骤的主要任务目标是数据集划分，其目的是将经过预处理和特征工程生成的数据划分为不同的子集，以便于模型的训练、验证和测试，进一步进行模型设计与训练。具体目标如下：

1. 生成适合时间序列模型的特征和标签

滑动窗口生成：通过滑动窗口方法从时间序列数据中生成特征和标签，使模型能够捕捉时间序列中的模式和趋势。
标签创建：生成的标签对应于未来若干天的金价变化，目的是让模型能够预测这些未来的值。

2. 归一化/标准化数据

数据归一化/标准化：对数据进行归一化或标准化处理，确保模型在训练过程中更快收敛，并且不同特征之间的数值差异不会导致模型性能的不稳定。

3. 划分训练集、验证集和测试集

训练集：用于模型的训练，占数据的主要部分。目的是通过这一部分数据调整模型参数，使模型能够学习到数据的模式。
验证集：用于模型的超参数调优和早停（Early Stopping），帮助确定模型的最佳配置并防止过拟合。
测试集：用于最终评估模型性能，确保模型在未见过的数据上能够良好地泛化。

为什么这些步骤重要？

时间序列的特性：时间序列数据有其独特的模式和依赖关系，直接使用传统的随机划分方法可能破坏这些模式。因此，采用滑动窗口等方法能够保持时间序列的连续性。
数据归一化/标准化：归一化和标准化可以帮助模型更好地理解和处理不同量级的数据，防止某些特征由于值的范围过大而主导模型的学习过程。
数据集划分：适当划分数据集可以有效地评估模型的泛化能力，防止过拟合，提高模型在实际应用中的表现。

最终目标

通过完成这一步，将准备好高质量的训练集、验证集和测试集，这些子集将为下一步的模型设计与训练奠定坚实的基础。模型训练的效果将很大程度上取决于数据集的划分质量，因此这是一个至关重要的步骤。

让我们开始吧！

具体来说：

1. 数据加载与清理

为什么这么做？

在处理金融数据时，首先要确保数据的完整性和一致性。金融数据通常包含多余的列、格式不一致的数值（如带有逗号的字符串形式的数值）、时间序列不连续或未排序的情况。清理数据的目的是确保接下来的特征生成和模型训练能在干净的数据上进行，以保证结果的准确性。

如何设计？

数据加载后，首先对列名进行标准化处理，去除空白和不必要的说明，确保列名的一致性。然后，数据中的日期列被转换为datetime类型，以便后续的时间序列分析。接着，对数值列进行处理，将带有逗号的字符串形式数值转换为浮点数。此外，删除不相关的列以简化数据。

如何实现？

利用pandas库中的read_csv函数加载数据，并使用str.replace()和astype(float)方法将字符串数值转换为浮点数。删除不必要的列，并按日期排序。

2. 特征生成

特征工程是提升模型性能的关键步骤。通过生成技术指标（如移动平均线、相对强弱指数、MACD、布林带、ATR等），我们可以提取和增强时间序列中的趋势、波动性、动量等信息，这些信息有助于模型更好地理解市场行为并进行预测。

如何设计？

选取常用的技术指标（如MA、RSI、MACD、布林带、ATR），每个指标都针对不同的市场特性进行设计，帮助模型识别市场中的模式和趋势。

使用pandas的内置函数进行技术指标计算。例如，rolling方法用于计算移动平均线，ewm方法用于计算指数移动平均线。通过封装函数如compute_RSI()、compute_bollinger_bands()和compute_atr()，生成各类技术指标。

3. 数据归一化

为什么这么做？

归一化是为了使数据特征的数值范围相似，避免某些特征因数值范围过大而主导模型的学习过程。归一化还可以加速模型训练的收敛速度。

目前脚本提供了多种归一化方法（如MinMaxScaler和StandardScaler），用户可以根据需求选择不同的归一化策略。一般来说，MinMaxScaler将数据缩放到[0, 1]区间，而StandardScaler将数据转换为零均值和单位方差的分布。

使用sklearn.preprocessing中的MinMaxScaler和StandardScaler进行数据归一化，并返回归一化后的数据和对应的scaler对象。

归一化过程确保不同特征对模型的影响是均衡的，从而提高模型的学习能力和稳定性。

4. 滑动窗口处理

为什么这么做？

滑动窗口是处理时间序列数据的一种常用方法，它能够将时间序列数据转化为适合深度学习模型输入的特征矩阵。通过滑动窗口，可以捕捉时间序列中的模式和依赖关系。

通过指定窗口大小和预测范围，将时间序列数据转换为适合LSTM、GRU等模型输入的特征和标签对。每个滑动窗口生成一段时间序列，作为模型的输入特征，标签则是后续几天的目标值。

如何实现？

使用一个循环遍历整个时间序列数据，根据窗口大小提取连续的时间序列段作为特征，并将预测范围内的目标值作为标签。

滑动窗口技术将时间序列数据转化为多个短期序列，使得深度学习模型能够在局部时间内识别模式和依赖关系。

5. 数据集划分

为了评估模型的性能并避免过拟合，我们需要将数据划分为训练集、验证集和测试集。训练集用于模型训练，验证集用于调参和避免过拟合，测试集用于最终评估模型的泛化能力。

根据用户指定的比例，将数据集划分为训练集、验证集和测试集，保证训练、验证和测试集之间不存在重叠，确保评估的公平性。

如何实现？

使用Python的数组切片功能，将特征和标签数据按比例划分为训练、验证和测试集。

通过数据集划分，可以评估模型在未见过的数据上的表现，从而判断模型的泛化能力和实际应用效果。

实现步骤

包括数据预处理、特征生成、滑动窗口处理、数据集划分，以及数据归一化和标准化等步骤。为了进一步提高灵活性，代码中包括更多的参数选项，允许用户自定义处理流程。

import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split
import logging

# 配置日志记录
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')


def load_and_clean_data(file_path, drop_columns=None):
    """
    加载数据并进行初步清理，包括处理列名、转换数据类型、删除无关列。

    参数:
    file_path - CSV文件路径
    drop_columns - 要删除的列列表 (list)

    返回:
    清理后的DataFrame
    """
    # 加载数据
    data = pd.read_csv(file_path)
    logging.info("数据加载完成")
    logging.info(f"原始数据包含 {data.shape[0]} 行，{data.shape[1]} 列")

    # 清理列名：去除多余空白和说明
    data.columns = data.columns.str.strip().str.split().str[0]

    # 转换日期列
    data['Date'] = pd.to_datetime(data['Date'])

    # 转换数值列（移除逗号并转换为浮点数）
    for col in ['Open', 'High', 'Low', 'Close']:
        data[col] = data[col].str.replace(',', '').astype(float)

    # 删除无关列
    if drop_columns:
        data = data.drop(columns=drop_columns)

    # 按日期排序
    data = data.sort_values('Date')

    logging.info("数据清理完成")
    logging.info(f"清理后的数据包含 {data.shape[0]} 行，{data.shape[1]} 列")
    logging.info(f"数据预览:\n{data.head()}")

    return data


def generate_technical_indicators(data, indicators=None):
    """
    生成技术指标。

    参数:
    data - 输入的DataFrame
    indicators - 要生成的技术指标列表

    返回:
    包含技术指标的DataFrame
    """
    if indicators is None:
        indicators = ['MA', 'RSI', 'MACD', 'Bollinger', 'ATR']

    logging.info(f"生成技术指标: {indicators}")

    if 'MA' in indicators:
        data['MA_5'] = data['Close'].rolling(window=5).mean()
        data['MA_10'] = data['Close'].rolling(window=10).mean()

    if 'RSI' in indicators:
        data['RSI_14'] = compute_RSI(data['Close'], window=14)

    if 'MACD' in indicators:
        short_ema = data['Close'].ewm(span=12, adjust=False).mean()
        long_ema = data['Close'].ewm(span=26, adjust=False).mean()
        data['MACD'] = short_ema - long_ema

    if 'Bollinger' in indicators:
        data['BB_upper'], data['BB_lower'] = compute_bollinger_bands(data['Close'], window=20)

    if 'ATR' in indicators:
        data['ATR'] = compute_atr(data, window=14)

    # 移除因滚动计算产生的NaN值
    data = data.dropna()

    logging.info("技术指标生成完成")
    return data


def compute_RSI(series, window=14):
    """
    计算相对强弱指数 (RSI)。
    """
    diff = series.diff(1).dropna()
    gain = 0.5 * (diff + abs(diff))
    loss = 0.5 * (diff - abs(diff))
    avg_gain = gain.rolling(window=window, min_periods=1).mean()
    avg_loss = abs(loss.rolling(window=window, min_periods=1).mean())
    rs = avg_gain / avg_loss
    rsi = 100 - (100 / (1 + rs))
    return rsi


def compute_bollinger_bands(series, window=20):
    """
    计算布林带的上下轨道。
    """
    sma = series.rolling(window=window).mean()
    std = series.rolling(window=window).std()
    upper_band = sma + (std * 2)
    lower_band = sma - (std * 2)
    return upper_band, lower_band


def compute_atr(data, window=14):
    """
    计算平均真实波动幅度 (ATR)。
    """
    high_low = data['High'] - data['Low']
    high_close = np.abs(data['High'] - data['Close'].shift())
    low_close = np.abs(data['Low'] - data['Close'].shift())
    tr = high_low.combine(high_close, max).combine(low_close, max)
    atr = tr.rolling(window=window, min_periods=1).mean()
    return atr


def normalize_data(data, method='minmax'):
    """
    归一化或标准化数据。

    参数:
    data - 输入的DataFrame
    method - 归一化或标准化方法 ('minmax' 或 'standard')

    返回:
    归一化或标准化后的DataFrame，及Scaler对象
    """
    if method == 'minmax':
        scaler = MinMaxScaler()
    elif method == 'standard':
        scaler = StandardScaler()
    else:
        raise ValueError("Unsupported normalization method")

    scaled_data = scaler.fit_transform(data)
    logging.info(f"数据归一化/标准化完成，方法: {method}")
    return pd.DataFrame(scaled_data, columns=data.columns), scaler


def create_sliding_windows(data, window_size, prediction_horizon):
    """
    通过滑动窗口生成特征和标签序列。

    参数:
    data - 输入的DataFrame
    window_size - 滑动窗口的大小
    prediction_horizon - 预测的时间范围

    返回:
    特征矩阵和标签向量
    """
    X, y = [], []
    for i in range(len(data) - window_size - prediction_horizon):
        X.append(data[i:(i + window_size)].values)
        y.append(data[(i + window_size):(i + window_size + prediction_horizon)]['Close'].values)
    logging.info(f"滑动窗口生成特征和标签，窗口大小: {window_size}, 预测范围: {prediction_horizon}")
    return np.array(X), np.array(y)


def split_data(X, y, train_size=0.7, validation_size=0.1):
    """
    将数据划分为训练集、验证集和测试集。

    参数:
    X - 特征矩阵
    y - 标签向量
    train_size - 训练集比例
    validation_size - 验证集比例

    返回:
    分割后的数据集
    """
    total_size = X.shape[0]
    train_end = int(total_size * train_size)
    val_end = train_end + int(total_size * validation_size)

    X_train, X_val, X_test = X[:train_end], X[train_end:val_end], X[val_end:]
    y_train, y_val, y_test = y[:train_end], y[train_end:val_end], y[val_end:]

    logging.info(f"数据集划分完成: 训练集大小={X_train.shape}, 验证集大小={X_val.shape}, 测试集大小={X_test.shape}")
    return X_train, X_val, X_test, y_train, y_val, y_test


def main(file_path, window_size=10, prediction_horizon=3, normalization_method='minmax',
         train_size=0.7, validation_size=0.1, drop_columns=None, indicators=None):
    """
    综合数据加载、清理、特征生成、归一化、滑动窗口处理和数据集划分的一体化流程。

    参数:
    file_path - 数据文件路径
    window_size - 滑动窗口大小
    prediction_horizon - 预测时间范围
    normalization_method - 归一化或标准化方法 ('minmax' 或 'standard')
    train_size - 训练集比例
    validation_size - 验证集比例
    drop_columns - 要删除的列列表
    indicators - 要生成的技术指标列表

    返回:
    分割后的训练集、验证集、测试集
    """
    # 1. 加载和清理数据
    data = load_and_clean_data(file_path, drop_columns)

    # 2. 生成技术指标
    data = generate_technical_indicators(data, indicators)

    # 3. 创建目标变量 (Target)
    data['Target'] = data['Close'].shift(-prediction_horizon) - data['Close']
    data = data.dropna()  # 删除因目标列创建而产生的NaN值

    # 4. 归一化或标准化数据
    normalized_data, scaler = normalize_data(data.drop(columns=['Date', 'Target']), method=normalization_method)

    # 5. 生成滑动窗口特征和标签
    X, y = create_sliding_windows(normalized_data, window_size, prediction_horizon)

    # 6. 数据集划分
    X_train, X_val, X_test, y_train, y_val, y_test = split_data(X, y, train_size, validation_size)

    logging.info(f"训练集大小: {X_train.shape}, 验证集大小: {X_val.shape}, 测试集大小: {X_test.shape}")

    return X_train, X_val, X_test, y_train, y_val, y_test


# 调用主函数
if __name__ == "__main__":
    X_train, X_val, X_test, y_train, y_val, y_test = main(
        file_path='gold_price_data_extracted.csv',
        window_size=10,
        prediction_horizon=3,
        normalization_method='minmax',
        train_size=0.7,
        validation_size=0.1,
        drop_columns=['Adj', 'Volume'],  # 示例：删除不需要的列
        indicators=['MA', 'RSI', 'MACD', 'Bollinger', 'ATR']  # 选择要生成的技术指标
    )

QQ_1723658243436

这样就初步完成了数据清理和特征工程、数据划分等；

进入下一步：

我们将重点放在模型的设计与训练上，尤其是在处理时间序列数据时，LSTM（长短期记忆网络）和GRU（门控循环单元）等深度学习模型是常用的选择。以下是将要进行的步骤：

1. 模型选择与设计

设计一个LSTM模型，LSTM模型擅长处理具有长期依赖关系的时间序列数据。LSTM模型可以记住长期的模式和趋势，因此非常适合用于预测未来的金价。

模型设计的思路：

输入层：处理滑动窗口生成的时间序列数据。
LSTM层：捕捉时间序列中的长期依赖关系。
Dropout层：防止过拟合。
全连接层：将LSTM的输出映射到预测值。
输出层：输出未来的金价变化。

2. 模型训练

在训练模型时，我们会使用已经划分的训练集，并在验证集上进行超参数调优，以防止过拟合。我们还可以设置早停机制来自动终止训练，当验证集的损失不再下降时。

训练的步骤：

定义损失函数：我们会选择均方误差（MSE）作为损失函数，这是回归问题中常用的损失函数。
选择优化器：使用Adam优化器，因为它能够自适应地调整学习率，通常表现较好。
设置训练轮数和批次大小：这些超参数可以在训练过程中进行调优。

3. 模型评估

在训练完成后，我们会在测试集上评估模型的性能，使用MSE、RMSE（均方根误差）等指标来衡量预测的准确性。

4. 模型保存与加载

为了便于模型的重用和部署，我们可以将训练好的模型保存到文件中，并提供加载模型的接口，以便在实际预测中使用。

实现步骤

模型设计与编译
模型训练
模型评估
模型保存与加载

如果在训练过程中没有输出或图表显示，我们还可以添加一些日志或检查evaluate_model函数中是否正确显示了测试结果。

检查模型性能

可以使用以下方法来验证和分析模型的性能：

检查损失值：在训练和验证过程中，损失值应该随着训练轮数的增加而减少。如果设置了EarlyStopping，训练可能会提前停止，这说明模型已经达到最优状态。
绘制损失曲线：查看训练和验证的损失曲线，确保没有发生过拟合（即验证损失明显高于训练损失）。
评估测试集性能：通过测试集上的均方误差（MSE）或均方根误差（RMSE）来衡量模型的泛化能力。
预测 vs. 真实值的对比：查看模型预测结果与真实值的对比图表，直观地理解模型的预测效果。

代码

import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from data_solve import main as data_preprocessing_main  # 导入数据预处理的主函数
from matplotlib import font_manager

# 配置Matplotlib使用中文字体，确保图表中的中文字符正常显示
plt.rcParams['font.sans-serif'] = ['SimHei']  # 使用黑体
plt.rcParams['axes.unicode_minus'] = False  # 确保负号显示正常

def build_lstm_model(input_shape, units=50, dropout_rate=0.2):
    """
    构建并编译LSTM模型。

    参数:
    - input_shape: 输入数据的形状 (time_steps, num_features)
    - units: LSTM单元的数量
    - dropout_rate: Dropout层的比例，用于防止过拟合

    返回:
    - 编译好的LSTM模型
    """
    model = Sequential([
        LSTM(units=units, return_sequences=True, input_shape=input_shape),
        Dropout(dropout_rate),
        LSTM(units=units, return_sequences=False),
        Dropout(dropout_rate),
        Dense(units=3)  # 输出3个时间步的预测
    ])
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

def train_model(model, X_train, y_train, X_val, y_val, epochs=50, batch_size=32):
    """
    训练LSTM模型，并返回训练历史。

    参数:
    - model: 编译好的LSTM模型
    - X_train, y_train: 训练数据及其标签
    - X_val, y_val: 验证数据及其标签
    - epochs: 训练轮数
    - batch_size: 批次大小

    返回:
    - 训练历史 (history) 对象
    """
    early_stopping = tf.keras.callbacks.EarlyStopping(
        monitor='val_loss', patience=10, restore_best_weights=True
    )
    history = model.fit(
        X_train, y_train,
        validation_data=(X_val, y_val),
        epochs=epochs, batch_size=batch_size,
        callbacks=[early_stopping]
    )
    return history

def plot_training_history(history):
    """
    绘制训练和验证损失随训练轮次的变化曲线。

    参数:
    - history: 训练历史对象
    """
    plt.figure(figsize=(10, 6))
    plt.plot(history.history['loss'], label='Training Loss')
    plt.plot(history.history['val_loss'], label='Validation Loss')
    plt.title('训练和验证损失随训练轮次的变化')
    plt.xlabel('训练轮次')
    plt.ylabel('损失')
    plt.legend()
    plt.show()

def evaluate_model(model, X_test, y_test):
    """
    评估模型在测试集上的表现，并绘制预测值与真实值的对比图。

    参数:
    - model: 训练好的LSTM模型
    - X_test, y_test: 测试数据及其标签

    返回:
    - 测试集上的均方误差 (MSE)
    """
    predictions = model.predict(X_test)
    mse = mean_squared_error(y_test, predictions)
    print(f"测试集上的均方误差: {mse}")

    plt.figure(figsize=(14, 7))
    for i in range(predictions.shape[1]):  # 针对每个时间步绘制预测和真实值的对比图
        plt.plot(y_test[:, i], label=f'真实值 - 时间步 {i+1}')
        plt.plot(predictions[:, i], label=f'预测值 - 时间步 {i+1}')
    plt.title('金价预测 vs 真实值')
    plt.xlabel('时间步')
    plt.ylabel('金价')
    plt.legend()
    plt.show()

    return mse

def save_model(model, file_path='lstm_gold_price_model.h5'):
    """
    保存训练好的模型。

    参数:
    - model: 训练好的LSTM模型
    - file_path: 保存模型的路径
    """
    model.save(file_path)
    print(f"模型已保存到 {file_path}")

def load_model(file_path='lstm_model.h5'):
    """
    加载保存的模型。

    参数:
    - file_path: 模型文件路径

    返回:
    - 加载的LSTM模型
    """
    model = tf.keras.models.load_model(file_path)
    print(f"模型已从 {file_path} 加载")
    return model

if __name__ == "__main__":
    # 数据预处理: 生成训练集、验证集和测试集
    X_train, X_val, X_test, y_train, y_val, y_test = data_preprocessing_main(
        file_path='gold_price_data_extracted.csv',
        window_size=10,
        prediction_horizon=3,
        normalization_method='minmax',
        train_size=0.7,
        validation_size=0.1,
        drop_columns=['Adj', 'Volume'],
        indicators=['MA', 'RSI', 'MACD', 'Bollinger', 'ATR']
    )

    # 构建LSTM模型
    input_shape = (X_train.shape[1], X_train.shape[2])
    model = build_lstm_model(input_shape=input_shape, units=50, dropout_rate=0.2)

    # 训练LSTM模型
    history = train_model(model, X_train, y_train, X_val, y_val, epochs=50, batch_size=32)

    # 绘制训练历史
    plot_training_history(history)

    # 评估模型
    mse = evaluate_model(model, X_test, y_test)

    # 保存模型
    save_model(model, 'lstm_gold_price_model.h5')

    # 测试加载模型
    loaded_model = load_model('lstm_gold_price_model.h5')

输出结果：

预测的数值与真实值十分接近！

接下来可以做什么？

进一步调优模型：如果模型的性能不如预期(实际上已经绝佳了)，可以考虑调整LSTM的层数、神经元数量、学习率、批次大小等超参数。
尝试其他模型：除了LSTM，你还可以尝试其他模型（如GRU或卷积神经网络）来进一步提升预测效果。
部署模型：如果你计划在实际应用中使用这个模型，可以考虑将其部署为API或集成到你的应用程序中。

让我们一步一步的来分析，首先，上面我们主要使用了主流的一种LSTM算法作为模型训练的方法，但是实际上还有许多其他的模型训练方法也是可以套用的，但是每种模型训练的方法对应的训练步数和超参数调优都是存在区别的，因此，我们是否可以设计一种策略，让机器自己决定使用哪种算法训练的模型，并比较出最佳精度的模型作为我们的目标模型？

大致思路和步骤如下：

1. 自动模型选择与训练流程

将以下几个算法纳入选择范围：

LSTM（长短期记忆网络）
GRU（门控循环单元）
CNN（卷积神经网络）
XGBoost（极端梯度提升）

为了自动化模型选择与训练，可以使用 GridSearchCV（网格搜索）或 RandomizedSearchCV（随机搜索）来优化模型的超参数，包括选择不同的算法和训练步数。

2. 如何工作

模型类型选择：可以在 model_types 列表中添加更多的模型类型，例如 'LSTM', 'GRU', 'CNN'。
自动化模型训练和评估：model_search 函数将自动遍历给定的模型类型，训练每种模型，并通过在验证集上的表现来评估每种模型的性能。
最佳模型保存：经过训练后，脚本会自动保存表现最好的模型。

脚本思路：

1. 导入必要的库

TensorFlow & Keras：用于构建和训练深度学习模型，如 LSTM、GRU 和 CNN。
Scikit-learn：用于传统机器学习模型（如随机森林、SVM、MLP）的构建和评估，以及模型评估指标的计算。
XGBoost：用于构建和训练 XGBoost 模型，这是一个非常流行的基于决策树的集成算法。
Matplotlib：用于绘制模型预测结果与真实值的对比图，以直观地展示模型性能。
Numpy：用于数据处理，特别是对多维数组的操作。
Data_solve 模块：导入数据预处理的主函数，该函数用于生成训练集、验证集和测试集。

2. 配置 Matplotlib

为了确保在绘制图表时中文能够正常显示，配置了 Matplotlib 的字体设置。这一步是为了支持中文字符的正常显示，避免乱码问题。

3. 模型构建函数 (build_model)

该函数用于构建并返回不同类型的模型，具体包括：

LSTM（长短期记忆网络）
GRU（门控循环单元）
CNN（卷积神经网络）
XGBoost（极端梯度提升）
RandomForest（随机森林）
SVM（支持向量机）
MLP（多层感知机）

对于神经网络模型，函数使用 Keras Sequential API 构建模型，并在最后一层输出 3 个时间步的预测。对于传统机器学习模型（如 XGBoost、RandomForest、SVM、MLP），则使用 Scikit-learn 或 XGBoost 库创建每个时间步的独立模型。

4. 模型评估函数 (evaluate_model)

该函数用于评估模型在测试集上的性能，并绘制预测值与真实值的对比图。具体步骤如下：

对于神经网络模型，直接调用模型的 predict 方法获取预测值。
对于传统机器学习模型，先将输入数据展平成二维数组，然后分别预测每个时间步的值。
计算均方误差（MSE）作为模型性能的评估指标，并通过 Matplotlib 绘制每个时间步的预测值与真实值对比图。

5. 模型搜索函数 (model_search)

该函数负责在多种模型类型之间进行搜索，找到最适合的模型。具体步骤包括：

循环遍历每种模型类型（如 LSTM、GRU、CNN、XGBoost、RandomForest、SVM、MLP）。
为每种模型类型构建模型，并在训练集和验证集上进行训练（对于神经网络模型）或单独训练每个时间步的模型（对于传统机器学习模型）。
使用测试集对模型进行评估，记录每个模型的均方误差（MSE）。
最后，选择 MSE 最小的模型作为最佳模型，并保存模型类型和性能指标。

6. 主函数

主函数执行以下步骤：

调用数据预处理模块，生成训练集、验证集和测试集。
调用 model_search 函数，搜索并训练最佳模型。
将最佳模型保存为文件，以便后续加载和使用。

详细脚本实现：

import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split, ParameterGrid
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, GRU, Conv1D, Dense, Dropout, Flatten
import numpy as np
from data_solve import main as data_preprocessing_main

# 配置Matplotlib使用中文字体，确保图表中的中文字符正常显示
plt.rcParams['font.sans-serif'] = ['SimHei']  # 使用黑体
plt.rcParams['axes.unicode_minus'] = False  # 确保负号显示正常


def build_model(model_type, input_shape=None, units=50, dropout_rate=0.2):
    """
    构建并编译模型或实例化机器学习算法。

    参数:
    - model_type: 模型类型 ('LSTM', 'GRU', 'CNN', 'XGBoost', 'RandomForest', 'SVM', 'MLP')
    - input_shape: 输入数据的形状 (time_steps, num_features)，仅适用于神经网络
    - units: 神经元数量，适用于神经网络
    - dropout_rate: Dropout层的比例，用于防止过拟合

    返回:
    - 编译好的模型或实例化好的机器学习算法
    """
    if model_type in ['LSTM', 'GRU', 'CNN']:
        model = Sequential()
        if model_type == 'LSTM':
            model.add(LSTM(units=units, return_sequences=True, input_shape=input_shape))
            model.add(Dropout(dropout_rate))
            model.add(LSTM(units=units, return_sequences=False))
        elif model_type == 'GRU':
            model.add(GRU(units=units, return_sequences=True, input_shape=input_shape))
            model.add(Dropout(dropout_rate))
            model.add(GRU(units=units, return_sequences=False))
        elif model_type == 'CNN':
            model.add(Conv1D(filters=64, kernel_size=2, activation='relu', input_shape=input_shape))
            model.add(Dropout(dropout_rate))
            model.add(Flatten())
        model.add(Dropout(dropout_rate))
        model.add(Dense(units=3))  # 输出3个时间步的预测
        model.compile(optimizer='adam', loss='mean_squared_error')
        return model
    elif model_type == 'XGBoost':
        return [XGBRegressor(objective='reg:squarederror') for _ in range(3)]
    elif model_type == 'RandomForest':
        return [RandomForestRegressor(n_estimators=100, random_state=42) for _ in range(3)]
    elif model_type == 'SVM':
        return [SVR(kernel='rbf') for _ in range(3)]
    elif model_type == 'MLP':
        return [MLPRegressor(hidden_layer_sizes=(units, units), max_iter=500) for _ in range(3)]
    else:
        raise ValueError(f"未知的模型类型: {model_type}")


def evaluate_model(model, X_test, y_test, model_type):
    """
    评估模型在测试集上的表现，并绘制预测值与真实值的对比图。

    参数:
    - model: 训练好的模型
    - X_test, y_test: 测试数据及其标签
    - model_type: 模型类型，决定如何处理数据输入

    返回:
    - 测试集上的均方误差 (MSE)
    """
    if model_type in ['LSTM', 'GRU', 'CNN']:
        predictions = model.predict(X_test)
    else:
        X_test_flat = X_test.reshape(X_test.shape[0], -1)
        predictions = np.zeros(y_test.shape)
        for i in range(3):
            predictions[:, i] = model[i].predict(X_test_flat)

    mse = mean_squared_error(y_test, predictions)
    print(f"测试集上的均方误差: {mse}")

    plt.figure(figsize=(14, 7))
    for i in range(predictions.shape[1]):  # 针对每个时间步绘制预测和真实值的对比图
        plt.plot(y_test[:, i], label=f'真实值 - 时间步 {i + 1}')
        plt.plot(predictions[:, i], label=f'预测值 - 时间步 {i + 1}')
    plt.title(f'金价预测 vs 真实值 ({model_type})')
    plt.xlabel('时间步')
    plt.ylabel('金价')
    plt.legend()
    plt.show()

    return mse


def model_search(X_train, y_train, X_val, y_val, X_test, y_test,
                 model_types=['LSTM', 'GRU', 'CNN', 'XGBoost', 'RandomForest', 'SVM', 'MLP'], epochs=50, batch_size=32):
    """
    搜索最佳模型类型、训练步数，并评估模型性能。

    参数:
    - X_train, y_train: 训练数据及其标签
    - X_val, y_val: 验证数据及其标签
    - X_test, y_test: 测试数据及其标签
    - model_types: 模型类型列表 (['LSTM', 'GRU', 'CNN', 'XGBoost', 'RandomForest', 'SVM', 'MLP'])
    - epochs: 最大训练轮次
    - batch_size: 批次大小

    返回:
    - 最佳模型类型和性能
    """
    best_model = None
    best_mse = float('inf')
    best_model_type = None

    for model_type in model_types:
        print(f"训练模型: {model_type}")
        model = build_model(model_type, input_shape=(X_train.shape[1], X_train.shape[2]))

        if model_type in ['LSTM', 'GRU', 'CNN']:
            model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=epochs, batch_size=batch_size, verbose=0)
        else:
            X_train_flat = X_train.reshape(X_train.shape[0], -1)
            X_val_flat = X_val.reshape(X_val.shape[0], -1)
            for i in range(3):  # 训练每个时间步的模型
                model[i].fit(X_train_flat, y_train[:, i])

        mse = evaluate_model(model, X_test, y_test, model_type)
        if mse < best_mse:
            best_mse = mse
            best_model = model
            best_model_type = model_type

    print(f"最佳模型类型: {best_model_type}，测试集MSE: {best_mse}")
    return best_model, best_model_type, best_mse


# 主函数
if __name__ == "__main__":
    # 生成数据 (你可以根据你的情况替换为实际的数据生成过程)
    # 数据生成部分需要根据你具体的数据情况进行替换
    X_train, X_val, X_test, y_train, y_val, y_test = data_preprocessing_main(
        file_path='gold_price_data_extracted.csv',
        window_size=10,
        prediction_horizon=3,
        normalization_method='minmax',
        train_size=0.7,
        validation_size=0.1,
        drop_columns=['Adj', 'Volume'],
        indicators=['MA', 'RSI', 'MACD', 'Bollinger', 'ATR']
    )

    # 进行模型搜索
    best_model, best_model_type, best_mse = model_search(X_train, y_train, X_val, y_val, X_test, y_test)

    # 保存最佳模型
    if best_model_type in ['LSTM', 'GRU', 'CNN']:
        best_model.save(f'best_model_{best_model_type}.h5')
    print(f"最佳模型 {best_model_type} 已保存。")