ICode9

精准搜索请尝试: 精确搜索
首页 > 其他分享> 文章详细

EDA中级-kaggle学习3.5-训练结构

2021-02-05 19:01:24  阅读:268  来源: 互联网

标签:EDA features kaggle feature fold train 3.5 test valid


总体训练结构

上文

总体介绍

  1. 提供了2种encoding的方式:one-hot-encoding和label-encoding
  2. cross-validation用了kfold
  3. 模型用了lightGBMClassifier

个人觉得值得学习的地方在于自己生成一些metrics指标和coding的基本技巧

分段分析

准备工作

这里准备了

  1. 训练和测试的id--->提取备用
  2. 训练和测试的features--->维度要一样!
  3. 训练的targets/labels--->测试的label就是我们要预测的
 # Extract the ids
    train_ids = features['SK_ID_CURR']
    test_ids = test_features['SK_ID_CURR']
    
    # Extract the labels for training
    labels = features['TARGET']
    
    # Remove the ids and target
    features = features.drop(columns = ['SK_ID_CURR', 'TARGET'])
    test_features = test_features.drop(columns = ['SK_ID_CURR'])

one hot encoding

上面说训练和测试的features--->维度要一样!
所以在get_dummies之后要align,去除train在one-hot中多出来的feature(test里面没有)
虽然这好像丢失了一些信息,但是这些信息是test里面没有的,我们也没必要考虑。
cat_indice是没有的因为我们把所有的cat列都变成了one-hot

# One Hot Encoding
    if encoding == 'ohe':
        features = pd.get_dummies(features)
        test_features = pd.get_dummies(test_features)
        
        # Align the dataframes by the columns
        features, test_features = features.align(test_features, join = 'inner', axis = 1)
        
        # No categorical indices to record
        cat_indices = 'auto'

label encoding

  1. 这里是用labelencoder,对于类型是object的每一列进行integer label encoding
  2. 然后reshape((-1,))是保证输出是只有:单独一列行数随便
  3. 记录下是object(categorical)列的indice,用于之后训练中的参数
# Integer label encoding
    elif encoding == 'le':
        
        # Create a label encoder
        label_encoder = LabelEncoder()
        
        # List for storing categorical indices
        cat_indices = []
        
        # Iterate through each column
        for i, col in enumerate(features):
            if features[col].dtype == 'object':
                # Map the categorical features to integers
                features[col] = label_encoder.fit_transform(np.array(features[col].astype(str)).reshape((-1,)))
                test_features[col] = label_encoder.transform(np.array(test_features[col].astype(str)).reshape((-1,)))

                # Record the categorical indices
                cat_indices.append(i)

准备工作2

  1. 创建kfold对象来split我们的数据集
  2. 生成空白的feature_importance数组
  3. 生成空白的test_prediction数组
  4. 生成空白的valid_prediction数组(out_of_fold)
  5. 生成空白的valid_scores 和train_scores
# Extract feature names
    feature_names = list(features.columns)
    
    # Convert to np arrays
    features = np.array(features)
    test_features = np.array(test_features)
    
    # Create the kfold object
    k_fold = KFold(n_splits = n_folds, shuffle = False, random_state = 50)
    
    # Empty array for feature importances
    feature_importance_values = np.zeros(len(feature_names))
    
    # Empty array for test predictions
    test_predictions = np.zeros(test_features.shape[0])
    
    # Empty array for out of fold validation predictions
    out_of_fold = np.zeros(features.shape[0])
    
    # Lists for recording validation and training scores
    valid_scores = []
    train_scores = []

正式训练

1.获得1/n_split的数据indice和valid_indice,然后获得相应的数据
2.每次fold生成新的classifier
3.开始训练,eval填train和valid之后可以获得对应的score

# Iterate through each fold
    for train_indices, valid_indices in k_fold.split(features):
        
        # Training data for the fold
        train_features, train_labels = features[train_indices], labels[train_indices]
        # Validation data for the fold
        valid_features, valid_labels = features[valid_indices], labels[valid_indices]
        
        # Create the model
        model = lgb.LGBMClassifier(n_estimators=10000, objective = 'binary', 
                                   class_weight = 'balanced', learning_rate = 0.05, 
                                   reg_alpha = 0.1, reg_lambda = 0.1, 
                                   subsample = 0.8, n_jobs = -1, random_state = 50)
        
        # Train the model
        model.fit(train_features, train_labels, eval_metric = 'auc',
                  eval_set = [(valid_features, valid_labels), (train_features, train_labels)],
                  eval_names = ['valid', 'train'], categorical_feature = cat_indices,
                  early_stopping_rounds = 100, verbose = 200)
  1. 获取训练中best_iteration的次数
  2. 更新feature_importance---》用加权平均的方法,对于每次fold只加上所得的1/n_splits倍数值
  3. 更新test_prediction---》用加权平均的方法,对于每次fold只加上所得的1/n_splits倍数值
  4. 更新validation_prediction---》由于每次fold只有1/n_splits的数据被训练到,所以这个只要加入原数组就可以了
# Record the best iteration
        best_iteration = model.best_iteration_
        
        # Record the feature importances
        feature_importance_values += model.feature_importances_ / k_fold.n_splits
        
        # Make predictions
        test_predictions += model.predict_proba(test_features, num_iteration = best_iteration)[:, 1] / k_fold.n_splits
        
        # Record the out of fold predictions
        out_of_fold[valid_indices] = model.predict_proba(valid_features, num_iteration = best_iteration)[:, 1]
  1. 记录当前fold中最好的valid和train score---》用auc值
  2. 释放内存,加快速度
        # Record the best score
        valid_score = model.best_score_['valid']['auc']
        train_score = model.best_score_['train']['auc']
        
        valid_scores.append(valid_score)
        train_scores.append(train_score)
        
        # Clean up memory
        gc.enable()
        del model, train_features, valid_features
        gc.collect()

总结整体score

  1. 建立2个新表,用于展示feature
# Make the submission dataframe
    submission = pd.DataFrame({'SK_ID_CURR': test_ids, 'TARGET': test_predictions})
    
    # Make the feature importance dataframe
    feature_importances = pd.DataFrame({'feature': feature_names, 'importance': feature_importance_values})
  1. 获取整体validation auc---》通过上面的每回fold加入validation_prediction的数组out_of_fold
  2. 获取整体train_auc---》通过mean
    # Overall validation score
    valid_auc = roc_auc_score(labels, out_of_fold)
    
    # Add the overall scores to the metrics
    valid_scores.append(valid_auc)
    train_scores.append(np.mean(train_scores))

建立新表:每次fold中不同的train_score和valid_score

    # Needed for creating dataframe of validation scores
    fold_names = list(range(n_folds))
    fold_names.append('overall')
    
    # Dataframe of validation scores
    metrics = pd.DataFrame({'fold': fold_names,
                            'train': train_scores,
                            'valid': valid_scores}) 

标签:EDA,features,kaggle,feature,fold,train,3.5,test,valid
来源: https://www.cnblogs.com/niemand-01/p/14379362.html

本站声明: 1. iCode9 技术分享网(下文简称本站)提供的所有内容,仅供技术学习、探讨和分享;
2. 关于本站的所有留言、评论、转载及引用,纯属内容发起人的个人观点,与本站观点和立场无关;
3. 关于本站的所有言论和文字,纯属内容发起人的个人观点,与本站观点和立场无关;
4. 本站文章均是网友提供,不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属;如您发现该文章侵犯了您的权益,可联系我们第一时间进行删除;
5. 本站为非盈利性的个人网站,所有内容不会用来进行牟利,也不会利用任何形式的广告来间接获益,纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。

专注分享技术,共同学习,共同进步。侵权联系[81616952@qq.com]

Copyright (C)ICode9.com, All Rights Reserved.

ICode9版权所有