current position:Home>Titanic passenger survival prediction

Titanic passenger survival prediction

2022-01-27 05:33:56 Ada_ lake

Decision tree algorithm -

criterion - standard entropy: Based on information entropy - namely ID3 Algorithm , The actual results are consistent with C4.5 Little difference
gini: The gini coefficient - namely CART Algorithm

The key process of survival prediction


  1. Preparation stage

    1. Data exploration - Analysis data quality
      1. info - Basic information of the data sheet : Row number 、 Number of columns 、 The data type of each column 、 Data integrity
      2. describe - Statistics of the data sheet : total 、 The average 、 Standard deviation 、 minimum value 、 Maximum
      3. describe(include = [ 'O' ] String type ( The digital ) The overall situation of
      4. head - The first few lines of data
      5. tail - The last few lines of data
    2. Data cleaning
      1. Fill in missing values - mean value / Maximum frequency value
        df.fillna(df['XX'].mean(), inplace = True)
        df.fillna(df['XX'].value_counts(), inplace = True)
    3. feature selection - Data dimension reduction , Facilitate subsequent classification operations
      1. Filter out meaningless Columns
      2. Filter out columns with more missing values
      3. Put the remaining features into the feature vector
      4. Data conversion - Convert character column to numeric column , Convenient follow-up operation - DictVectorizer Class to convert
        DictVectorizer: Symbols are converted into numbers 0/1 To said
        1. Instantiate a converter
          devc = DictVectorizer(sparse = False) - sparse=False It means not using sparse matrices , A sparse matrix is a matrix in which a non - 0 Values are expressed by position
          one-hot To make the category more equitable , There is no priority between each other
        2. call fit_transform() Method
          to_dict(orient='record') - convert to list form
  2. Classification stage

    1. Decision tree model
      1. Import decision tree model
      1. Generate decision tree
      2. Fitting generates a decision tree
    2. Model to evaluate & forecast
      1. forecast - The decision tree outputs the prediction results
      2. assessment Known predicted values and real results - clf.score( features , Results tab ) Don't know the real prediction results - K Crossover verification - cross_val_score
    3. Decision tree visualization
  3. Drawing stage - GraphViz

    1. Install first graphviz
    2. Import graphviz package - import graphviz
    3. sklearn Import export_graphviz
    4. First use export_graphviz The data to be displayed in the decision tree model
    5. Reuse graphviz Acquisition data source
    6. Data presentation

K Crossover verification
Take out most of the samples for training , A small amount is used for classifier verification - do K Secondary cross validation , Every time you select K One third of the data is verified , The rest is for training , take turns K Time , Average.

  • Divide the data set evenly into K Equal parts

  • Use 1 Data as test data , The rest is training data

  • Calculate the test accuracy

  • Use different test sets , repeat 2、3 step

copyright notice
author[Ada_ lake],Please bring the original link to reprint, thank you.