【问题标题】:How to fix the TypeError: "Expected binary or unicode string, got 22.0?" when passing a training function "linear_estimator.train(train_input_fn)"?如何修复 TypeError:“预期的二进制或 unicode 字符串,得到 22.0?”传递训练函数“linear_estimator.train(train_input_fn)”时?
【发布时间】:2021-02-17 16:01:57
【问题描述】:

这是我的代码,我正在从驱动器加载 csv 文件

 # -*- coding: utf-8 -*-
"""Titanic_Linear_Regression_Model.ipynb

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/1MMY9m7QGpqWVj-zyv2oaIJQZ2V__7AGX
"""

pip install -q sklearn

# Commented out IPython magic to ensure Python compatibility.
# %tensorflow_version 2.x

"""**Predicting Survivals of Titanic using linear regression model**

```
# This is formatted as code
```
Predicting Survivals of Titanic using linear regression model. 
Load the data the titanic.csv data, perform analysis to gain better understaning of the data
Build the model, train and evaluate it
# New Section
"""

# Commented out IPython magic to ensure Python compatibility.
#import all necessary libraries

# %tensorflow_version 2.x

from __future__ import absolute_import, division, print_function, unicode_literals

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from IPython.display import clear_output
from six.moves import urllib

import tensorflow as tf

#Install PyDrive
!pip install -U -q PyDrive

#import necessary modules for the PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

#authenticate and create a PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

#gender_submission_path = https://drive.google.com/file/d/1e0ZLmv8G-kVDJgeeb1b1REWJG0yNT3jc/view?...
#testing_data_path = https://drive.google.com/file/d/1SCaUViZG8qt1q5K5_cdBkFCM4VN8H_5z/view?...
#training_data_path= https://drive.google.com/file/d/1lLrhVKTGuQiJI5kTQkKF6if_r5B5S9HN/...

fileDownloaded = drive.CreateFile({"id":"1e0ZLmv8G-kVDJgeeb1b1REWJG0yNT3jc"})
fileDownloaded = drive.CreateFile({"id":"1SCaUViZG8qt1q5K5_cdBkFCM4VN8H_5z"})
fileDownloaded = drive.CreateFile({"id":"1lLrhVKTGuQiJI5kTQkKF6if_r5B5S9HN"})

#Loading the files
fileDownloaded.GetContentFile("gender_submission.csv")
fileDownloaded.GetContentFile("training_data.csv")
fileDownloaded.GetContentFile("testing_data.csv")

#Using proper encoding to read the csv files
df_train = pd.read_csv("training_data.csv", encoding='utf-8', quotechar='"', delimiter=',') #training data
df_test = pd.read_csv("testing_data.csv", encoding='utf-8', quotechar='"', delimiter=',')   #test data
df_gender = pd.read_csv("gender_submission.csv", encoding='utf-8', quotechar='"', delimiter=',') #gender_submission dataset

#since our data contains NaN, lets convert them into empty strings
df_train.fillna('', inplace=True)

df_test.fillna('', inplace=True)

#removes Survival column from the train dataset and save it as y_train
y_train = df_train.pop('Survived')
#create a new column of train survival naming "y_train"
print(y_train.head())

#removes the Survived column from the test dataset as save as y_test
y_test = df_gender.pop('Survived')

print(y_test.head())

df_train.head() #prints overhead of the training datasets

"""**Analysis on the train dataset**"""

print(df_train.describe()) #print train dataset stats
print(df_train.shape) #prints the shape of the train data

#Generate histogram data for ages
#df_train.Age.hist(bins = 20)

#count sex to generate the graph of number of females and males 
df_train['Sex'].value_counts().plot(kind = 'bar')

#generate a pie graph for the Pclass
df_train['Pclass'].value_counts().plot(kind = "pie")

#concatenate the train and y_train datasets and genergate a graph grouped by sex
pd.concat([df_train, y_train], axis = 1).groupby('Sex').Survived.mean().plot(kind = "bar").set_xlabel("% Survival")

#concatenate the train and y_train datasets and genergate a graph grouped by Age
pd.concat([df_train, y_train], axis = 1).groupby('Pclass').Survived.mean().plot(kind = "bar").set_xlabel("% Survival per class")

"""The data shows that the majority of the passengers were aged between 15 to 40 years and majority of the passengers were males. The data shows that most of the passengers were based in the passenger class 3. Females have much higher chance of survival than males, 70% of females survived compared to 20% of males. The data suggest that Passengers who were in class 1 had a higher chance of survival than those who were in class 2 and 3

**Create feature columns that will be used to feed the model**
"""

#feature columns that will be used to feed the model
categorical_columns = ["Pclass", "Name", "Sex", "Ticket", "Cabin", "Embarked", "SibSp"]
numerical_columns = ["Age", "Fare"]

feature_columns = []

#gets a list of unique values
for feature_name in categorical_columns:
  vocabulary = df_train[feature_name].unique()
  feature_columns.append(tf.feature_column.categorical_column_with_vocabulary_list(
    feature_name, vocabulary, dtype=None, default_value=-1, num_oov_buckets=0
))

print(feature_columns)

"""**Training the model using feature categorical_columns**

Create an input function that will convert the data into tf.data.Dataset
"""

#Create an input function that will convert the data into tf.data.Dataset
def make_input_fn(data_df, label_df, num_epochs = 10, shuffle = True, batch_size = 32):
  #input fn to be returned by
  def input_function():
    #create tf.data.Dataset object with data and its label
    ds = tf.data.Dataset.from_tensor_slices((dict(data_df), label_df))
    if shuffle:
      ds.ds.shuffle(1000) #shuffles data 1000 times
    ds = ds.batch(batch_size).repeat(num_epochs) #splits dataset into 32 batches and reeats the process 10 times

    return ds #retursn a batch of a dataset
  return input_function #returns function object for useage


#Calls the input function that we returned to create the train and test functions
train_input_fn = make_input_fn(df_train, y_train) #train function

test_input_fn = make_input_fn(df_test, y_test, num_epochs = 1, shuffle = False) #testing function

"""**Creating the Model**

Use linear estimator to utalize linear regression algorithm
"""

#creates the linear estimator by passing the feature_columns
linear_estimator = tf.estimator.LinearClassifier(feature_columns = feature_columns)

#Training the model by passing the training function
linear_estimator.train(train_input_fn)

我收到的错误消息:

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
tensorflow/python/framework/fast_tensor_util.pyx in tensorflow.python.framework.fast_tensor_util.AppendObjectArrayToTensorProto()

/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/compat.py in as_bytes(bytes_or_text, encoding)
     85   else:
     86     raise TypeError('Expected binary or unicode string, got %r' %
---> 87                     (bytes_or_text,))
     88 
     89 

TypeError: Expected binary or unicode string, got 22.0

enter image description here

【问题讨论】:

标签: python machine-learning


【解决方案1】:

可能您在读取 ​​csv 文件时使用了错误的编码。

您可以使用 chardet 模块首先尝试检测正确的编码 - 通过读取 csv 文件的前几个字节。

import csv
import chardet

with open("training_data.csv", mode="rb") as ifile:
    # may need to increase byte count from 32 for better accuracy
    raw_bytes = ifile.read(32)
    encoding_name = chardet.detect(raw_bytes)["encoding"]
    

然后,您可以在 read_csv() 中使用此编码:

pd.read_csv("training_data.csv", encoding=encoding_name, quotechar='"', delimiter=',')

【讨论】:

  • 假设您的文件不是太大而无法放入内存,您可以尝试“raw_bytes = ifile.read()”来获取编码,而不是在参数中指定字节数。
【解决方案2】:

我决定删除数据中的 NaN 行

#Check info for data type    
print(df_train.info())
print("\n", df_test.info())

#Drop all the NaN
df_train.dropna(axis = 0, inplace = True)
df_test.dropna(axis = 0, inplace = True)

错误已解决

【讨论】:

    猜你喜欢
    • 2021-04-04
    • 2017-10-28
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-11-29
    • 2020-01-25
    • 1970-01-01
    • 2017-11-10
    相关资源
    最近更新 更多