导入 CSV，为逻辑回归重塑变量数组答案

【问题标题】：Importing a CSV, reshaping a variable's array for logistic regression导入 CSV，为逻辑回归重塑变量数组
【发布时间】：2020-07-24 07:54:52
【问题描述】：

我希望每个人都在 COVID-19 大流行期间保持安全。我是 Python 新手，有一个关于将数据从 CSV 导入 Python 以进行简单逻辑回归分析的问题，其中因变量是二元的，自变量是连续的。

我导入了一个 CSV 文件，然后希望使用一个变量（Active）作为自变量，使用另一个变量（Smoke）作为响应变量。我能够将 CSV 文件加载到 Python 中，但每次我尝试生成逻辑回归模型来预测来自运动的烟雾时，我都会收到一个错误，即运动必须重新整形为一列（二维），因为它目前是一列维度。

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
data = pd.read_csv('Pulse.csv') # Read the data from the CSV file
x = data['Active'] # Load the values from Exercise into the independent variable
x = np.array.reshape(-1,1)
y = data['Smoke'] # The dependent variable is set as Smoke

我不断收到以下错误消息：

ValueError: Expected 2D array, got 1D array instead: 数组=[ 97. 82. 88. 106. 78. 109. 66. 68. 100. 70. 98. 140. 105. 84. 134. 117. 100. 108. 76. 86. 110. 65. 85. 80. 87. 133. 125. 61. 117. 90. 110. 68. 102. 67. 112. 86. 85. 66. 73. 85. 110. 97. 93. 86. 80. 96. 74. 124. 78. 93. 80. 80. 92. 69. 82. 88. 74. 74. 75. 120. 105. 104. 99. 113. 67. 125. 133. 98. 80. 91. 76. 78. 94. 150. 92. 96. 68. 82. 102. 69. 65. 84. 86. 84. 116. 88. 65. 101. 89. 128. 68. 90. 80. 80. 98. 90. 82. 97. 90. 98. 88. 94. 92. 96. 80. 66. 110. 87. 88. 94. 96. 89. 74. 111. 81. 98. 99. 65. 95. 127. 76. 102. 88. 125. 72. 76. 112. 69. 101. 72. 112. 81. 90. 96. 66. 114. 71. 75. 102. 138. 85. 80. 107. 119. 98. 95. 95. 76. 96. 102. 82. 99. 80. 83. 102. 102. 106. 79. 80. 79. 110. 144. 80. 97. 60. 80. 108. 107. 51. 68. 80. 80. 60. 64. 87. 110. 110. 82. 154. 139. 86. 95. 112. 120. 79. 64. 84. 65. 60. 79. 79. 70. 75. 107. 78. 74. 80. 121. 120. 96. 75. 106. 88. 91. 98. 63. 95. 85. 83. 92. 81. 89. 103. 110. 78. 122. 122. 71. 65. 92. 93. 88. 90. 56. 95. 83. 97. 105. 82. 102. 87. 81.]。如果您的数据具有单个特征，则使用 array.reshape(-1, 1) 重塑您的数据，如果数据包含单个样本，则使用 array.reshape(1, -1)。

以下是包含错误的完整更新代码（2020 年 4 月 12 日）： *我无法将错误日志输入此文档，因此我将其复制并粘贴到此 Google 公开文档中：https://docs.google.com/document/d/1vtrj6Znv54FJ4Zvv211TQvvCN6Ac5LDaOfvHicQn0nU/edit?usp=sharing

另外，这里是 CSV 文件： https://drive.google.com/file/d/1g_-vPNklxRn_3nlNPsR-IOflLfXSzFb1/view?usp=sharing

scikit-learn
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
data = pd.read_csv('Pulse.csv')
x = data['Active']
y = data['Smoke']
lr = LogisticRegression().fit(x.values.reshape(-1,1), y)
p_pred = lr.predict_proba(x.values)
y_pred = lr.predict(x.values)
score_ = lr.score(x.values,y.values)
conf_m = confusion_matrix(y.values,y_pred.values)
report = classification_report(y.values,y_pred.values)
confusion_matrix(y, lr.predict(x))    
cm = confusion_matrix(y, lr.predict(x))
fig, ax = plt.subplots(figsize = (8,8))
ax.imshow(cm)
ax.grid(False)
ax.xaxis.set(ticks=(0,1), ticklabels = ('Predicted 0s', 'Predicted 1s'))
ax.yaxis.set(ticks=(0,1), ticklabels = ('Actual 0s', 'Actual 1s'))
ax.set_ylim(1.5, -0.5)
for i in range(2):
    for j in range(2):
        ax.text(j,i,cm[i,j],ha='center',va='center',color='red', size='45')
plt.show()
print(classification_report(y,model.predict(x)))

【问题讨论】：

尝试不使用此行x = np.array.reshape(-1,1)
感谢您的建议。我试过了，但结果是一样的：“ValueError: Expected 2D array, got 1D array instead.”
你能添加完整的代码，还包括模型拟合部分吗？
亲爱的 ManojK，感谢您的耐心等待和复制支持。我已经用我可以使用的整个代码更新了这个问题，我还复制并粘贴了错误日志（当我尝试提交它时这里不接受）到谷歌文档中。任何建议将不胜感激。
请在下面查看我的答案。

标签： python numpy statistics regression reshape

【解决方案1】：

下面的代码应该可以工作：

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
data = pd.read_csv('Pulse.csv')
x = pd.DataFrame(data['Smoke'])
y = data['Smoke']
lr = LogisticRegression()
lr.fit(x,y)
p_pred = lr.predict_proba(x)
y_pred = lr.predict(x)
score_ = lr.score(x,y)
conf_m = confusion_matrix(y,y_pred)
report = classification_report(y,y_pred)

print(score_)
0.8836206896551724

print(conf_m)
[[204   2]
 [ 25   1]]

【讨论】：

亲爱的 ManojK，非常感谢您的耐心和持续的支持......不幸的是，我无法让它工作。这是 PDF 的公共链接：drive.google.com/file/d/1_1FUHuLWh2KsxjbTXAdlx4lZHxPcrHEc/…，这是 CSV 文件：stat2.org/datasets/Pulse.csv此致，ciel_azzuro
查看我更新的代码，它现在工作正常，只是更改了这一行：x = pd.DataFrame(data['Smoke']) 它给出了错误，因为x 是Series 现在它被转换为DataFrame。
非常感谢您宝贵的时间和洞察力。分析的输出与另一个计算软件 (SPSS) 的输出相匹配。谢谢。
亲爱的 ManojK，我只是想让您知道，我一直在参考此页面以获取您对未来逻辑回归模型的建议，并且您的建议仍然非常有用。再次感谢。
太好了，如果您还有其他问题，请告诉我。

【解决方案2】：

试试这个：

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

data = pd.read_csv('Pulse.csv') # Read the data from the CSV file
x = data['Active'] # Load the values from Exercise into the independent variable
y = data['Smoke'] # The dependent variable is set as Smoke

lr = LogisticRegression().fit(x.values.reshape(-1,1), y)

【讨论】：

谢谢。输入提供的命令后，我收到以下错误：p_pred = lr.predict_proba(x) y_pred = lr.predict(x) score_ = lr.score(x,y) conf_m = confusion_matrix(y,y_pred) report = classification_report(y,y_pred)
这不是错误，是代码。请注意，您必须使用 x.values.reshape(-1,1) 而不是 x
亲爱的克里斯蒂安，非常感谢您一直以来的支持。我尝试了您的建议，但无法规避错误。我已经用整个代码更新了这个问题，以及尝试运行它后生成的错误日志。任何建议将不胜感激，谢谢。