【发布时间】:2023-03-20 13:08:01
【问题描述】:
我使用 pdftotext 工具从 pdf 中提取了这段文本
请在下面找到文本结构
stage title1 title2 title3 title4
I value1 value2 value3
II value5 value6
stage Other1 Other2 Other3 Other4
I otherval1 otherval2 otherval3 otherval4
现在我想以这种方式使用适当的列和标题以 CSV 格式导出此文本,或者以这种方式构建一个数组
[
"category" => "title1",
"score" => "value1",
],
[
"category" => "title2",
"score" => "value2",
],
[
"category" => "title3",
"score" => "value3"
],
// unable to to do this
[
"category" => "title3",
"score" => "value5"
],
[
"category" => "title4",
"score" => "value6",
],
.
.
// so on
现在的问题是
- I 阶段和 II 阶段中的列值是可选的,但任何一个 每一列的行将至少包含一个值
- Stage II 行是可选的,可能存在也可能不存在
- 如果阶段 II 行存在,则至少一个列值存在于 行
我面临的问题是如何映射
- value5 到 title3
- value6 到 title4
这是我的解析器代码 (PHP)
$rows = explode("\n", $pdfExtractedText);
$rows = array_values(array_filter($rows));
$categories = array_values(array_filter(explode(" ", $rows[7])));
$stage1Scores = array_values(array_filter(explode(" ", $rows[8])));
$stage2Scores = array_values(array_filter(explode(" ", $rows[9])));
var_dump($categories);
var_dump($stage1Scores);
var_dump($stage2Scores);
输出:
// categories
array:13 [
0 => "stage"
1 => "title1"
2 => "title2"
3 => "title3"
4 => "title4"
]
//values - Index preserved so that I can map with categories
array:14 [
0 => "I"
1 => "value1"
2 => "value2"
3 => "value3"
4 => "value4"
]
// index not preserved :(
array:2 [
0 => "II"
1 => "value5",
2 => "value6"
]
【问题讨论】:
-
您想要一种将输出解析为数组的方法吗?
-
或者您只是想将其推送为 csv 格式?
-
@Hudson CSV 如果保留所需的标头也可以
-
我已经在下面回答了,如果您需要任何更改,请在答案下方发表评论,我会更正任何内容。
标签: php data-extraction pdftotext