【问题标题】:Extract raw data from Tables in Word? using Perl从 Word 中的表格中提取原始数据?使用 Perl
【发布时间】:2011-10-08 14:34:16
【问题描述】:

我正在尝试从 Word 文档中的多个表格中提取数据。尝试将表中的数据转换为文本时出现错误。 ConvertToText 方法有两个可选参数(如何分隔数据和一个布尔值)。这是我当前的代码:

#usr/bin/perl
#OLEWord.pl

#Use string and print warnings
use strict;use warnings;
#Using OLE + OLE constants for Variants and OLE enumeration for Enumerations
use Win32::OLE qw(in);
use Win32::OLE::Const 'Microsoft Word';
use Win32::OLE::Variant;

my $var1 = Win32::OLE::Variant->new(VT_BOOL, 'true');

$Win32::OLE::Warn = 3;

#set the file to be opened
my $file = 'C:\work\SCL_International Financial New Fund Setup Questionnaire V1.6.docx';

#Create a new instance of Win32::OLE for the Word application, die if could not open the application
my $MSWord = Win32::OLE->GetActiveObject('Excel.Application') or Win32::OLE->new('Word.Application','Quit');

#Set the screen to Visible, so that you can see what is going on
$MSWord->{'Visible'} = 1;
 $MSWord->{'DisplayAlerts'} = 0; #Supress Alerts, such as 'Save As....'

#open the request file or die and print warning message
my $Doc = $MSWord->{'Documents'}->Open($file) or die "Could not open ", $file, " Error:", Win32::OLE->LastError();

#$MSWord->ActiveDocument->SaveAs({Filename => 'AlteredTest.docx', 
                            #FileFormat => wdFormatDocument});

my $tables = $MSWord->ActiveDocument->{'Tables'};

for my $table (in $tables){
   my $tableText = $table->ConverToText(wdSeparateByParagraphs,$var1);
   print "Table: ", $tableText, "\n";
}


$MSWord->ActiveDocument->Close;
$MSWord->Quit;

我收到了这个错误:

在 OLEWord.pl 第 31 行使用“strict subs”时不允许使用裸词“VT_BOOL”
在 OLEWord.pl 第 31 行使用“strict subs”时,不允许使用裸词“true”
由于编译错误,OLEWord.pl 的执行中止。

【问题讨论】:

    标签: perl ms-word automation win32ole


    【解决方案1】:

    删除“use strict”将删除“Bareword”错误

    【讨论】:

      【解决方案2】:

      “Bareword”错误是由代码中的语法错误引起的。一种 'runaway multi-line' 通常指出错误的开始位置 是,通常意味着一条线还没有完成,通常 因为括号或引号不匹配。

      正如几个 SO-ers 指出的那样,这看起来不像 珀尔! Perl 解释器在语法错误上犹豫不决,因为它 不会说那种特定的语言! Source

      不使用 strict 不会给你警告。 (但你应该用它来写一个好的代码)

      阅读有关 Bareword 的信息,这样您就会知道它们是什么,并且您会自己知道如何纠正这个错误。

      以下是一些关于 Bareword 的学习链接: 1.perl.com 2.alumnus

      【讨论】:

      • 谢谢,请问如何从表中提取数据?代码看起来正确吗?
      【解决方案3】:

      VT_BOOL 之类的东西没有被定义为常量时,perl 会认为它们是裸词。其他人已经提供了有关他们的信息。

      问题的根本原因是缺少Win32::OLE::Variant 模块导出的常量。添加:

      use Win32::OLE::Variant;
      

      到您的脚本以删除第一个错误。第二个是类似的问题,true 也没有定义。将其替换为 1 或自己定义常量:

      use constant true => 1;
      

      编辑:这是提取表格文本的示例:

      my $tables = $MSWord->ActiveDocument->{'Tables'};
      for my $table (in $tables){
         my $tableText = $table->ConvertToText({ Separator => wdSeparateByTabs });
         print "Table: ", $tableText->Text(), "\n";
      }
      

      在您的代码中,您的方法名称 ConverToText 有拼写错误。该方法还返回Range 对象,因此您必须使用Text 方法来获取实际文本。

      【讨论】:

      • 是的,我忘记了,谢谢,但是从 Word 中的表格中提取数据呢>
      • @Shahab - 请查看我更新的表格提取代码答案。
      • hmmm,我运行时出错:来自“Microsoft Word”的 OLE 异常此方法或属性不可用,因为部分或全部数据未引用表 -> ConverToText。跨度>
      • 我虽然 Tables 属性返回了文档中的 Tables 集合?
      • @Shahab - 你是对的,Tables 是集合,你用in 对其进行迭代,每个都转换为文本。您是否注意到ConverToText 中缺少t
      【解决方案4】:

      将所有文档表提取到一个 xls 文件中

           sub doParseDoc {
      
                 my $msg     = '' ; 
                 my $ret     = 1 ; # assume failure at the beginning ...
      
                 $msg        = 'START --- doParseDoc' ; 
                 $objLogger->LogDebugMsg( $msg );
                 $msg        = 'using the following DocFile: "' . $DocFile . '"' ; 
                 $objLogger->LogInfoMsg( $msg );
                 #-----------------------------------------------------------------------
                 #Using OLE + OLE constants for Variants and OLE enumeration for Enumerations
      
      
                 # Create a new Excel workbook
                 my $objWorkBook = Spreadsheet::WriteExcel->new("$DocFile" . '.xls');
      
                 # Add a worksheet
                 my $objWorkSheet = $objWorkBook->add_worksheet();
      
      
                 my $var1 = Win32::OLE::Variant->new(VT_BOOL, 'true');
      
                 Win32::OLE->Option(Warn => \&Carp::croak);
                 use constant true => 0;
      
                 # at this point you should have the Word application opened in UI with t
                 # the DocFile
                 # build the MS Word object during run-time 
                 my $objMSWord = Win32::OLE->GetActiveObject('Word.Application')
                                   or Win32::OLE->new('Word.Application', 'Quit');  
      
                 # build the doc object during run-time 
                 my $objDoc   = $objMSWord->Documents->Open($DocFile)
                       or die "Could not open ", $DocFile, " Error:", Win32::OLE->LastError();
      
                 #Set the screen to Visible, so that you can see what is going on
                 $objMSWord->{'Visible'} = 1;
                 # try NOT printing directly to the file
      
      
                  #$objMSWord->ActiveDocument->SaveAs({Filename => 'AlteredTest.docx', 
                                              #FileFormat => wdFormatDocument});
      
                 my $tables        = $objMSWord->ActiveDocument->Tables();
                 my $tableText     = '' ;   
                 my $xlsRow        = 1 ; 
      
                 for my $table (in $tables){
                    # extract the table text as a single string
                    #$tableText = $table->ConvertToText({ Separator => 'wdSeparateByTabs' });
                    # cheated those properties from here: 
                    # https://msdn.microsoft.com/en-us/library/aa537149(v=office.11).aspx#officewordautomatingtablesdata_populateatablewithdata
                    my $RowsCount = $table->{'Rows'}->{'Count'} ; 
                    my $ColsCount = $table->{'Columns'}->{'Count'} ; 
      
                    # disgard the tables having different than 5 columns count
                    next unless ( $ColsCount == 5 ) ;
      
                    $msg           = "Rows Count: $RowsCount " ; 
                    $msg           .= "Cols Count: $ColsCount " ; 
                    $objLogger->LogDebugMsg ( $msg ) ; 
      
                    #my $tableRange = $table->ConvertToText({ Separator => '##' });
                    # OBS !!! simple print WILL print to your doc file use Select ?!
                    #$objLogger->LogDebugMsg ( $tableRange . "\n" );
                    # skip the header row
                    foreach my $row ( 0..$RowsCount ) {
                       foreach my $col (0..$ColsCount) {
      
                          # nope ... $table->cell($row,$col)->->{'WrapText'} = 1 ; 
                          # nope $table->cell($row,$col)->{'WordWrap'} = 1  ;
                          # so so $table->cell($row,$col)->WordWrap() ; 
      
                          my $txt = ''; 
                          # well some 1% of the values are so nasty that we really give up on them ... 
                          eval {
                             $txt = $table->cell($row,$col)->range->{'Text'}; 
                             #replace all the ctrl chars by space
                             $txt =~ s/\r/ /g   ; 
                             $txt =~ s/[^\040-\176]/ /g  ; 
                             # perform some cleansing - ColName<primary key>=> ColName
                             #$txt =~ s#^(.[a-zA-Z_0-9]*)(\<.*)#$1#g ; 
      
                             # this will most probably brake your cmd ... 
                             # $objLogger->LogDebugMsg ( "row: $row , col: $col with txt: $txt \n" ) ; 
                          } or $txt = 'N/A' ; 
      
                          # Write a formatted and unformatted string, row and column notation.
                          $objWorkSheet->write($xlsRow, $col, $txt);
      
                       } #eof foreach col
      
                       # we just want to dump all the tables into the one sheet
                       $xlsRow++ ; 
                     } #eof foreach row
                     sleep 1 ; 
                 }  #eof foreach table
      
                 # close the opened in the UI document
                 $objMSWord->ActiveDocument->Close;
      
                 # OBS !!! now we are able to print 
                 $objLogger->LogDebugMsg ( $tableText . "\n" );
      
                 # exit the whole Word application
                 $objMSWord->Quit;
      
                 return ( $ret , $msg ) ; 
           }
           #eof sub doParseDoc
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2010-11-10
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2021-05-26
        • 2016-04-23
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多