【问题标题】:splitting a large excel file using pandas使用熊猫分割一个大的excel文件
【发布时间】:2021-09-26 04:23:55
【问题描述】:

我在 xlsx 文件中有表格数据,其中包含许多数据列,并且我想将其分隔到不同 xlsx 文件的行块之间存在间隙,其名称从第四列开始,两个字符重新排列序列号。

另外我想删除输出文件中的第 5 列。

例如:在我的数据中有几列数据我想将它分成两个块。

【问题讨论】:

  • 到目前为止你尝试了什么。
  • 我完全不知道为什么这个问题会被赞成。
  • 另外,“pandas”的标签明明是python库,标题说要用pandas,那为什么是Perl标签呢?

标签: python excel pandas


【解决方案1】:

你能试试下面的 Perl 脚本吗:

#! /usr/bin/env perl

package Main;
use feature qw(say);
use strict;
use warnings;
use Spreadsheet::ParseXLSX;
use Excel::CloneXLSX::Format qw(translate_xlsx_format);
use Excel::Writer::XLSX;


{
    my $self = Main->new(
        input_file => 'x.xlsx',
        skip_cols => [4],
    );
    $self->scan_input_file();
    $self->write_chunks();
    say "Done";
}

sub close_chunk_file {
    my ( $self ) = @_;

    if ($self->{workbook}) {
        $self->{workbook}->close();
    }
}

sub new {
    my ( $class, %args ) = @_;

    return bless \%args, $class;
}

sub scan_input_file {
    my ( $self ) = @_;

    my $parser = Spreadsheet::ParseXLSX->new;
    my $workbook = $parser->parse($self->{input_file});
    my $worksheet = $workbook->worksheet(0);
    my ( $row_min, $row_max ) = $worksheet->row_range();
    my ( $col_min, $col_max ) = $worksheet->col_range();

    my $name;
    my @chunks;
    $self->save_header_line( $worksheet, $row_min );
    $row_min++;
    for my $row ( $row_min .. $row_max ) {
        my $col0 = 0;
        my $cell0 = $worksheet->get_cell( $row, $col0 );
        if (!$cell0) {
            push @chunks, $name if defined $name;
            $name = undef;
            next;
        }
        my $col3 = 3;
        my $cell3 = $worksheet->get_cell( $row, $col3 );
        if ($cell3) {
            $name = $cell3->unformatted();
            ($name) = $name =~ /^(\S+)/;
        }
    }
    push @chunks, $name if defined $name;
    $self->{chunks} = \@chunks;
}

sub save_header_line {
    my ( $self, $worksheet, $row_min ) = @_;

    my ( $col_min, $col_max ) = $worksheet->col_range();
    my %skip = map { $_ => 1 } @{$self->{skip_cols}};
    my @header;
    my $row0 = 0;
    my %col_map;
    my $new_col = 0;
    for my $col ( $col_min .. $col_max ) {
        next if exists $skip{$col};
        $col_map{$col} = $new_col;
        $new_col++;
        my $cell = $worksheet->get_cell( $row0, $col );
        push @header, $cell;
    }
    $self->{header} = \@header;
    $self->{skip_col} = \%skip;
    $self->{col_map} = \%col_map;
}

sub start_new_chunk {
    my ( $self, $name) = @_;

    say "--> $name.xlsx";
    $self->close_chunk_file();
    $self->{workbook} = Excel::Writer::XLSX->new( "$name.xlsx" );
    $self->{worksheet} = $self->{workbook}->add_worksheet();
    $self->write_header();
}

sub write_cell {
    my ( $self, $row, $col, $cell) = @_;

    my $fmt = $cell->get_format();
    my $fmt_props  = translate_xlsx_format( $fmt );
    my $new_format = $self->{workbook}->add_format(%$fmt_props);
    my $value = $cell->unformatted() || '';
    $self->{worksheet}->write($row, $self->{col_map}{$col}, $value, $new_format);
}

sub write_chunks {
    my ( $self ) = @_;

    my $parser = Spreadsheet::ParseXLSX->new;
    my $workbook = $parser->parse($self->{input_file});
    my $worksheet = $workbook->worksheet(0);
    my ( $row_min, $row_max ) = $worksheet->row_range();
    my ( $col_min, $col_max ) = $worksheet->col_range();
    my @chunks = @{$self->{chunks}};
    die "No chunks to write\n" if @chunks == 0;
    $self->start_new_chunk(shift @chunks);
    my $chunk_row = 1;  # skip header row
    $row_min++; # skip header row
    ROW: for my $row ( $row_min .. $row_max ) {
        for my $col ( $col_min .. $col_max ) {
            my $cell = $worksheet->get_cell( $row, $col );
            if ( $col == 0 && !$cell) {
                if (@chunks) {
                    $self->start_new_chunk(shift @chunks);
                    $chunk_row = 1;
                    next ROW;
                }
                else {
                    last;
                }
            }
            if ( $cell ) {
                if (!exists $self->{skip_col}{$col}) {
                    $self->write_cell($chunk_row, $col, $cell);
                }
            }
        }
        $chunk_row++;
    }
    $self->close_chunk_file();
}

sub write_header {
    my ( $self ) = @_;

    my $header = $self->{header};
    for my $col ( 0 .. $#$header ) {
        my $cell = $header->[$col];
        next if !$cell;
        my $fmt = $cell->get_format();
        my $fmt_props  = translate_xlsx_format( $fmt );
        my $new_format = $self->{workbook}->add_format(%$fmt_props);
        my $value = $cell->unformatted() || '';
        my $row0 = 0;
        $self->{worksheet}->write($row0, $col, $value, $new_format);
    }
}

【讨论】:

  • 你的意思是如何运行脚本?
  • 好的,首先安装模块Spreadsheet::ParseXLSXExcel::CloneXLSX::FormatExcel::Writer::XLSX,然后像perl script.pl这样运行脚本。假设文件x.xlsx在当前目录中
  • 我会尽力让你知道。非常感谢你为我写了这么长的脚本@Hakon Haegland
猜你喜欢
  • 2018-05-01
  • 2011-04-07
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2021-08-19
  • 2016-09-20
  • 1970-01-01
相关资源
最近更新 更多