4 번째 수정본 소스 보기 : Perl/FileCache

내용출력
로그인[l]
수정할 수 없습니다: Perl/FileCache 는 읽기 전용 페이지입니다.
= FileCache 코어 모듈 =

<toc>

Perldoc:FileCache

프로그램이 실행될 때 동시에 열 수 있는 파일의 갯수에 제한이 있는데, FileCache 모듈을 쓰면 자기가 알아서 오랫동안 사용되지 않은 핸들을 닫고, 필요할 때 다시 열 수 있게 해 준다.


== # 테스트 기록 ==

[http://cafe.naver.com/perlstudy 펄스터디] 까페에 한군님이 대용량 파일의 내용 정렬 질문글을 올리셨던 것과 비슷한 상황을 만들어서 해 봤음.

=== # 테스트 상황 ===

다음과 같은 파일이 있다:
{{{
키      일련번호
0       1
0       2
0       3
1       1
2       1
2       2
...
}}}
첫번째 컬럼에 있는 숫자를 키라고 하고, 각 키마다 다시 일련번호가 붙는다. 각 키가 파일 안에 나타나는 횟수가 다 다르다. 이것을 키가 가장 많이 나타나는 순으로 정렬하고 싶다.

{{{
정렬 후:
0       1
0       2
0       3
2       1
2       2
1       1
}}}

일단 이런 형태의 데이타를 생성하기 위해서 Cpan:Math::Random 모듈의 random_exponential 함수를 사용했다. 결과적으로 다음과 같이 생성되었다.
* 키값이 작은 쪽(0,1,2,...)은 빈도가 3000여 번 나옴
* 키값이 커질 수록 빈도수가 작아지다가 끝에 가면 1번만 나오는 키들이 매우 많음
* 키의 총 가짓수는 8573개
* 전체 라인 수는 3백만개
* 입력 파일 전체 크기는 약 25MB
원래 한군님이 정렬해야 될 데이타는 키의 가짓수는 6천여개였고 라인 수가 더 길었는데 (7백만 정도?), 시간이 너무 걸려서 좀 줄였다.

그리고 원본 파일의 구성을 두 가지로 준비했다
* seq 파일은 위에 예로 든 것과 같이, 하나의 키에 해당하는 라인들이 모두 모여있음. 즉 어떤 키에 해당하는 라인들이 죽 이어서 나오고 그 이후로는 다시는 나오지 않음
* rnd 파일은 라인들이 무작위로 섞여 있음.

실행한 환경은
* 리눅스 커널 2.6
* Perl 5.14.2
* 메모리 2GB
* CPU Intel(R) Core(TM)2 Duo CPU     E4600  @ 2.40GHz

테스트에걸리는 시간은 코드 내에서 직접 측정해서 출력하게 했고, 메모리 사용량은 옆에서 top을 띄워둔 후 virtual image 크기를 대충 눈으로 봤음.

<memo(테스트에 사용한 전체 코드: 클릭하면 펼쳐짐)>
{{{#!vim perl
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use FileCache maxopen => 128;
use Time::HiRes qw/gettimeofday tv_interval/;

if ( not defined $ARGV[0] or not grep { $ARGV[0] == $_ } (11,12,21,22,31,32,41,42,51,52) ) {
    print <<EOF;
arg:
11 - in-memory sort. sequantial
12 - in-memory sort. random

21 - open-close in each iteration. sequatial
22 - open-close in each iteration. random

31 - open-close once. maybe die. sequantial
32 - open-close once. maybe die. random

41 - use cacheout. sequantial
42 - use cacheout. random

51 - use cacheout. sequantial
52 - use cacheout. maybe error. random
EOF
    exit 1;
}

my $input_file = ( $ARGV[0] % 10 == 1 ) ? 'input_seq.txt' : 'input_rnd.txt';
my $output_postfix = ( $ARGV[0] % 10 == 1 ) ? 'seq.txt' : 'rnd.txt';
my $method = int($ARGV[0] / 10);

if ( $method == 1 ) {       # in-memory sort
    print "method 1 - in-memory sort. [$input_file]\n";
    my $t0 = [ gettimeofday ];

    open my $in, '<', $input_file;
    my @array = <$in>;
    close $in;
    print " laptime time after loading file : ", tv_interval($t0), "\n";

    my %freq = ();
    foreach my $line ( @array ) {
        my $key = (split /\s+/, $line)[0];
        $freq{$key}++;
    }

    print " laptime time after counting : ", tv_interval($t0), "\n";

    my @sorted =
        map { $_->[1] }
        sort { $freq{$b->[0]} <=> $freq{$a->[0]} or $a->[0] <=> $b->[0] }
        map { [ (split /\s+/, $_)[0], $_ ] }
        @array;
    print " laptime time after sorting : ", tv_interval($t0), "\n";

    my $output_file = "sorted_1_$output_postfix";
    open my $out, ">", $output_file;
    foreach ( @sorted ) {
        print {$out} $_;
    }
    close $out;
    print "method 1 - save [$output_file]\n";
    print " total elapsed time : ", tv_interval($t0), "\n";

}
elsif ( $method == 2 ) {    # open-close in each iteration
    print "method 2 - open-close in each iteration. [$input_file]\n";
    my $t0 = [ gettimeofday ];

    my %freq = ();
    open my $in, '<', $input_file;
    while ( my $line = <$in> ) {
        my $key = (split /\s+/, $line)[0];
        $freq{$key}++;

        open my $tmp, '>>', $key;
        print {$tmp} $line;
        close $tmp;
    }
    close $in;
    print " laptime time after loading, counting, splitting : ", tv_interval($t0), "\n";

    my $output_file = "sorted_2_$output_postfix";
    open my $out, ">", $output_file;
    foreach my $key ( sort { $freq{$b} <=> $freq{$a} or $a <=> $b } keys %freq ) {
        open my $tmp, '<', $key;
        while ( my $line = <$tmp> ) {
            print {$out} $line;
        }
        close $tmp;
        unlink $key;
    }
    close $out;

    print "method 2 - save [$output_file]\n";
    print " total elapsed time : ", tv_interval($t0), "\n";
}
elsif ( $method == 3 ) {    # open-close once
    print "method 3 - open-close once. [$input_file]\n";
    my $t0 = [ gettimeofday ];

    my %freq = ();
    my %fh = ();
    open my $in, '<', $input_file;
    while ( my $line = <$in> ) {
        my $key = (split /\s+/, $line)[0];
        $freq{$key}++;

        if ( not exists $fh{$key} ) {
            open $fh{$key}, '>', $key;
        }
        print {$fh{$key}} $line;
    }
    close $in;
    close $_ foreach ( values %fh );
    print " laptime time after loading, counting, splitting : ", tv_interval($t0), "\n";

    my $output_file = "sorted_3_$output_postfix";
    open my $out, ">", $output_file;
    foreach my $key ( sort { $freq{$b} <=> $freq{$a} or $a <=> $b } keys %freq ) {
        open my $tmp, '<', $key;
        while ( my $line = <$tmp> ) {
            print {$out} $line;
        }
        close $tmp;
        unlink $key;
    }
    close $out;

    print "method 3 - save [$output_file]\n";
    print " total elapsed time : ", tv_interval($t0), "\n";
}
elsif ( $method == 4 ) {                      # cacheout
    no strict 'refs';
    use IO::Handle;
    print "method 4 - use cacheout. [$input_file]\n";
    my $t0 = [ gettimeofday ];

    my %freq = ();
    my %fh = ();
    open my $in, '<', $input_file;
    while ( my $line = <$in> ) {
        my $key = (split /\s+/, $line)[0];
        $freq{$key}++;

        $fh{$key} = cacheout "$key-t";
#         $fh{$key}->autoflush(1);
        print {$fh{$key}} $line;
        $fh{$key}->flush();
    }
    close $in;
    print " laptime time after loading, counting, splitting : ", tv_interval($t0), "\n";

#     if ( 1 ) { print "check file size!\n"; <STDIN>; }
    my $output_file = "sorted_4_$output_postfix";
    open my $out, ">", $output_file;
    foreach my $key ( sort { $freq{$b} <=> $freq{$a} or $a <=> $b } keys %freq ) {
        open my $tmp, '<', "$key-t";
        while ( my $line = <$tmp> ) {
            print {$out} $line;
        }
        close $tmp;
        unlink "$key-t";
    }
    close $out;

    print "method 4 - save [$output_file]\n";
    print " total elapsed time : ", tv_interval($t0), "\n";
}
elsif ( $method == 5 ) {
    no strict 'refs';
    print "method 5 - use cacheout (maybe error with random sequence). [$input_file]\n";
    my $t0 = [ gettimeofday ];

    my %freq = ();
    my %fh = ();
    open my $in, '<', $input_file;
    while ( my $line = <$in> ) {
        my $key = (split /\s+/, $line)[0];
        $freq{$key}++;

        if ( not exists $fh{$key} ) {
            $fh{$key} = cacheout "$key.txt";
        }
        print {$fh{$key}} $line;
        $fh{$key}->flush();
    }
    close $in;
    print " laptime time after loading, counting, splitting : ", tv_interval($t0), "\n";

    my $output_file = "sorted_5_$output_postfix";
    open my $out, ">", $output_file;
    foreach my $key ( sort { $freq{$b} <=> $freq{$a} or $a <=> $b } keys %freq ) {
        open my $tmp, '<', "$key.txt";
        while ( my $line = <$tmp> ) {
            print {$out} $line;
        }
        close $tmp;
        unlink "$key.txt";
    }
    close $out;

    print "method 5 - save [$output_file]\n";
    print " total elapsed time : ", tv_interval($t0), "\n";
}
}}}
</memo>





=== # 메모리에 다 불러와서 정렬 ===

일단 파일 전체를 메모리에 읽어들인 후 정렬하여 파일로 기록하는 코드:
{{{#!vim perl
    # 통채로 읽고
    my @array = <$in>;

    # 각 키의 빈도수를 먼저 카운트한 후
    my %freq = ();
    foreach my $line ( @array ) {
        my $key = (split /\s+/, $line)[0];
        $freq{$key}++;
    }

    # 정렬하고
    my @sorted =
        map { $_->[1] }
        sort { $freq{$b->[0]} <=> $freq{$a->[0]} or $a->[0] <=> $b->[0] }
        map { [ (split /\s+/, $_)[0], $_ ] }
        @array;

    # 파일에 저장
}}}

실행결과는 다음과 같음
* 순차 파일
{{{#!vim
[gypark@www perlstudy_1473]$ ./sort.pl 11
method 1 - in-memory sort. [input_seq.txt]
 laptime time after loading file : 1.730617
 laptime time after counting : 4.001903
 laptime time after sorting : 13.373383
method 1 - save [sorted_1_seq.txt]
 total elapsed time : 14.21391

사용메모리: 약 745M
}}}
* 랜덤 파일
{{{#!vim
[gypark@www perlstudy_1473]$ ./sort.pl 12
method 1 - in-memory sort. [input_rnd.txt]
 laptime time after loading file : 1.725655
 laptime time after counting : 4.212954
 laptime time after sorting : 40.480208
method 1 - save [sorted_1_rnd.txt]
 total elapsed time : 41.319748

사용메모리: 약 820M
}}}

파일을 읽어 배열에 넣는데 1.7초, 그 배열원소 3백만개를 순회하며 키를 뽑아내어 빈도수를 재는 데 2.3초 정도 걸리는 건 비슷하나, 그 다음 정렬할 때 차이가 많이 난다. 순차 파일의 경우 같은 키에 해당하는 원소들이 이미 정렬된 채로 모여 있어서 그런 듯.

(여담으로, 위 테스트를 처음에는 윈도우7이 깔린 컴퓨터에서 하려고 했는데, 파일을 읽는 데에만 십수분이 걸려도 끝나지 않았다 -_-;; [[/StringConcatenationBenchmark]]에서도 얘기가 나왔었지만 윈도우에서 메모리를 realloc 으로 계속 늘려나갈 때 문제가 있어서 그런 게 아닐까 짐작함)

문제는 메모리 사용량이다. 25MB 짜리 텍스트 파일을 읽어 배열에 저장할 때 400MB 이상을 쓰게되고, 정렬을 위해서 배열이 하나 복사되면서 두 배가 된다. 라인 수를 7백만개로 늘렸을 경우 내 테스트 환경에서는 순차 파일은 2GB를 간당간당하게 쓰면서 40초 정도 끝났는데, 랜덤 파일은 디스크에 스왑을 하게되는 듯 끝날 기미가 안 보여서 포기했음.



=== # 키 별로 개별 파일에 저장 후 병합 ===

메모리가 부족해서 통채로 읽을 수 없으니, 할 수 없이 디스크를 쓰기로 한다.

기본 컨셉은, 하나의 키에 해당하는 라인들만 뽑아내어 파일 하나에 저장하는 식으로 분리를 한 후, 빈도수가 높은 (즉 라인수가 많은) 키의 파일들부터 읽어서 하나의 파일에 기록하는 식.


{{{#!vim perl
    my %freq = ();
    open my $in, '<', $input_file;
    while ( my $line = <$in> ) {
        my $key = (split /\s+/, $line)[0];
        $freq{$key}++;

# 한 라인을 읽을 때마다 매번 그 라인에 해당하는 키 파일을 열고 거기에 덧붙인 후 닫음
        open my $tmp, '>>', $key;
        print {$tmp} $line;
        close $tmp;
    }
    close $in;

    open my $out, ">", $output_file;
    foreach my $key ( sort { $freq{$b} <=> $freq{$a} or $a <=> $b } keys %freq ) {
# 정렬한 키 순서대로 키 파일을 읽어서 최종 결과물에 덧붙임
        open my $tmp, '<', $key;
        while ( my $line = <$tmp> ) {
            print {$out} $line;
        }
        close $tmp;
        unlink $key;
    }
    close $out;
}}}

실행 결과:
{{{#!vim
[gypark@www perlstudy_1473]$ ./sort.pl 21
method 2 - open-close in each iteration. [input_seq.txt]
 laptime time after loading, counting, splitting : 132.328214
method 2 - save [sorted_2_seq.txt]
 total elapsed time : 134.268979

사용메모리: 약 10M
}}}

{{{#!vim
[gypark@www perlstudy_1473]$ ./sort.pl 22
method 2 - open-close in each iteration. [input_rnd.txt]
 laptime time after loading, counting, splitting : 1081.486577
method 2 - save [sorted_2_rnd.txt]
 total elapsed time : 1089.66655

사용메모리: 약 10M
}}}

메모리는, 쓸래야 쓸 일이 없다. 항상 한 라인씩 처리하기 때문에 10MB를 넘지 않는다.

순차 파일과 랜덤 파일 둘 다 정렬하고 최종 파일에 기록하는 건 몇 초 걸리지 않는데, 입력을 읽으며 키 파일들을 만드는 데 시간의 차이가 크게 난다. 어차피 3백만 번 오픈,클로즈를 꼬박꼬박 하는 건 마찬가지인데 이렇게 차이가 나는 건, 순차 파일의 경우는 한번 열었던 파일을 다시 열게 되니까 이 때는 버퍼 캐시 등의 도움으로 오버헤드가 줄어들기 때문으로 생각됨.




=== # open 횟수를 줄이려는 시도 ===


=== # FileCache 모듈 사용 ===
== # Comments ==



<div class="comments">
<longcomments(100)>
</div>
----
[[기타분류]]
Perl/FileCache 페이지로 돌아가기 | 다른 수정본 보기