Perl/StringConcatenationBenchmark 페이지의 소스 보기

내용출력
로그인[l]
수정할 수 없습니다: Perl/StringConcatenationBenchmark 는 읽기 전용 페이지입니다.
= [[Perl]]의 String Concatenation 속도 측정 =

<toc>

== # 개요 ==
* 10글자짜리 스트링을 10만번 이상 "."연산자를 써서 이어붙여서 긴 하나의 스트링을 생성하는 테스트
* [[Diary/Perl대Java]]에서, 100만 글자를 넘어가면서부터 "."연산의 수행이 상당히 느려지는 것을 발견하였는데, 이게 윈도우에서만 느려지고 리눅스에서는 괜찮더라. 그래서 좀 더 확실히 알아보고자 Perldoc:Benchmark 모듈을 사용해서 측정해 보았다.

== # 실험 환경 ==
* "straw" - MS윈도우 + Strawberry Perl[http://strawberryperl.com/]
** AMD Athlon 64 X2 Dual 5600+ 2.9GHz
** 2.00GB RAM
** MS Windows SP2
** Strawberry Perl 5.10.0

* "linux" - Linux + Perl
** Intel(R) Core2 Duo CPU E4600 2.4GHz
** 2.00GB RAM
** CentOS release 5.2
** Kernel 2.6.18-92.1.6.el5

* "active" - MS윈도우 + Activestate Perl[http://www.activestate.com/]
** "straw"와 동일한 머신
** Activestate perl 5.10


== # 테스트 코드 ==

기본적으로 <code>"1234567890"</code> 스트링을 반복연결하여, 10글자, 100글자, ..., 천만 글자 길이의 스트링을 생성한다.
* $mul 값을 1, 10, 100, ... , 1,000,000 까지 증가시키면서
** 한 번은 <code>$long_str = $str x $mul</code> 를 써서 한번에 해당 길이의 스트링을 생성하고
** 또 한 번은 for 루프를 $mul 값만큼의 횟수를 반복하면서 <code>$long_str .= $str</code>를 수행하여 생성한다.

{{{#!vim perl
#!/usr/bin/perl
use strict;
use warnings;
use integer;
use Benchmark qw(:all);

my $str = "1234567890";     # 10글자 스트링

for (my $mul = 1; $mul <= 1_000_000; $mul *= 10) {
    print "==== mul = $mul ====\n";

    my $COUNT = 30_000_000 / $mul;

    my $results =
        timethese( $COUNT, {
            'x' =>
                sub {
                    my $long_str;
                    $long_str = $str x $mul;             # "x" 연산자를 사용하여 한 번에 원하는 길이의 스트링 생성
                },
            'dot' =>
                sub {
                    my $long_str;
                    for ( my $i = 1; $i <= $mul; $i++ ) {
                        $long_str .= $str;               # "." 연산자를 사용하여 10글자짜리 스트링을 계속 이어서 붙임
                    }
                },
        } );
    cmpthese($results);
}
}}}


== # 테스트 결과 ==

=== # straw ===
{{{
==== mul = 1 ====
Benchmark: timing 30000000 iterations of dot, x...
       dot: 17 wallclock secs (16.00 usr +  0.00 sys = 16.00 CPU) @ 1875117.19/s (n=30000000)
         x:  6 wallclock secs ( 6.02 usr +  0.00 sys =  6.02 CPU) @ 4986702.13/s (n=30000000)
         Rate  dot    x
dot 1875117/s   -- -62%
x   4986702/s 166%   --
==== mul = 10 ====
Benchmark: timing 3000000 iterations of dot, x...
       dot:  6 wallclock secs ( 6.72 usr +  0.00 sys =  6.72 CPU) @ 446495.01/s (n=3000000)
         x:  1 wallclock secs ( 1.20 usr +  0.00 sys =  1.20 CPU) @ 2493765.59/s (n=3000000)
         Rate  dot    x
dot  446495/s   -- -82%
x   2493766/s 459%   --
==== mul = 100 ====
Benchmark: timing 300000 iterations of dot, x...
       dot:  6 wallclock secs ( 5.86 usr +  0.00 sys =  5.86 CPU) @ 51194.54/s (n=300000)
         x:  1 wallclock secs ( 0.59 usr +  0.00 sys =  0.59 CPU) @ 505050.51/s (n=300000)
        Rate  dot    x
dot  51195/s   -- -90%
x   505051/s 887%   --
==== mul = 1000 ====
Benchmark: timing 30000 iterations of dot, x...
       dot:  6 wallclock secs ( 5.69 usr +  0.00 sys =  5.69 CPU) @ 5275.19/s (n=30000)
         x:  0 wallclock secs ( 0.53 usr +  0.00 sys =  0.53 CPU) @ 56497.18/s (n=30000)
       Rate  dot    x
dot  5275/s   -- -91%
x   56497/s 971%   --
==== mul = 10000 ====
Benchmark: timing 3000 iterations of dot, x...
       dot:  6 wallclock secs ( 5.72 usr +  0.00 sys =  5.72 CPU) @ 524.57/s (n=3000)
         x:  1 wallclock secs ( 0.58 usr +  0.00 sys =  0.58 CPU) @ 5190.31/s (n=3000)
      Rate  dot    x
dot  525/s   -- -90%
x   5190/s 889%   --
==== mul = 100000 ====
Benchmark: timing 300 iterations of dot, x...
       dot:  6 wallclock secs ( 5.89 usr +  0.02 sys =  5.91 CPU) @ 50.79/s (n=300)
         x:  0 wallclock secs ( 0.75 usr +  0.00 sys =  0.75 CPU) @ 400.00/s (n=300)
            (warning: too few iterations for a reliable count)
      Rate  dot    x
dot 50.8/s   -- -87%
x    400/s 688%   --
==== mul = 1000000 ====
Benchmark: timing 30 iterations of dot, x...
       dot: 21 wallclock secs (16.33 usr +  3.83 sys = 20.16 CPU) @ 1.49/s (n=30)
         x:  0 wallclock secs ( 0.73 usr +  0.00 sys =  0.73 CPU) @ 40.87/s (n=30)
            (warning: too few iterations for a reliable count)
      Rate   dot     x
dot 1.49/s    --  -96%
x   40.9/s 2646%    --
}}}

"@ 1.49/s" 형태로 적혀 있는 것은, 해당 서브루틴을 초당 1.49번 수행했다는 얘기다. mul값이 100에서 100,000 사이일 때 "dot"과 "x"의 초당 수행 횟수를 비교해보면 대략 9~10배 정도 차이가 난다. '''그런데 mul값이 백만일 때를 보면 (제일 마지막 단락), 1.49/s와 40.87/s로 27배 차이가 나고 있다.''' "."연산을 백만 번 반복하는데 20초 정도나 걸린다.


=== # linux ===
{{{
==== mul = 1 ====
Benchmark: timing 30000000 iterations of dot, x...
       dot: 16 wallclock secs (17.01 usr +  0.00 sys = 17.01 CPU) @ 1763668.43/s (n=30000000)
         x:  9 wallclock secs ( 8.16 usr +  0.00 sys =  8.16 CPU) @ 3676470.59/s (n=30000000)
         Rate  dot    x
dot 1763668/s   -- -52%
x   3676471/s 108%   --
==== mul = 10 ====
Benchmark: timing 3000000 iterations of dot, x...
       dot:  9 wallclock secs ( 7.26 usr +  0.00 sys =  7.26 CPU) @ 413223.14/s (n=3000000)
         x:  1 wallclock secs ( 1.54 usr +  0.00 sys =  1.54 CPU) @ 1948051.95/s (n=3000000)
         Rate  dot    x
dot  413223/s   -- -79%
x   1948052/s 371%   --
==== mul = 100 ====
Benchmark: timing 300000 iterations of dot, x...
       dot:  7 wallclock secs ( 6.30 usr +  0.00 sys =  6.30 CPU) @ 47619.05/s (n=300000)
         x:  1 wallclock secs ( 0.92 usr +  0.00 sys =  0.92 CPU) @ 326086.96/s (n=300000)
        Rate  dot    x
dot  47619/s   -- -85%
x   326087/s 585%   --
==== mul = 1000 ====
Benchmark: timing 30000 iterations of dot, x...
       dot:  6 wallclock secs ( 6.16 usr +  0.00 sys =  6.16 CPU) @ 4870.13/s (n=30000)
         x:  1 wallclock secs ( 0.85 usr +  0.00 sys =  0.85 CPU) @ 35294.12/s (n=30000)
       Rate  dot    x
dot  4870/s   -- -86%
x   35294/s 625%   --
==== mul = 10000 ====
Benchmark: timing 3000 iterations of dot, x...
       dot:  6 wallclock secs ( 6.13 usr +  0.00 sys =  6.13 CPU) @ 489.40/s (n=3000)
         x:  1 wallclock secs ( 1.00 usr +  0.00 sys =  1.00 CPU) @ 3000.00/s (n=3000)
      Rate  dot    x
dot  489/s   -- -84%
x   3000/s 513%   --
==== mul = 100000 ====
Benchmark: timing 300 iterations of dot, x...
       dot:  6 wallclock secs ( 6.12 usr +  0.00 sys =  6.12 CPU) @ 49.02/s (n=300)
         x:  1 wallclock secs ( 1.03 usr +  0.01 sys =  1.04 CPU) @ 288.46/s (n=300)
      Rate  dot    x
dot 49.0/s   -- -83%
x    288/s 488%   --
==== mul = 1000000 ====
Benchmark: timing 30 iterations of dot, x...
       dot:  7 wallclock secs ( 6.18 usr +  0.01 sys =  6.19 CPU) @  4.85/s (n=30)
         x:  1 wallclock secs ( 1.09 usr +  0.02 sys =  1.11 CPU) @ 27.03/s (n=30)
      Rate  dot    x
dot 4.85/s   -- -82%
x   27.0/s 458%   --
}}}

'''리눅스에서는 백만번 수행할 때도 다른 때와 같이 6배 정도밖에 차이가 나지 않는다.'''


=== # 그래프 ===
위 두 결과를 차트로 만들어 보았다.

Upload:bench_01.png

x(windows)와 dot(windows) 라인 사이의 간격이, mul값이 십만을 넘으면서 점점 벌어지는 것을 볼 수 있다. 그 두 값의 비율인 ratio 그래프를 보면, 결과적으로 ratio가 증가하고 있는데, 그래프의 y축이 로그 스케일인 걸 감안하면 상당히 급격히 늘어나는 것이다. 
=== # mul값을 10만~100만 사이에서 변경 ===
mul값 10만과 100만 사이에서 급격히 차이가 벌어지는 곳을 발견할 수 있을까 해서, 값을 5만씩 증가시켜가면서 다시 측정해보았다. 표는 생략하고 그래프만 보이면

Upload:bench_02.png

딱히 눈에 띄는 포인트는 없고, 10만번일 때 이미 벌어지기 시작했던 것으로 보인다.


=== # active - Activestate perl의 경우 ===

"straw" 환경에서 백만번 반복할 때 속도가 현저하게 떨어진 것이, Strawberry Perl의 perl 인터프리터 구현의 문제일까 싶어서, Activestate사의 배포본을 다시 설치해서 테스트해보았는데, '''Strawberry Perl과 유사한 결과를 보였다''' 따라서 Perl 배포본의 문제는 아닌 것으로 보임.

<memo(결과 로그만 첨부함 - 클릭)>
{{{
==== mul = 1 ====
Benchmark: timing 30000000 iterations of dot, x...
       dot: 16 wallclock secs (16.23 usr +  0.00 sys = 16.23 CPU) @ 1847973.39/s (n=30000000)
         x:  5 wallclock secs ( 6.50 usr +  0.00 sys =  6.50 CPU) @ 4614674.67/s (n=30000000)
         Rate  dot    x
dot 1847973/s   -- -60%
x   4614675/s 150%   --
==== mul = 10 ====
Benchmark: timing 3000000 iterations of dot, x...
       dot:  7 wallclock secs ( 6.84 usr +  0.00 sys =  6.84 CPU) @ 438340.15/s (n=3000000)
         x:  1 wallclock secs ( 1.20 usr +  0.00 sys =  1.20 CPU) @ 2491694.35/s (n=3000000)
         Rate  dot    x
dot  438340/s   -- -82%
x   2491694/s 468%   --
==== mul = 100 ====
Benchmark: timing 300000 iterations of dot, x...
       dot:  6 wallclock secs ( 5.86 usr +  0.00 sys =  5.86 CPU) @ 51203.28/s (n=300000)
         x: -1 wallclock secs ( 0.56 usr +  0.00 sys =  0.56 CPU) @ 532859.68/s (n=300000)
        Rate  dot    x
dot  51203/s   -- -90%
x   532860/s 941%   --
==== mul = 1000 ====
Benchmark: timing 30000 iterations of dot, x...
       dot:  6 wallclock secs ( 5.81 usr +  0.00 sys =  5.81 CPU) @ 5161.73/s (n=30000)
         x:  1 wallclock secs ( 0.50 usr +  0.00 sys =  0.50 CPU) @ 60000.00/s (n=30000)
       Rate   dot     x
dot  5162/s    --  -91%
x   60000/s 1062%    --
==== mul = 10000 ====
Benchmark: timing 3000 iterations of dot, x...
       dot:  6 wallclock secs ( 5.84 usr +  0.00 sys =  5.84 CPU) @ 513.43/s (n=3000)
         x:  0 wallclock secs ( 0.55 usr +  0.00 sys =  0.55 CPU) @ 5494.51/s (n=3000)
      Rate  dot    x
dot  513/s   -- -91%
x   5495/s 970%   --
==== mul = 100000 ====
Benchmark: timing 300 iterations of dot, x...
       dot:  6 wallclock secs ( 5.95 usr +  0.02 sys =  5.97 CPU) @ 50.26/s (n=300)
         x:  1 wallclock secs ( 0.70 usr +  0.00 sys =  0.70 CPU) @ 426.74/s (n=300)
      Rate  dot    x
dot 50.3/s   -- -88%
x    427/s 749%   --
==== mul = 1000000 ====
Benchmark: timing 30 iterations of dot, x...
       dot: 20 wallclock secs (16.11 usr +  4.14 sys = 20.25 CPU) @  1.48/s (n=30)
         x:  1 wallclock secs ( 0.70 usr +  0.00 sys =  0.70 CPU) @ 42.67/s (n=30)
      Rate   dot     x
dot 1.48/s    --  -97%
x   42.7/s 2781%    --
}}}
</memo>



== # 그 외 ==

딱히 딸기 펄만의 문제가 아니라면 뭐가 문제일까?

* 윈도우가 깔린 내 컴퓨터의 메모리의 용량 문제
** 천만 글자짜리 스트링이라 해도 크기는 10MB이다. 펄에서 스칼라 변수를 관리할 때 추가 정보를 위한 공간이 더 필요하겠지만, 그걸 고려해도 설마 두 배 이상 되진 않을테니 20MB면 된다는 얘기. 실제로 process explorer로 살펴보면 프로세스의 최대 virtual size는 30MB 이내였다. ("x"를 쓰든 "."을 쓰든) 시스템의 램이 2GB나 되는 상태에서 딱히 메모리 스왑을 하진 않았을 것이다.

그 외 생각할 수 있는 것은, 위 테스트에서 차이가 나는 부분을 생각하면

* 펄5.8에서 5.10으로 바뀔 때 해당 루틴에 문제가 있음 - 설마;;
* 딸기 펄과 액티브 펄이 공통적으로 해당 루틴에 문제가 있음
* 내 컴퓨터 하드웨어(AMD CPU와 메인보드와 메모리 등) 시스템에서 메모리를 관리하는 곳에 문제가 있음
** aero님의 실험[http://aero.sarang.net/blog/2009/04/perl-9.html]에서는 Intel 플랫폼에서도 같은 현상이 생겼으니 이것도 아닌 듯
* MS윈도우에서 메모리를 할당받을 때 문제가 있음
** 매우 혐의가 높음. aero님 글[http://aero.sarang.net/blog/2009/04/perl-9.html]과 "realloc() on Windows vs. Linux"[http://blog.kowalczyk.info/article/realloc-on-Windows-vs-Linux-1.html] 참고

등등이 있을 법 한데, 이건 다른 상황에서 테스트한 결과를 얻을 때마다 소거해 나갈 수 있을 것이다.



== # (문서 업데이트) ==

aero님의 실험[http://aero.sarang.net/blog/2009/04/perl-9.html]에서 다음과 같이 확인됨
* "."연산 루프 10,000번 돌 때마다 걸리는 시간을 측정해보니 점점 더 증가한다.
* 미리 천만 글자 분량의 메모리를 확보한 후 수행할 때는 소요 시간이 일정하다.
* [http://blog.kowalczyk.info/article/realloc-on-Windows-vs-Linux-1.html]에서 언급하는 것처럼, 윈도우에서 realloc()으로 메모리를 추가 확보할 때 성능에 문제가 있는 것 같다


== # 결론 ==

* "."연산을 십만번 이상 반복하면서 스트링을 이어붙여나갈때, 윈도우에서는 성능이 급격히 떨어진다.
* 그 이유는, 아마도 MS윈도우에 구현된 realloc()의 문제로 생각됨
* '''긴 스트링이 필요하면 "."으로 이어 붙이는 루프를 반복하지 말고 "x"로 한 번에 늘리자. :-) 언제나 이게 빠르다'''


== # 참고 ==
* [http://thebasis.tistory.com/89 The basis :: Perl vs Java 간단한 정규표현식 문자열 치환 성능]
* [http://aero.sarang.net/blog/2009/04/java-vs-perl.html Java VS Perl - aero's Blog]
* [[Diary/Perl대Java]]
* [http://aero.sarang.net/blog/2009/04/perl-9.html Perl의 문자열 병합 성능 - aero's Blog] [http://aero.sarang.net/blog/2009/04/perl-9.html]
* [http://blog.kowalczyk.info/article/realloc-on-Windows-vs-Linux-1.html realloc() on Windows vs. Linux] [http://blog.kowalczyk.info/article/realloc-on-Windows-vs-Linux-1.html]

== # Comments ==
<div class="comments">

<longcomments(100)>
----

</div>
----
[[컴퓨터분류]]
Perl/StringConcatenationBenchmark 페이지로 돌아가기 | 다른 수정본 보기