[PATCH] highlight: do not use guess_lexer functions. they use too much CPU time for certain inputs

Ralf Schmitt schmir at gmail.com
Wed Apr 2 16:53:44 CDT 2008


On Wed, Apr 2, 2008 at 10:45 PM, Matt Mackall <mpm at selenic.com> wrote:

>
> On Wed, 2008-04-02 at 21:59 +0200, Ralf Schmitt wrote:
> > # HG changeset patch
> > # User ralf at brainbot.com
> > # Date 1207165818 -7200
> > # Node ID 50015149baa0dbf1b7066f0356b65f492ed78450
> > # Parent  101526031d06d184559ae797687e50661b96156e
> > highlight: do not use guess_lexer functions. they use too much CPU time
> for certain inputs.
>
> Does certain input mean big inputs? Can we send some truncated source to
> the guesser instead?


I reported this some time ago:
http://selenic.com/pipermail/mercurial/2008-March/018029.html
The file where this happened for me is a php file with around 2000 lines
(140k).

I wrote a short script to measure the time it takes to run
guess_lexer_for_filename on truncated input:
from pygments.lexers import guess_lexer_for_filename

text=open("Collection.i18n.php").read()

import time
size=512
while 1:
    stime=time.time()
    for run in range(10):
        guess_lexer_for_filename("collection.i18n.php", text[:size],
encoding="utf-8")
    print (time.time()-stime)/10, size

    size+=512


It prints the following values (first row is time needed in seconds, second
row is size in bytes):

0.00721120834351 512
0.00744049549103 1024
0.0433429002762 1536
0.161764788628 2048
0.34955329895 2560
0.627179193497 3072
0.958257818222 3584
1.46866378784 4096
2.11897850037 4608
2.94355890751 5120
3.93533871174 5632
5.09589328766 6144

This is on a 2.4 Ghz CPU.

Regards,
- Ralf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://selenic.com/pipermail/mercurial/attachments/20080402/9ce0e059/attachment.htm 


More information about the Mercurial mailing list