Jump to content

Python code to extract specific lines in a textfile


Jo27

Recommended Posts

Here is the python code for it but it doesn't get over the limitations on pythons open() which just reads the file as one line in a massive string. Ps I am going out for the day. Basically I load the massive string into an array memory because that is all i can do then I am writing the lines as separate files and deleting the massive string from memory. Then I perform a regex on each individual file and any files which pass the test are placed into the output directory. Then the files in the output directory are compiled and the two temporary directories are deleted. I thought I should probably use the current time for the directory names but I didn't get around to it, ...

That's utterly bizarre. Why write each line of the file to separate files (in temp), just to read them again and write them as separate files (to output), to then read them again into one long in-memory string to finally write to the output file, when you could just write them (if they meet whatever criteria) directly to the output file in the first pass?

 

... I'm off to do this in C# to show the pattern ...

 

Edit: yep, this works fine. As noted, not "production" code, but I'd be surprised if the Python code had to be much different. (Open input file, open output file, read lines from input file and write to output file if they meet some criteria.)

 

using System.IO;

namespace FileFilter
{
    class Program
    {
        // Expects two parameters - the name of the file to process and the text wanted at the start of lines to keep
        // This is not production code. There's no error checking (Does input file exist? Will output file be overwritten?).
        // It may give you a stomache ache. And socks will vanish from your washing.
        static void Main(string[] args)
        {
            using (var inputFile = new StreamReader(args[0]))
            using (var outputFile = new StreamWriter(args[0].Insert(args[0].LastIndexOf('.'), "_Filtered")))
            {
                string lineIn;

                while ((lineIn = inputFile.ReadLine()) != null)
                {
                    if (lineIn.StartsWith(args[1]))
                    {
                        outputFile.WriteLine(lineIn);
                    }
                }
            }
        }
    }
}

P.S. re-use of the variable named "txt" is nasty.

Edited by pzkpfw
Link to comment
Share on other sites

Fiveworlds seems to swerve continuously between almost sensible posts and complete nonsense. I recommend people ignore everything he says, just to be on the safe side.

 

Fiveworlds is self made programmer in interpreted languages with quite little experience..

 

With time, he should learn how to optimize his programs, and gain knowledge.

Edited by Sensei
Link to comment
Share on other sites

Fiveworlds is self made programmer in interpreted languages with quite little experience..

 

With time, he should learn how to optimize his programs, and gain knowledge.

To learn you first need to admit you've got stuff to learn.

 

fiveworlds definitely does know some stuff. He just needs to understand doesn't know everything, and open himself up to learning, and not being defensive about what he's said.

 

(I've been programming since I was 11, in 1981 (BASIC on a Pr1me Minicomputer via dumb terminal). Programming has been the bulk of my career. What's making me most interested and engaged in my current job, is what I'm learning from the contractors who've been brought into the project I'm working on. The methods and patterns they use are making my brain hurt. But hurt in a good way. They know stuff I don't. That's a good thing.)

 

((P.S. I gave him green for his gobbledigook program, because at least he put some effort into it.))

Edited by pzkpfw
Link to comment
Share on other sites

... I'm off to do this in C# to show the pattern ...

 

Edit: yep, this works fine. As noted, not "production" code,

 

Just a thought: try it with file name without dot . inside.. :)

LastIndexOf() will return -1.

How Insert() will react with it?

Exception, I guess so.

Link to comment
Share on other sites

fiveworlds, you are generally helpful in the comp sci subforum, but you have a lot to learn about python.

 

 

filein = open('infile.txt', 'r')
fileout = open('outfile.txt', 'w')

for line in filein:
   if line.startswith('fixedStep'):
      fileout.write(line)

filein.close()
fileout.close()

this simply reads the file, line by line (don't have to load it all into memory as you claim), checks if it starts with the 'fixedStep' string, if it does, we dump it to the export file. Done.

 

I regularly process files 100s of GB in size this way with my PC. Python is extremely versatile in its methods.

Link to comment
Share on other sites

I regularly process files 100s of GB in size this way with my PC. Python is extremely versatile in its methods.

 

The new versions of python are I remember a python that couldn't count and complained on big files. It still can't count. Why adding 0.0001 onto 0 up to 10 returns 9.994699999999 I have no clue.

 

VSmnrPE.png?1

 

fiveworlds, you are generally helpful in the comp sci subforum, but you have a lot to learn about python.

 

If you want to give me the book on it sure fire away. I have no idea how they made blender out of it. Mostly because I learn everything online and have no teachers and no programming books.

Edited by fiveworlds
Link to comment
Share on other sites

The new versions of python are I remember a python that couldn't count and complained on big files.

If you will read big file in chunks, as everybody here are suggesting, there should be no problem.

 

It still can't count. Why adding 0.0001 onto 0 up to 10 returns 9.994699999999 I have no clue.

I can't check this, to confirm or deny. But such errors are everywhere, in all codes.

It's just a matter of precision (any code will have issue while using IEEE 32 bit at 7+ digit after floating dot)..

Math library issue most likely. It could be also issue with printing float, or float to string parsing routine.

 

If it's really an issue (causing problem with your own code, not just theoretical talking), why don't you use integers and divide by 10000 at the end?

 

Like in C/C++ code:

int x = 0;

int y = 1;

x += y; // counter

float x1 = (float) x / 10000.0;

printf( "%f\n", x1 );

 

(integer 10000 = 1.0 in float)

 

I have no idea how they made blender out of it.

Blender is written in C/C++.

Such apps have special extension for Python, which can be loaded/imported in script, and Python can execute its functions, controlling the main app.

Link to comment
Share on other sites

Why do you want to use regular expressions for such silly task as checking whether 1st character is digit or not.. ?

 

A serious answer to this is flexibility. We have assumed that the numeric entries all begin with a digit. But what if there are also negative numbers? Or if some of the numbers are written with no digits before the decimal point? Or there are spaces before the number? And the spaces could be tabs and/or other white space? Or there are blank lines to be skipped? Or there are comment lines to be ignored as well?

 

As the solution becomes more complex (i.e. more realistic) then the conditional code becomes increasingly complex, while the RE approach remains the same.

 

And for anything more complex than the initial example, a compiled RE will quickly become more efficient than the conditional code.

Link to comment
Share on other sites

If you would have plentiful of data to process, counted in MB or GB, using reg expression could tremendously slow it down.
Once I had project (.NET Framework Managed C++, I don't write scripts..),
download 250-300 thousands files from net, save in folder, 20-100 kb per file, ~20-30 GB could be total,
and find some data in HTML file loaded from disk.
Linux-loving-neighborhood was saying "do it using reg expressions!" (he is obviously great fan of it). So I did try it..
(he was doing this project in Python, and I was doing in .NET Framework C++, such little competition)
I am GUI-loving person, so I am adding everywhere progress-bars running on 2nd thread etc.
And listviews where are logs displayed to show me what is going on in real-time *)
After processing a few hundred files, I was shocked, it was nightmare performance, I canceled.
Otherwise would have to wait weeks to finish with all files..
Replaced reg expression, which was extracting data from HTML, by string search some <tag>, which was unique, prior my data,
then find closing </tag>. And processed what was inside by hand.
Guess what. Literally it started working "1000 times" faster...

*) This is important.
Linux-loving-guys are running their scripts, with logging to file, not to screen typically, and going doing other stuff.
And then returning when job is done, periodically checking.. They might not even realize stuff they coded could run 1000 times faster, if would code it differently..

 

A serious answer to this is flexibility. We have assumed that the numeric entries all begin with a digit. But what if there are also negative numbers? Or if some of the numbers are written with no digits before the decimal point? Or there are spaces before the number? And the spaces could be tabs and/or other white space? Or there are blank lines to be skipped? Or there are comment lines to be ignored as well?


That's job for String.Trim()/String.TrimStart().
To check whether we have correct float, there could be used Double.Parse()/Double.TryParse().
.NET Framework C#/C++.

array<String ^> ^lines = File.ReadAllLines( filename );
for each( String ^line in lines )
{
   line = line->Trim();
   if( line->Length() == 0 ) continue;
   try
   {
      Double x = Double.Parse( line );
   }
   catch( FormatException ^e )
   {
      // row not number or so
   }
}

I would spend more on searching reg expression tutorials to learn what to enter than writing this reply..

And for anything more complex than the initial example, a compiled RE will quickly become more efficient than the conditional code.


More efficient to code (less time spend on programming), not more efficient for cpu.

Edited by Sensei
Link to comment
Share on other sites

More efficient to code (less time spend on programming), not more efficient for cpu.

 

Speaking from both the development tools side and as a user, this is not always true. A regex can often be much more efficient than equivalent conditional code.

 

You wonder why your program with GUI, progress bars, logs, etc. was so slow...? I don't think it is anything to do with regular expressions!

 

And, apart from anything else, you replaced it with a program that did less work. So that might account for the performance difference. Also, did you compile the regex? Or maybe the .NET compiler just handles regular expressions very poorly. I don't know. But condemning regular expressions on the basis of a single example is ridiculous.

Edited by Strange
Link to comment
Share on other sites

Why adding 0.0001 onto 0 up to 10 returns 9.994699999999 I have no clue.

You need to check your screen shot again. The second-to-last output line was '9.99999999999', which is essentially 10 with the normal float round off error, the expected result. See the comments about floats above.

 

Your last line of output was '10.0001' which tells me you had an off by one error in your loop counter. The '9.99469999999' just happened to be the line at the top of the window with the outputs. Maybe you normally use a tool that scrolls upward with output lines, but the window you show in your screen shot there scrolls down. The later results are the ones closer to the prompt >>>.

 

If you want to give me the book on it sure fire away.

http://bfy.tw/1Xyb If you are just starting, I would suggest starting with a python 3.x.x version as there was a deliberate compatibility break between py2 and py3. Almost all the maintained modules out there have been upgraded to py3 today (2015). So, if you're truly just starting out, might as well start with the up to date versions.

Edited by Bignose
Link to comment
Share on other sites

Speaking from both the development tools side and as a user, this is not always true. A regex can often be much more efficient than equivalent conditional code.

 

Basically, nonsense from cpu/performance point of view.

It's physically not possible. And you should know it exactly from physics.

Native code is less instructions to execute (optimized by C/C++ compiler), taking less time, than reg expression compiled code.

 

Doing C++ code:

int length = strlen( buffer );
for( int i = 0; i < length; i++ )
{
   if( !strcmp( buffer + i, "some string" ) )
   {
       // found sub-string
   }
}

or simply:

if( strstr( buffer, "some string" ) != NULL )
{
   // found sub-string
}

will, and always be, much faster than equivalent reg expression.

Compiled to machine code or JIT, can in the ultimately the best scenario just reach 90%-95% of normal code speed. And never exceed nominal code.

 

You should not compare *Python* conditional code, to C/C++ conditional code. That's where you made mistake.

*Python* (or other scripting language) conditional code might be indeed slower than same done by reg expression.

But C/C++/assembly will never be slower.

C/C++/assembly is not interpreted, as Python..

 

You wonder why your program with GUI, progress bars, logs, etc. was so slow...? I don't think it is anything to do with regular expressions!

 

I knew, (I knew!) you will point it out when I wrote about GUI. Typical Linux user-programmer.

Sounds like you have no experience in .NET Framework C++ GUI coding.. Because it's (unlike many GUI toolkits around) extremely fast (at least listview).

I can make listview items counted in thousands per second, without any slow down, to the main code.

 

See attached project. Compile it on your machine.

BenchmarkListView.zip

It's benchmark adding items to listview.

Adding 10,000 listview items, takes 125 miliseconds, on my Core i7 machine.. with update_is_faster = true;

(80000 per second)

 

Adding 1000 listview items, takes 93 miliseconds, with update_is_faster = false;

(10752 per second)

 

Single threaded GUI.

 

Handling GUI didn't introduce even 0.1% of slowdown..

Progress bar even less (it's updating every 1% of progress, where 300,000 files is 100%, 1% is 3000 files..).

I stopped it after waiting 10-20 minutes, it didn't even reach 1% to update..

 

BTW, I used exactly the same logging to GUI listview with reg expression parsing code, as with manual conditional parsing code. It should be hint to you.

Edited by Sensei
Link to comment
Share on other sites

 

or simply:
if( strstr( buffer, "some string" ) != NULL )
{
   // found sub-string
}

will, and always be, much faster than equivalent reg expression.

 

I imagine that would be about the same performance, as I assume strstr is optimised to use a similar algorithm that a regex engine would for a single string.

 

But now imagine that you need to search for 100 variations of your "some string". A regex will be much, much faster.

 

 

Typical Linux user-programmer.

 

Very funny.

Edited by Strange
Link to comment
Share on other sites

But now imagine that you need to search for 100 variations of your "some string". A regex will be much, much faster.

 

Not if the expression is fixed. When you have to do a large number of matches using the same expression, a native implementation of that expression will be faster.

Link to comment
Share on other sites

But now imagine that you need to search for 100 variations of your "some string". A regex will be much, much faster.

 

Make source code proving your words,

I will make counter-code, proving my words.

 

Your case could be only true, if somebody for purpose, scuttle it.

f.e.

strings to search are:

xyzabc

xyzbcd

xyzcde

 

(and reg expression something like (.*)xyz(abc|bcd|cde)(.*) )

 

now C/C++ coder could use:

if( strstr( buf, "xyzabc" ) ||

strstr( buf, "xyzbcd" ) ||

strstr( buf, "xyzcde" ) )

{

// found!

}

(this is WRONG, because each time its going from start of buffer to the first occurrence, cpu is passing through the same chars multiple times)

 

Instead he should do (something similar to):

t = strstr( buf, "xyz" );

if( t != NULL )

{

t += 3;

if( !strncmp( t, "abc", 3 ) ||

!strncmp( t, "bcd", 3 ) ||

!strncmp( t, "cde", 3 ) )

{

// found

}

}

 

ps. I am using reg expressions. Where I find them useful. f.e. made process monitor, and file monitor, which had filters implemented as reg expressions (filter control in GUI). So they displayed entries matching pattern.

But I also showed example of job where reg expressions showed weakness..

Edited by Sensei
Link to comment
Share on other sites

 

Not if the expression is fixed. When you have to do a large number of matches using the same expression, a native implementation of that expression will be faster.

 

Only if you write your code as an NDFA. Which will be hard for anything except the most trivial examples. And then the code will be almost incomprehensible and unmaintainable.

Make source code proving your words,

 

I thought anyone who understood programming would know this. But apparently not.

Edited by Strange
Link to comment
Share on other sites

 

I thought anyone who understood programming would know this. But apparently not.

 

Since the beginning of this discussion, I am *exclusively* talking about execution time, time spend by CPU processing data.

It's obvious that writing reg expression handling code is easier for programmer, spend less time on doing so, than making the same string comparing code by hand.

 

If you claim something else, then prove it.

Measure time spend by cpu, by getting time in microseconds/miliseconds, then execute some reg expression match(), compiled or not, 1000 times repeat, get time again, subtract.

And we will know how many miliseconds/microseconds is taken by single match.

Then do the same with manual code finding same pattern.

Edited by Sensei
Link to comment
Share on other sites

OK. If you write the code to find all words with the 5 vowels in alphabetic order (and no other vowels) then I will compare it with this: '\<[^aeiou]*a[^aueio]*e[^aeiuo]*i[^aeiou]*o[^aeiou]*u[^aeiou]*\>'

 

While writing code for that is annoying, it will still be faster. The regex engine has to do extra work that a native implementation of the regex doesn't have to do. Whether or not it's useful to write a native version is a different matter.

Link to comment
Share on other sites

There is a neat program called re2c which generates C source code from regular expressions. For my example it produces this:

 

 

for (; {

YYCTYPE yych;



yych = *YYCURSOR;

switch (yych) {

case 'a': goto yy7;

case 'b':

case 'c':

case 'd':

case 'f':

case 'g':

case 'h':

case 'j':

case 'k':

case 'l':

case 'm':

case 'n':

case 'p':

case 'q':

case 'r':

case 's':

case 't':

case 'v':

case 'w':

case 'x':

case 'y':

case 'z': goto yy5;

case 'e':

case 'i':

case 'o':

case 'u': goto yy2;

default: goto yy3;

}

yy2:

YYCURSOR = YYMARKER;

goto yy4;

yy3:

yych = *(YYMARKER = ++YYCURSOR);

switch (yych) {

case 'e':

case 'i':

case 'o':

case 'u': goto yy4;

default: goto yy6;

}

yy4:

{ return result; }

yy5:

++YYCURSOR;

yych = *YYCURSOR;

yy6:

switch (yych) {

case 'a': goto yy7;

case 'e':

case 'i':

case 'o':

case 'u': goto yy2;

default: goto yy5;

}

yy7:

++YYCURSOR;

yych = *YYCURSOR;

switch (yych) {

case 'a':

case 'i':

case 'o':

case 'u': goto yy2;

case 'e': goto yy9;

default: goto yy7;

}

yy9:

++YYCURSOR;

yych = *YYCURSOR;

switch (yych) {

case 'a':

case 'e':

case 'o':

case 'u': goto yy2;

case 'i': goto yy11;

default: goto yy9;

}

yy11:

++YYCURSOR;

yych = *YYCURSOR;

switch (yych) {

case 'a':

case 'e':

case 'i':

case 'u': goto yy2;

case 'o': goto yy13;

default: goto yy11;

}

yy13:

++YYCURSOR;

yych = *YYCURSOR;

switch (yych) {

case 'a':

case 'e':

case 'i':

case 'o': goto yy2;

case 'u': goto yy15;

default: goto yy13;

}

yy15:

++YYCURSOR;

yych = *YYCURSOR;

switch (yych) {

case 'a':

case 'e':

case 'i':

case 'o':

case 'u': goto yy17;

default: goto yy15;

}

yy17:

{return result += c; }

}

 

 

 

 

Over 100 lines. But compilers make a very good job of case statements so it should be pretty efficient.

Link to comment
Share on other sites

  • 2 weeks later...
Blender is written in C/C++.

Such apps have special extension for Python, which can be loaded/imported in script, and Python can execute its functions, controlling the main app.

 

 

Figured it out in c#

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using IronPython.Hosting;

namespace includingpython
{
    class Program
    {
        static void Main(string[] args)
        {
            var pythonruntime = Python.CreateRuntime();

            dynamic pythonFile = pythonruntime.UseFile("test.py");

            pythonFile.pythondef();
        }
    }
}
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.