Jump to content

Python code to extract specific lines in a textfile


Jo27

Recommended Posts

Hey Guys:

 

For my research project, I would need a python code that will enable me to extract specific lines from a textfile.

 

The textfile has the following format:

 

fixedStep chrom=chr3 start=56424 step=1

0.000
0.001
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.007
fixedStep chrom=chr3 start=56425 step=1
Etc....
In fact, I would like to obtain an other textfile without the numerical lines, in the following format:
fixedStep chrom=chr3 start=56424 step=1
fixedStep chrom=chr3 start=56425 step=1
Etc....
Looking forward to hearing from you soon,
Best, JEFF O.
Link to comment
Share on other sites

like this?

#!/pubserver/python27/python

print "Content-type: text/html"
print
print "<html><head>"
print "</head><body>"
print "This was written in Python."
txt = open("test.txt", "r")
s=txt.readlines()
print s[0],s[11];
file.close()
print "</body></html>"
Link to comment
Share on other sites

Use awk instead. :)

 

awk "/^[^0-9]/" textfile

 

Both sensei and fiveworlds seem to have missed the point of the question. In fiveworld's example the "print" line needs to be replaced with something like:

if (<regular expression to find lines not starting with number):

print

(I am not familiar enough with python to write that off the top of my head.)

 

Also, I have no idea why he is printing out (invalid) HTML :confused:

Edited by Strange
Link to comment
Share on other sites

Both sensei and fiveworlds seem to have missed the point of the question.

I didn't want to give final code, as it sounds quite like homework..

 

In fiveworld's example the "print" line needs to be replaced with something like:

if (<regular expression to find lines not starting with number):

Why do you want to use regular expressions for such silly task as checking whether 1st character is digit or not.. ?

 

Equivalent of C/C++ code:

if( !( ( row[ 0 ] >= '0' ) && ( row[ 0 ] <= '9' ) ) )

or

if( !isdigit( row[ 0 ] ) )

will be sufficient (and faster, reg expressions are pretty slow).

 

(assuming null-terminated row is in buffer)

 

http://www.tutorialspoint.com/python/string_isdigit.htm

Edited by Sensei
Link to comment
Share on other sites

This seems to work

#!/usr/bin/python

import re
import sys

if len(sys.argv) != 2:
	print("Missing file name argument")
	sys.exit(1)

filename = sys.argv[1]

# Pattern to match lines that don't begin with a number
pat = re.compile('^[^0-9]')

for line in  open(filename):
	m = pat.match(line)
	if (m):
		print(line),
For Python 3, the print line will need to change to: print(line, end="")

Why do you want to use regular expressions for such silly task as checking whether 1st character is digit or not.. ?

 

Because that is how awk does it? :)

 

Actually, from the short example given, it might just be enough to check if the first character is 'f' ...

Link to comment
Share on other sites

... or even use the much simpler "grep". However, that assumes Jo27 is on a linux (or mac) system. And not only is that not certain. In fact, if someone asks for a Python script to do trivial text processing I would assume a Windows user.

Link to comment
Share on other sites

Also, I have no idea why he is printing out (invalid) HTML :confused:

 

I haven't installed python because I have a few versions.

 

awk "/^[^0-9]/" textfile

 

Both sensei and fiveworlds seem to have missed the point of the question. In fiveworld's example the "print" line needs to be replaced with something like:

if (<regular expression to find lines not starting with number):

print

(I am not familiar enough with python to write that off the top of my head.)

 

 

From what I can see from the ops example he/she wants lines that are separated by a number of lines not the lines without numbers.But I can see where you are going to just regex out the numbers. This does depend on the lines he/she wants because they can't contain numbers like that.

fixedStep chrom=chr3 start=56424 step=1
0.000
0.001
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.007
fixedStep chrom=chr3 start=56425 step=1

Ie each line is after a fixed number of lines. My code retrieves

 

the first line

s[0]=fixedStep chrom=chr3 start=56424 step=1

 

the last line

s[11]=fixedStep chrom=chr3 start=56425 step=1

 

 

I didn't want to give final code, as it sounds quite like homework..

 

I didn't want to either he\she still needs to output the data to a new file.

Edited by fiveworlds
Link to comment
Share on other sites

From what I can see from the ops example he/she wants lines that are separated by a number of lines not the lines without numbers.

 

He says: "I would like to obtain an other textfile without the numerical lines"

 

Seemed pretty clear to me. But let's see.

 

... or even use the much simpler "grep".

 

That would do it.

 

 

However, that assumes Jo27 is on a linux (or mac) system. And not only is that not certain. In fact, if someone asks for a Python script to do trivial text processing I would assume a Windows user.

 

The first thing I do on a new Windows machine is install cygwin!

Link to comment
Share on other sites

Erm, instead of looking for '0' to '9' as the first character of the line (via regex or not) - why not just look for the 'f' of "fixedStep"? (or the whole word).

 

Seems safe enough, going by the specification (which does say "...without the numerical lines..." but equally shows the only non numeric lines to begin "fixedStep ...").

 

(Minor point, but, well ...)

 

 

@fiveworlds: the spec says "Etc....". A solution that only shows lines 1 and 12 of the file (elements 0 and 11 of the list) misses all lines in the potential "Etc.".

Also, a solution that loads (with the use of "readlines") the entire file at once into a list, will get to be a drag on the system if the input file gets large.

Link to comment
Share on other sites

Also, a solution that loads (with the use of "readlines") the entire file at once into a list, will get to be a drag on the system if the input file gets large.

 

 

I would use an assembly code executable for that probably written in c# but that was not what the op asked

 

@fiveworlds: the spec says "Etc....". A solution that only shows lines 1 and 12 of the file (elements 0 and 11 of the list) misses all lines in the potential "Etc.".

 

 

 

 

#!/pubserver/python27/python

print "Content-type: text/html"
print "Accept-Language: en-US"
print "Cache-Control: no-cache"
print ""
print "<html><head>"
print "</head><body>"
print "This was written in Python."
fp = open("test.txt", "r")
txt =fp.readlines()
print len(txt)
print "</br>"
i=0
while i<len(txt):

    print "</br>"
    print txt[i]
    txt[i]="hello\n"
    print "</br>"
    print txt[i+11]
    txt[i+11]="hello\n"
    i=i+12

file.close(fp)

    
fclose(fp)

print "</body></html>"

 

 

Link to comment
Share on other sites

I would use an assembly code executable for that probably written in c# ...

That doesn't really make sense (given how C# works; even if you're talking about .Net Native under .NET Framework 4.6 and 4.5). Having said that, C# is what I'd use too - I (more or less) currently make my living as a C# programmer.

 

... but that was not what the op asked

He didn't ask for mangled HTML either!

 

... python code ...

Still has the issue of reading the file all at once. I don't know Python specifically, but reading line by line seems possible (e.g. http://stackoverflow.com/questions/8009882/how-to-read-large-file-line-by-line-in-python )

 

Reading line by line also means you'd not be tied to the assumption that there's always 10 lines between the lines wanted. And your code assumes the input ends on a wanted line; but going by the sample (with the "Etc ...") I'd suggest that's not guaranteed. And if that "last line" is not there (or there are less than 10 unwanted lines), the last iteration of your while loop may well have an i that's less than len(txt) ... but adding 11 (i.e. the "print txt[i+11]") would push you past the end of the list.

 

That is, your code will only work if the input file is exactly like:

 

wanted line

10 x unwanted lines

wanted line

wanted line

10 x unwanted lines

wanted line

wanted line

10 x unwanted lines

wanted line

... etc.

 

What's the point of the "hello" replacements?

Link to comment
Share on other sites

Still has the issue of reading the file all at once.

 

Not really python only handles small files.

 

the last iteration of your while loop may well have an i that's less than len(txt)

 

An empty string

 

What's the point of the "hello" replacements?

hello\n it was to point out that if \n isn't included in the replacements a line is lost

Edited by fiveworlds
Link to comment
Share on other sites

Only it has nothing to do with the available (virtual) memory of my computer which is 8 GB that file was 3.5GB. I can open the file just not with notepad or python.

Limitation of some app (like Notepad),

or language,

is not the same as limitation of system.

 

I made C/C++ project for you.

Compiled for either 32 bit and 64 bit.

 

Run it in command line as follows:

OpenFile "file name"

 

OpenFile.zip

I used in this project ftell() to learn file size, which is defined as follows:

 

long __cdecl ftell(_Inout_ FILE * _File);

 

in includes.

 

It's returning 32 bit integer.

 

There is yet another function for 64 bit:

 

__int64 _ftelli64(

FILE *stream

);

 

I used it (and _fseeki64()) in this project:

OpenFile64.zip

so you can compare results.

 

First project is written without using Windows specific functions (ANSI C, portable code possible to compile without changes on Linux/MacOS),

while 2nd project uses functions available only in Windows,

added by Microsoft to support 64 bit.

 

To support 64 bit file handling on Linux, there are other functions, available on Linux, but not available on Windows (and vice versa):

http://stackoverflow.com/questions/9026896/get-large-file-size-in-c

 

Program written in the past by default uses 32 bit file handling functions.

Program written now, have to be written by professional programmer, who is aware of how to deal with too large files.

Otherwise he/she will use wrong functions, and will make limitation by himself/herself.

 

I know plentiful programmers who use obsolete ancient computers with (32 bit) Windows XP, and don't want to upgrade (and it's not a matter of money).

My neighborhood was using Pentium III laptop (OMG).

Their private code, private projects, will most likely be affect by this issue.

Edited by Sensei
Link to comment
Share on other sites

Here is the python code for it but it doesn't get over the limitations on pythons open() which just reads the file as one line in a massive string. Ps I am going out for the day. Basically I load the massive string into an array memory because that is all i can do then I am writing the lines as separate files and deleting the massive string from memory. Then I perform a regex on each individual file and any files which pass the test are placed into the output directory. Then the files in the output directory are compiled and the two temporary directories are deleted. I thought I should probably use the current time for the directory names but I didn't get around to it,

 

 

 

import os
import re
import shutil


while os.path.exists("temp"):
shutil.rmtree('temp')
while os.path.exists("output"):
shutil.rmtree('output')

fp = open("test.txt", "r")
txt =fp.readlines()
print len(txt)
print "</br>"

handler=0
while handler<len(txt):

if not os.path.exists("temp"):
os.makedirs("temp")


path="temp\\"+str(handler)+".txt"
fz=open(path,"w")
fz.write(txt[handler])
file.close(fz)
handler=handler+1

file.close(fp)
txt=0


while txt<handler:


if not os.path.exists("output"):
os.makedirs("output")

path="temp\\"+str(txt)+".txt"
fz = open(path, "r")
out =fz.readline()
file.close(fz)
pattern = '([a-z]+)'
result = re.compile(pattern).search(out)
if result:
path="output\\"+str(txt)+".txt"
fz=open(path,"w")
fz.write(out)
file.close(fz)
else:
print ""


txt=txt+1

txt=0
out=""
fp=open("output.txt","w")


for i in os.listdir("output"):
if i.endswith(".txt"):
fz = open("output\\"+str(i), "r")
out =str(out)+str(fz.readline())
file.close(fz)
continue
else:
continue


fp.write(out)
file.close(fp)
del(out)
del(pattern)
del(result)
while os.path.exists("temp"):
shutil.rmtree('temp')

while os.path.exists("output"):
shutil.rmtree('output')


 

 

Edited by fiveworlds
Link to comment
Share on other sites

Basically I load the massive string into an array memory because that is all i can do

 

What do you mean, it is all you can do? Of course it isn't. You can read the file one line at a time, which would be more appropriate. You could even read it one byte at a time, if you wanted to.

 

You do talk nonsense sometimes.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.