0 votes
in Education by (1.7m points)
I'm Running Ubuntu 16.04 LTS with Python 3.6.8 and I have the following code that allows me to iterate over lines in a file where I process each row and append the data to a database. I need to process a row, and then delete it or replace it with an \n or do anything to reduce the file-size of the text file. Also, I need at most 2 copies of the file: database and first-line-deleted file.

with open(filename, buffering=1000) as f:

    for rows in f:

        #process text

        #delete row or replace with '\n'

How exactly do I do this?

JavaScript questions and answers, JavaScript questions pdf, JavaScript question bank, JavaScript questions and answers pdf, mcq on JavaScript pdf, JavaScript questions and solutions, JavaScript mcq Test , Interview JavaScript questions, JavaScript Questions for Interview, JavaScript MCQ (Multiple Choice Questions)

1 Answer

0 votes
by (1.7m points)
You have a big problem here: deleting the middle of a file isn't something you can do on most operating systems and their filesystems, and if you can, it's an esoteric operation with complicated restraints.

So, the normal way to delete from the middle of a file, is to rewrite the entire file. But you seem to indicate in the comments that your file is hundreds of gigabytes. So reading the whole file, processing one line, and rewriting the whole file is going to be expensive and require extra temporary storage space. If you want to do this for every line, you'll end up doing far more work and require about double the amount of disk space anyway.

If you absolutely have to do this, here are some possibilities:

Read the file backwards and truncate it as you go. Reading it backwards is going to be awkward because not much is set up to help with that, but in principle this is possible and you can truncate the end of a file like this without needing to copy it.

Use smaller files, and delete each file after you've processed it. This depends on you being able to change how the files are created, but if you can do it it's much simpler and lets you delete processed pieces sooner.

On the other hand, do you definitely need to? Is the problem that the file is so big that the database will run out of room if it's still on the disk? Or do you just want to process more huge files simultaneously? If the latter, have you checked that processing multiple files simultaneously actually goes faster than doing the same files one after the other? And of course, could you buy more disks or a bigger disk?
...