Re: Problem with subprocess in threaded enviroment



On Mar 12, 12:58 pm, Ningyu Shi <ShiNin...@xxxxxxxxx> wrote:
I'm trying to write a multi-task downloader to download files from a
website using multi-threading. I have one thread to analyze the
webpage, get the addresses of the files to be downloaded and put these
in a Queue. Then the main thread will start some threads to get the
address from the queue and download it. To keep the maximum files
downloaded concurrently, I use a semaphore to control this, like at
most 5 downloads at same time.

You don't need a semaphore for that. Semaphores are low-level and
very hard to get right. (Read http://www.greenteapress.com/semaphores/
if you don't believe me.) You don't need any other type of explicit
lock either. Just use Queues.

Queues are much easier. Make simple threads, each of which typically
reads from one queue and writes to one queue. If you want N-way
concurrency at one stage, start N instances of that thread. If you
need to guarantee non-concurrency at some stage, make that thread a
singleton. Avoid sharing mutable data across threads except through
Queues.

I tried to use urllib.urlretreive in the download() thread, but from
time to time, it seems that one download thread may freeze the whole
program. Then I gave up and use subprocess to call wget to do the job.
My download thread is like this:

def download( url ):
subprocess.call(["wget", "-q", url])
with print_lock:
print url, 'finished.'
semaphore.realease()

If the prints are just for debugging, print_lock is overkill. And in
practice that print statement is probably atomic in CPython because of
the GIL.

But let's assume that it really is critical not to mix the output
streams, and that the print isn't necessarily atomic. So instead of
the lock, use a print_queue and a single PrintThread that just pulls
strings off the queue and prints them.

But later I found that after the specific wget job finished
downloading, that download() thread never reach the print url
statement. So I end up with files been downloaded, but the download()
thread never ends and don't realease the semaphore then block the
whole program. My guess is that at the time wget process ends, that
specific download thread is not active and missed the return of the
call.

The fact that changing libraries didn't fix your problem is a strong
indicator that it's your code, not the libraries. You're probably
introducing deadlocks with your semaphores. Too much locking is as
bad as too little.

Here's the skeleton of how I'd write this program:

main thread: create queues, create and start other threads, find URLs,
put URLs on input_queue

class DownloadThread(threading.Thread):
def run(self):
while True:
url = input_queue.get()
urllib.urlretrieve(url)
print_queue.put("done with %s" % url)

class PrintThread(threading.Thread):
def run(self):
while True:
st = print_queue.get()
print st

Then you just have to worry about clean shutdown. A simple way to
handle this to initialize a counter to the number of URLs in main, and
decrement it in PrintThread. Then have main busy-wait (with a sleep
to avoid consuming too much CPU) for it to hit zero, then join all the
other threads and exit. (This seems to violate the no-shared-mutable-
data rule, but only one thread at a time can mutate it, so it's safe.)

Once you've done this manually, consider using the TaskQueue recipe on
the Python Cookbook site, to simplify it a bit.
.



Relevant Pages

  • Re: Alt.Binz: Using v242. Should I upgrade?
    ... Newzleech RSS fixed [Link] ... + previous copy of queue file is kept and can be loaded when current queue ... * manual nzb rss download possibile while on rss timer ... redocked tab state not beeing saved when exiting ...
    (news.software.readers)
  • Re: foreach, IEnumerable and modifying contents
    ... I would like to incorporate threads so that I can download N ... The producer of the queue would add URLs to be downloaded, while the consumer would keep track of how many active downloads are going on. ... thread pool when the downloads are finished. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: foreach, IEnumerable and modifying contents
    ... In a typical queue design, you would gracefully deal with an empty queue. ... One download finishes and my code begins pulling ... to access the same enumerator. ... If anything, my intent is to point out that concurrency is _not_ simple, and that your second design appears to have just made it more complicated than it otherwise needed to be. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Traffic Shaping not working correctly after ipfw coverted to use pfil_hooks API
    ... I'm using 400Kbit/s as the upload limit ... > # Assign all other outgoing traffic to the low-priority queue ... The upload pipe ... > Here is where the problem is if I do a upload and download at the same time: ...
    (freebsd-current)
  • Re: Newbie question on require and semaphores
    ... > converting C header files to perl. ... I didn't download squat. ... Googling "Perl And Semaphore" returns ...
    (comp.lang.perl.misc)