Duplicates in share not removed from queue.xml

Bug #250238 reported by petestrash
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
DC++
Fix Released
Wishlist
Unassigned

Bug Description

The only way I know to bulk add magnet links to queue.xml is to do it manually while DC++ is not running.

When DC++ started it will download files from the queue that already are in the share, even if 'Don't download files already in share' is selected.

Would it be possible for DC++ to check the queue for duplicates in share, and remove them from the queue before they are downloaded or give an option to manually initiate such a process.

Vista32 SP1, DC++ V0.707

Tags: core
Revision history for this message
Jacek Sieka (arnetheduck) wrote :

I guess a button that did this in the queue frame wouldn't hurt...

Changed in dcplusplus:
importance: Undecided → Wishlist
status: New → Confirmed
Revision history for this message
petestrash (petestrash) wrote :

Thanks, for that. The button as a feature request would be nice.

But I still think it's a bug not checking the queue for duplicates in share, when 'Don't download files already in share' is selected. I think this should be also checked prior to a download starting.

Thanks,

Peter.

Revision history for this message
petestrash (petestrash) wrote :

Another way to think about this (which may give it higher importance) is that after DC++ starts and finishes hashing any new files in the share, it does not check the queue to see if there are any duplicates in the queue, that are already in the share.

Really, this check does not need a button. Just run it after the files are hashed. This way any new items in share will be removed from the queue stopping duplicate downloads and reducing wasted slots/bandwidth.

Peter.

Revision history for this message
be4all2 (bschors) wrote :

I totally agree with petestrash above, its very irritating to download files i already have. Its a bug.

Revision history for this message
Nick-V (nick-veit) wrote :

Yes this causes much frustration amongst the users of our hub too. Duplicate checking should be conducted later in the processing cycle as things could have changed.

Using TTH is an excellent and necessary way to deal with duplicates but there is a perception it is not working because of this bug or design flaw.

Revision history for this message
Twilight2013 (twilight2013) wrote :

I agree with petestrash, be4all2, Nick-V on this 1. I always looked at my "Finished Uploads", if I find a user downloading the same file, I send him/her a private message, some but mostly don't respond at all. I'm tired of these fucking assholes downloading the same shit that users are wasting their bandwith, slot & time over the same ass file. Sorry for my language, but I've grown tired of these abusive users who download & queue the same file with the same TTH.

Tiger Tree Hash is helpful for finding search alternate sources and/or find duplicates in your share but needs some improvement. In the future, as in, will this work if the file is renamed but same TTH?

Revision history for this message
petestrash (petestrash) wrote :

I'm sorry, but even if this problem is fixed [which I hope it is :)}, it may not fix the problem you are experiencing.

If they have downloaded from you already and then download again in the same session, then the problem lies with the downloader. They probably do not have 'Don't download files already in share' selected in their settings, Wasting everyone's time and bandwidth.

Regarding your second question. I thought TTH's already work like like that if you have "Don't download files already in share" & "Don't download files already in queue" selected in your settings.

Peter.

Revision history for this message
Arabesque (pleidonius) wrote :

I am experiencing the same problem. My file list has long been hashed, the 2 options "Don't download files already in share" & "Don't download files already in queue" are turned on. Yet still, files and directories that I already have in my share are queued! It took me an entire day to clean out the duplicates in my queue, the horror!! So I warn other users, test out the function before starting to queue loads of stuff. I hope it's possible for developers to have a look at this (mal)function. I hope someone can resolve this before the next update arrives...

Philip

Revision history for this message
Nick-V (nick-veit) wrote :

I note that this thread deals particularly with the checking of duplicates AFTER items have been added to the queue outside of DC++. The thread is asking that duplicate checking occurs later, perhaps at the time of download, to cater for these items rather than earlier.

Philip's report may be different as he does not mention adding items to te queue manually.

There are reasons that duplicates appear to be downloaded. The first I note is with video clips...sometimes a visually identical (perceived duplicate) video clip is downloaded. Whilst the clip appears identical, DC sees it as a different file because perhaps someone has made a small edit to it...whilst this is annoying, DC cannot know it is the same clip - it only knows that the TTH (unique identifier code) is different due to the small edit.

The second circumstance appears to be due to a weakness of TTH that I note again with video (big) clips. Whilst TTH sees the files as duplicates, CloneSpy uses a different approach and sees them as slightly different files. I think the difference is the creation or modified date of the file. This happens quite rarely and is not a major issue but I thought I'd raise it.

Nick-V

Revision history for this message
petestrash (petestrash) wrote :

Yes, I agree with Nick. I don't think the issue Arabesque is talking about is the same issue I raised here.

The Issue we have is that the queue is not checked for duplicates in share when DC++ is started, so that TTH's manually added to the Queue will still be downloaded even if they already exist in the share.

I have not had any issues with downloads added to the queue by DC++ having the same TTH as those in my share. But as Nick has said, it is possible to download what looks like the same file but in some small way is different so the TTH is different. This is not an Issue with DC++ but with people changing files.

Peter.

Revision history for this message
Arabesque (pleidonius) wrote :

First of all, thanks for your kind words of enlightenment :) I'm not very much of a technical genius when it comes to this kind of matters.
But I'll try to be a bit more specific about my "case", maybe it adds to solving the issue in the end...
First of all, I haven't tested this with video files, I only queue music. I always queue files in DC++ by entering search strings in 'search' window, or by browsing other users' filelists. I download only group releases, with the intent of keeping my share clean and being able to have a good overview of what I already have and what not. The releases that were queued unintentionally (since they're already in my share) actually look exactly the same: all the file names and the directory name is exactly the same. So the files haven't been renamed or anything. If I understood the above correctly, something in the TTH must have changed so they got queued anyway? Maybe the script can be adapted so that not (only) TTH but file name structure itself is checked (too)? Or will that generate excess data traffic in DC++? Does anyone know if there exists any external tool/freeware that can do the comparison between queue.xml and my share??

Revision history for this message
Nick-V (nick-veit) wrote :

Your case is definately different to this thread which is about WHEN the queue is checked. You only add file to te queue using DC search etc.

For info, TTH is much better than a filename and date check...TTH is a unique code that represents the file's content in such a way that the same file will not be downloaded even if it has been renamed. Further, music and video (and any other) files are theoretically the same...the only reason I mentioned video files is just in case much larger files (perhaps hundreds of MB) react differently sometimes with TTH.

You say the files look the same due to filename and directory name (and perhaps size too?). This does not mean they are the same content and TTH is there to note any small difference even if you do not detect it. You can look at the TTH of two example files for youself in DC (probably) and see if they are regarded as the same.

We really don't want the filename checked at all! A file is the same if the content is the same. TTH is about AVOIDING duplicates irrespective of name.

To find and delete duplicates already downloaded try CloneSpy which uses a similar approach to find duplicates irrespective of name etc.

Revision history for this message
Arabesque (pleidonius) wrote :

ok, hang the "ass" sign around my neck. What happened is the duplicate *directory names* DO appear in my download queue, but they *only* contain the .sfv/nfo/jpg files (or either of those), the actual songfiles aren't queued!! So I assume that the actual files won't be downloaded. Sorry but this was very confusing to me at first! I suppose that dc will however create the queued dirs in my download folder, so after a while all directory names of stuff i ever downloaded might be in my download directory again, always remaining there as empty directories, forever? Or am I wrong again? All this put aside, I must say the DC++ people do a TREMENDOUS good job!! So my issue of the day seems to be solved! Have a good one!

kind regards Philip

Revision history for this message
petestrash (petestrash) wrote :

If you have a folder in your queue that has a file that is not in your share, it will create that folder and add the missing files.

Once you have moved the files from the download folder you can delete the now empty subdirectory. DC++ will make a new subdirectories in your downloads folder if it needs to.

Peter.

Revision history for this message
petestrash (petestrash) wrote :

Also if you download Googles Picasa software it has an experimental feature to find duplicate files which is pretty good at finding duplicate pictures and Videos regardless of date or name and it includes thumbnails.

Peter.

Revision history for this message
Arabesque (pleidonius) wrote :

Thanks for the reply :) But in the case of this Picasa for instance, can it compare my QUEUE.xml file (files to be downloaded) to my SHARE (already downloaded and sorted files)? I assume with this Picasa, I would only be able to check for doubles AFTER i downloaded the files/directories in question? If that's the case, then it can't help me a lot for what I'd need to do. I would need to check my shared files, against the queue.xml file, while the files are still in the queue, preferably before I use up all my download limit.

As reply to your previous last post: for the releases that are already in my hashed share, if i re-queue them (assuming they get filtered out by the "don't download files already in share" option, in dc++ settings), the dirnames are put in the queue regardlessly, including a .sfv or .nfo file are still being added to my download queue (even though i already have the full release, which is not missing the sfv/nfo!). So for anything in my share that i re-queue, a new and nearly empty dir with an .sfv/.nfo/jpg will be created in my download folder. this means after a while, i have a huuuuge bunch of (almost) empty dirs, that will just keep sitting there, in my download folder. I'm not gonna manually remove them... so the point is, these dirs will stay there forever without any purpose. And it's sort of useless that they are created in the first place, since i already have them in my share. so maybe that is something to consider?

Revision history for this message
petestrash (petestrash) wrote :

I am not aware of anything that can check TTH's in your queue, with those in your share. If there was, it would make this bug less of an issue.

Picassa and Clonespy were only suggested so you could check you current downloads folder for duplicates, this way you can see if DC++ really has a problem with the same file being downloaded or just that both files are similar, but not technically the same.

Regarding re-queuing, again the only reason the .sfv/.nfo/jpg files will be downloaded again and a new folder created is because they are not exactly the same as what you already have.

Peter.

Revision history for this message
petestrash (petestrash) wrote :

Just wondering if there are plans to fix this bug?

Peter.

Revision history for this message
petestrash (petestrash) wrote :

It's been a year and half since this issue was confirmed.

Just thought I would check if there is any chance of getting DC++ to check the download queue for duplicates once the file hash has completed (or a manual option on the file menu).

Thanks,

Peter.

Revision history for this message
petestrash (petestrash) wrote :

Just thought I'd Bump this, as this still is an issue in 0.799.

Any chance of getting DC++ to automatically check the queue after hashing has completed.

If duplicates are found, remove them from the queue so stop duplicates being downloaded (if checked in settings) and bandwidth being wasted.

Thanks,

Peter

tags: added: v0.799
Revision history for this message
petestrash (petestrash) wrote :

Just re-read the bug FAQ.

Sorry I shouldn't have bumped this bug.

Revision history for this message
petestrash (petestrash) wrote :

Have just confirmed that this issue still exists in V0.811.

Revision history for this message
Fredrik Ullner (ullner) wrote :

Attached patch will, upon start of DC++, check whether files are in the share and remove them from the queue.

I am hesitant regarding checking the queue when a file has finshed hashing, though.

Revision history for this message
Nick-V (nick-veit) wrote :

Great to see this issues attended to...looking forward to a compiled version so I can test it...not sure how to use this diff files

Revision history for this message
poy (poy) wrote :

as good a solution as any, and it is simple enough to express. feel free to push it after having attended to the following:
1) use %1% instead of %s in the log message or the substitution won't carry into translations.
2) remove the dot in the log message for consistency.
3) test with large queue files to make sure the additional check doesn't make the loading time unreasonable. perhaps find some clever way to bypass the check if loading starts taking too long...

Revision history for this message
Fredrik Ullner (ullner) wrote :

Pushed to rev 3367. I tested with a queue of 10k files, and it didn't take any further time than it took to start up normally.

Nick-V: You can check out the builds directory: http://builds.dcbase.org/ Built versions of in-development DC++ are uploaded there (eventually at least).

Changed in dcplusplus:
status: Confirmed → Fix Committed
Revision history for this message
petestrash (petestrash) wrote :

Thanks very much for this fredrik!

I have downloaded rev 3367, and on the first run it removed 2000 files from my queue of 6000.

I manually added 6000 duplicates to the queue, and restarted DC++ which still had the correct 4000 files only.

It did not take much longer on startup to remove those 6000 duplicates.

I am very grateful that this has been fixed.

Fredrik Ullner (ullner)
tags: removed: download duplicates queue.xml share v0.707 v0.799
tags: added: core
Revision history for this message
petestrash (petestrash) wrote :

Does the change of tags mean the fix will be included in V0.832 when it is released?

Revision history for this message
Fredrik Ullner (ullner) wrote :

The change of tags was so that it was easier to keep track of bugs in the tracker, I did a massive sweep for many other bugs, too.

Having said that, this functionality will be included in the upcoming version (whenever that it released). (Unless something is discovered about it that requires the change to be removed, but I don't foresee that.) Any bugs or feature requests whose status that are marked as "Fix committed" will be incorporated in the next version.

Revision history for this message
petestrash (petestrash) wrote :

That's great thanks. Very happy to have this fixed.

Revision history for this message
poy (poy) wrote :

Fixed in DC++ 0.840.

Changed in dcplusplus:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.