Deduplication tool

Agility0971@lemmy.world · edit-2 1 year ago

Deduplication tool

fartsparkles@sh.itjust.works · 1 year ago

I don’t know about deduping mid transfer but these two have been helpful over the years:

lurch (he/him)@sh.itjust.works · 1 year ago

make sure to make the first backup before you use deduplication. just in case it goes sideways

Nine@lemmy.world · 1 year ago

Restic

Kualk@lemm.ee · 1 year ago

hardlink

Most underrated tool that is frequently installed on your system. It recognizes BTRFS. Be aware that there are multiple versions of it in the wild.

It is unattended.

https://www.man7.org/linux/man-pages/man1/hardlink.1.html

Agility0971@lemmy.world · 1 year ago

This will indeed save space but I don’t want links either. I unique files

kylian0087@lemmy.dbzer0.com · 1 year ago

Take a look at Borg. It is a very well suited backup tool that has deduplication.

geoma@lemmy.ml · 1 year ago

What about folders? Because sometimes when you have duplicated folders (sometimes with a lot of nested subfolders), a file deduplicator will take forever. Do you know of a software that works with duplicate folders?

Agility0971@lemmy.world · 1 year ago

What do you mean that a file deduplication will take forever if there are duplicated directories? That the scan will take forever or that manual confirmation will take forever?

geoma@lemmy.ml · 1 year ago

That manual confirmation will take forever

HumanPerson@sh.itjust.works · 1 year ago

I believe zfs has deduplication built in if you want a separate backup partition. Not sure about its reliability though. Personally I just have a script that keeps a backup and an oldbackup, and they are both fairly small. I keep a file in my home dir called excluded for things like linux ISOs that don’t need backed up.

biribiri11@lemmy.ml · 1 year ago

As said previously, Borg is a full dedplicating incremental archiver complete with compression. You can use relative paths temporarily to build up your backups and a full backup history, then use something like pika to browse the archives to ensure a complete history.

Agility0971@lemmy.world · 1 year ago

I did not ask for a backup solution, but for a deduplication tool

biribiri11@lemmy.ml · edit-2 1 year ago

Tbf you did start your post with

I’m in the process of starting a proper backup

So you’re going to end up with at least a few people talking about how to onboard your existing backups into a proper backup solution (like borg). Your bullet points can certainly probably be organized into a shell script with sync, but why? A proper backup solution with a full backup history is going to be way more useful than dumping all your files into a directory and renaming in case something clobbers. I don’t see the point in doing anything other than tarring your old backups and using borg import-tar (docs). It feels like you’re trying to go from one half-baked, odd backup solution to another, instead of just going with a full, complete solution.

TheAnonymouseJoker@lemmy.ml · edit-2 1 year ago

Removed by mod

utopiah@lemmy.ml · 1 year ago

I don’t actually know but I bet that’s relatively costly so I would at least try to be mindful of efficiency, e.g

use find to start only with large files, e.g > 1Gb (depends on your own threshold)
look for a “cheap” way to find duplicates, e.g exact same size (far from perfect yet I bet is sufficient is most cases)

then after trying a couple of times

find a “better” way to avoid duplicates, e.g SHA1 (quite expensive)
lower the threshold to include more files, e.g >.1Gb

and possibly heuristics e.g

directories where all filenames are identical, maybe based on locate/updatedb that is most likely already indexing your entire filesystems

Why do I suggest all this rather than a tool? Because I be a lot of decisions have to be manually made.

utopiah@lemmy.ml · 1 year ago

fclones https://github.com/pkolaczk/fclones looks great but I didn’t use it so can’t vouch for it.

utopiah@lemmy.ml · 1 year ago

if you use rmlint as others suggested here is how to check for path of dupes

jq -c '.[] | select(.type == "duplicate_file").path' rmlint.json

utopiah@lemmy.ml · edit-2 1 year ago

FWIW just did a quick test with rmlint and I would definitely not trust an automated tool to remove on my filesystem, as a user. If it’s for a proper data filesystem, basically a database, sure, but otherwise there are plenty of legitimate duplication, e.g ./node_modules, so the risk of breaking things is relatively high. IMHO it’s better to learn why there are duplicates on case by case basis but again I don’t know your specific use case so maybe it’d fit.

PS: I imagine it’d be good for a content library, e.g ebooks, ROMs, movies, etc.