ArchiveBox or similar for shared archiving of research project

10 hours ago by Stopwatch1986 to c/selfhosted

I am one of a network of academic researchers from around the world working on collecting media market data. One problem is that referenced sources often disappear which makes validation later difficult or impossible. So, I thought I would recommend self-hosting something like archive.org that would allow affiliated researchers to submit their web references and have their sources efficiently archived in a central project repository. That would allow validation and continuity for when web-hosted text and files disappear or researchers leave.

I have been looking at ArchiveBox. If you have experience of this or a similar solution, would that fit the bill? The important thing is efficiency for researchers submitting/retrieving pages and files, and openness in structure and formats so that the archive would remain useful if ArchiveBox or similar disappears. FOSS of course means you can't be locked out anyway.

irmadlad 6 points 10 hours ago

I use ArchiveBox occasionally to archive websites into a browsable, offline copy, regardless of the data disappearing online, and independently of whether or not ArchiveBox is in operation after the archiving finishes, if of course you persist the data locally. I've archived several self-hosted sites because they contained data I would like to conserve for personal use at a later date. It does it quite thoroughly, tho obviously large sites would take a little time to ingest. It might be worth spinning up a Docker instance and run it through it's paces to see if it would fit your criteria.

save

path: 0 24385166, hotness: undefined, score: 6, children: 0

selfhosted

@lemmy.world

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
No spam.
Posts here are to be centered around self-hosting. Please ensure it is clear in your post how it relates to self-hosting.
Don't duplicate the full text of your blog or git here. Just post the link for folks to click.
Submission headline should match the article title.
No trolling.

Resources:

selfh.st Newsletter and index of selfhosted software and apps
awesome-selfhosted software
awesome-sysadmin resources
Self-Hosted Podcast from Jupiter Broadcasting

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

go to feed...