⏱ Heads Up, CodePlex Archive Going Down Soon! <420 Hours

jamie_yello · February 27, 2021, 7:16pm

Sure thing, the Codeplex scraper is about half way done and sitting at 300 GB right now. I didn’t bother scraping the HTML, sorry if anyone felt nostalgic for that.

The Microsoft doc website, we should be able to rip that with some tool someone already made, right? Probably just a basic link crawler should do the trick. Codeplex was trickier because all the content was only available through a search bar.

I appreciate your thanks, but you don’t need to. Now that you point it out, I’d be pretty sad if the XNA documentation we had disappeared forever. And >5 GB of it?? So thank you for pointing it out. Let me know if you can think of anything else that needs to be preserved.

I was active in the flashpoint flash preservation community as well. One day I dropped 15,000 flash files they didn’t have on them and left. XD

I also made them a tool to automatically capture flash screenshots. It would open 50 different SWFs in 50 flash players at once, wait 10 seconds, save their buffers to a PNG and close it. You should have seen that madness. That decrepid 18+ madness. It went on for hours and hours.

jamie_yello · February 27, 2021, 8:25pm

Using a program called Cyotek, I’m scraping the entire docs.mocrosoft.com website. I have no idea how it will turn out. Fingers crossed for an easy project.

This is the best scraper I’ve used, actually. Once I’m done I’ll try to use it to scrape archive.codeplex.com again just to see if it works. (I doubt it would, like I said the only content is available through a search query)

jamie_yello · February 28, 2021, 12:42am

Here’s a site rip that targeted specifically the XNA portion of the site. The bar on the left is missing, but otherwise the site is intact. (XNA HTML is in docs.microsoft.com\en-us\previous-versions\windows\xna)

I’m still doing a full site rip, but that will be over 10 GB. Maybe that will fix the bar.

What we can do maybe… we can take the opportunity to make our own index page? I don’t care for the official one anyway. Should be easy if we automate it with a program.

Maybe someone who knows more about HTML can fix the bar. I know very little. Maybe we can figure out what part of the full site is needed to get that bar.

https://mega.nz/file/T4YywbpT#XTbfhWupFWn3H9fjzuJ0qffaEiEop6lVcOK14Q7hv_0

I should just put this up on Github. Is Microsoft going to copyright takedown their own page?

“Don’t put that there!”

MrValentine · February 28, 2021, 3:15am

Looks like it is using JavaScript code to generate it? only took a quick look…

EDIT

@jamie_yello Dude, it’s perfect! I can use ‘search file contents’ to search within the html files, perfect! for anyone wondering, I have downloaded and scanned the file, which took a good 64seconds to do.

EDIT

Cannot thank you enough!

EDIT

Concerning sites to archive, I can think of a few such as RBW’s site and the like, but I think we can try to ask for permission first?

MrValentine · February 28, 2021, 2:54pm

122	days
2944	hours

to go…

jamie_yello · February 28, 2021, 6:10pm

Sure, I’ll ask for permission first I guess. Can’t imagine them saying no.

87793/108508 projects downloaded, 81% done, 628 GB so far. This is going to be done tonight finally. I was afraid it was going to end up being 8 TB and I would have to go out and buy a new hard drive, but no, I don’t even have to uninstall any steam games.

There was also the possibility that it would take a month (or 4 o_o) to finish, which was possible, but thankfully that wasn’t the case.

jamie_yello · February 28, 2021, 7:59pm

rbwhitaker brought up the fact that his site was already backed up by the wayback machine

https://web.archive.org/web/20170611235510/http://rbwhitaker.wikidot.com/monogame-getting-started-tutorials

So that brings up the question, what XNA resources did we lose that might be in the wayback machine?

MrValentine · February 28, 2021, 9:56pm

WBM backs up text and some images if below a certain size? but not downloads…

What connection you on lol, and epic…!

EDIT

Forum is always broken for me somehow…

@jamie_yello

jamie_yello · March 1, 2021, 1:40am

WBM crawls websites and just recursively scans them for links. The reason it couldn’t do Codeplex was because there is no way to “scan” a website for all files the way you might think. You can only find links within pages and request those. It can only find static hard-coded links.

This is just generally how all crawlers work. It very clumsily crawls around looking for links until it’s found everything it can.

It can’t scrape anything that’s behind a database, only the user interface elements.

MrValentine · March 1, 2021, 1:44am

Oh, I meant the first bit, ignore the nagging about the forum being a bother, the reply button is broken…

EDIT

I am confusing myself now… time for some tea…

MrValentine · March 1, 2021, 10:40pm

@jamie_yello by the way, you could use that page I linked originally for the MS Docs dump, and save the page, which will give the complete menu list for reference at the least…

Just a thought…

jamie_yello · March 2, 2021, 2:27am

It’s in there (bb200104) it just doesn’t have working links or themes for whatever reason.

https://web.archive.org/web/20200720065504/https://docs.microsoft.com/en-us/previous-versions/windows/xna/bb203916(v=xnagamestudio.10)

The one in the wayback machine works, but it’s also missing the side bar for whatever reason. I think before we have to worry about using the content, we might as well at least wait for the official XNA documentation to go down.

BTW, I have finished the scrape and made a torrent, so Codeplex is saved as long as this torrent lives.

(link to torrent, 750 GB)

Could someone do me a favor and make sure this torrent works? You don’t have to download the whole thing, just confirm that it starts downloading

jamie_yello · March 2, 2021, 2:38am

I could trim the torrent down to projects that were not migrated to discord, I might get around to that so that I don’t waste 600 gb on preservationists’ hard drives, but that would also have to be automated through a program.

MrValentine · March 2, 2021, 2:37pm

Sigh, unable to test this sadly…

GitHub?

MrValentine · March 3, 2021, 3:21am

Can you break the file up into max size for Mega? all 750GB in say 50GB chunks?

EDIT

LOL nvm, it is max 50GB for free accounts…

Wish there was some other way…

jamie_yello · March 5, 2021, 4:11am

I don’t think my torrent works anyway. Probably should have fixed it by now, but what I think I’ll do is make a website that hosts all the data once codeplex goes down.

I’ll also curate the data so there’s a downloadable version with all projects (750 GB) and a version with just some of the biggest files (that are on github) removed. That should cut down the size to less than 30 GB.

I’ll have it up before Codeplex goes down, but for right now I’ll be taking a break from this.

@MrValentine Let me know if you want me to put it all up on Mega in the meantime, that way Codeplex isn’t lost even if my house burns down or I die or get amnesia or something. I can do that pretty quickly.

MrValentine · March 5, 2021, 4:14am

Happy to store it, currently sitting on roughly 30TB empty… I think removing any projects is not important and would be an insane amount of work anyway.

Thanks for your skillset and efforts

jamie_yello · March 5, 2021, 8:30pm

Sure, the only reason I’m good at this is that I’ve made many web scapers before, I think at least 5, for some reason. Every time I get a little better. I first started by downloading HTML as a string and using string.IndexOf() to look for data, then I realized APIs are a thing. I will say, I’m not sure if a Selenium web scraper has the same limitations of an API based scraper when it comes to API call restrictions. I’m thinking a Selenium scraper may be bypass call restrictions by virtue of imitating a user, but that’s some techno jargon.

If you want to do a really accurate site rip, the most accurate you can, you have to use Selenium. It basically hooks into a modern web browser (Chrome) and gives you control over it through code. One of my favorite things to use. I tried to use it to automatically buy Bitcoin on Robinhood based off my machine learning models. XD That didn’t turn out well.

Now I’m moving on to scraping Stocktwits comments to find the most accurate investors and storing all the data in a real SQL database as opposed to one bloated XML file that has to be recreated entirely to modify at all.

Did you hear that the reason for GTA’s long loading screen is one single extremely poorly done method that loads one JSON file every single time the game loads anything? It makes me hurt inside.

Anyways hopefully 7zip and Mega will be done compressing/uploading the data… tomorrow.

MrValentine · March 6, 2021, 12:14am

Thanks for the insight

That company, oh boy, yes, they managed to make something somewhat impressive, but on the fundamentals, and now the whole gambling system… nah, avoiding…

MrValentine · March 9, 2021, 2:30pm

@jamie_yello Another one?

EDIT

Found something useful for you @jamie_yello ?