Firehawke, Haze
I've been looking at this closer, and I'm concluding it'll be both easier and harder than I was initially thinking.
On the easier side, I have a source for the full names of a lot of the material out there. So I'm thinking I can use that as a starting point, and build a tree out of it. Or more specifically, use Neo4j to build a graph database out of it.
Main nodes would be names as a parent, children would be details of each "Rom", edges would be filesizes/crcs to differentiate between variants.
Then I take a directory full of those roms, and write a utility to scrape their details, traverse the graph, insert if it's not been seen before, and move it to an appropriate directory.
Once done, I can dump the graph to a softlist and have most of the detail.
But the hard part...there's a phenomenal amount of variants. every time I touch a collection of material I'm finding variants. I think it's going to take me longer to collect the C64 content than it is to generate the softlist.
Further compounding this is - I can't tell what's a crack and what isn't without exploring every single file. At least not in a way that I know of yet, but I wanted to run something past you.
The one thing I could think of, cracks have common intro screens, which means they should have common binary. I could find those binary patterns, log them, and search each file for that pattern, then classify them as cracks.
Thoughts? Does that seem feasible?
|