Web crawling

I'm working on creating an archive. One of my goals right now is saving every Wikipedia page related to tobacco. That means cigarette cards / baseball cards, kiseru pipes, cigars, etc. Everything. I've been copying the links into a Google Doc for a couple days, and I've been thinking the whole time that it would be nice if it were automated. How feasible would it be to create a program that crawls all of the pages? I eventually want to download them and put them in cold storage. I'm much more tech literate than most, but I am not a computer programmer, so I doubt that I could do it myself, but how easy would it be for me to find someone who could? Of course, I'd be willing to pay a fee.

Nothing Ever Happens Shirt $21.68

UFOs Are A Psyop Shirt $21.68

Nothing Ever Happens Shirt $21.68

  1. 2 years ago
    Anonymous

    Just download the entire Wikipedia. Less headache and you get more encyclopedia.

  2. 2 years ago
    Anonymous

    You know you can just download Wikipedia right?
    Like its a feature they offer

  3. 2 years ago
    Anonymous

    lmao you can literally just download the whole wiki off their own site

  4. 2 years ago
    Anonymous

    https://www.kiwix.org/en/
    https://library.kiwix.org/

  5. 2 years ago
    Anonymous

    just use wget

  6. 2 years ago
    Anonymous

    One thing I never got about kiwix. Do they do delta updates? Or do you have to download 60GB of data every time they have a new archive release?

    • 2 years ago
      Anonymous

      Yes you technically have to redownload every time to see updated articles. I just don't give enough of a shit to get them

  7. 2 years ago
    Anonymous

    thank you based anon for making smoking more enjoyable

    i hope you read them too

    • 2 years ago
      Anonymous

      Most of what I've been doing has not been fun reading. It's more like the type of reading you do for a school project. If I wasn't copying links constantly, I'd have a good time going down rabbit holes and taking in information. Thanks for the kind thought.

      Just download the entire Wikipedia. Less headache and you get more encyclopedia.

      You know you can just download Wikipedia right?
      Like its a feature they offer

      lmao you can literally just download the whole wiki off their own site

      Yes, I'm aware of this. The problem with downloading all of Wikipedia is that people might have trouble finding stuff related to tobacco in the heap. I guess I might as well do it anyway. With it all downloaded, I guess I could leave rather work to someone else in the future.

      • 2 years ago
        Anonymous

        you can download individual pages too
        what is your goal?

        • 2 years ago
          Anonymous

          He's probably having a manic episode where he thinks the history of tobacco use will be scrubbed from the internet.

          • 2 years ago
            Anonymous

            boomers are schizophrenic like that
            besides, tobacco history is a meme and wikipedia barely covers anything

          • 2 years ago
            Anonymous

            No, I am not manic. I am a translator of ancient languages and an amateur archivist. Tobacco is an interest of mine. I have already done some archival work, and based on my knowledge of ancient history and experience with archiving, I'm aware of how much is lost to time, even in very brief spans of time. My reason for wanting to save a bunch of tobacco-related Wikipedia articles is that Wikipedia is a very general resource that cites sources that are more rigorous and in-depth.
            In the United States, there is an anti-tobacco push happening. YouTube has banned and threatened to ban tobacco-related channels. The biggest snus reviewer got banned because he mentioned where you can buy snus. Other channels got strikes for the same thing. A nasal snuff store in the UK that has shipped to Americans for years was told by FedEx, without warning, that their packages will no longer be shipped. They returned packages to this company that had been in transit for weeks. The Biden administration is criminalizing Juuls and is pushing to limit the amount of nicotine in cigarettes. This all has happened in the last month. Some of this stuff, like content on YouTube, could be gone tomorrow, which is why I'd like to save it.

          • 2 years ago
            Anonymous

            nicotine addiction will be eradicated and there's nothing you can do about it

          • 2 years ago
            Anonymous

            doh-ho-ho-no no no I don't think so
            there a bunch of people now who start douche flutin without even having smoked first

          • 2 years ago
            Anonymous

            Prohibition actually increased Alcohol consumption shortly before it was overturned [1]
            I thought we learned from history.
            We were even starting to legalize marijuana.
            Evidently, Democrats are brain dead and need to give even more votes to Republicans as they crash the Economy.

            [1] https://www.cato.org/policy-analysis/alcohol-prohibition-was-failure

          • 2 years ago
            Anonymous

            stop being so schizo grandpa
            there is no content on youtube worth saving
            postal companies have never shipped tobacco products internationally because of excise reasons, they just didnt know or didnt care about snus until now, likely because some alphabet agency complained
            >Beginning on June 29, 2010, the Postal Service will no longer accept or transport any package that it knows, or reasonably believes, to contain nonmailable smokeless tobacco or cigarettes, unless covered by one of the defined exceptions.

            do you realise you can just grow tobacco yourself and it will be infinitely better than the factory farm produced stuff in cigarettes and snus
            both are a poor people thing anyway

          • 2 years ago
            Anonymous

            >you can just grow tobacco yourself and it will be infinitely better than the factory farm produced stuff in cigarettes
            Not OP but that's simply not true. Blending and curing tobacco takes a lot of expertise, chances are if you make your own it'll taste like shit.

          • 2 years ago
            Anonymous

            I smoked natural tobacco that was grown and dried by morons in Papua new Guinea and I smoked it as a cigarette rolled in newspaper
            Smelled and tasted way better than commercial ciggies

          • 2 years ago
            Anonymous

            basically this

            I smoked natural tobacco that was grown and dried by morons in Papua new Guinea and I smoked it as a cigarette rolled in newspaper
            Smelled and tasted way better than commercial ciggies

            people pretend its extremely difficult but its actually extremely easy, its just a meme to keep up the "premium" image of the product
            tobacco is an easy plant, many of the pests dont even exist in temperate climates
            drying is just making sure it doesnt mold and releases enough ammonia to be smokable
            blending is just smelling it and putting the small leaves inside and the big leaves on the outside

            ez pz, good luck

          • 2 years ago
            Anonymous

            Not a grandpa. There's plenty of stuff on YouTube worth saving. As

            >you can just grow tobacco yourself and it will be infinitely better than the factory farm produced stuff in cigarettes
            Not OP but that's simply not true. Blending and curing tobacco takes a lot of expertise, chances are if you make your own it'll taste like shit.

            said, processing tobacco is not as easy as it sounds. I'm currently venturing into making snuff from raw leaf for 15 people, so I'm in a position where I can comment. Growing your own tobacco is also difficult, but I might grow some next year. I wouldn't say cigarettes and snus are for poor people. Snus certainly isn't, but your thinking isn't very productive and is quite presumptive. Today, I took some nasal snuff and smoked a cigar (pic rel).

          • 2 years ago
            Anonymous

            >cigarettes are for poor people
            where i live the cost of a cigarette, A cigarette, shot past $1 a few years ago, i don't even know what it is now because i can't afford them anymore

        • 2 years ago
          Anonymous

          First, I am getting all the links. I put an "x" next to a page that has been exhausted for all relevant links about tobacco content. For instance, the article about pipes might link to various pipe manufacturers. After getting all of the manufacturers in the Google Doc, then I'll put an x next to the URL for the Wikipedia pipes page. After that, I'd go check all of the pipe manufacturers for relevant article links. I will then download all the pages I have in the Google Doc. That's my goal for this section of the project. My greater goal is to preserve any valuable information about tobacco and things related to tobacco, like pipes for instance.

          • 2 years ago
            Anonymous

            >no i'm not compulsively saving every scrap of information on tobacco use in case it gets scrubbed, even though I literally said I'm gonna do that

          • 2 years ago
            Anonymous

            Sounds comfy.

          • 2 years ago
            Anonymous

            this is the most moronic way to do it possible

          • 2 years ago
            Anonymous

            Yes, I know. That's why I am asking for help.

            how are you going to cold store

            M-DISC is what I am primarily relying on for longevity. They'd mirror what's on hard drives that I will refresh regularly. I'm beginning to follow 3-2-1.

          • 2 years ago
            Anonymous

            use a counting machine! a wikipedia page is like 50mb

          • 2 years ago
            Anonymous

            Counting machine?

            disk rot

            Yeah, but that will be an issue 1,000 years from now.

          • 2 years ago
            Anonymous

            a counting machine yes, but you need to go further
            1000 years is like, a day to god. no time at all

          • 2 years ago
            Anonymous

            all optical media deteriorates much quicker than they want you to believe

          • 2 years ago
            Anonymous

            disk rot

  8. 2 years ago
    Anonymous

    how are you going to cold store

  9. 2 years ago
    Anonymous

    Not sure what's more stupid, OPs obsession with tobacco and how "it's getting scrubbed" or the way he's going about it

  10. 2 years ago
    Anonymous

    one million
    wikipedia s
    a day! one!
    million! aa!!
    and then, you tabulate
    (one million
    wikipedia pages a day)

  11. 2 years ago
    Anonymous

    STOP CRAWLING WIKIPEDIA YOU moron Black person
    USE THE DUMPS

  12. 2 years ago
    Anonymous

    i would much rather sort through all of the tobacco pages and the links there (and chat pages etc.) and delete, than sort through them saving. thats all though. i dont know about any dumps
    you could also host some version of them as a torrent, and on some sites, while you use tapes or whatever for cold storage

  13. 2 years ago
    Anonymous

    Since gallery-dl thread is nowhere to be seen, I'll just ask here: is that script or yt-dl capable of downloading whole channels?

    • 2 years ago
      Anonymous

      youtube dl or at least youtube dlp has that built in

  14. 2 years ago
    Anonymous

    Are there any doomsday wiki's which have recipies/blueprints of day to day items?

  15. 2 years ago
    Anonymous

    bump

Your email address will not be published. Required fields are marked *