How do I use Open Source scrapers? (Selenium, Scrapy, etc.)

Noah@lemmy.dbzer0.com · edit-2 8 months ago

How do I use Open Source scrapers? (Selenium, Scrapy, etc.)

AndyMFK@lemmy.dbzer0.com · 8 months ago

I have quite an extensive history of scraping web sites for various data over the years, I’d be happy to help you out but I can’t really know how to help without knowing what website your trying to scrape, different sites have their own challenges (maybe behind a login, or using JavaScript to load content - in which case a http response won’t give you what you’re after, or any number of things really).

If you give me a link to a book you want to download as an example I can take a look and help guide you through it

aMockTie@beehaw.org · 8 months ago

100% this. Every website is different, though after doing this kind of thing for long enough, there are often common patterns and frameworks/libraries. Even general obfuscation can be reasonably reverse engineered with enough time and effort.

Kissaki@lemmy.dbzer0.com · edit-2 8 months ago

Depending on what you want to scape, that’s a lot of overkill and overcomplication. Full website testing frameworks may not be necessary to scrape. Python with it’s tooling and package management may not be necessary.

I’ve recently extracted and downloaded stuff via Nushell.

Requirement: Knowledge of CSS Selectors
Inspect Website DOM in Webbrowser web developer tools
1. Identify structure
2. Identify adequate selectors; testable via browser dev tools console document.querySelectorAll()
Get and query data

For me, my command line terminal and scripting language of choice is Nushell:

let $html = http get 'https://example.org/'
let $meta = $html | query web --query '#infobox .title, #infobox .tags' |  | { title: $in.0.0 tags: $in.1.0 }
let $content = $html | query web --query 'main img' --attribute data-src
$meta | save meta.json

or

1..30 | each {|x| http get $'https://example.org/img/($x).jpg' | save $'($x).jpg'; sleep 100ms }

Depending on the tools you use, it’ll be quite similar or very different.

Selenium is an entire web-browser driver meaning it does a lot more and has a more extensive interface because of it; and you can talk to it through different interfaces and languages.

lacaio da inquisição@lemmy.eco.br · 8 months ago

It needs a driver and the web-browser to be executed in headless mode. For Chrome that’s chrome-driver. You can get it here.

To make a script for it, I recommend talking to a LLM. I have asked it to build scrapers before, so it does the job.

If you want a practical use of Selenium being demonstrated, you can see it in LucidWebSearch plugin for Oobabooga.

Noah@lemmy.dbzer0.com · 8 months ago

I recommend talking to a LLM

Any recommendations? Not chat-GPT

Also thanks for the help so far!

umami_wasabi@lemmy.ml · 8 months ago

There is no simplification that you’re looking for. It seems you don’t have a programing background. If you really need to scrape something, you need to learn a programing language, HTTP, HTML, and maybe javascript. AFAIK, there is no easy way or point and click scrapper building tool. You will need to invest time and learn. Don’t worry, you should be able to get it done in 2-3 months if you do invest your time in.

Noah@lemmy.dbzer0.com · 8 months ago

I don’t want a point and click scraper, just a guide that isn’t assuming I have background + simple mans terms for easier reading. Thanks for believing in me to be able to build the basic skills necessary! Much appreciated :3

umami_wasabi@lemmy.ml · edit-2 8 months ago

I don’t a single guide for you but I can layout a road map.

A programming language. I prefer Python.
Basic HTML syntax and CSS selectors
HTTP, specifically methods, status code (no need to memorize all cuz you can go look it up), and cookies

After you got those foundation ready, you can go on and try to build a webscraper. I advice aginst using Scrapy. Not because it is bad but too overwhelming and abstracted for any beginner. I will instead advice you use requests for HTTP, and BeautifulSoup4 for HTML parsing. You will build a more solid foundation and transition to scrapy later when you need those advanced function.

When you get stuck, don’t afraid to pause on your attempt and read tutorials again. Head to the Python Community on Discord to get interactive help. We welcome noobs as we once were noobs too. Just don’t ever mention scraping there as they can’t help if they suspect you’re trying to do something inappropriate, malicious, or illegal. They are notoriously aginst yt-dlp which frustrates me a bit. Phrase it nicely and in an generic way. I will be there occasionally offering help.

Noah@lemmy.dbzer0.com · 8 months ago

The discord thing is a no-go since I don’t really know how to make my issue palatable. That’s why I used lemmy. Thanks again!

fuckwit_mcbumcrumble@lemmy.dbzer0.com · 8 months ago

We use node.js with puppeteer for some of our web crawling at work. It’s pretty straightforward once you have a basic script to launch it. If you havent already I’d highly suggest installing vs code. You install node.js, then using npm (node package manager) install puppeteer and whatever other dependencies you might have. Someone out there probably has a basic js file out there that will open chrome, or just ask an LLM (I just use ChatGPT, they’re all the same shit). From there you just need to navigate to your pages, then use a queryselector and .click() to click on your elements. It’s all javascript from there.

Pro tip: write your queryselectors in your browser using the inspect element/console tab, then put it in your JS file. Nothing is worse than being 10 minutes into a crawl and you’ve got a queerselector.

Noah@lemmy.dbzer0.com · 8 months ago

I don’t like to touch js so ive being going python only. (besides basic html & Css) but I found puppeteer and didn’t really get it.

8 months ago

Ask dr gpt

How do I use Open Source scrapers? (Selenium, Scrapy, etc.)

How do I use Open Source scrapers? (Selenium, Scrapy, etc.)

I have been trying for hours to figure this out. From a building tutorial to just trying to find prebuilt ones, I can’t seem to make it click.