Residential proxies are sketchy at best. How can you guarantee that your service's infrastructure isn't hinging on an illicit botnet?
chadwebscraper 1 days ago [-]
This is a good callout - I’ve tried my best thus far to limit the use of proxies unless absolutely necessary and then focus on reputable providers (even though these are a bit more pricey).
Definitely going to give this more thought though, thank you for the comment
dewey 1 days ago [-]
There's a lot of variety in the residential proxy market. Some are sourced from bandwidth sharing SDKs for games with user consent, some are "mislabeled" IPs from ISPs that offer that as a product and then there's a long tail of "hacked" devices. Labeling them generally as sketchy seems wrong.
sfRattan 24 hours ago [-]
> Some are sourced from bandwidth sharing SDKs for games with user consent...
The notion that most people installing a game meaningfully consent to unspecified ongoing uses of their Internet connection resold to undeclared third parties gave me a good, hearty belly laugh. Especially expressed so matter-of-factly.
Thank you.
dewey 23 hours ago [-]
I don't think it's much different than games that force users to watch ads or capturing them in pay-to-win schemes.
sfRattan 23 hours ago [-]
When a game shows an unskippable ad, the user is consciously aware of what is happening, as it is happening, and can close the program to stop watching the ad. It is in no sense comparable to what you describe.
When a third party library bundled into a game makes ongoing, commercial, surreptitious use of the user's Internet access, the vast majority of users aren't meaningfully consenting to that use of their residential IP and bandwidth because they understand neither computers nor networks well enough to meaningfully consent.
I don't doubt your bases are sufficiently covered in terms of liablities. I don't doubt that some portion of whatever EULA you have (that your users click right on past) details in eye-watering legalese that you are reselling their IP and bandwidth.
It's just... The notion that there has been any meeting of minds at all between your organization and its games' users on the matter of IP address and bandwidth resale is patently risible.
kingforaday 23 hours ago [-]
To add, it's also strictly forbidden by all the major ISPs Acceptable Use Policy. At least in the US.
fnimick 24 hours ago [-]
Legal? probably. Ethical? Absolutely not.
muwtyhg 22 hours ago [-]
> bandwidth sharing SDKs for games with user consent
What games are you aware of that do this? I want to make sure I have none of them installed.
tglobs 16 hours ago [-]
I had my ebike stolen today a few hours after seeing this, and immediately made an account to watch Craigslist for bike thieves trying to sell it.
If you had asked for $60/month to run it, I would've paid it.
6 attempts later, it's failed every time. I love that it's so easy to throw together things like this, but we need better ways of testing vibe-coded apps.
chadwebscraper 11 hours ago [-]
First off, really sorry to hear that it didn't work.
Edit: it looks like you hit an edge case that I didn't see in testing. Happy to explain more, but it was skipping the extraction due to some pre-processing that craigslist was failing when it shouldn't have.
Would love if you want to try the tool again, but completely understand if not :)
golfer 22 hours ago [-]
As a site owner, how does one opt out of this, since it obviously ignores robots.txt?
chadwebscraper 22 hours ago [-]
Shoot me your site and I can blacklist it
xnx 4 hours ago [-]
That's what robots.txt is for.
groby_b 1 days ago [-]
"AntiBot bypass".
I see we continue to aim for high ethical standards throughout the industry.
arjunchint 1 days ago [-]
So what happens when the website layout updates, does the monitoring job fail silently?
chadwebscraper 1 days ago [-]
So with APIs, it adjusts. For HTML layouts, it looks at the previous diffs to catch potential errors and then re-indexes.
the_arun 22 hours ago [-]
What is a strategy? You need to elaborate that in pricing.
chadwebscraper 22 hours ago [-]
Thank you for the feedback - agreed.
It’s an extraction pattern for a certain site, so you can reuse it. Think a pattern to extract all forum posts - then using that on different pages with the same format. Like show new, show, new posts on HN.
chadwebscraper 1 days ago [-]
Here’s how it works:
1. Paste a URL in, describe what you want
2. Define an interval to monitor
3. Get real time webhooks of any changes in JSON
Lots of customers are using this across different domains to get consistent, repeatable JSON out of sites and monitor changes.
Supports API + HTML extraction, never write a scraper again!
codingdave 1 days ago [-]
Writing a scraper isn't the hard part, that is actually fairly trivial at this point in time. Pulling content into JSON from your scrape is also fairly trivial - libraries exist that handle it well.
The harder parts are things like playing nicely so your bot doesn't get banned by sysadmins, detecting changes downstream from your URL, handling dynamically loading content, and keeping that JSON structure consistent even as your sites change their content, their designs, etc. Also, scalability. One customer I'm talking to could use a product like this, but they have 100K URLs to track, and that is more than I currently want to deal with.
I absolutely can see the use case for consistent change data from a URL, I'm just not seeing enough content in your marketing to know whether you really have something here, or if you vibe coded a scraper and are throwing it against the wall to see if it sticks.
chadwebscraper 1 days ago [-]
I appreciate the response! I also agree - happy to add some clarity to this stuff.
Bot protection - this is handled in a few ways, the basic form bypasses most bot protections and that’s what you can use on the site today. For tougher sites, it solves the bot protections (think datadome, Akamai, incapsula).
The consistency part is ongoing, but it’s possible to check the diffs and content extractions and notice if something has changed and “reindex” the site.
100k URLs is a lot! It could support that, but the initial indexing would be heavy. It’s fairly resource efficient (no browsers). For scale, it’s doing about 40k/scrapes a day right now.
Appreciate the comments, happy to dive deeper into the implantation and I agree with everything you’ve said. Still iterating and trying to improve it.
codingdave 23 hours ago [-]
Re-indexing seem sub-optimal. I can't think of a use case where people care if the design changes. Even some content changes are not going to be interesting. Someone corrected a typo, updated punctuation, that kind of thing... such things are just noise if you are trying to react to content changes.
Your system needs to know not only what changed, but whether or not it matters. Splitting meaningful content from irrelevant noise is exceedingly important. If you know that, you do not need to re-index because you can diff only the meaningful content.
As far as the 100K URLs, each URL has between 200 and 1000 sub-pages beneath the top-level page. They all need to be periodically scanned for updates, while capturing that distinction of noise vs. meaningful change. I've actually got code that does the needed work - it is scaling it up to that level that I didn't want to take on.
I'm not sure what you mean by no browsers. My existing scraper uses headless browsers, in order to capture JavaScript-driven content and navigate through a SPA without having to re-load at every URL change. If you are not using even a headless browser, how are you getting dynamic content?
chadwebscraper 22 hours ago [-]
Let me clarify, it just reindexes if the structured data changes, so it ignores layout changes. So it diffs the extractions.
Would be curious to try it out on your sites if you want to shoot me a few over - I can share my email.
It does use a browser to find dynamic content but does not afterwards.
codingdave 10 hours ago [-]
I tell ya what - I'm currently working on finalizing a contract with someone for related work, but if that falls apart for some reason, I might pick this back up independently, and if so, I might be interested in chatting more. I'd recommend adding your email to your profile, and I can reach out if appropriate.
Just to set proper expectations, though, my current contract negotiations are already fully agreed, we're just waiting on the legal team to finalize things, so the odds of me seeking a new partner are not high.
chadwebscraper 9 hours ago [-]
Just did - great chatting with you. Best of luck & looking forward to (potentially) hearing back!
1 days ago [-]
tmaly 1 days ago [-]
this must wreck their google analytics stats
chadwebscraper 1 days ago [-]
lol it probably does unless their filtering is great
I recommend a pivot: take your structured data approach and build a browser plugin that allows users to pin forums, wiki edits and adverts on any web content they like.
chadwebscraper 21 hours ago [-]
This is actually a really interesting though - like an embedder with live data?
cyanydeez 21 hours ago [-]
Yeah, like open a disputed article and turn on the plugin. Suddenly wikiedits, videos and forums/notes annotate a page. The structured data is used to organize it and when/if it changes the editors can update links.
Probably use a graph database and RAG type references.
Coral-Tiny 14 hours ago [-]
[dead]
Rendered at 22:54:55 GMT+0000 (Coordinated Universal Time) with Vercel.
Definitely going to give this more thought though, thank you for the comment
The notion that most people installing a game meaningfully consent to unspecified ongoing uses of their Internet connection resold to undeclared third parties gave me a good, hearty belly laugh. Especially expressed so matter-of-factly.
Thank you.
When a third party library bundled into a game makes ongoing, commercial, surreptitious use of the user's Internet access, the vast majority of users aren't meaningfully consenting to that use of their residential IP and bandwidth because they understand neither computers nor networks well enough to meaningfully consent.
I don't doubt your bases are sufficiently covered in terms of liablities. I don't doubt that some portion of whatever EULA you have (that your users click right on past) details in eye-watering legalese that you are reselling their IP and bandwidth.
It's just... The notion that there has been any meeting of minds at all between your organization and its games' users on the matter of IP address and bandwidth resale is patently risible.
What games are you aware of that do this? I want to make sure I have none of them installed.
If you had asked for $60/month to run it, I would've paid it.
6 attempts later, it's failed every time. I love that it's so easy to throw together things like this, but we need better ways of testing vibe-coded apps.
Edit: it looks like you hit an edge case that I didn't see in testing. Happy to explain more, but it was skipping the extraction due to some pre-processing that craigslist was failing when it shouldn't have.
Would love if you want to try the tool again, but completely understand if not :)
I see we continue to aim for high ethical standards throughout the industry.
It’s an extraction pattern for a certain site, so you can reuse it. Think a pattern to extract all forum posts - then using that on different pages with the same format. Like show new, show, new posts on HN.
1. Paste a URL in, describe what you want
2. Define an interval to monitor
3. Get real time webhooks of any changes in JSON
Lots of customers are using this across different domains to get consistent, repeatable JSON out of sites and monitor changes.
Supports API + HTML extraction, never write a scraper again!
The harder parts are things like playing nicely so your bot doesn't get banned by sysadmins, detecting changes downstream from your URL, handling dynamically loading content, and keeping that JSON structure consistent even as your sites change their content, their designs, etc. Also, scalability. One customer I'm talking to could use a product like this, but they have 100K URLs to track, and that is more than I currently want to deal with.
I absolutely can see the use case for consistent change data from a URL, I'm just not seeing enough content in your marketing to know whether you really have something here, or if you vibe coded a scraper and are throwing it against the wall to see if it sticks.
Bot protection - this is handled in a few ways, the basic form bypasses most bot protections and that’s what you can use on the site today. For tougher sites, it solves the bot protections (think datadome, Akamai, incapsula).
The consistency part is ongoing, but it’s possible to check the diffs and content extractions and notice if something has changed and “reindex” the site.
100k URLs is a lot! It could support that, but the initial indexing would be heavy. It’s fairly resource efficient (no browsers). For scale, it’s doing about 40k/scrapes a day right now.
Appreciate the comments, happy to dive deeper into the implantation and I agree with everything you’ve said. Still iterating and trying to improve it.
Your system needs to know not only what changed, but whether or not it matters. Splitting meaningful content from irrelevant noise is exceedingly important. If you know that, you do not need to re-index because you can diff only the meaningful content.
As far as the 100K URLs, each URL has between 200 and 1000 sub-pages beneath the top-level page. They all need to be periodically scanned for updates, while capturing that distinction of noise vs. meaningful change. I've actually got code that does the needed work - it is scaling it up to that level that I didn't want to take on.
I'm not sure what you mean by no browsers. My existing scraper uses headless browsers, in order to capture JavaScript-driven content and navigate through a SPA without having to re-load at every URL change. If you are not using even a headless browser, how are you getting dynamic content?
Would be curious to try it out on your sites if you want to shoot me a few over - I can share my email.
It does use a browser to find dynamic content but does not afterwards.
Just to set proper expectations, though, my current contract negotiations are already fully agreed, we're just waiting on the legal team to finalize things, so the odds of me seeking a new partner are not high.
Probably use a graph database and RAG type references.