LESSWRONG
LW

ProgrammingSite Meta
Personal Blog

21

[ Question ]

Why is lesswrong blocking wget and curl (scrape)?

by nick lacombe
8th Nov 2023
1 min read
A
2
15

21

ProgrammingSite Meta
Personal Blog

21

Why is lesswrong blocking wget and curl (scrape)?
21jimrandomh
3nick lacombe
1varungodbole
1nick lacombe
1varungodbole
5gwern
5habryka
3RHollerith
1nick lacombe
2RHollerith
1nick lacombe
2RHollerith
1nick lacombe
2[comment deleted]
2ryan_b
3nick lacombe
New Answer
New Comment

2 Answers sorted by
top scoring

jimrandomh

Nov 08, 2023

210

It's an AWS firewall rule with bad defaults. We'll fix it soon, but in the mean time, you can scrape if you change your user agent to something other than wget/curl/etc. Please use your name/project in the user-agent so we can identify you in logs if we need to, and rate-limit yourself conservatively.

Add Comment
[-]nick lacombe2y30

thanks a lot for the answer!

Reply
[-]varungodbole7mo10

This is something I'm curious about as well! A friend recently introduced me to LessWrong, and I've found myself really enjoying the posts here! I'd like to spend more focused time digging into them!

I'd like to create a dump of LessWrong so that I can use a tool like DocETL (https://www.docetl.org/) to better sift through articles that might be interesting to me. It's been quite some time since jimrandomh replied to this post. So I just thought I'd check in before I attempted to crawl the site.

Also, it looks like https://www.lesswrong.com/robots.txt disall... (read more)

Reply
1nick lacombe7mo
greaterwrong is a website with the same content as lesswrong but different look, it gets its content from the lesswrong website. it's basically just a different way to access the same posts. lesswrong has a graphql api, which is probably the best way to read and dump posts on, if you rate limit conservatively. but that means some programming involved. to just have a quick dump of everything it's probably best to use wget on greaterwrong with rate limiting. email admin@greaterwrong.com before doing so to make sure you do it in a way they approve of.
1varungodbole7mo
gotcha, thanks!

gwern

Nov 11, 2023

50

You should use GreaterWrong. Even when the AWS stuff is fixed for LW2, GW is designed to be more static than LW2, and ought to snapshot better in general. You can also use the built-in theme designer to customize it better for your offline use and scrape it using your cookies.

Add Comment
[-]habryka2y50

Yeah, GW is pretty good for snapshots and scraping. Either that or grab stuff directly from our API. 

Reply
3RHollerith2y
@nicolas-lacombe If you decide to grab stuff directly from the API (rather than scraping GW) I might help by offering to pair program with you or trying to contribute code.
1nick lacombe2y
thanks for offering! right now i am thinking ill just use wget to create an archive of gw and/or lw since that would likely be faster than using the api for my use case. but i am still interested to write code that would generate a lw archive from the lw api. if i end up doing that and if i remember this discussion then ill likely contact you and show you where i put the code so that we could both work on the same codebase if you want.
6 comments, sorted by
top scoring
Click to highlight new comments since: Today at 10:57 AM
[-]RHollerith2y*20

When you imagine your "read offline" project having succeeded, do you tend to imagine yourself reading LW with a net connection on a computer, a smartphone or both?

Correction: I meant without a net connection. D'oh!

Reply
[-]nick lacombe2y10

i'll most likely read it when i have no internet access on a laptop.

Reply
[-]RHollerith2y20

What app do you imagine you will use? A web browser?

Reply
[-]nick lacombe2y10

probably some form of web browser: yes.

Reply
[+][comment deleted]2y20
[-]ryan_b2y2-1

I register a guess this is to keep the content of lesswrong from being scraped for LLMs and similar purposes.

Reply
[-]nick lacombe2y30

according to this comment it looks like a member of the lw site devs is ok with lw being scraped by gpt.

Reply
Moderation Log
More from nick lacombe
View more
Curated and popular this week
A
2
6
Deleted by RHollerith, 11/11/2023
Reason: Comment deleted by its author.
Mentioned in
12Is there a lesswrong archive of all public posts?

if there is no official lesswrong db/site archive for public posts, i'd like to be able to create my own with automated tools like wget, so that i can browse the site while offline. see Is there a lesswrong archive of all public posts?

wget and curl logs:

$ wget -mk https://www.lesswrong.com/
--2023-11-08 14:31:26--  https://www.lesswrong.com/
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving www.lesswrong.com (www.lesswrong.com)... 54.90.19.223, 44.213.228.21, 54.81.2.129
Connecting to www.lesswrong.com (www.lesswrong.com)|54.90.19.223|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2023-11-08 14:31:26 ERROR 403: Forbidden.

Converted links in 0 files in 0 seconds.
$ curl -Lv https://www.lesswrong.com/
*   Trying 54.81.2.129:443...
* Connected to www.lesswrong.com (54.81.2.129) port 443
* ALPN: curl offers h2,http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: none
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN: server accepted h2
* Server certificate:
*  subject: CN=lesswrong.com
*  start date: Sep  8 00:00:00 2023 GMT
*  expire date: Oct  6 23:59:59 2024 GMT
*  subjectAltName: host "www.lesswrong.com" matched cert's "www.lesswrong.com"
*  issuer: C=US; O=Amazon; CN=Amazon RSA 2048 M02
*  SSL certificate verify ok.
* using HTTP/2
* [HTTP/2] [1] OPENED stream for https://www.lesswrong.com/
* [HTTP/2] [1] [:method: GET]
* [HTTP/2] [1] [:scheme: https]
* [HTTP/2] [1] [:authority: www.lesswrong.com]
* [HTTP/2] [1] [:path: /]
* [HTTP/2] [1] [user-agent: curl/8.4.0]
* [HTTP/2] [1] [accept: */*]
> GET / HTTP/2
> Host: www.lesswrong.com
> User-Agent: curl/8.4.0
> Accept: */*
> 
< HTTP/2 403 
< server: awselb/2.0
< date: Wed, 08 Nov 2023 19:31:44 GMT
< content-type: text/html
< content-length: 118
< 
<html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
</body>
</html>
* Connection #0 to host www.lesswrong.com left intact