v3.0
golang-proxy是一个开箱即用的高匿代理抓取工具, 它是语言无关的
项目地址: https://github.com/storyicon/golang-proxyhtml
Golang-Proxy -- 简单高效的免费代理抓取工具经过抓取网络上公开的免费代理,来维护一个属于本身的高匿代理池,用于网络爬虫、资源下载等用途。node
v3.0
有哪些新特性localhost:9999/all
与 localhost:9999/random
直接获取抓到的代理!甚至可使用 localhost:9999/sql?query=
来执行一些简单的 SQL 语句来自定义代理筛选规则!Windows
、Linux
、Mac
开箱即用版!schemeType
断定代理对http
和https
的支持程度-mode=
来指定是否单独启动 producer
/consumer
/assessor
/service
API
接口发生了变更源
的数据结构, 去除了 filter
等字段, 请注意, 这意味着 v2.0
的源在直接提供给v3.0
使用时可能会出现一些问题源
-source
启动参数golang-proxy
Release 页面 根据系统环境提供了一些压缩包,将他们解压后执行便可。mysql
开箱即用版下载地址: Download Release v3.0git
下载完成后, 将压缩包中的二进制文件和 source
目录解压到同一个位置, 启动二进制文件便可, 程序将会启动下面这些服务:github
producer
: 周期性的抓取source
目录中定义的源, 将抓取到的代理写入到 crude_proxy
表中consumer
: 周期性的从 crude_proxy
中读取必定数量的代理, 判断它们的代理类型以及可用性, 将它们写入到 proxy
表中assessor
: 周期性的从 proxy
表中读取必定数量的代理, 评估它们的质量service
: golang-proxy
提供的 http api 接口, 使你能够经过 localhost:9999/all
, localhost:9999/random
, localhost:9999/sql?query=
这三个接口来筛选和获取 crude_proxy
和 proxy
表中的代理当你启动编译好的二进制文件时, 默认这些服务会依次启动, 可是在 v3.0
版本, 你能够经过添加 -mode
启动参数来指定单独启动某个服务, 好比:golang
golang-proxy -mode=service
这样运行, 将只会启动 service
服务, 在启动了 service
以后, 你能够在浏览器中访问如下接口, 得到相应的代理:算法
url | description |
---|---|
localhost:9999/all |
获取 proxy 表中全部已经抓取到的代理 |
localhost:9999/all?table=proxy |
获取 proxy 表中全部已经抓取到的代理 |
localhost:9999/all?table=crude_proxy |
获取 crude_proxy 表中全部已经抓取到的代理 |
localhost:9999/random |
从 proxy 表中随机获取一条代理 |
localhost:9999/random?table=proxy |
从 proxy 表中随机获取一条代理 |
localhost:9999/random?table=crude_proxy |
从 crude_proxy 表中随机获取一条代理 |
localhost:9999/sql?query= |
在query= 后加上SQL 语句, 返回SQL执行结果, 只支持较为简单的查询语句 |
请注意, crude_proxy
只是抓取到的代理的临时储存表, 不能保证它们的质量, 而proxy
表中的代理将会不断获得 assessor
的评估, proxy
表中的 score
字段能够较为全面的反映一个代理的质量, 质量较低时会被删除sql
localhost:9999/sql
例如访问 localhost:9999/sql?query=SELECT * FROM PROXY WHERE SCORE > 5 ORDER BY SCORE DESC
, 将会返回 proxy
表中全部分数大于5的代理, 并按照分数从高到低返回数据库
{ "error": "", "message": [ { "id": 2, "ip": "45.113.69.177", "port": "1080", // scheme_type 能够取如下值: // 0: 代理只支持 http // 1: 代理只支持 https // 2: 代理同时支持 http 和 https "scheme_type": 0, "content": "45.113.69.177:1080", // 评估次数 "assess_times": 9, // 评估成功次数, 能够经过 success_times/assess_times得到代理链接成功率 "success_times": 9, // 平均响应时间 "avg_response_time": 0.098, // 连续失败次数 "continuous_failed_times": 0, // 分数, 推荐使用 5 分以上的代理 "score": 68.45106053570785, "insert_time": 1540793312, "update_time": 1540797880 }, ] }
go get -u github.com/storyicon/golang-proxy
进入到 golang-proxy
目录,执行 go build main.go
,执行生成的二进制的执行程序便可。json
注意:
项目根目录下的 ./source
是项目执行必须的文件夹,里面存储了各种网站源,其余的文件夹储存的均为项目源码。因此在编译后获得二进制程序 main
文件后,便可将 main
文件和 source
文件夹一同移动到任意地方,main
文件能够任意命名。
localhost:9999/all
与 localhost:9999/random
直接获取抓到的代理!甚至可使用 localhost:9999/sql?query=
来执行 SQL 语句来自定义代理筛选规则!./source/
下的全部 yml 格式的文件都是源,你能够增长源,也能够经过在文件名前加上一个 .
来使程序忽略这个源,固然你也能够直接删除,来让一个源永远的消失,下面进行 Source 参数介绍:
#Page配置项 page: entry: "https://xxx/1.html" template: "https://xxx/{page}.html" from: 2 to: 10 #publisher将会首先抓取entry,即 https://xxx/1.html #而后根据 template、from 和 to 依次抓取 # https://xxx/2.html # https://xxx/3.html # https://xxx/4.html # ... # https://xxx/10.html
#Selector配置项 selector: iterator: ".table tbody tr" ip: "td:nth-child(1)" port: "td:nth-child(2)" # 以上配置用于抓取下面这种 HTML 结构 # <table class="table"> # <tbody> # <tr> # <td>187.3.0.1</td> # <td>8080</td> # <td>HTTP</td> # <tr> # <tr> # <td>164.23.1.2</td> # <td>80</td> # <td>HTTPS</td> # <tr> # <tr> # <td>131.9.2.3</td> # <td>8080</td> # <td>HTTP</td> # <tr> # <tbody> # <table> # 选择器为通用的JQuery选择器,iterator为循环对象,好比表格里的行,每行一条代理,那这个行的选择器就是iterator,而ip、port、protocal则是在iterator选择器的基础上进行子元素的查找。
category: # 并行数 parallelnumber: 1 # 对于这个源,每抓取一个页面 # 将会随机等待5~20s再抓下一个页面 delayRange: [5, 20] # 间隔多长时间启用一次这个源 # @every 10s , @every 10h... interval: "@every 10m" debug: true
issues
便可Golang-proxy is an efficient free proxy crawler that ensures that the captured proxies are highly anonymous and at the same time guarantee their quality. You can use these captured proxies to download network resources and ensure the privacy of your own identity.
golang-proxy
provides compiled binary files so that you do not need golang
on the machine. Download binary compression pack to Release Page
According to your system type, download the corresponding compression package, unzip it and run it. After a few minutes, you can access localhost:9999/all
in the browser to see the proxy's crawl results.
Before I go into the detailed introduction of golang-proxy, I think it's best to tell you the most useful information first.
After you start the binary, you can access the following interface in the browser to get the proxy
url | description |
---|---|
localhost:9999/all |
Get all highly available proxies |
localhost:9999/all?table=proxy |
Get all highly available proxies |
localhost:9999/random |
Randomly acquire a highly available proxy |
localhost:9999/all?table=crude_proxy |
Obtain the proxies in the temporary table (the quality of them cannot be guaranteed) |
localhost:9999/random?table=proxy |
Randomly get an proxy from the temporary table (the quality of them cannot be guaranteed) |
localhost:9999/sql?query= |
Write the SQL statement you want to execute after query= , customize your filter rules. |
Having mastered the above content, you have been able to use the 50% function of golang-proxy
. But the last interface allows you to execute custom SQL statements, and you'll find that you need to know at least the structure of the tables. The following will tell you.
golang-proxy consists of the following parts:
data tables
configuration file
source folder
modules
data tables
In order to store temporary proxies, we designed the data table crude_proxy
, the table is defined as follows.
field | type | example | description |
---|---|---|---|
id | int | - | - |
ip | string | 192.168.0.1 | - |
port | string | 255 | - |
content | string | 192.168.0.1:255 | - |
insert_time | int | 1540798717 | - |
update_time | int | 1540798717 | - |
table crude_proxy
stores the proxies that are crawled out, and cannot guarantee their quality.
When the agent in the crude_proxy
table passes through pre assess
( pre assess
roughly verifies the availability of the proxy and tests the proxy's support for https
and http
), it will enter the proxy
table.
field | type | example | description |
---|---|---|---|
id | int | - | - |
ip | string | 192.168.0.1 | - |
port | string | 255 | - |
scheme_type | int | 2 | Identify the extent to which the proxy supports http and https, 0 : http only, 1 https only, 2 https & http |
content | string | 192.168.0.1:255 | |
assess_times | int | 5 | proxy evaluation times |
success_times | int | 5 | The number of times the proxy successfully passed the evaluation |
avg_response_time | float | 0.001 | - |
continuous_failed_times | int | 0 | The number of consecutive failures during the proxy evaluation process |
score | float | 25 | The higher the better |
insert_time | int | 1540798717 | - |
update_time | int | 1540798717 | - |
The proxy in the proxy
table will be evaluated periodically and their scores will be modified. Low scores will be deleted.
configuration file
For convenience, the proxy in golang-proxy is stored in the portable database sqlite by default. You can make golang-proxy
use the mysql database by adding the config.yml
file in the executable directory.
For details, see Config page.
source folder
golang-proxy needs source
to define its crawling contents and rules. Therefore, the run directory of golang-proxy needs at least one source
folder, and the source folder should have at least one source in yml
format.
The source is defined as follows:
page: entry: "http://www.xxx.com/http/?page=1" template: "http://www.xxx.com/http/?page={page}" from: 1 to: 2000 selector: iterator: ".list item" ip: ".ip" port: ".port" category: parallelnumber: 3 delayRange: [10, 30] interval: "@every 10m" debug: true
In the definition above, producer
will first crawl the entry page, then crawl:
http://www.xxx.com/http/?page=1 http://www.xxx.com/http/?page=2 http://www.xxx.com/http/?page=3 ... http://www.xxx.com/http/?page=2000
This source definition page expects this format:
<html> ... <div class="list"> <div class="item"> <div class="ip"> 127.0.0.1 </div> <div class="port"> 80 </div> ... </div> <div class="item"> <div class="ip"> 125.4.0.1 </div> <div class="port"> 8080 </div> ... </div> ... </div> ... </html>
When producer
parses a single page, it always traverses the nodes defined by iterator first, and then gets the elements defined by ip
and port
selectors from these nodes. The source definition above is still valid for the following HTML structure.
<html> ... <div class="list"> <div class="item"> <div class="ip"> 127.0.0.1:80 </div> </div> <div class="item"> <div class="ip"> 125.4.0.1:8080</div> </div> ... </div> ... </html>
Because when the port
selector cannot get the content, it will try to parse the port from the text selected by the ip
selector.
The source is stored in the source folder in yml format, and a source definition is completed. Golang-proxy will read it and crawl it the next time it starts. So you successfully define a source, store it in the source folder in YML format, and the next time you start golang-proxy, the source will enter the crawl list.
If a source file name starts with a
.
, the source will not be read.
modules
golang-proxy consists of four modules, which cooperate to complete the task that golang-proxy wants to accomplish.
module name | description |
---|---|
producer | Periodically fetch the source defined in the source directory, and write the fetched proxy to the crude_proxy table. |
consumer | Periodically read a certain number of proxies from crude_proxy , determine their proxy scheme type and availability, and write them to the proxy table. |
assessor | Periodically read a number of proxies from the proxy table to evaluate their quality. |
service | Be responsible for the HTTP API interface provided by golang-proxy , allows you to filter and obtain the proxies in the crude_proxy and proxy tables by localhost: 9999/all , localhost: 9999/random , and localhost: 9999/sql . |
When you start the executable file of golang-proxy, you will start these module in turn. But you can add the -mode
startup parameter after the golang-proxy executable to command golang-proxy to start only one module. Like below:
golang-proxy -mode=service
This will only start the HTTP API interface service.
At this point, you have mastered the 95% function of golang-proxy. If you want to find more, you can read the source code provided above, and improve them.
Welcome to submit issue. If you feel that golang-proxy is helping you, you can order a star or watch, thanks !