Web Scraping Framework

Last update: Jan 04, 2023

Overview

Grab Framework Documentation

Installation

    $ pip install -U grab

See details about installing Grab on different platforms here http://docs.grablib.org/en/latest/usage/installation.html

Support

Documentation: https://grablab.org/docs/

Russian telegram chat: https://t.me/grablab_ru

English telegram chat: https://t.me/grablab

To report bug please use GitHub issue tracker: https://github.com/lorien/grab/issues

What is Grab?

Grab is a python web scraping framework. Grab provides a number of helpful methods to perform network requests, scrape web sites and process the scraped content:

Automatic cookies (session) support
HTTP and SOCKS proxy with/without authorization
Keep-Alive support
IDN support
Tools to work with web forms
Easy multipart file uploading
Flexible customization of HTTP requests
Automatic charset detection
Powerful API to extract data from DOM tree of HTML documents with XPATH queries
Asynchronous API to make thousands of simultaneous queries. This part of library called Spider. See list of spider fetures below.
Python 3 ready

Spider is a framework for writing web-site scrapers. Features:

Rules and conventions to organize the request/parse logic in separate blocks of codes
Multiple parallel network requests
Automatic processing of network errors (failed tasks go back to task queue)
You can create network requests and parse responses with Grab API (see above)
HTTP proxy support
Caching network results in permanent storage
Different backends for task queue (in-memory, redis, mongodb)
Tools to debug and collect statistics

Grab Example

    import logging

    from grab import Grab

    logging.basicConfig(level=logging.DEBUG)

    g = Grab()

    g.go('https://github.com/login')
    g.doc.set_input('login', '****')
    g.doc.set_input('password', '****')
    g.doc.submit()

    g.doc.save('/tmp/x.html')

    g.doc('//ul[@id="user-links"]//button[contains(@class, "signout")]').assert_exists()

    home_url = g.doc('//a[contains(@class, "header-nav-link name")]/@href').text()
    repo_url = home_url + '?tab=repositories'

    g.go(repo_url)

    for elem in g.doc.select('//h3[@class="repo-list-name"]/a'):
        print('%s: %s' % (elem.text(),
                          g.make_url_absolute(elem.attr('href'))))

Grab::Spider Example

    import logging

    from grab.spider import Spider, Task

    logging.basicConfig(level=logging.DEBUG)


    class ExampleSpider(Spider):
        def task_generator(self):
            for lang in 'python', 'ruby', 'perl':
                url = 'https://www.google.com/search?q=%s' % lang
                yield Task('search', url=url, lang=lang)

        def task_search(self, grab, task):
            print('%s: %s' % (task.lang,
                              grab.doc('//div[@class="s"]//cite').text()))


    bot = ExampleSpider(thread_number=2)
    bot.run()

Comments

Spider hangs during work randomly

now I have an issue I can't crack myself, last log entries looks this way:

[12.05.2018] [15:23:07] [DEBUG] [1535-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
[12.05.2018] [15:23:08] [DEBUG] RPS: 2.02 [error:grab-connection-error=283, error:grab-timeout-error=1, fatal=43, network-count-rejected=47]
[12.05.2018] [15:23:08] [DEBUG] RPS: 0.51 [error:grab-connection-error=283, error:grab-timeout-error=1, fatal=43, network-count-rejected=47]
[12.05.2018] [15:23:08] [DEBUG] [1536-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
[12.05.2018] [15:23:08] [DEBUG] [1537-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
[12.05.2018] [15:23:10] [DEBUG] RPS: 0.76 [error:grab-connection-error=283, error:grab-timeout-error=1, fatal=43, network-count-rejected=47]
[12.05.2018] [15:23:10] [DEBUG] RPS: 0.76 [error:grab-connection-error=283, error:grab-timeout-error=1, fatal=43, network-count-rejected=47]
[12.05.2018] [15:23:10] [DEBUG] [1538-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
[12.05.2018] [15:23:10] [DEBUG] [1539-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
[12.05.2018] [15:23:10] [DEBUG] [1540-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
[12.05.2018] [15:23:11] [DEBUG] RPS: 1.17 [error:grab-connection-error=284, error:grab-timeout-error=1, fatal=43, network-count-rejected=47]
[12.05.2018] [15:23:11] [DEBUG] RPS: 0.59 [error:grab-connection-error=284, error:grab-timeout-error=1, fatal=43, network-count-rejected=47]
[12.05.2018] [15:23:11] [DEBUG] [1541-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
[12.05.2018] [15:23:11] [DEBUG] [1542-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
[12.05.2018] [15:23:12] [DEBUG] [1543-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
[12.05.2018] [15:23:12] [DEBUG] RPS: 1.73 [error:grab-connection-error=285, error:grab-timeout-error=1, fatal=43, network-count-rejected=47]
[12.05.2018] [15:23:12] [DEBUG] [1544-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
[12.05.2018] [15:23:12] [DEBUG] [1545-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
[12.05.2018] [15:23:14] [DEBUG] RPS: 3.69 [error:grab-connection-error=285, error:grab-timeout-error=1, fatal=43, network-count-rejected=47]

or this:

[13.05.2018] [07:13:53] [DEBUG] RPS: 1.49 [fatal=24]
[13.05.2018] [07:13:53] [DEBUG] RPS: 0.75 [fatal=24]
[13.05.2018] [07:13:53] [DEBUG] [363-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
[13.05.2018] [07:13:53] [DEBUG] [364-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
[13.05.2018] [07:13:53] [DEBUG] [365-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
[13.05.2018] [07:13:54] [DEBUG] RPS: 1.29 [fatal=24]
[13.05.2018] [07:13:54] [DEBUG] RPS: 0.65 [fatal=24]
[13.05.2018] [07:13:54] [DEBUG] [366-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
[13.05.2018] [07:13:54] [DEBUG] [367-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
[13.05.2018] [07:13:55] [DEBUG] [368-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
[13.05.2018] [07:13:56] [DEBUG] RPS: 1.21 [fatal=24]
[13.05.2018] [07:13:56] [DEBUG] RPS: 0.61 [fatal=24]
[13.05.2018] [07:13:56] [DEBUG] [369-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
[13.05.2018] [07:13:56] [DEBUG] [370-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
[13.05.2018] [07:13:56] [DEBUG] [371-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
[13.05.2018] [07:13:58] [DEBUG] RPS: 1.38 [fatal=24]
[13.05.2018] [07:13:58] [DEBUG] RPS: 0.69 [fatal=24]
[13.05.2018] [07:13:58] [DEBUG] [372-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
[13.05.2018] [07:13:58] [DEBUG] [373-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
[13.05.2018] [07:13:58] [DEBUG] [374-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
[13.05.2018] [07:13:59] [DEBUG] RPS: 1.77 [fatal=24]
[13.05.2018] [07:13:59] [DEBUG] RPS: 0.00 [fatal=24]
[13.05.2018] [07:13:59] [DEBUG] [375-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
[13.05.2018] [07:13:59] [DEBUG] [376-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
[13.05.2018] [07:13:59] [DEBUG] [377-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
[13.05.2018] [07:14:01] [DEBUG] RPS: 1.21 [fatal=24]
[13.05.2018] [07:14:01] [DEBUG] RPS: 0.60 [fatal=24]
[13.05.2018] [07:14:01] [DEBUG] [378-worker:networkservicethreaded:worker_callback] GET https://here_is_hidden_website.com/... via 127.0.0.1:9150 proxy of type socks5
[13.05.2018] [07:14:02] [DEBUG] RPS: 2.20 [fatal=24]

it's final entries I can see in log. no mater how much time to wait (I was waiting for 20 hours to see result), spider just eat 1 CPU core at 100% and do absolutely nothing.

my bot have 5 stages of work and it can be hanged on any stage.

it's very difficult to debug this issue because no any errors in log, and it happens randomly

I am running bot this way:

bot = ExampleSpider(thread_number=3, network_service='threaded', grab_transport='pycurl')
bot.load_proxylist("./proxy_tor.txt", "text_file", "socks5")
bot.run()

any ideas guys? :confused:

bug

opened by EnzoRondo 25

Ошибка: pycurl.error: (0, '') при попытке отправки формы

Имеется html с формой:

<form enctype="multipart/form-data" action="http://example.com/" method="post" accept-charset="UTF-8">
    <textarea name="body">Beställa</textarea>
    <input type="submit" name="op" value="Save"/>
</form>

Есть который отправляет форму:

from grab import Grab

g = Grab()
g.setup(debug=True)

g.go('file:///C:/page.html') # тут вставьте путь до файла что указан выше

g.doc.set_input('op', 'Save')
g.doc.submit(submit_name='op')

При отправке формы получаем ошибку:

pycurl.error: (0, '')

Но если заменить внутри textarea код на другой, например вот так сделать в textarea:

<textarea name="body">123</textarea>

То всё отправится нормально.

Как это исправить?

opened by InputError 23

Python 3.5 - Unable to build DOM tree.

File "src/lxml/lxml.etree.pyx", line 3427, in lxml.etree.parse (src/lxml/lxml.etree.c:79801)
  File "src/lxml/parser.pxi", line 1799, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:116219)
  File "src/lxml/parser.pxi", line 1819, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:116494)
  File "src/lxml/parser.pxi", line 1700, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:115040)
  File "src/lxml/parser.pxi", line 1040, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:109165)
  File "src/lxml/parser.pxi", line 573, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:103404)
  File "src/lxml/parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:105058)
  File "src/lxml/parser.pxi", line 613, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:103967)
  File "<string>", line None
lxml.etree.XMLSyntaxError: switching encoding: encoder error, line 1, column 1

With preceding:

encoding error : input conversion failed due to input error, bytes 0x21 0x00 0x00 0x00
encoding error : input conversion failed due to input error, bytes 0x44 0x00 0x00 0x00
I/O error : encoder error

Example:

class Scraper(Spider):
    def task_generator(self):
        urls = [
            'https://au.linkedin.com/directory/people-a/',
            'https://www.linkedin.com/directory/people-a/'
        ]
        for url in urls:
            yield Task('url', url=url)

    def task_url(self, grab, task):
        links = grab.doc('//div[@class="columns"]//ul/li[@class="content"]/a')


bot = Scraper()
bot.run()

That's happened on some pages, perhaps lxml failed to detect correct encoding.

bug

opened by oiwn 23

Пустые cookie

from grab import Grab
url = 'https://www.fiverr.com/'
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36'
session_f_name1 = 'fiverr_session1.txt'
session_f_name2 = 'fiverr_session2.txt'

g1 = Grab()
g1.setup(cookiefile=session_f_name1)
g1.go(url)
print 'g1', g1.cookies.cookiejar

g2 = Grab()
g2.setup(cookiefile=session_f_name2, user_agent=user_agent)
g2.go(url)
print 'g2', g2.cookies.cookiejar

в первом случае кука есть, во втором нет

opened by nodermann 22

Cookies issue on windows with pycurl version pycurl 7.43.0.1

Проверял на:

Microsoft Windows Server 2012 Standard
Microsoft Windows 7 Ultimate

Версия питона на обоих машинах: Python 3.5.4 (v3.5.4:3f56838, Aug 8 2017, 02:17:05) [MSC v.1900 64 bit (AMD64)] on win32

grab=0.6.3.8

Код:

from grab import Grab, error
import sys
import logging
import base64

g = Grab()
g.setup(timeout=60)
g.setup(debug=True, debug_post=True)

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)

# запрятано в интересах приватности
url, login, pwd = base64.b64decode("v1WZktzbtVGZ7AHaw5CelRmbp9icvRXYyR3cp5WatRWYvwGcu0Wdw12LvoDc0RHa"[::-1])\
    .decode("utf-8").split(";")

g.go(url)
g.doc.set_input('username', login)
g.doc.set_input('passwd', pwd)
try:
    g.doc.set_input('lang', 'en-GB')
except:
    pass
g.doc.submit()

is_logged = g.doc.text_search("task=logout")
if not is_logged:
    raise error.GrabError("не вошли")

print("all right!")

Ставим pycurl-7.19.5.3, получаем результат: all right!

Ставим последний pycurl-7.43.0.1, получаем: grab.error.GrabError: не вошли

Это не особенность сайта, я проверял в других местах и там тоже самое. Сверял руками тело пост запроса с тем что отправляется в браузере, содержимое идентично.

opened by InputError 20

GrabTimeoutError

Hello, I have exception like https://github.com/lorien/grab/issues/140, but I haven`t some DNS Errors. OS and network: OS: Windows 8 (x64) fixed DNS: 8.8.8.8

Script:

import pycurl; 
from grab import Grab
import logging

print(pycurl.version); 
print(pycurl.__file__);

logging.basicConfig(level=logging.DEBUG)
g = Grab(verbose_logging=True, debug=True)
g.go('http://github.com')
print g.xpath_text('//title')

script output:

PycURL/7.43.0 libcurl/7.47.0 OpenSSL/1.0.2f zlib/1.2.8 c-ares/1.10.0 libssh2/1.6.0
c:\Python27\lib\site-packages\pycurl.pyd
DEBUG:grab.network:[01] GET http://github.com
DEBUG:grab.transport.curl:i: Rebuilt URL to: http://github.com/
DEBUG:grab.transport.curl:i: Resolving timed out after 3000 milliseconds
DEBUG:grab.transport.curl:i: Closing connection 0
Traceback (most recent call last):
  File "D:\pr_files\source\python\planned\htmlParser\bgsParser\NewPythonProject\src\bgsParser.py", line 10, in <module>
    g.go('http://github.com')
  File "c:\Python27\lib\site-packages\grab-0.6.30-py2.7.egg\grab\base.py", line 377, in go
    return self.request(url=url, **kwargs)
  File "c:\Python27\lib\site-packages\grab-0.6.30-py2.7.egg\grab\base.py", line 450, in request
    self.transport.request()
  File "c:\Python27\lib\site-packages\grab-0.6.30-py2.7.egg\grab\transport\curl.py", line 489, in request
    raise error.GrabTimeoutError(ex.args[0], ex.args[1])
grab.error.GrabTimeoutError: [Errno 28] Resolving timed out after 3000 milliseconds

I tried to reinstall curl:

C:\Python27\Scripts>pip install pycurl-7.43.0-cp27-none-win_amd64.whl --upgrade
Processing c:\python27\scripts\pycurl-7.43.0-cp27-none-win_amd64.whl
Installing collected packages: pycurl
  Found existing installation: pycurl 7.43.0
    Uninstalling pycurl-7.43.0:
      Successfully uninstalled pycurl-7.43.0
Successfully installed pycurl-7.43.0

but It doesn`t work. What can be wrong?

Thank you for help.

opened by tofflife 20

socks5 ip mismatch

Hi, I am trying to use socks5 proxy list with grab.spider

Here is my small test script:

from grab.spider import Spider, Task
import logging


class TestSpider(Spider):
    def prepare(self):
        self.load_proxylist(
            'proxy.list',
            source_type='text_file', proxy_type='socks5',
            auto_change=True,
            read_timeout=180
        )
        self.set_proxy = set()
        self.real_proxy = set()

    def task_generator(self):
        for i in range(200):
            yield Task('2ip', 'http://2ip.ru/')

    def task_2ip(self, grab, task):
        ip = grab.doc.select('//big[@id="d_clip_button"]').text()
        self.real_proxy.add(ip)

        proxy = grab.config['proxy'].split(':')[0]
        self.set_proxy.add(proxy)

        # if proxy != ip:
        #     print proxy, ip

    def shutdown(self):
        print len(self.set_proxy), len(self.real_proxy)


logging.basicConfig(level=logging.DEBUG)
TestSpider(thread_number=16).run()

The result is: 197 16

As we can see, the real number of used proxies is only 16, the same as thread_number.

I am using grab version from pip. The version of the curl is

curl 7.38.0 (x86_64-pc-linux-gnu) libcurl/7.38.0 OpenSSL/1.0.1f zlib/1.2.8 libidn/1.28 librtmp/2.3
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtmp rtsp smtp smtps telnet tftp 
Features: AsynchDNS IDN IPv6 Largefile GSS-API SPNEGO NTLM NTLM_WB SSL libz TLS-SRP

I think, the problem is with libcurl. What shall I do to fix it?

opened by asluchevskiy 16

accidental work of grab.spider

Hey there (@lorien), thanks a lot for great library :smiley:

I am learning your library and now see unexpected behavior during work, here is my code sample which is based on example in documentation:

import csv
import logging
import re

from grab.spider import Spider, Task


class ExampleSpider(Spider):
    def create_grab_instance(self, **kwargs):
        g = super(ExampleSpider, self).create_grab_instance(**kwargs)
        g.setup(proxy='127.0.0.1:8090', proxy_type='socks5', timeout=60, connect_timeout=15)
        return g

    def task_generator(self):
        for i in range(1, 1 + 1):
            page_url = "{}{}/".format("https://www.mourjan.com/properties/", i)
            # print("page url: {}".format(page_url))
            yield Task('stage_two', url=page_url)

    def prepare(self):
        # Prepare the file handler to save results.
        # The method `prepare` is called one time before the
        # spider has started working
        self.result_file = csv.writer(open('result.txt', 'w'))

        # This counter will be used to enumerate found images
        # to simplify image file naming
        self.result_counter = 0

    def task_stage_two(self, grab, task):
        for elem in grab.doc.select("//li[@itemprop='itemListElement']//p")[0:4]:
            part = elem.attr("onclick")
            url_part = re.search(r"(?<=wo\(\').*(?=\'\))", part).group()
            end_url = grab.make_url_absolute(url_part)
            yield Task('stage_three', url=end_url)

    def task_stage_three(self, grab, task):
        # First, save URL and title into dictionary
        post = {
            'url': task.url,
            'title': grab.doc.xpath_text("//title/text()"),
        }
        self.result_file.writerow([
            post['url'],
            post['title'],
        ])
        # Increment image counter
        self.result_counter += 1


if __name__ == '__main__':
    logging.basicConfig(level=logging.DEBUG)
    # Let's start spider with two network concurrent streams
    bot = ExampleSpider(thread_number=2)
    bot.run()

first run:

DEBUG:grab.spider.base:Using memory backend for task queue
DEBUG:grab.network:[01] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[02] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[03] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[04] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[05] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.stat:RPS: 7.35 [error:multi-added-already=5, network-count-rejected=1]
DEBUG:grab.spider.parser_pipeline:Started shutdown of parser process: Thread-1
DEBUG:grab.spider.parser_pipeline:Finished joining parser process: Thread-1
DEBUG:grab.spider.base:Main process [pid=4064]: work done

:confused:

then I am running code again ~20 attempts and have same shit, but 21 time gives success and I see what I want to see:

DEBUG:grab.spider.base:Using memory backend for task queue
DEBUG:grab.network:[01] GET https://www.mourjan.com/properties/1/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.stat:RPS: 0.52 []
DEBUG:grab.network:[02] GET https://www.mourjan.com/kw/kuwait/warehouses/rental/10854564/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[03] GET https://www.mourjan.com/ae/abu-dhabi/apartments/rental/11047384/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[04] GET https://www.mourjan.com/kw/kuwait/villas-and-houses/rental/11041455/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.network:[05] GET https://www.mourjan.com/ae/abu-dhabi/apartments/rental/11009663/ via 127.0.0.1:8090 proxy of type socks5
DEBUG:grab.stat:RPS: 2.36 []
DEBUG:grab.stat:RPS: 1.28 []
DEBUG:grab.spider.parser_pipeline:Started shutdown of parser process: Thread-1
DEBUG:grab.spider.parser_pipeline:Finished joining parser process: Thread-1
DEBUG:grab.spider.base:Main process [pid=4860]: work done

why it happens?

bug

opened by EnzoRondo 15

Баг post запросов windows x64

с 32 битными системами все проще, один добрый человек собрал курл под него, есть ли у вас возможность сделать тоже самое для x64? Мы тут всей командой мучаемся из за этого бага. Библиотека мощная и труда в нее вложено не мало, но она становится бесполезной при таком баге. Или возможно ли сделать его не на curl, а на сокетах к примеру.

opened by ArturFis 15
Добавление возможности накопления проксей между их обновлениями

Раньше, при каждом обновлении прокси-листа, старый лист затирался новым. Добавил возможность держать старые прокси в списке, а при обновлении просто расширять список проксями, которых ещё нет в списке. (т.е. добавлять только свежие соксы, не дублируя старые) Также, ранее, при обновлении проксей, итератор каждый раз создавался заново, что влечёт за собой такую ситацию: если время обновления списка относительно маленькое, то итератор не успевает доходить до конца списка и обновляется и , в итоге, последние прокси в списке так и не юзаются. Поэтому сейчас он создаётся единожды, но вместо itertools.cycle использую свой cycle, потому что в itertools список кешируется, т.е. не получится динамически его обновлять.

opened by temptask 13
Как избавиться от "operation-timeouted"?

Здравствуйте. Только начал использовать Grab, но все очень нравится.

Проблема следующая: Парсю больше 3 миллионов инвентарей со стима. Инвентари - это просто json-файлы, бывают большие и маленькие. Маленькие инвентари парсятся Spyder'ом без проблем, но вот инвентари побольше как-то страшно подвешивают поток, а потом выдают ошибку вида:

DEBUG:grab.stat:RPS: 0.26 [error:operation-timeouted=7]

Искал везде: тут, на гитхабе, в гуглгруппах, в документации, но ничего не нашел про то, что это значит и как с этим бороться. Пробовал вручную создавать инстансы Grab'а, передавать им connection_timeout и timeout и сувать их в Task'и, но видимого эффекта не получил.

opened by seniorjoinu 10

Releases(v0.6.40)

v0.6.40(May 14, 2018)

Fixed

+- Fix #346: spider does not process initial_urls +- Fix #344: raise GrabInvalidUrl for pycurl error 3
Source code(tar.gz)
Source code(zip)
v0.6.39(May 10, 2018)
Fixed

Fix bug: task generator works incorrectly

Fix bug: pypi package misses http api html file

Fix bug: dictionary changed size during iteration in stat logging

Fix bug: multiple errors in urllib3 transport and threaded network service

Fix short names of errors in stat logging

Improve error handling in urrllib3 transport

Fix #299: multi-added errors

Fix bug: pypi package misses http api html file

Fix #285: pyquery extension parses html incorrectly

Fix #267: normalize handling of too many redirect error

Fix #268: fix processing of utf cookies

Fix #241: form_fields() fails on some HTML forms

Fix normalize_unicode issue in debug post method

Fix #323: urllib3 transport fails with UnicodeError on some invalid URLs

Fix #31: support for multivalue form inputs

Fix #328, fix #67: remove hard link between document and grab

Fix #284: option headers affects content of common_headers

Fix #293: processing non-latin chars in Location header

Fix #324: refactor response header processing

Changed

Refactor Spider into set of async. services

Add certifi dependency into grab[full] setup target

Fix #315: use psycopg2-binary package for postgres cache

Related to #206: do not use connection_reuse=False for proxy connections in spider

Removed

Remove cache timeout option

Remove structured extension

Source code(tar.gz)
Source code(zip)
v0.6.38(May 10, 2018)
Fixed

Fix "error:None" in spider rps logging

Fix race condition bug in task generator

Added

Add original_exc attribute to GrabNetworkError (and subclasses) that points to original exception

Changed

Remove IOError from the ancestors of GrabNetworkError

Add default values to --spider-transport and --grab-transport options of crawl script

Source code(tar.gz)
Source code(zip)
v0.6.37(May 10, 2018)
Added

Add --spider-transport and --grab-transport options to crawl script

Add SOCKS5 proxy support in urllib3 transport

Fixed

Fix #237: urllib3 transport fails without pycurl installed

Fix bug: incorrect spider request logging when cache is enabled

Fix bug: crawl script fails while trying to process a lock key

Fix bug: urllib3 transport fails while trying to throw GrabConnectionError exception

Fix bug: Spider add_task method fails while trying to log invalid URL error

Removed

Remove obsoleted hammer_mode and hammer_timeout config options

Source code(tar.gz)
Source code(zip)
v0.6.36(May 10, 2018)
Added

Add pylint to default test set

Fixed

Fix #229: using deprecated response object inside Grab

Removed

Remove spider project template and start_project script

Source code(tar.gz)
Source code(zip)

Owner

GitHub Repository https://grab.readthedocs.io/en/latest/

A Web Scraping Program.

Web Scraping AUTHOR: Saurabh G. MTech Information Security, IIT Jammu. If you find this repository useful. I would appreciate if you Star it and Fork

2 Dec 14, 2022

Pseudo API for Google Trends

pytrends Introduction Unofficial API for Google Trends Allows simple interface for automating downloading of reports from Google Trends. Only good unt

2.6k Dec 28, 2022

A simple Discord scraper for discord bots

A simple Discord scraper for discord bots. That includes sending an guild members ids to an file, Mass inviter for joining servers your bot is in and Fetching all the servers of the bot (w/MemberCoun

1 Jan 06, 2022

Twitter Scraper

Twitter's API is annoying to work with, and has lots of limitations — luckily their frontend (JavaScript) has it's own API, which I reverse–engineered. No API rate limits. No restrictions. Extremely

45 Dec 30, 2022

A web scraper which checks price of a product regularly and sends price alerts by email if price reduces.

Amazon-Web-Scarper Created a web scraper using simple functions to check price of a product on amazon (can be duplicated to check price at other marke

1 Jan 17, 2022

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo.

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo. (Todas as infomações)

3 Oct 04, 2022

A scrapy pipeline that provides an easy way to store files and images using various folder structures.

scrapy-folder-tree This is a scrapy pipeline that provides an easy way to store files and images using various folder structures. Supported folder str

7 Oct 23, 2022

python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸每日一句 + 毒鸡汤（从2月份稳定运行至今）

python+selenium实现的web端自动打卡说明本打卡脚本适用于郑州大学健康打卡，其他web端打卡也可借鉴学习。（自己用的，从2月分稳定运行至今）仅供学习交流使用，请勿依赖。开发者对使用本脚本造成的问题不负任何责任，不对脚本执行效果做出任何担保，原则上不提供任何形式的技术支持。为防止

1 Aug 27, 2022

Subscrape - A Python scraper for substrate chains

subscrape A Python scraper for substrate chains that uses Subscan. Usage copy co

14 Dec 15, 2022

Scrape plants scientific name information from Agroforestry Species Switchboard 2.0.

Agroforestry Species Switchboard 2.0 Scraper Scrape plants scientific name information from Species Switchboard 2.0. Requirements python = 3.10 (you

2 Dec 23, 2021

Discord webhook spammer with proxy support and proxy scraper

3 Feb 27, 2022

A list of Python Bots used to extract data from several websites

A list of Python Bots used to extract data from several websites. Data extraction is for products on e-commerce (ecommerce) websites. Data fetched i

1 Jan 14, 2022

Universal Reddit Scraper - A comprehensive Reddit scraping command-line tool written in Python.

543 Jan 03, 2023

This app will let you continuously scrape certain parts of LeasePlan and extract data of cars becoming available for lease.

LeasePlan - Scraper This app will let you continuously scrape certain parts of LeasePlan and extract data of cars becoming available for lease. It has

4 Nov 18, 2022

A Python Covid-19 cases tracker that scrapes data off the web and presents the number of Cases, Recovered Cases, and Deaths that occurred because of the pandemic.

1 Nov 13, 2021

Meme-videos - Scrapes memes and turn them into a video compilations

Meme Videos Scrapes memes from reddit using praw and request and then converts t

12 Oct 28, 2022

tweet random sand cat pictures

sandcatbot setup pip3 install --user -r requirements.txt cp sandcatbot.example.conf sandcatbot.conf vim sandcatbot.conf running the first parameter i

8 Aug 07, 2022

Web Scraping Framework

Related tags

Overview

Grab Framework Documentation

Installation

Support

What is Grab?

Grab Example

Grab::Spider Example

Comments

Releases(v0.6.40)

v0.6.40(May 14, 2018)

Fixed

v0.6.39(May 10, 2018)

Fixed

Changed

Removed

v0.6.38(May 10, 2018)

Fixed

Added

Changed

v0.6.37(May 10, 2018)

Added

Fixed

Removed

v0.6.36(May 10, 2018)

Added

Fixed

Removed

Owner

A Web Scraping Program.

Pseudo API for Google Trends

A simple Discord scraper for discord bots

Twitter Scraper

A web scraper which checks price of a product regularly and sends price alerts by email if price reduces.

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo.

A scrapy pipeline that provides an easy way to store files and images using various folder structures.

python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸 每日一句 + 毒鸡汤（从2月份稳定运行至今）

Subscrape - A Python scraper for substrate chains

Scrape plants scientific name information from Agroforestry Species Switchboard 2.0.

Discord webhook spammer with proxy support and proxy scraper

A list of Python Bots used to extract data from several websites

Universal Reddit Scraper - A comprehensive Reddit scraping command-line tool written in Python.

This app will let you continuously scrape certain parts of LeasePlan and extract data of cars becoming available for lease.

爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、百度指数、维普万方、Zlibraty、Oalib、小说、招标网、采购网、小红书》

Simple Web scrapper Bot to scrap webpages using Requests, html5lib and Beautifulsoup.

Simple tool to scrape and download cross country ski timings and results from live.skidor.com

A Python Covid-19 cases tracker that scrapes data off the web and presents the number of Cases, Recovered Cases, and Deaths that occurred because of the pandemic.

Meme-videos - Scrapes memes and turn them into a video compilations

tweet random sand cat pictures

python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸每日一句 + 毒鸡汤（从2月份稳定运行至今）