Downloader Middleware to support Playwright in Scrapy & Gerapy

Last update: Dec 31, 2022

Related tags

Overview

Gerapy Playwright

This is a package for supporting Playwright in Scrapy, also this package is a module in Gerapy.

Installation

pip3 install gerapy-playwright

Usage

You can use PlaywrightRequest to specify a request which uses playwright to render.

For example:

yield PlaywrightRequest(detail_url, callback=self.parse_detail)

And you also need to enable PlaywrightMiddleware in DOWNLOADER_MIDDLEWARES:

DOWNLOADER_MIDDLEWARES = {
    'gerapy_playwright.downloadermiddlewares.PlaywrightMiddleware': 543,
}

Congratulate, you've finished the all of the required configuration.

If you run the Spider again, Playwright will be started to render every web page which you configured the request as PlaywrightRequest.

Settings

GerapyPlaywright provides some optional settings.

Concurrency

You can directly use Scrapy's setting to set Concurrency of Playwright, for example:

CONCURRENT_REQUESTS = 3

Pretend as Real Browser

Some website will detect WebDriver or Headless, GerapyPlaywright can pretend Chromium by inject scripts. This is enabled by default.

You can close it if website does not detect WebDriver to speed up:

GERAPY_PLAYWRIGHT_PRETEND = False

Also you can use pretend attribute in PlaywrightRequest to overwrite this configuration.

Logging Level

By default, Playwright will log all the debug messages, so GerapyPlaywright configured the logging level of Playwright to WARNING.

If you want to see more logs from Playwright, you can change the this setting:

import logging
GERAPY_PLAYWRIGHT_LOGGING_LEVEL = logging.DEBUG

Download Timeout

Playwright may take some time to render the required web page, you can also change this setting, default is 30s:

# playwright timeout
GERAPY_PLAYWRIGHT_DOWNLOAD_TIMEOUT = 30

Headless

By default, Playwright is running in Headless mode, you can also change it to False as you need, default is True:

GERAPY_PLAYWRIGHT_HEADLESS = False

Window Size

You can also set the width and height of Playwright window:

GERAPY_PLAYWRIGHT_WINDOW_WIDTH = 1400
GERAPY_PLAYWRIGHT_WINDOW_HEIGHT = 700

Default is 1400, 700.

Proxy

You can set a proxy channel via below this config:

GERAPY_PLAYWRIGHT_PROXY = 'http://tps254.kdlapi.com:15818'
GERAPY_PLAYWRIGHT_PROXY_CREDENTIAL = {
  'username': 'xxx',
  'password': 'xxxx'
}

Screenshot

You can get screenshot of loaded page, you can pass screenshot args to PlaywrightRequest as dict:

Below are the supported args:

type (str): Specify screenshot type, can be either jpeg or png. Defaults to png.
quality (int): The quality of the image, between 0-100. Not applicable to png image.
full_page (bool): When true, take a screenshot of the full scrollable page. Defaults to False.
clip (dict): An object which specifies clipping region of the page. This option should have the following fields:
- x (int): x-coordinate of top-left corner of clip area.
- y (int): y-coordinate of top-left corner of clip area.
- width (int): width of clipping area.
- height (int): height of clipping area.
omit_background (bool): Hide default white background and allow capturing screenshot with transparency.
timeout (str): Maximum time in milliseconds, defaults to 30 seconds, pass 0 to disable timeout.

For example:

yield PlaywrightRequest(start_url, callback=self.parse_index, wait_for='.item .name', screenshot={
            'type': 'png',
            'full_page': True
        })

then you can get screenshot result in response.meta['screenshot']:

Simplest save it to file:

def parse_index(self, response):
    with open('screenshot.png', 'wb') as f:
        f.write(response.meta['screenshot'].getbuffer())

If you want to enable screenshot for all requests, you can configure it by GERAPY_PLAYWRIGHT_SCREENSHOT.

For example:

GERAPY_PLAYWRIGHT_SCREENSHOT = {
    'type': 'png',
    'full_page': True
}

PlaywrightRequest

PlaywrightRequest provide args which can override global settings above.

url: request url
callback: callback
wait_until: one of "load", "domcontentloaded", "networkidle" see https://playwright.dev/python/docs/api/class-page#page-wait-for-load-state, default is domcontentloaded
wait_for: wait for some element to load, also supports dict
script: script to execute
actions: actions defined for execution of Page object
proxy: use proxy for this time, like http://x.x.x.x:x
proxy_credential: the proxy credential, like {'username': 'xxxx', 'password': 'xxxx'}
sleep: time to sleep after loaded, override GERAPY_PLAYWRIGHT_SLEEP
timeout: load timeout, override GERAPY_PLAYWRIGHT_DOWNLOAD_TIMEOUT
ignore_resource_types: ignored resource types, override GERAPY_PLAYWRIGHT_IGNORE_RESOURCE_TYPES
pretend: pretend as normal browser, override GERAPY_PLAYWRIGHT_PRETEND
screenshot: ignored resource types, see https://playwright.dev/python/docs/api/class-page#page-screenshot, override GERAPY_PLAYWRIGHT_SCREENSHOT

For example, you can configure PlaywrightRequest as:

from gerapy_playwright import PlaywrightRequest

def parse(self, response):
    yield PlaywrightRequest(url,
        callback=self.parse_detail,
        wait_until='domcontentloaded',
        wait_for='title',
        script='() => { return {name: "Germey"} }',
        sleep=2)

Then Playwright will:

wait for document to load
wait for title to load
execute console.log(document) script
sleep for 2s
return the rendered web page content, get from response.meta['screenshot']
return the script executed result, get from response.meta['script_result']

For waiting mechanism controlled by JavaScript, you can use await in script, for example:

js = '''async () => {
    await new Promise(resolve => setTimeout(resolve, 10000));
    return {
        'name': 'Germey'
    }
}
'''
yield PlaywrightRequest(url, callback=self.parse, script=js)

Then you can get the script result from response.meta['script_result'], result is {'name': 'Germey'}.

If you think the JavaScript is wired to write, you can use actions argument to define a function to execute Python based functions, for example:

async def execute_actions(page):
    await page.evaluate('() => { document.title = "Hello World"; }')
    return 1
yield PlaywrightRequest(url, callback=self.parse, actions=execute_actions)

Then you can get the actions result from response.meta['actions_result'], result is 1.

Also you can define proxy and proxy_credential for each Reqest, for example:

yield PlaywrightRequest(
  self.base_url,
  callback=self.parse_index,
  priority=10,
  proxy='http://tps254.kdlapi.com:15818',
  proxy_credential={
      'username': 'xxxx',
      'password': 'xxxx'
})

proxy and proxy_credential will override the settings GERAPY_PLAYWRIGHT_PROXY and GERAPY_PLAYWRIGHT_PROXY_CREDENTIAL.

Example

For more detail, please see example.

Also you can directly run with Docker:

docker run germey/gerapy-playwright-example

Outputs:

2021-12-27 16:54:14 [scrapy.utils.log] INFO: Scrapy 2.2.0 started (bot: example)
2021-12-27 16:54:14 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.7.9 (default, Aug 31 2020, 07:22:35) - [Clang 10.0.0 ], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 35.0.0, Platform Darwin-21.1.0-x86_64-i386-64bit
2021-12-27 16:54:14 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2021-12-27 16:54:14 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'example',
 'CONCURRENT_REQUESTS': 1,
 'NEWSPIDER_MODULE': 'example.spiders',
 'RETRY_HTTP_CODES': [403, 500, 502, 503, 504],
 'SPIDER_MODULES': ['example.spiders']}
2021-12-27 16:54:14 [scrapy.extensions.telnet] INFO: Telnet Password: e931b241390ad06a
2021-12-27 16:54:14 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2021-12-27 16:54:14 [gerapy.playwright] INFO: playwright libraries already installed
2021-12-27 16:54:14 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'gerapy_playwright.downloadermiddlewares.PlaywrightMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-12-27 16:54:14 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-12-27 16:54:14 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-12-27 16:54:14 [scrapy.core.engine] INFO: Spider opened
2021-12-27 16:54:14 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-12-27 16:54:14 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-12-27 16:54:14 [example.spiders.movie] DEBUG: start url https://antispider1.scrape.center/page/1
2021-12-27 16:54:14 [gerapy.playwright] DEBUG: processing request <GET https://antispider1.scrape.center/page/1>
2021-12-27 16:54:14 [gerapy.playwright] DEBUG: playwright_meta {'wait_until': 'domcontentloaded', 'wait_for': '.item', 'script': None, 'actions': None, 'sleep': None, 'proxy': None, 'proxy_credential': None, 'pretend': None, 'timeout': None, 'screenshot': None}
2021-12-27 16:54:14 [gerapy.playwright] DEBUG: set options {'headless': False}
cookies []
2021-12-27 16:54:16 [gerapy.playwright] DEBUG: PRETEND_SCRIPTS is run
2021-12-27 16:54:16 [gerapy.playwright] DEBUG: timeout 10
2021-12-27 16:54:16 [gerapy.playwright] DEBUG: crawling https://antispider1.scrape.center/page/1
2021-12-27 16:54:16 [gerapy.playwright] DEBUG: request https://antispider1.scrape.center/page/1 with options {'url': 'https://antispider1.scrape.center/page/1', 'wait_until': 'domcontentloaded'}
2021-12-27 16:54:18 [gerapy.playwright] DEBUG: waiting for .item
2021-12-27 16:54:18 [gerapy.playwright] DEBUG: sleep for 1s
2021-12-27 16:54:19 [gerapy.playwright] DEBUG: taking screenshot using args {'type': 'png', 'full_page': True}
2021-12-27 16:54:19 [gerapy.playwright] DEBUG: close playwright
2021-12-27 16:54:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://antispider1.scrape.center/page/1> (referer: None)
2021-12-27 16:54:20 [example.spiders.movie] DEBUG: start url https://antispider1.scrape.center/page/2
2021-12-27 16:54:20 [gerapy.playwright] DEBUG: processing request <GET https://antispider1.scrape.center/page/2>
2021-12-27 16:54:20 [gerapy.playwright] DEBUG: playwright_meta {'wait_until': 'domcontentloaded', 'wait_for': '.item', 'script': None, 'actions': None, 'sleep': None, 'proxy': None, 'proxy_credential': None, 'pretend': None, 'timeout': None, 'screenshot': None}
2021-12-27 16:54:20 [gerapy.playwright] DEBUG: set options {'headless': False}
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/1
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/2
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/3
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/4
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/5
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/6
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/7
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/8
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/9
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/10
cookies []
2021-12-27 16:54:21 [gerapy.playwright] DEBUG: PRETEND_SCRIPTS is run
2021-12-27 16:54:21 [gerapy.playwright] DEBUG: timeout 10
2021-12-27 16:54:21 [gerapy.playwright] DEBUG: crawling https://antispider1.scrape.center/page/2
2021-12-27 16:54:21 [gerapy.playwright] DEBUG: request https://antispider1.scrape.center/page/2 with options {'url': 'https://antispider1.scrape.center/page/2', 'wait_until': 'domcontentloaded'}
2021-12-27 16:54:23 [gerapy.playwright] DEBUG: waiting for .item
2021-12-27 16:54:24 [gerapy.playwright] DEBUG: sleep for 1s
2021-12-27 16:54:25 [gerapy.playwright] DEBUG: taking screenshot using args {'type': 'png', 'full_page': True}
2021-12-27 16:54:25 [gerapy.playwright] DEBUG: close playwright
2021-12-27 16:54:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://antispider1.scrape.center/page/2> (referer: None)
2021-12-27 16:54:25 [gerapy.playwright] DEBUG: processing request <GET https://antispider1.scrape.center/detail/10>
2021-12-27 16:54:25 [gerapy.playwright] DEBUG: playwright_meta {'wait_until': 'domcontentloaded', 'wait_for': '.item', 'script': None, 'actions': None, 'sleep': None, 'proxy': None, 'proxy_credential': None, 'pretend': None, 'timeout': None, 'screenshot': None}
2021-12-27 16:54:25 [gerapy.playwright] DEBUG: set options {'headless': False}
...

Comments

twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed

python: 3.9 GerapyPlaywright: 0.2.4 os: mac 11.6

运行scrapy crawl spider的时候直接报错:

Traceback (most recent call last):
  File "/Users/zz/.virtualenvs/crawler-apk/bin/scrapy", line 8, in <module>
    sys.exit(execute())
  File "/Users/zz/.virtualenvs/crawler-apk/lib/python3.9/site-packages/scrapy/cmdline.py", line 145, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/Users/zz/.virtualenvs/crawler-apk/lib/python3.9/site-packages/scrapy/cmdline.py", line 100, in _run_print_help
    func(*a, **kw)
  File "/Users/zz/.virtualenvs/crawler-apk/lib/python3.9/site-packages/scrapy/cmdline.py", line 153, in _run_command
    cmd.run(args, opts)
  File "/Users/zz/.virtualenvs/crawler-apk/lib/python3.9/site-packages/scrapy/commands/crawl.py", line 22, in run
    crawl_defer = self.crawler_process.crawl(spname, **opts.spargs)
  File "/Users/zz/.virtualenvs/crawler-apk/lib/python3.9/site-packages/scrapy/crawler.py", line 205, in crawl
    crawler = self.create_crawler(crawler_or_spidercls)
  File "/Users/zz/.virtualenvs/crawler-apk/lib/python3.9/site-packages/scrapy/crawler.py", line 238, in create_crawler
    return self._create_crawler(crawler_or_spidercls)
  File "/Users/zz/.virtualenvs/crawler-apk/lib/python3.9/site-packages/scrapy/crawler.py", line 313, in _create_crawler
    return Crawler(spidercls, self.settings, init_reactor=True)
  File "/Users/zz/.virtualenvs/crawler-apk/lib/python3.9/site-packages/scrapy/crawler.py", line 82, in __init__
    default.install()
  File "/Users/zz/.virtualenvs/crawler-apk/lib/python3.9/site-packages/twisted/internet/selectreactor.py", line 194, in install
    installReactor(reactor)
  File "/Users/zz/.virtualenvs/crawler-apk/lib/python3.9/site-packages/twisted/internet/main.py", line 32, in installReactor
    raise error.ReactorAlreadyInstalledError("reactor already installed"
twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed

这个应该改哪里呢

opened by yyyy777 3

Fix leaking file descriptors by using the context manager

I was running into OSError: [Errno 24] Too many open files while using this with scrapy for scraping a domain.

By using async_playwright() as a context manager, we ensure it's closed once finished. This fixes the issue.

opened by xolan 2
raise BadGzipFile('Not a gzipped file (%r)' % magic) gzip.BadGzipFile: Not a gzipped file (b'

崔佬，我这边也不能用。一启动scrapy，就会报这个。配置： python:3.9.4 macOs: 11.5.2

Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/twisted/internet/defer.py", line 1445, in _inlineCallbacks result = current_context.run(g.send, result) File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/scrapy/core/downloader/middleware.py", line 54, in process_response response = yield deferred_from_coro(method(request=request, response=response, spider=spider)) File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/scrapy/downloadermiddlewares/httpcompression.py", line 62, in process_response decoded_body = self._decode(response.body, encoding.lower()) File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/scrapy/downloadermiddlewares/httpcompression.py", line 82, in _decode body = gunzip(body) File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/scrapy/utils/gz.py", line 27, in gunzip chunk = f.read1(8196) File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/gzip.py", line 313, in read1 return self._buffer.read1(size) File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/_compression.py", line 68, in readinto data = self.read(len(byte_view)) File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/gzip.py", line 487, in read if not self._read_gzip_header(): File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/gzip.py", line 435, in _read_gzip_header raise BadGzipFile('Not a gzipped file (%r)' % magic) gzip.BadGzipFile: Not a gzipped file (b'<!')

opened by wf4867612 2

Method is not Json serializable for actions

Hello,

I am running into an issue with 'yield' for gerapy-playwright when I need to access a login page. I try to run: yield PlaywrightRequest(login_page_url, self.parse_login, actions = self.login_action) in order to first use playwright to login and then access data that can only be accessed when logging in with self.parse_login. I am getting a: builtins.TypeError: is not JSON serializable.

I am using scrapy cluster along with gerapy-playwright in order to run a scheduler for all the spiders that I have: https://github.com/istresearch/scrapy-cluster

It seems that the action is saved in the meta data as a method and cannot be passed to the scheduler. Is it possible to type cast the action as a string and then when the action is called later, to do a method call on the string? If I understand correctly, the action is produced on line 339 of the downloadermiddlewares.py inside of gerepy-playwright. Would it be possible to evaluate the string as a method so that the scrapy-cluster scheduler can pass the string but gerapy-playwright still calls the self.login_action method prior to the self.parse_login?

opened by BenzTivianne 0

出现gzip.BadGzipFile: Not a gzipped file (b'

如何爬去的一个网站返回的response里面的headers包含了 content-encoding: "gzip"的话，那么就会报上述错误，虽然作者在 downloadermiddlewares.py 的代码段中去掉了这个属性：

Necessary to bypass the compression middleware

        # 这个地方只能去掉 headers 中的content-encoding，但是response.headers中的依然存在，所以下面应该直接改为  headers=headers,
        headers = response.headers
        headers.pop('content-encoding', None)
        headers.pop('Content-Encoding', None)

        response = HtmlResponse(
            page.url,
            status=response.status,
            headers=response.headers,    # 解决办法就是改为： headers=headers, 
            body=content,
            encoding='utf-8',
            request=request
        )

但是很可惜的是，去不掉，只有把 headers=response.headers, 改为headers才可以。

opened by legend-zl 3

gzip.BadGzipFile: Not a gzipped file (b'

使用Sscrapy的时候，利用

yield PlaywrightRequest(article_url, callback=self.parse_result, wait_until='domcontentloaded', headers=self.headers)

出现错误：gzip.BadGzipFile: Not a gzipped file (b'<!')

opened by legend-zl 1

playwright._impl._api_types.Error: Browser closed.

这种报错会是什么原因呢...

2022-03-23 06:21:06 [scrapy.core.scraper] ERROR: Error downloading <GET https://apkpure.com/bikers-men-women-bike-photo-editor-future-trends/com.dsrtech.bikers> Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/twisted/internet/defer.py", line 1656, in _inlineCallbacks result = current_context.run( File "/usr/local/lib/python3.8/dist-packages/twisted/python/failure.py", line 489, in throwExceptionIntoGenerator return g.throw(self.type, self.value, self.tb) File "/usr/local/lib/python3.8/dist-packages/scrapy/core/downloader/middleware.py", line 41, in process_request response = yield deferred_from_coro(method(request=request, spider=spider)) File "/usr/local/lib/python3.8/dist-packages/twisted/internet/defer.py", line 1030, in adapt extracted = result.result() File "/usr/local/lib/python3.8/dist-packages/gerapy_playwright/downloadermiddlewares.py", line 243, in _process_request context = await browser.new_context( File "/usr/local/lib/python3.8/dist-packages/playwright/async_api/_generated.py", line 11254, in new_context await self._async( File "/usr/local/lib/python3.8/dist-packages/playwright/_impl/_browser.py", line 117, in new_context channel = await self._channel.send("newContext", params) File "/usr/local/lib/python3.8/dist-packages/playwright/_impl/_connection.py", line 39, in send return await self.inner_send(method, params, False) File "/usr/local/lib/python3.8/dist-packages/playwright/_impl/_connection.py", line 63, in inner_send result = next(iter(done)).result() playwright._impl._api_types.Error: Browser closed. ==================== Browser output: ==================== /ms-playwright/chromium-978106/chrome-linux/chrome --disable-background-networking --enable-features=NetworkService,NetworkServiceInProcess --disable-background-timer-throttling --disable-backgrounding-occluded-windows --disable-breakpad --disable-client-side-phishing-detection --disable-component-extensions-with-background-pages --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-features=ImprovedCookieControls,LazyFrameLoading,GlobalMediaControls,DestroyProfileOnBrowserClose,MediaRouter,AcceptCHFrame,AutoExpandDetailsElement --allow-pre-commit-input --disable-hang-monitor --disable-ipc-flooding-protection --disable-popup-blocking --disable-prompt-on-repost --disable-renderer-backgrounding --disable-sync --force-color-profile=srgb --metrics-recording-only --no-first-run --enable-automation --password-store=basic --use-mock-keychain --no-service-autorun --export-tagged-pdf --headless --hide-scrollbars --mute-audio --blink-settings=primaryHoverType=2,availableHoverTypes=2,primaryPointerType=4,availablePointerTypes=4 --no-sandbox --disable-extensions --hide-scrollbars --mute-audio --no-sandbox --disable-setuid-sandbox --disable-gpu --user-data-dir=/tmp/playwright_chromiumdev_profile-LGppgb --remote-debugging-pipe --no-startup-window pid=1185 [pid=1185][err] [0323/062041.773971:ERROR:platform_thread_posix.cc(151)] pthread_create: Resource temporarily unavailable (11) [pid=1185][err] [0323/062041.774268:ERROR:platform_thread_posix.cc(151)] pthread_create: Resource temporarily unavailable (11) [pid=1185][err] [0323/062041.778128:ERROR:zygote_communication_linux.cc(142)] Did not receive ping from zygote child [pid=1185][err] [0323/062041.778090:ERROR:zygote_linux.cc(607)] Zygote could not fork: process_type gpu-process numfds 3 child_pid -1 [pid=1185][err] [0323/062041.778548:ERROR:gpu_process_host.cc(968)] GPU process launch failed: error_code=1002 [pid=1185][err] [0323/062041.778568:WARNING:gpu_process_host.cc(1279)] The GPU process has crashed 1 time(s) [pid=1185][err] [0323/062041.778540:ERROR:zygote_linux.cc(271)] Unexpected real PID message from browser [pid=1185][err] [0323/062041.780496:ERROR:zygote_communication_linux.cc(142)] Did not receive ping from zygote child [pid=1185][err] [0323/062041.785120:ERROR:zygote_linux.cc(607)] Zygote could not fork: process_type gpu-process numfds 3 child_pid -1 [pid=1185][err] [0323/062041.785947:ERROR:gpu_process_host.cc(968)] GPU process launch failed: error_code=1002 [pid=1185][err] [0323/062041.785963:WARNING:gpu_process_host.cc(1279)] The GPU process has crashed 2 time(s) [pid=1185][err] [0323/062041.786835:ERROR:bus.cc(397)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory [pid=1185][err] [0323/062041.786892:ERROR:bus.cc(397)] Failed to connect to the bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory [pid=1185][err] [0323/062041.787157:WARNING:bluez_dbus_manager.cc(248)] Floss manager not present, cannot set Floss enable/disable. [pid=1185][err] [0323/062041.787335:ERROR:zygote_linux.cc(271)] Unexpected real PID message from browser [pid=1185][err] [0323/062041.816287:ERROR:zygote_communication_linux.cc(142)] Did not receive ping from zygote child [pid=1185][err] [0323/062041.815965:ERROR:zygote_linux.cc(607)] Zygote could not fork: process_type gpu-process numfds 3 child_pid -1 [pid=1185][err] [0323/062041.816903:ERROR:gpu_process_host.cc(968)] GPU process launch failed: error_code=1002 [pid=1185][err] [0323/062041.816914:WARNING:gpu_process_host.cc(1279)] The GPU process has crashed 3 time(s) [pid=1185][err] [0323/062041.816721:ERROR:zygote_linux.cc(271)] Unexpected real PID message from browser [pid=1185][err] [0323/062041.821091:ERROR:zygote_communication_linux.cc(142)] Did not receive ping from zygote child [pid=1185][err] [0323/062041.821123:ERROR:zygote_linux.cc(607)] Zygote could not fork: process_type gpu-process numfds 3 child_pid -1 [pid=1185][err] [0323/062041.821310:ERROR:gpu_process_host.cc(968)] GPU process launch failed: error_code=1002 [pid=1185][err] [0323/062041.821321:WARNING:gpu_process_host.cc(1279)] The GPU process has crashed 4 time(s) [pid=1185][err] [0323/062041.822089:ERROR:zygote_linux.cc(271)] Unexpected real PID message from browser [pid=1185][err] [0323/062041.823058:ERROR:zygote_communication_linux.cc(142)] Did not receive ping from zygote child [pid=1185][err] [0323/062041.823172:ERROR:zygote_linux.cc(607)] Zygote could not fork: process_type gpu-process numfds 3 child_pid -1 [pid=1185][err] [0323/062041.823358:ERROR:gpu_process_host.cc(968)] GPU process launch failed: error_code=1002 [pid=1185][err] [0323/062041.823369:WARNING:gpu_process_host.cc(1279)] The GPU process has crashed 5 time(s) [pid=1185][err] [0323/062041.824213:ERROR:zygote_linux.cc(271)] Unexpected real PID message from browser [pid=1185][err] [0323/062041.825010:ERROR:zygote_communication_linux.cc(142)] Did not receive ping from zygote child [pid=1185][err] [0323/062041.825129:ERROR:zygote_linux.cc(607)] Zygote could not fork: process_type gpu-process numfds 3 child_pid -1 [pid=1185][err] [0323/062041.825312:ERROR:gpu_process_host.cc(968)] GPU process launch failed: error_code=1002 [pid=1185][err] [0323/062041.825323:WARNING:gpu_process_host.cc(1279)] The GPU process has crashed 6 time(s) [pid=1185][err] [0323/062041.825608:ERROR:zygote_linux.cc(271)] Unexpected real PID message from browser [pid=1185][err] [0323/062041.825332:FATAL:gpu_data_manager_impl_private.cc(447)] GPU process isn't usable. Goodbye. [pid=1185][err] #0 0x55f9da32a369 base::debug::CollectStackTrace() [pid=1185][err] #1 0x55f9da2908c3 base::debug::StackTrace::StackTrace() [pid=1185][err] #2 0x55f9da2a3650 logging::LogMessage::~LogMessage() [pid=1185][err] #3 0x55f9d7e92bf7 content::(anonymous namespace)::IntentionallyCrashBrowserForUnusableGpuProcess() [pid=1185][err] #4 0x55f9d7e903fe content::GpuDataManagerImplPrivate::FallBackToNextGpuMode() [pid=1185][err] #5 0x55f9d7e8f303 content::GpuDataManagerImpl::FallBackToNextGpuMode() [pid=1185][err] #6 0x55f9d7e99d13 content::GpuProcessHost::RecordProcessCrash() [pid=1185][err] #7 0x55f9d7e9af44 content::GpuProcessHost::OnProcessLaunchFailed() [pid=1185][err] #8 0x55f9d7d15421 content::BrowserChildProcessHostImpl::OnProcessLaunchFailed() [pid=1185][err] #9 0x55f9d7d6faf5 content::internal::ChildProcessLauncherHelper::PostLaunchOnClientThread() [pid=1185][err] #10 0x55f9d7d6fd15 base::internal::Invoker<>::RunOnce() [pid=1185][err] #11 0x55f9da2e8bb0 base::TaskAnnotator::RunTaskImpl() [pid=1185][err] #12 0x55f9da2fca99 base::sequence_manager::internal::ThreadControllerWithMessagePumpImpl::DoWorkImpl() [pid=1185][err] #13 0x55f9da2fc7bc base::sequence_manager::internal::ThreadControllerWithMessagePumpImpl::DoWork() [pid=1185][err] #14 0x55f9da2fcf92 base::sequence_manager::internal::ThreadControllerWithMessagePumpImpl::DoWork() [pid=1185][err] #15 0x55f9da2ac06b base::(anonymous namespace)::WorkSourceDispatch() [pid=1185][err] #16 0x7fe589c2b17d g_main_context_dispatch [pid=1185][err] #17 0x7fe589c2b400 (/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6+0x523ff) [pid=1185][err] #18 0x7fe589c2b4a3 g_main_context_iteration [pid=1185][err] #19 0x55f9da2abeb3 base::MessagePumpGlib::Run() [pid=1185][err] #20 0x55f9da2fd1fe base::sequence_manager::internal::ThreadControllerWithMessagePumpImpl::Run() [pid=1185][err] #21 0x55f9da2ca3ed base::RunLoop::Run() [pid=1185][err] #22 0x55f9d7d2d2ad content::BrowserMainLoop::RunMainMessageLoop() [pid=1185][err] #23 0x55f9d7d2eb62 content::BrowserMainRunnerImpl::Run() [pid=1185][err] #24 0x55f9df75683e headless::HeadlessContentMainDelegate::RunProcess() [pid=1185][err] #25 0x55f9d9e42862 content::RunBrowserProcessMain() [pid=1185][err] #26 0x55f9d9e43d0f content::ContentMainRunnerImpl::RunBrowser() [pid=1185][err] #27 0x55f9d9e4389f content::ContentMainRunnerImpl::Run() [pid=1185][err] #28 0x55f9d9e40cb4 content::RunContentProcess() [pid=1185][err] #29 0x55f9d9e415ce content::ContentMain() [pid=1185][err] #30 0x55f9d9e9cc5a headless::(anonymous namespace)::RunContentMain() [pid=1185][err] #31 0x55f9d9e9c965 headless::HeadlessShellMain() [pid=1185][err] #32 0x55f9d6961fa8 ChromeMain [pid=1185][err] #33 0x7fe588ea60b3 __libc_start_main [pid=1185][err] #34 0x55f9d6961dea _start [pid=1185][err] Task trace: [pid=1185][err] #0 0x55f9d7d6f9ac content::internal::ChildProcessLauncherHelper::PostLaunchOnLauncherThread() [pid=1185][err] #1 0x55f9d7d6f3aa content::internal::ChildProcessLauncherHelper::StartLaunchOnClientThread() [pid=1185][err] #2 0x55f9da682456 mojo::SimpleWatcher::Context::Notify() [pid=1185][err] #3 0x55f9d7d6f3aa content::internal::ChildProcessLauncherHelper::StartLaunchOnClientThread() [pid=1185][err] #4 0x55f9da682456 mojo::SimpleWatcher::Context::Notify() [pid=1185][err] Task trace buffer limit hit, update PendingTask::kTaskBacktraceLength to increase. [pid=1185][err]

opened by yyyy777 0

Releases(v0.2.3)

v0.2.3(Jan 11, 2022)
Fix bug: https://github.com/Gerapy/GerapyPlaywright/issues/3

Source code(tar.gz)
Source code(zip)
v0.2.0(Dec 28, 2021)
New Feature: Add support for:

Specifying channel for launching

Specifying executablePath for launching

Specifying slowMo for launching

Specifying devtools for launching

Specifying --disable-extensions in args for launching

Specifying --hide-scrollbars in args for launching

Specifying --no-sandbox in args for launching

Specifying --disable-setuid-sandbox in args for launching

Specifying --disable-gpu in args for launching

Update: change GERAPY_PLAYWRIGHT_SLEEP default to 0

Source code(tar.gz)
Source code(zip)
v0.1.2(Dec 28, 2021)

Fix bug: Add error handling logic for playwright._impl._api_types.Error, to add retrying logic and avoid unexpected memory increase.
Source code(tar.gz)
Source code(zip)
v0.1.1(Dec 27, 2021)
First version of Playwright, add basic support for:

Auto Installation

Render with Playwright

Setting Concurrency

Setting Proxy

Setting Cookies

Screenshot

Evaluating Script

Wait for Elements

Wait loading control

Setting Timeout

Pretending Webdriver

Source code(tar.gz)
Source code(zip)

Owner

Gerapy

Distributed Crawler Management Framework Based on Scrapy, Scrapyd.

GitHub Repository

Downloader Middleware to support Playwright in Scrapy & Gerapy

Related tags

Overview

Gerapy Playwright

Installation

Usage

Settings

Concurrency

Pretend as Real Browser

Logging Level

Download Timeout

Headless

Window Size

Proxy

Screenshot

PlaywrightRequest

Example

Comments

twisted.internet.error.ReactorAlreadyInstalledError: reactor already installed

Fix leaking file descriptors by using the context manager

raise BadGzipFile('Not a gzipped file (%r)' % magic) gzip.BadGzipFile: Not a gzipped file (b'

Method is not Json serializable for actions

出现gzip.BadGzipFile: Not a gzipped file (b'

Necessary to bypass the compression middleware

gzip.BadGzipFile: Not a gzipped file (b'

playwright._impl._api_types.Error: Browser closed.

Releases(v0.2.3)

v0.2.3(Jan 11, 2022)

v0.2.0(Dec 28, 2021)

v0.1.2(Dec 28, 2021)

v0.1.1(Dec 27, 2021)

Owner

Gerapy

MMDL (Mega Music Downloader) - A tool to easily download music.

YouTube Downloader Bot With Python

DYA ( Ditch YouTube API ) is a package created to power the user with YouTube Data API functionality without any API Key

Noto fonts go universal! Download Noto fonts combined to suit your region

Convert BMS songs to osu! With options to convert keysounds and convert to 7key.

Google Art Image Downloader Tkinter

Storing, versioning, and downloading files from S3 made as easy as using open() in Python. Caching included.

ASF Sentinel-1 Metadata Download tool

Download YouTube videos that are available in the given playlist

📼Command line tool based on youtube-dl to easily download selected channels from your subscriptions.

Easily download audio described movies and TV shows found on audiovault.net

Libretrofuzz - Fuzzy Retroarch thumbnail downloader

Python script for downloading audio from YouTube songs/videos.

Apple Music Animated Artwork Fetcher

Python based YouTube video Downloader GUI Application.

Download minecraft head or skin, allows TLauncher accounts

A python module to download ISO Standards

A very fast file streaming bot used for streaming and downloading movies

An automatic beatmapset downloader via txt file, suitable for tourney mappools.

File Downloader