Open Crawl Vietnamese Text

Last update: Jan 05, 2022

Related tags

Overview

Open Crawl Vietnamese Text

This repo contains crawled Vietnamese text from multiple sources.

This list of a topic-centric public data sources in high quality . We have collected and cleaned them from multiple sources. All of the datasets listed below are free.

Here are the ways we clean the data:

Removal of emojis
Removal of emoticons
Removal of URLs
Removal of HTML tags

1. Binhvq News Corpus:

Binhvq News Corpus was crawled from news on the internet with size of 50GB text.

link_raw, link_clean

2. Oscar corpus vietnamese crawl:

OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. Oscar has mostly 32 GB vietnamese text discarded duplicates.

link_raw, link_clean

3. Dataset story VietNamese :

Including texts of short and long story with size of 10 GB crawled by QAI on the internet.

link_clean

4. Dataset poem VietNamese :

More than 1 million sentences collected by QAI on the internet.

link_clean

Open Crawl Vietnamese Text

Related tags

Overview

Open Crawl Vietnamese Text

1. Binhvq News Corpus:

2. Oscar corpus vietnamese crawl:

3. Dataset story VietNamese :

4. Dataset poem VietNamese :

Owner

QAI Research

Unja is a fast & light tool for fetching known URLs from Wayback Machine

High available distributed ip proxy pool, powerd by Scrapy and Redis

This program will help you to properly scrape all data from a specific website

Explore scraping with BeautifulSoup!

Scrapping Connections' info on Linkedin

A low-code tool that generates python crawler code based on curl or url

Web and PDF Scraper Refactoring

A Powerful Spider(Web Crawler) System in Python.

京东云无线宝积分推送，支持查看多设备积分使用情况

Using Selenium with Python to Web Scrap Popular Youtube Tech Channels.

Scrapes all articles and their headlines from theonion.com

Web scraped S&P 500 Data from Wikipedia using Pandas and performed Exploratory Data Analysis on the data.

Web Crawlers for Data Labelling of Malicious Domain Detection & IP Reputation Evaluation

Scrape all the media from an OnlyFans account - Updated regularly

ChromiumJniGenerator - Jni Generator module extracted from Chromium project

WebScraper - A script that prints out a list of all EXTERNAL references in the HTML response to an HTTP/S request

Grab the changelog from releases on Github

This app will let you continuously scrape certain parts of LeasePlan and extract data of cars becoming available for lease.

The core packages of security analyzer web crawler

a Scrapy spider that utilizes Postgres as a DB, Squid as a proxy server, Redis for de-duplication and Splash to render JavaScript. All in a microservices architecture utilizing Docker and Docker Compose