2021 2학기 데이터크롤링 기말프로젝트

Last update: Aug 16, 2022

Related tags

Text Data & NLP data_crawling

Overview

공지

주제

웹 크롤링을 이용한 취업 공고 스케줄러

스케줄

주제 정하기
코딩하기
핵심 코드 설명 + 피피티 구조 구상 // 12/4 토
피피티 + 스크립트(대본) 제작 + 녹화 // ~ 12/10 ~ 12/11 금~토
영상 편집 // ~12/11 토

웹크롤러

사람인_평균연봉 1000개

주제 선정 배경

마지막 학기를 보내며 취업 전선에 뛰어들려 하니 여러 가지 생각해야 할 게 많았다. 학교라는 좁은 사회를 벗어나 더 큰 물에 뛰어들려 보니 겁부터 났다. 수영 전 준비운동을 하듯 내가 취업하기 위해 먼저 채용 정보를 수집해야 겠다고 생각했다.
IT 내에서도 트렌드와 어떤 분야에서 사람을 많이 구하는지 알고 싶었다. 그를 위해 스택 오버플로우에서 User-Agent 를 확인 후 채용 공고 크롤링을 수행했다.
우리나라 내에서 각자의 분야에 종사하는 사람들이 평균 연봉으로 얼마를 받는지 알고 싶어서 여러 취업 사이트 중 하나인 '사람인'에서 User-Agent 를 확인 후 평균 연봉 정보를 크롤링했다. 최근 1000개만 수행해보았다. (10000개 해도 될 듯하다.)

데이터 수집 방법

사람인, 스택오버플로우에서의 채용 공고를 긁어오기로 했다.
따로 만든 크롤러 파일(연봉정보, 채용공고)에서 CSV 로 데이터를 추출한다.

크롤링 작업 중 핵심 코드 설명

연봉 정보 파일은 주석 달기 완료

분석 방법

주제어(키워드) 빈도 분석
주제어(키워드) 중요도 분석
텍스트 마이닝
참고한 링크

결론

어떠한 분야에서의 국내 평균 연봉은 이렇다!
요새는 세계적으로 IT 내 이쪽 분야가 트렌드다! 사람을 많이 뽑는다!

참고자료

사람인 사이트
스택 오버플로우 사이트

과제 수행에서 어려웠던 점

User-Agent 에서 크롤링을 허락해주는 사이트 중 URL 에 페이지의 숫자가 나타나는 사이트를 찾기 어려웠다.
직무 별

PPT 구성

[1] - 주제
[2] - 주제 선정 배경
[3] - 데이터 수집 방법
[4] - 크롤링 작업 중 핵심 소스 코드 설명
[5] - 분석방법/모델
[6] - 결론
[7] - 참고자료
[8] - 과제 수행에서 어려웠던 점

PPT 상세 구성

스택 오버 플로우
- 직종별 구인수 (Front/Back) (NCS IT 직무 8개)
- 나라별 구인 직종
사람인
- 1000개의 임의의 기업에 따른 최고 연봉 (5) 과 최저 연봉 (5)
  - 최고 같은 경우 은행이나 다른 업종
  - 최저 같은 경우 서비스 업종
- 기업형태에 따른 연봉 구간 (중소/중견/대)
- 산업(업종)에 따른 연봉 구간
- 코스닥/코스피에 따른 연봉 구간 차이?
현재 취업하려고 하는 사람들에게 어떤 직무가 자신에게 나을지 판단 -> 결론
- 직무별 수요에 따라서 결과 표시 (스택)
- 연봉을 중요시 여긴다면 결과 표시 (사람인)

분석 결과

스택 오버 플로우
- 직종별 구인수 (Front/Back) (NCS IT 직무 8개)
  - 분석 결과 여따 써줘요
  - 대략 밑에 작성하라는 의미
  - Front / Back
  - 직무 8개 별로
- 나라별 구인 직종
- 사람인
  - 1000개의 임의의 기업에 따른 최고 연봉 (5) 과 최저 연봉 (5)
    - 최고 같은 경우 은행이나 다른 업종
    - 최저 같은 경우 서비스 업종
  - 기업형태에 따른 연봉 구간 (중소/중견/대)
  - 산업(업종)에 따른 연봉 구간
  - 코스닥/코스피에 따른 연봉 구간 차이?

Owner

Choi Eun Jeong

Frontend Developer with React & React Native

Choi Eun Jeong

GitHub Repository

Python package for Turkish Language.

PyTurkce Python package for Turkish Language. Documentation: https://pyturkce.readthedocs.io. Installation pip install pyturkce Usage from pyturkce im

14 Oct 09, 2022

Partially offline multi-language translator built upon Huggingface transformers.

Translate Command-line interface to translation pipelines, powered by Huggingface transformers. This tool can download translation models, and then us

8 Oct 25, 2022

Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together

SpeechMix Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together. Introduction For the same input: from datas

31 Nov 07, 2022

SimBERT升级版（SimBERTv2）！

RoFormer-Sim RoFormer-Sim，又称SimBERTv2，是我们之前发布的SimBERT模型的升级版。介绍 https://kexue.fm/archives/8454 训练 tensorflow 1.14 + keras 2.3.1 + bert4keras 0.10.6 下载

317 Dec 23, 2022

DeepPavlov Tutorials

DeepPavlov tutorials DeepPavlov: Sentence Classification with Word Embeddings DeepPavlov: Transfer Learning with BERT. Classification, Tagging, QA, Ze

28 Sep 13, 2022

The implementation of Parameter Differentiation based Multilingual Neural Machine Translation

The implementation of Parameter Differentiation based Multilingual Neural Machine Translation .

21 Dec 17, 2022

open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

中文开放信息抽取系统, open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

7 Nov 02, 2022

This repository is home to the Optimus data transformation plugins for various data processing needs.

Transformers Optimus's transformation plugins are implementations of Task and Hook interfaces that allows execution of arbitrary jobs in optimus. To i

37 Dec 14, 2022

⚡ boost inference speed of T5 models by 5x & reduce the model size by 3x using fastT5.

Reduce T5 model size by 3X and increase the inference speed up to 5X. Install Usage Details Functionalities Benchmarks Onnx model Quantized onnx model

399 Jan 05, 2023

Transformer-based Text Auto-encoder (T-TA) using TensorFlow 2.

T-TA (Transformer-based Text Auto-encoder) This repository contains codes for Transformer-based Text Auto-encoder (T-TA, paper: Fast and Accurate Deep

13 Dec 13, 2022

Neural text generators like the GPT models promise a general-purpose means of manipulating texts.

Boolean Prompting for Neural Text Generators Neural text generators like the GPT models promise a general-purpose means of manipulating texts. These m

20 Jan 09, 2023

EasyTransfer is designed to make the development of transfer learning in NLP applications easier.

EasyTransfer is designed to make the development of transfer learning in NLP applications easier. The literature has witnessed the success of applying

819 Jan 03, 2023

Ελληνικά νέα (Python script) / Greek News Feed (Python script)

Ελληνικά νέα (Python script) / Greek News Feed (Python script) Ελληνικά English Το 2017 είχα υλοποιήσει ένα Python script για να εμφανίζει τα τωρινά ν

1 Jun 14, 2022

File-based TF-IDF: Calculates keywords in a document, using a word corpus.

File-based TF-IDF Calculates keywords in a document, using a word corpus. Why? Because I found myself with hundreds of plain text files, with no way t

1 Feb 11, 2022

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

3.2k Dec 31, 2022

[ICLR 2021 Spotlight] Pytorch implementation for "Long-tailed Recognition by Routing Diverse Distribution-Aware Experts."

RIDE: Long-tailed Recognition by Routing Diverse Distribution-Aware Experts. by Xudong Wang, Long Lian, Zhongqi Miao, Ziwei Liu and Stella X. Yu at UC

205 Dec 16, 2022

BookNLP, a natural language processing pipeline for books

BookNLP BookNLP is a natural language processing pipeline that scales to books and other long documents (in English), including: Part-of-speech taggin

654 Jan 02, 2023

Black for Python docstrings and reStructuredText (rst).

Style-Doc Style-Doc is Black for Python docstrings and reStructuredText (rst). It can be used to format docstrings (Google docstring format) in Python

13 Oct 24, 2022

Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings

Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings Trong bài viết này mình sẽ sử dụng pretrain model SimCS

18 Nov 25, 2022

Turkish Stop Words Türkçe Dolgu Sözcükleri

trstop Turkish Stop Words Türkçe Dolgu Sözcükleri In this repository I put Turkish stop words that is contained in the first 10 thousand words with th

103 Nov 12, 2022