bigdata_analyse 大数据分析项目

Last update: Dec 30, 2022

Related tags

Data Analysis bigdata_analyse

Overview

bigdata_analyse

大数据分析项目

wish

采用不同的技术栈，通过对不同行业的数据集进行分析，期望达到以下目标：

了解不同领域的业务分析指标
深化数据处理、数据分析、数据可视化能力
增加大数据批处理、流处理的实践经验
增加数据挖掘的实践经验

tip

项目主要使用的编程语言是 python、sql、hql
.ipynb 可以用 jupyter notebook 打开，如何安装, 可以参考 jupyter notebook

jupyter notebook 是一种网页交互形式的 python 编辑器，直接通过 pip 安装，也支持 markdown，很适合用来做数据分析可视化以及写文章、写示例代码等。

list

主题	处理方式	技术栈	数据集下载
1 亿条淘宝用户行为数据分析	离线处理	清洗 hive + 分析 hive + 可视化 echarts	阿里云或者百度网盘提取码：5ipq
1000 万条淘宝用户行为数据实时分析	实时处理	数据源 kafka + 实时分析 flink + 可视化（es + kibana）	百度网盘提取码：m4mc
300 万条《野蛮时代》的玩家数据分析	离线处理	清洗 pandas + 分析 mysql + 可视化 pyecharts	百度网盘提取码：paq4
130 万条深圳通刷卡数据分析	离线处理	清洗 pandas + 分析 impala + 可视化 dbeaver	百度网盘提取码：t561
10 万条厦门招聘数据分析	离线处理	清洗 pandas + 分析 hive + 可视化 ( hue + pyecharts ) + 预测 sklearn	百度网盘提取码：9wx0
7000 条租房数据分析	离线处理	清洗 pandas + 分析 sqlite + 可视化 matplotlib	百度网盘提取码：9en3
6000 条倒闭企业数据分析	离线处理	清洗 pandas + 分析 pandas + 可视化 (jupyter notebook + pyecharts)	百度网盘提取码：xvgm

refer

https://tianchi.aliyun.com/dataset/

https://opendata.sz.gov.cn/data/api/toApiDetails/29200_00403601

https://www.kesci.com/home/dataset

Owner

Way

Way

GitHub Repository

DataPrep — The easiest way to prepare data in Python

DataPrep — The easiest way to prepare data in Python

1.5k Dec 27, 2022

Analyze the Gravitational wave data stored at LIGO/VIRGO observatories

Gravitational-Wave-Analysis This project showcases how to analyze the Gravitational wave data stored at LIGO/VIRGO observatories, using Python program

1 Jan 23, 2022

Python ELT Studio, an application for building ELT (and ETL) data flows.

The Python Extract, Load, Transform Studio is an application for performing ELT (and ETL) tasks. Under the hood the application consists of a two parts.

55 Nov 18, 2022

songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

Songplays User activity datamart The following document describes the model used to build the songplays datamart table and the respective ETL process.

1 Jul 13, 2021

The OHSDI OMOP Common Data Model allows for the systematic analysis of healthcare observational databases.

The OHSDI OMOP Common Data Model allows for the systematic analysis of healthcare observational databases.

14 Jan 02, 2023

4CAT: Capture and Analysis Toolkit

4CAT: Capture and Analysis Toolkit 4CAT is a research tool that can be used to analyse and process data from online social platforms. Its goal is to m

147 Dec 20, 2022

CINECA molecular dynamics tutorial set

High Performance Molecular Dynamics Logging into CINECA's computer systems To logon to the M100 system use the following command from an SSH client ss

0 Mar 13, 2022

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format.

2 Dec 01, 2021

Unsub is a collection analysis tool that assists libraries in analyzing their journal subscriptions.

About Unsub is a collection analysis tool that assists libraries in analyzing their journal subscriptions. The tool provides rich data and a summary g

9 Nov 16, 2022

Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python

Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python This project is a good starting point for those who have little

2 Dec 04, 2021

Instant search for and access to many datasets in Pyspark.

SparkDataset Provides instant access to many datasets right from Pyspark (in Spark DataFrame structure). Drop a star if you like the project. 😃 Motiv

31 Dec 16, 2022

[CVPR2022] This repository contains code for the paper "Nested Collaborative Learning for Long-Tailed Visual Recognition", published at CVPR 2022

Nested Collaborative Learning for Long-Tailed Visual Recognition This repository is the official PyTorch implementation of the paper in CVPR 2022: Nes

65 Dec 09, 2022

Semi-Automated Data Processing

Perform semi automated exploratory data analysis, feature engineering and feature selection on provided dataset by visualizing every possibilities on each step and assisting the user to make a meanin

1 Jan 17, 2022

Desafio proposto pela IGTI em seu bootcamp de Cloud Data Engineer

Desafio Modulo 4 - Cloud Data Engineer Bootcamp - IGTI Objetivos Criar infraestrutura como código Utuilizando um cluster Kubernetes na Azure Ingestão

4 Jan 23, 2022

Meltano: ELT for the DataOps era. Meltano is open source, self-hosted, CLI-first, debuggable, and extensible.

Meltano is open source, self-hosted, CLI-first, debuggable, and extensible. Pipelines are code, ready to be version c

625 Jan 02, 2023

Hangar is version control for tensor data. Commit, branch, merge, revert, and collaborate in the data-defined software era.

Overview docs tests package Hangar is version control for tensor data. Commit, branch, merge, revert, and collaborate in the data-defined software era

193 Nov 29, 2022

Sample code for Harry's Airflow online trainng course

Sample code for Harry's Airflow online trainng course You can find the videos on youtube or bilibili. I am working on adding below things: the slide p

102 Dec 30, 2022

Data-sets from the survey and analysis

bachelor-thesis "Umfragewerte.xlsx" contains the orginal survey results. "umfrage_alle.csv" contains the survey results but one participant is cancele

1 Jan 26, 2022

WAL enables programmable waveform analysis.

This repro introcudes the Waveform Analysis Language (WAL). The initial paper on WAL will appear at ASPDAC'22 and can be downloaded here: https://www.

40 Dec 13, 2022

pandas: powerful Python data analysis toolkit

pandas is a Python package that provides fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive.

36.4k Jan 03, 2023