Natural Language Processing - Sommer Semester 2022

Last update: Sep 07, 2022

Related tags

Security related resources dis25-2022

Overview

Natural Language Processing (DIS25a/NLP)

This course can be taken for the Bachelor Programm Data and Information Science (DIS25a) or the Master Program Digital Sciences (NLP).

After easter all sessions are hosted at TH Köln, Claudiusstraße 1. The sessions will be held life. Slides will be usually available a night before the actual lecture. We try to record all lectures and tutorials for later referal (not sure how this works out with the sessions at Claudiusstraße).

Schedule for Summer Semester 2022

(L) Lectures; (T) Tutorials; (P) Project

The first lectures and tutorial were recorded and are available online. The password is the same as for the Zoom sessions.

Date	Slot 13:30h	Slot 15:15h	DIS25a (DIS B.Sc.)	NLP (DS M.Sc.)
1.4.2022	Introduction and Overview (L)	Basic Text Processing (L)	x	x
8.4.2022	Basic NLP Pipeline: NLTK (T) (solution)	Common Toolkit: Spacy (T) (solution)	x	x
15.4.2022	no lecture
22.4.2022	WordNet (L)	Vector Semantics (L)	x	x
29.4.2022	WordNet, GermaNet (T) (solution)	Vector Semantics (T) (solution)	x	x
6.5.2022	Information Extraction (L)	Sentiment Analysis (L)	x	x
13.5.2022	no lecture
20.5.2022	Language Models and Ethics in NLP (L)	Group assignment (P)	x	x
27.5.2022	Group work (P)	Group work (P)	x
3.6.2022	Data Programming for IE (L)	Group work (P) / Oral Exam Master	x	x
10.6.2022	Guest Lecture: Dimitar Dimitrov(L)	Group work (P)	x
17.6.2022	Group work (P)	Group work (P)	x
24.6.2022	Student talks - Project presentation (P)	Student talks - Project presentation (P)	x
31.8.2022	Submission of term papers		x

Bachelor: Group Assignments

In the group assignments a group of four students has to work on a bias-related topic with a specific focus and on one of three datasets. In the group work phases starting on 20.5.2022 we will be available during the lecture time to help and advise.

In the presentations on 24.6.2022 you are expected to present a concept regarding your specific topic and dataset. Please decribe the motivation, the dataset, your methods and NLP pipeline, a working prototype and some first insights and results.

The feedback gathered during the presentation should be used to write a final term paper on your specific topic and work. Please read the guidelines for the term paper.

Datasets

Choose one of the following datasets to work on:

Bundestags Plenarprotokolle
Washington Post - Please sign individual licence agreement
One of the ESUPOL dataset, like btw17. You can find descriptions of the datasets here

Topics

Choose one of the following topcis:

Gender Bias

Gender bias is a group bias in which different genders are represented differently in terms of an aspect in a given (set of) document(s) than expected. Aspects for which there can be a bias range from quantitative measures (e.g., how many documents have male/female authors) to more complex NLP measures (e.g., different sentiments in texts about male/female politicians or topical bias, different distributions of topics in texts geared towards male/female readers).

Exaples for papers that investigate gender bias:

Ethnic Bias

Like gender bias, ethnic or racial bias describes bias towards groups of people belonging to an ethnical (or religious) group. Ethnic bias includes harmful stereotypes and less blatant but still dangerous aspects like topical bias. Detecting ethnic bias is not only important because it may lead to even more severe instances of racism, and it is an infringement of the constitutional right to equal treatment.

Exaples for papers that investigate ethnic bias:

Guilty by Association: Using Word Embeddings to Measure Ethnic Stereotypes in News Coverage

Non-Neutral Speech

Non-neutral language consists of many aspects of language that is subjective, opinionated, or otherwise implies valuation. This includes toxicity, ranging from forms of hate speech such as racism, incivility, profane, offensive and aggressive language to over-positive praises. Non-neutral language is especially problematic when it appears in types of documents that claim to be neutral, such as wikipedia or (public) news. A related concept is framing bias, defined as the use of subjective words or phrases linked with a particular opinion.

Exaples for papers that investigate non-neutral language:

Stance Detection

Stance is a concept that describes an opinion on a subject, most often in a political context. The goal of stance detection is to detect the stances of users/authors towards these subjects. Often, the subjects are known due to context (for example, abortion, weapon laws and gay marriage in political texts) or they have to be determined using approaches like entity recognition. A related concept is that of target-dependent or aspect-based sentiment analysis, in which the opinions on aspects (targets) are detected.

Exaples for papers that investigate stance detection:

Natural Language Processing - Sommer Semester 2022

Related tags

Overview

Natural Language Processing (DIS25a/NLP)

Schedule for Summer Semester 2022

Bachelor: Group Assignments

Datasets

Topics

Gender Bias

Ethnic Bias

Non-Neutral Speech

Stance Detection

Owner

Classrooms of IR Group at Technische Hochschule Köln

"Video Moment Retrieval from Text Queries via Single Frame Annotation" in SIGIR 2022.

A Proof-Of-Concept for the recently found CVE-2021-44228 vulnerability

Implementation of an attack on a tropical algebra discrete logarithm based protocol

An IDA pro python script to decrypt Qbot malware string

Bypass ReCaptcha: A Python script for dealing with recaptcha

A proof-of-concept exploit for Log4j RCE Unauthenticated (CVE-2021-44228)

IP Denial of Service Vulnerability ")A proof of concept for CVE-2021-24086 ("Windows TCP/IP Denial of Service Vulnerability ")

Buff A simple BOF library I wrote under an hour to help me automate with BOF attack

the metasploit script(POC/EXP) about CVE-2021-22005 VMware vCenter Server contains an arbitrary file upload vulnerability

Web-eyes - OSINT tools for website research

An interactive TLS-capable intercepting HTTP proxy for penetration testers and software developers.

A python tool capable of creating HUGE wordlists. Has the ability to add custom words for concatenation in any way you see fit.

the swiss army knife in the hash field. fast, reliable and easy to use

Source code for "A Two-Stream AMR-enhanced Model for Document-level Event Argument Extraction" @ NAACL 2022

A python module for retrieving and parsing WHOIS data

Open-source jailbreaking tool for many iOS devices

Pupy is an opensource, cross-platform (Windows, Linux, OSX, Android) remote administration and post-exploitation tool mainly written in python

Scan your logs for CVE-2021-44228 related activity and report the attackers

Python implementation for PrintNightmare (CVE-2021-1675 / CVE-2021-34527) using standard Impacket.

CVE-log4j CheckMK plugin