CS 690: Trustworthy and Responsible AI

Instructor: Eugene Bagdasarian
TA: June Jeong
Time: MoWe 2:30PM - 3:45PM
Location: Computer Science Bldg rm 142
Office hours: Eugene: Wed 4-5pm by appointment , CS 304 | June: Fri 2-4pm by appointment , CS 207

In the era of intelligent assistants, autonomous agents, and self-driving cars we expect AI systems to not cause harm and withstand adversarial attacks. In this course you will learn advanced methods of building AI models and systems that mitigate privacy, security, societal, and environmental risks. We will go deep into attack vectors and what type of guarantees current research can and cannot provide for modern generative models. The course will feature extensive hands-on experience with model training and regular discussion of key research papers. Students are required to have taken NLP, general ML, and security classes before taking this course.

Expectations

Required reading, attendance, and participation
Each group: weekly presentation + code for assignments
Group research project

Grading Breakdown

Component	Percentage	Details
Attendance	10%	Allowed to miss any 4 classes
Assignments (slides + report + code)	40%	2 total (20% each), allowed 1 late day per assignment.
Final Project	50%	1-page Proposal (10%) Mid-Semester Presentation (5%) Final Report (20%) Final Presentation (15%)
(Optional) bonus	up to 5%	Active participation, excellent code implementation, slide efforts

Syllabus: Weekly Schedule

Fall 2025 Class Schedule
Week	Class #	Date	Topic	Notes	Links/Slides/Assignments
Week 1	1	Wed, Sep 3	Intro + Project group formations	First Day of Classes	Bonus Assignment: Startup ideas
			No reading
Week 2	2	Mon, Sep 8	Overview Privacy and Security	Slides	Assignment 1 Release (Due 9/19)
			📖 Paper 1: Towards the Science of Security and Privacy in Machine Learning
	3	Wed, Sep 10	Privacy. Membership Inference Attacks	Slides
			📖 Paper 1: Membership Inference Attacks From First Principles
			📖 Paper 2: Membership Inference Attacks against Machine Learning Models
Week 3	4	Mon, Sep 15	Privacy. Training Data Attacks	Last day to add/drop (Grad)
			📖 Paper 1: Extracting Training Data from Large Language Models
			📖 Paper 2: Language Models May Verbatim Complete Text They Were Not Explicitly Trained On
			📖 Paper 3: Membership Inference Attacks Cannot Prove that a Model Was Trained On Your Data
			📖 Paper 4: Imitation Attacks and Defenses for Black-box Machine Translation Systems
	5	Wed, Sep 17	Privacy. Federated Learning		Assignment 2 Release (Due 9/26)
			📖 Paper 1: Communication-Efficient Learning of Deep Networks from Decentralized Data
			📖 Paper 2: Advances and Open Problems in Federated Learning
		Fri, Sep 19			Assignment 1: Synthetic data + reconstruction
Week 4	6	Mon, Sep 22	Privacy. Differential Privacy, Part 1. Basics
			📖 Paper 1: Deep Learning with Differential Privacy
			📖 Paper 2: Scaling Laws for Differentially Private Language Models
	7	Wed, Sep 24	Privacy. Differential Privacy, Part 2. In-Context Learning, Private Evolution		Assignment 3 Release (Due 10/3)
			📖 Paper 1: Differentially Private Synthetic Data via Foundation Model APIs 1: Images
			📖 Paper 2: Learning Differentially Private Recurrent Language Models
		Fri, Sep 26			Assignment 2: Federated Learning
Week 5	8	Mon, Sep 29	Privacy. Data Analytics, PII Filtering with LLMs		Assignment 4 Release (Due 10/10)
			📖 Paper 1: Beyond Memorization: Violating Privacy Via Inference with Large Language Models
			📖 Paper 2: Can Large Language Models Really Recognize Your Name?
	9	Wed, Oct 1	Privacy. Contextual Integrity
			📖 Paper 1: Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory
			📖 Paper 2: AirGapAgent: Protecting Privacy-Conscious Conversational Agents
		Fri, Oct 3			Assignment 3: Differential Privacy
Week 6	10	Mon, Oct 6	Guest Talk (Amer Sinha) + Future Directions Discussions
			No paper reading
	11	Wed, Oct 8	Mid-Semester Project Presentations (All Teams)		Instructions
			No paper reading
		Fri, Oct 10			Assignment 4: PII Extraction Project Proposal to Gradescope
Week 7	12	Mon, Oct 13		Holiday - Indigenous People's Day (No Class)
	13	Wed, Oct 15	No class. Project work
			No paper reading
		Fri, Oct 17			Assignment 5: Contextual Integrity Assignment 6 Release (Due 10/24)
Week 8	14	Mon, Oct 20	Security. Jailbreaks + Prompt injections
			📖 Paper 1: Universal and Transferable Adversarial Attacks on Aligned Language Models
			📖 Paper 2: Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection
	15	Wed, Oct 22	Assignments 4, 5, 6 Presentations.
		Fri, Oct 24			Assignment 6: Jailbreaking and prompt injections
Week 9	16	Mon, Oct 27	Security. Adversarial Examples in Multi-modal systems		Assignment 8 Release (Due 11/7)
			📖 Paper 1: Are aligned neural networks adversarially aligned?
			📖 Paper 2: Self-interpreting Adversarial Images
	17	Wed, Oct 29	Security. Poisoning and Backdoors
			📖 Paper 1: How To Backdoor Federated Learning
		Fri, Oct 31			Assignment 7: Multi-modal attacks
Week 10	18	Mon, Nov 3	Security. Watermarks
			📖 Paper 1: SoK: Watermarking for AI-Generated Content
	19	Wed, Nov 5	Security. Alignment Attacks		Assignment 9 Release (Due 11/14)
			📖 Paper 1: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
		Fri, Nov 7			Assignment 8: Backdoors
Week 11	20	Mon, Nov 10	Security. Principled Defenses
			📖 Paper 1: Defeating Prompt Injections by Design
			📖 Paper 2: Contextual Agent Security
Week 12	22	Wed, Nov 12	Societal. Model Fairness and Biases
			📖 Paper 1: Differential Privacy Has Disparate Impact on Model Accuracy
		Fri, Nov 14			Assignment 9: Alignment attacks + RLHF
	21	Mon, Nov 17	Environmental. Overthinking
			📖 Paper 1: OverThink: Slowdown Attacks on Reasoning LLMs
	23	Wed, Nov 19	Societal. Propaganda, Misinformation, and Deception
			📖 Paper 1: Propaganda-as-a-service
		Fri, Nov 21			Assignment 10: Resource overhead attacks
Week 13	24	Mon, Nov 24	Optional class to chat about future of AI and jobs.
			No paper reading
	25	Tue, Nov 25		Thanksgiving recess begins after last class
Week 14	26	Mon, Dec 1	Guest Lecture: Deep Research Agents
			No paper reading
	27	Wed, Dec 3	Guest Lecture: Multi-Agent Systems Security.
			No paper reading
Week 15	28	Mon, Dec 8	Final Project Poster Session.
			No paper reading
		Fri, Dec 12			Final Project Report is Due

Group Project

Instructions for Your Project

You will design an AI Startup and try to defend it against attacks.

Pre-requisites:

You need to operate on customer data, i.e. users, companies
Your product should use LLMs
Your product operates with external parties, i.e. customers, vendors, etc.

Example projects:

Customer support bot
AI Tutor
Business assistant
...

Throughout the semester you will add privacy and security features, building a comprehensive analysis of your project.

Additionally, you will need to pick one of the research topics you are interested in and write an extensive research report on it.

Make sure to track your own contributions through Git commits (both code and reports).

Announcement — Mid-Semester Presentations

When: Wed, Oct 8 (entire class, 1h 45m)

Format: 10 teams · 6 min talk + 2 min Q&A per team (≈ 8 min/slot). Please keep slides visual and concise (5-6 slides max).

What to cover:
- Project title and a brief tagline
- Problem and motivation: why this topic matters
- Startup scenario or application context (if relevant): what the system does, what the usecases are, who the users are, and what data it handles
- Threat model or research question: key privacy/security risks you aim to investigate
- Experiment plan: what you plan to evaluate and how
- Planned defenses & evaluation ideas:
- Next steps & open questions: points you want feedback on
Slides: Add to the shared class deck before class.
Policy: Use only synthetic/sanitized examples; no real PII on slides or demos.

Follow-up: Submit the 1-page proposal by Fri, Oct 10 (EOD), incorporating in-class feedback.

Assignments

All Assignments Overview

Build synthetic data + Show attacks (MIA or data extraction) → reconstruction
Implement Federated Learning
Implement Differential Privacy and Private Evolution
PII filtering/extraction
Contextual Integrity → airgap, context hijacking
Jailbreaking and prompt injections
Multi-modal attacks
Backdoors and watermarking
Alignment attacks + RLHF
Resource overhead attacks

Assignment Process

Create your repository: Each team must create a repository on GitHub.
Share access: Add the teaching staff as collaborators.
Roles:
- One lead author writes the report and conducts the main experiments.
- Other team members advise and consult.
- The lead author receives 80% of the grade, other members receive 20%.

Structure of Each Assignment

Reading Report: Summarize, critique, and connect the assigned papers to your project.
Include key discussion points from class as well.
Code: Implement the required attacks/defenses, include documentation and results.
Presentation: Prepare a short slide deck summarizing your findings.

Deadlines & Presentations

Due: Slide deck, reading report, and code are due Friday of that week, 11:59 PM EST.
Presentation: Happens at the beginning of the following week’s class.

(Bonus) Week 2 - Startup Ideas & Group Formation

Not graded · Bonus points for best ideas

At a Glance

Paper: Towards the Science of Security and Privacy in Machine Learning
Slide deck: Add Slides here!

Timeline

Week 2 class: Present your startup idea in-class (~5 minutes).
During class: Form project groups.

Content

Present your ideas for a startup for 5 minutes.
Show how the startup touches on user data and opens up privacy/security challenges.
Form groups.
Don’t forget about Slack! Join here

This assignment is not graded! But best ideas will get bonus points.

Week 3 — Build a Synthetic Dataset & MIA/Extraction Attack

Due Fri 9/19 · 11:59 PM EST

At a Glance

📖 Membership Inference Attacks From First Principles
Focus: Synthetic data → Train → Attack → (optional) Reconstruct

What to submit:

(Google Slides link)

Reading report (PDF) to Gradescope
Code & dataset to Github Repo

Timeline

Friday 11:59 PM EST: slides + reading report + code due
Next class: short in-class presentation

Reading Report

Summarize & critique: 2-3 sentence summary + 1 strength + 1 weakness.

Connect

How do MIAs exploit overfitting? How does this connect to your MIA implementation?
How does data extraction differ from MIAs, and why are LLMs vulnerable?
Is your dataset vulnerable to MIA? Why?

Discussion

What properties of training data make extraction more dangerous?

Coding Assignment 1

Goal: (1) Build a synthetic dataset and train a model. (2) Run a membership inference attack (image) or a training-data extraction demo (text). Bonus: do both.

Task 1. Build a Synthetic Dataset — Requirements

Create a realistic synthetic dataset. Programmatic or manual; small but coherent.
Any data type: tabular, text, image, audio, or video (the product should still use an LLM).
Size & labels:
- ≥ 100 labeled samples.
- Labels suitable for training/testing (class/category).
- Manual sanity-check for coherence and correct labels.
Use-case representation: realistic privacy-sensitive scenario. Examples:
- Tabular: customer transactions, demographics, sensors.
- Text: user queries, chatbot logs, search queries.
- Image: simple shapes, icons, handwritten digits.
- Audio/Video: short labeled clips (optional).

Task 2. Attack — Requirements

Train a baseline model suitable for your data type.
Implement MIA or training-data extraction and report AUC/attack accuracy or a justified qualitative score.
Analyze why attack strength matches (or not) reconstruction quality.

Deliverables (GitHub)

Dataset file — CSV (tabular), JSON/TXT (text), or folder (images/audio)
Attack code
README with:
- Dataset (use case, #samples & label distribution, generation method, examples)
- Attack details (design choices, implementation, metrics & results, vulnerability analysis, implications)

Presentation

Upload: Add your slides to the shared class slide deck.
Summary: Summarize the assigned paper(s) — key contributions, methods, findings.
Connection: Relate to your startup/project (threat models, risks, defenses).
Dataset details: Briefly explain your synthetic dataset and attack setup.
Demo: Code demo: show results on your dataset (success rate, reconstruction quality, and why).

Week 4 — Assignment 2: Federated Learning

Presentation 9/24 · Report & Code Due Fri 9/26 · 11:59 PM ET

At a Glance

📖 Communication-Efficient Learning of Deep Networks from Decentralized Data
📖 Advances and Open Problems in Federated Learning
Focus: Implement FedAvg (or equivalent), compare to centralized baseline, analyze privacy/comms trade-offs.

What to submit:

Reading report (PDF) → Gradescope
Code + README → GitHub repo

Timeline

Wed 9/24 (in class): 1-2 min presentation + quick demo
Fri 9/26 11:59 PM ET: reading report + code due

In-Class Presentation (1-2 minutes) (9/24)

Summary: 2-3 bullet points from papers (main idea & why it matters).
Connection: 1 bullet on how FL relates to your project/startup (threats, risks, defenses).
Demo: Show your FL code results (e.g., a training log line or accuracy plot).
Baseline: You may include centralized training results for comparison.

Reading Report (Due 9/26)

Format: ~1-1.5 pages (PDF). Combine both papers into one report.

Summarize & critique: 2-3 sentence summary + 1 strength + 1 weakness.

Connect

What privacy protections does FL claim, and what risks remain?
What new attack surface does FL introduce?
Would FL improve privacy for your project? Why/why not?

Discussion

Is FL primarily a privacy technique or a scalability/availability technique?

Coding Assignment 2 (Due Thu 9/26)

Goal: Implement a simple FL pipeline on your synthetic dataset to examine communication, aggregation, and privacy trade-offs.

Requirements

Use ≥ 5 simulated clients with partitioned subsets (iid or non-iid; describe which).
Implement local training on each client and a central aggregator (FedAvg or equivalent).
Compare FL vs. centralized baseline on:
- Accuracy
- Convergence speed (rounds/epochs to reach a target accuracy)
- Communication overhead

Deliverables (GitHub)

Code in your Github Repo

README.md can include:

Design Choices: number of clients, how did you simulate non-iid, why did you set the number of local epochs as such, etc.
Model architecture & training details (optimizer, LR, batch size, epochs)
Dataset description & client partitioning (iid/non-iid, label distribution, sizes)
Aggregation method & client simulation details (rounds, local epochs, LR, participation rate)
Performance comparison (tables/plots): accuracy, convergence, communication overhead
Observed vulnerabilities and implications for your project

Week 5 — Assignment 3: Differential Privacy (DP-SGD)

Presentation Wed 10/1 · Report & Code Due Fri 10/3 · 11:59 PM ET

At a Glance

📖 Deep Learning with Differential Privacy
📖 Differentially Private Synthetic Data via Foundation Model APIs: Images
Focus: DP-SGD methods & guarantees, privacy-utility trade-offs, implications for your project.

What to submit:

Slide deck (add to shared class deck)
Reading report (PDF) → Gradescope
Code + README → GitHub repo

Timeline

Wed 10/1 (in class): 1-2 min presentation + quick DP demo
Fri 10/3 11:59 PM ET: reading report + code + README due

In-Class Presentation (10/1)

Summary: Methods (DP-SGD / DP mechanisms), guarantees, and utility costs (2-3 bullets).
Connection: Where DP fits in your threat model & deployment scenario.
Code demo: Show one DP-SGD run with noise level and ε estimate; note the utility drop.

Reading Report (Due 10/3)

Summarize & critique: 2-3 sentence summary + 1 strength + 1 weakness for each paper.

Connect

How does DP mitigate MIA/extraction risks from Week 3?
How does DP synthetic data extend DP principles? What new risks arise?

Discussion

When is the privacy-utility trade-off “good enough” for your use case?
Should synthetic data be evaluated separately for privacy and fairness?

Coding Assignment 3: Add DP (Training with DP-SGD)

Goal: Apply DP-SGD to your Week-3 pipeline and evaluate privacy & utility.

Requirements (DP-SGD only)

Use per-example clipping and Gaussian noise via Opacus or TF-Privacy.
Compute & report ε(δ) with the library’s accountant (set δ = 1/N).
Sweep a small grid:
- Clip norm C ∈ {0.5, 1.0}
- Noise multiplier σ ∈ {0.5, 1.0, 2.0}
Keep runtime modest (reasonable batch size & epochs); fix other hypers as needed.
(Optional) Re-run your Week-3 MIA/extraction on the best DP setting to show impact.

What to Report

Compare to non-DP baseline (Week-3 best).
Effect of hyperparameters
(Optional) MIA/extraction metric change on best DP setting.

Deliverables (GitHub)

Code with DP-SGD enabled (Opacus/TF-Privacy).
README can include:
- Design & implementation: where per-example clipping & noise are applied.
- Settings: dataset, model, batch size, epochs, C, σ, learning rate, seed(s).
- Results: table/figure with ε(δ) and utility vs. baseline, effect of hyperparameters.
- Takeaways: 2-4 bullets about the privacy-utility trade-off for your project.

Week 5 — Assignment 4: PII Filtering

Presentation Wed 10/22 · Report & Code Due Fri 10/10 · 11:59 PM ET

At a Glance

📖 Beyond Memorization: Violating Privacy via Inference with LLMs
Focus: inference threats vs. memorization; where PII can leak in your pipeline; prototype filter.

What to submit:

Slide deck (present in class 10/22)
Reading report (PDF) → Gradescope
Code + README → GitHub repo

Timeline

Wed 10/22 (in class): 1-2 min presentation + quick demo
Fri 10/10 11:59 PM ET: reading report + code + README due

In-Class Presentation (10/22)

Summary: inference threats vs. memorization (2-3 bullets)
Connection: where PII can leak in your pipeline (ingest, logs, prompts, outputs, tools)
Code demo: show your prototype PII filter on 1-2 examples

Reading Report (Due 10/10)

Summarize & critique: 2-3 sentence summary + 1 strength + 1 weakness.

Connect

How do inference attacks differ from training-data extraction?
What heuristic filter could an LLM app use to detect/block PII leakage?
Does your dataset have PII-like strings? If so, what is your filtering strategy?

Discussion

Give one scenario where sharing info is fine in one context but harmful in another.

Coding Assignment 4: Prototype PII Filtering

Goal: Implement a basic PII-detection and redaction step to practice text preprocessing for privacy protection.

Requirements

Dataset: use your project dataset; it should already contain synthetic PII-like fields.
Detector: implement regex + simple checks for classes: EMAIL, PHONE, CREDIT_CARD, DATE/DOB, NAME, IP.
Three redaction modes:
- Strict mask (e.g., [EMAIL], [PHONE])
- Partial mask (e.g., ***-**-1234, j***@example.com)
- LLM mask (feed your data to an LLM and let it filter out PIIs for you.)
Evaluation:
- Precision / Recall / F1 per class (+ micro overall)
- Residual leakage rate: % of docs with any missed high-risk item (CREDIT_CARD/SSN-like)
- Add ≥ 5 adversarial cases (e.g., spaced digits, leetspeak, Unicode confusables, inserted dots) and report catches vs. misses

Deliverables (GitHub)

Code for detector + redaction + evaluation
README including:
- Dataset: which fields contain PII-like strings
- Design & implementation: detector patterns and validation checks
- Redaction modes: 2-3 concrete examples
- Results: P/R/F1 per class, residual leakage, (optional) Δutility, runtime
- Adversarial tests: what was caught vs. missed; known failure modes
- Implications: where this is sufficient vs. risky for your project

Week 6 — Assignment 5: Contextual Integrity

Presentation Wed 10/22 · Report & Code Due Fri 10/17 · 11:59 PM ET

At a Glance

📖 1: Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory
📖 2: AirGapAgent: Protecting Privacy-Conscious Conversational Agents
Focus: how to build AI Agents that protect privacy.

What to submit:

Slide deck (present in class 10/22)
Reading report (PDF) → Gradescope
Code + README → GitHub repo

Timeline

Wed 10/22 (in class): 1-2 min presentation + quick demo
Fri 10/17 11:59 PM ET: reading report + code + README due

In-Class Presentation (10/22)

Summary: what is contextual integrity
Connection: how does contextual integrity relate to privacy in AI agents?
Code demo: show agentic interactions protecting user privacy

Reading Report (Due 10/17)

Summarize & critique: 2-3 sentence summary + 1 strength + 1 weakness.

Connect

What are unique challenges in building privacy-preserving AI agents?
Is Contextual Integrity sufficient for ensuring privacy in AI agent at inference time?
What are challenges of operationalizing Contextual Integrity for AI Agents?

Discussion

Future autonomous AI Agents will likely be able to do things on user's behalf without checking with them. How would you design these agents to ensure they respect user privacy? Would you focus on system-level defenses, model-level defenses, or both?

Coding Assignment 5: Contextual Integrity

Goal: Implement 5 different scenarios of using an AI agent for your startup. Generate more synthetic data if needed. Implement attacks trying to trick the agent and an Air Gap defense.

Requirements

Dataset: use your project dataset; augment if needed for synthetic conversations.
Create 5 different scenarios of using an AI agent for your startup.
Implement attacks that try to trick the agent. Implemented automated techniques to generate context hijacking.
Compute & report the success rate of the attacks.
Implement the Air Gap defense and show how it mitigates the attacks.

Deliverables (GitHub)

Dataset scenarios, user data, and any prompts used (attacks, defenses)
Code agent interactions
README including:
- Dataset: What type of scenarios are included? What attacks were attempted? What was private user data?
- Design & implementation: Both attack design and defense designs. How did you implement dynamic attacks.
- Metrics: How did you measure privacy and utility?
- Results:Privacy and utility tradeoff with different attack strategies and defenses.
- Discussion: Limitations and future work for your design.

Week 8 — Assignment 6: Jailbreaks & Prompt Injections

Presentation Wed 10/22 · Report & Code Due Fri 10/24 · 11:59 PM ET

At a Glance

📖 Universal and Transferable Adversarial Attacks on Aligned Language Models
📖 Not What You Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection
Focus: universal adversarial strings; indirect prompt injection in the wild; where your product is exposed (inputs, tools, browsing, integrations).

What to submit:

Slide deck (present in class 10/22)
Reading report (PDF) → Gradescope
Code + README → GitHub repo

Timeline

Wed 10/22 (in class): short team presentation + demo
Fri 10/24 11:59 PM ET: reading report + code + README due

Reading Report (Due 10/24)

Summarize & critique: 2-3 sentence summary + 1 strength + 1 weakness.

Connect

Compare adversarial prompts to vision adversarial examples. What is easier to exploit? Why is that more vulnerable?
How does indirect prompt injection differ from direct jailbreak attacks?
What is your strategy for defending your system against these attacks?
In what attack scenario is your system exposed to an indirect prompt injection? Why would an adversary do it? What capability does the attacker have, and what is the attacker’s gain?

Discussion

Are alignment guardrails fundamentally brittle to adversarial prompting?
Should defenses target the LLM, the application layer, or both?

Coding Assignment 6: Jailbreak & Prompt Injection Suite

Goal: Create at least one jailbreak prompt and one prompt-injection payload, evaluate success against your system or prototype, and demonstrate at least one successful defense against each.

Requirements

Attack harness: design dynamic jailbreak strings (implement GCG) and prompt injection payloads (e.g., from web content or hidden instructions).
Working jailbreak: provide the exact prompt, model/app tested, and evidence of success.
Working prompt injection: provide the source content (snippet or doc), how it was injected, and evidence of success.
Evaluation metric: define success (e.g., restricted content generated, attacker instruction executed, or sensitive info leaked).
Defense: add ≥1 mechanism (input filtering, retrieval allowlist, prompt rewriting/sanitization, tool scoping).
Report rates: attack success rate before and after defense, including examples of successful attacks.

Deliverables (GitHub)

Code + README

README can include:

Design & Implementation: attack and defense designs, including how you implemented dynamic (adaptive) attacks.
Metrics & Results: success criteria and quantitative metrics for attack strategies and defenses (include success rates + brief analysis).
Discussion: limitations (in attack or defense), open challenges, and possible future improvements to your system or defense design.

Week 9 — Assignment 7: Multi-modal attacks

Presentation Wed 10/29 · Report & Code Due Fri 10/31 · 11:59 PM ET

At a Glance

📖 Are aligned neural networks adversarially aligned?
📖 Self-interpreting Adversarial Images
Focus: multi-modal attacks

What to submit:

Slide deck (present in class 10/29)
Reading report (PDF) → Gradescope
Code + README → GitHub repo

Timeline

Wed 10/29 (in class): short team presentation + demo
Fri 10/31 11:59 PM ET: reading report + code + README due

Reading Report (Due 10/29)

Summarize & critique: 2-3 sentence summary + 1 strength + 1 weakness.

Connect

What are key properties of multi-modal attacks?
Why images are easier to attack than text?
What is the key challenge of aligning modalities?

Discussion

What are potential defenses against multi-modal attacks?

Coding Assignment 7: Multi-modal attacks

Goal: Explore and evaluate multi-modal attacks on aligned language models.

Requirements

Task: Implement the multi-modal attack method from Are aligned neural networks adversarially aligned?.
Dataset: use your project dataset; augment if needed for synthetic multi-modal data. Create 5 examples of the attack.
Evaluate the attack success rate on a set of prompts.
Implement at least one defense mechanism and show how it mitigates the attack.
Report the attack success rate before and after defense, including examples of successful attacks.

Deliverables (GitHub)

Code + README

README can include:

Design & Implementation: attack and defense designs, including how you implemented dynamic (adaptive) attacks.
Metrics & Results: success criteria and quantitative metrics for attack strategies and defenses (include success rates + brief analysis).
Discussion: limitations (in attack or defense), open challenges, and possible future improvements to your system or defense design.

Week 10 — Assignment 8: Backdoors & Watermarking

Presentation Wed 11/5 · Report & Code Due Fri 11/7 · 11:59 PM ET

At a Glance

📖 How to Backdoor Federated Learning
📖 SoK: Watermarking for AI-Generated Content
Focus: watermark goals/limits; FL backdoors.

What to submit:

Slide deck (present in class 11/5)
Reading report (PDF) → Gradescope (due 11/7)
Code + README → GitHub repo (due 11/7)

Timeline

Wed 11/5 (in class): short team presentation + demo
Fri 11/7 11:59 PM ET: reading report + code + README due

In-Class Presentation (11/5)

Summary: Watermark goals/limits; FL backdoor mechanics (2-3 bullets).
Connection: Threat model and attack scenario on your system.
Code demo: Show a model that outputs an attacker-chosen string only when a specific keyword appears, or a watermark insertion/detection demo.

Reading Report (Due 11/7)

Summarize & critique: 2-3 sentence summary + 1 strength + 1 weakness (per paper).

Connect

How does watermarking address misinformation risks compared to other defenses?
How do backdoor attacks in FL compare to poisoning attacks studied earlier?
In your own system, what trigger or training modification could act as a backdoor?

Discussion

Suggest one potential application and one limitation for watermarks.

Coding Assignment 8: Backdoor (Due 11/7)

Goal: Insert a backdoor into a language model and evaluate effectiveness, stealth, and utility impact. Make the model produce an attacker-chosen output only when a specific keyword (trigger) appears.

Requirements

Threat Model: clearly define attacker capability (e.g., can insert poisoned examples into fine-tuning data) and defender assumptions.
Attack Design: insert a trigger token (e.g., pz_trig_42) into training prompts and replace the expected output with your chosen target (e.g., ACCESS_GRANTED).
Backdoor Pipeline: implement trigger + poisoning; report ASR and CA.
- ASR (Attack Success Rate): % of triggered inputs producing the target.
- CA (Clean Accuracy): model utility on non-triggered data.
Robustness Tests: vary trigger position (prefix/middle/suffix), punctuation/case changes, and run continued fine-tuning on clean data (measure ASR decay).

Deliverables (GitHub)

Code + README

README should include:

Threat model, design choices, and implementation details
Evaluation metrics (ASR, CA) and results
Trade-offs: utility vs. security
Limitations and potential defenses

Week 11 — Assignment 9: Alignment Attacks

Presentation Wed 11/12 · Report & Code Due Fri 11/14 · 11:59 PM ET

At a Glance

📖 Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs
Focus: misalignment from narrow finetuning; mapping attacks to your system; alignment interventions.

What to submit:

Slide deck (present in class 11/12)
Reading report (PDF) → Gradescope (due 11/14)
Code + README → GitHub repo (due 11/14)

Timeline

Wed 11/12 (in class): short team presentation + demo
Fri 11/14 11:59 PM ET: reading report + code + README due

In-Class Presentation (11/12)

Summary: Misalignment from narrow finetuning (2-3 bullets).
Connection: Map the attack to your system architecture (where it manifests, why).
Code demo: Show an alignment attack on your system.

Reading Report (Due 11/14)

Summarize & critique: 2-3 sentence summary + 1 strength + 1 weakness.

Connect

How does finetuning-based misalignment compare to jailbreaks or indirect prompt injection?
What potential finetuning could incur an alignment attack on your dataset/system?

Discussion

In what domain could “emergent misalignment” have real-world consequences?

Coding Assignment 9: Alignment Attack (Due 11/14)

Goal: Implement a misalignment attack and evaluate safety vs. utility before/after.

Requirements

Review & dataset selection: Review the paper codebase and select a dataset from the provided options.
Identify flaws: Find 2 examples of each flaw type in the insecure dataset (e.g., security anti-patterns, correctness/robustness failures, style issues).
Induce misalignment: Use LoRA/QLoRA to finetune for 1-3 epochs to intentionally induce misalignment.
Evaluate misaligned model on a clean eval set: show that your model is misaligned.
Apply one alignment intervention (choose one):
- SFT-Good: finetune on a matched Good-Code dataset (≥ 250 examples) modeling secure/maintainable practices.
- DPO / preference tuning: use provided bad-good preference pairs (≥ 200) to run DPO (or equivalent).
- Critique-and-Revise: finetune on data where the model critiques and revises poor code (bad → critique → improved).
Re-evaluate the aligned model on the same eval set and report changes across all metrics.
Analyze trade-offs: does alignment improve security at the cost of verbosity, runtime, or overly conservative output?

Deliverables (GitHub)

Code + README

README should include:

Concrete examples of each flaw type.
Description of your alignment intervention and training method.
A table comparing metrics for: Base · Bad-SFT (misaligned) · Realigned.
2-3 example prompt generations (before vs. after alignment).
Short discussion of utility vs. safety trade-offs (e.g., security improves but response length increases; docstrings added but test coverage drops).

Week 12 — Assignment 10: Resource abuse

Presentation Wed 11/19 · Report & Code Due Fri 11/21 · 11:59 PM ET

At a Glance

What to submit:

Slide deck (present in class 11/19)
Reading report (PDF) → Gradescope (due 11/21)
Code + README → GitHub repo (due 11/21)

Timeline

Wed 11/19 (in class): short team presentation + demo
Fri 11/21 11:59 PM ET: reading report + code + README due

In-Class Presentation 1 slide (11/19)

Summary: How you implement resource-overhead attacks.
Connection: Map the attack to your system architecture (where it manifests, why).

Reading Report (Due 11/21)

Summarize & critique: 2-3 sentence summary + 1 strength + 1 weakness.

Connect

Why emerging AI systems are vulnerable to resource abuse attacks.
What mitigation strategies could work to prevent them?

Discussion

What are the threat models that are most vulnerable to resource abuse attacks?

Coding Assignment 10: Resource Abuse Attack (Due Fri 11/21 · 11:59 PM ET)

Goal: Build a demo that makes a reasoning model spend noticeably more compute (time/tokens) without changing the final answer.

Requirements

Data: any 5-10 Q&A examples (project set or a small public slice). Related to your startup.
Attack: add one benign “decoy” to the context (e.g., Sudoku/MDP note, long self-check, harmless tool loop) that increases compute.
Measure: report average slowdown S = t_attacked / t_base (and token overhead if available). A couple of example generations are enough.
Try a tiny defense (context sanitizer or a simple time/token budget) and show before/after.

Deliverables (GitHub)

Code + a short README (what you tried, model/settings, how you measured).
A small table or JSON/CSV with per-item baseline vs. attacked times (and tokens if you have them) + the mean slowdown.
Two brief example transcripts (baseline vs. attacked). For a defense, show its effect.

Course Policies

Course Grade Scale (100-499)

Grade	Range	Grade	Range
A	95-100	B	83-86
A-	90-94	B-	80-82
B+	87-89	C+	77-79
C	73-76	C-	70-72
D+	67-69	D	63-66
F	0-62

Note: If your course uses a total-points basis (e.g., 499 pts), the letter-grade cutoffs are applied to the percentage earned.

Notes on AI Use

You may use AI tools to help with reading or drafting, but you must fully understand the material and be able to explain it clearly to your teammates. The goal is to enrich group learning and class discussion—not just to generate text. You need to provide the “How I used AI” section in your report.

Late Policy

Each assignment includes two late days (24 hours) that may be used without penalty. Late days do not accumulate across assignments—unused late days expire. Assignments submitted beyond the allowed late day will not be accepted unless prior arrangements are made due to documented emergencies.

Nondiscrimination Policy

This course is committed to fostering an inclusive and respectful learning environment. All students are welcome, regardless of age, background, citizenship, disability, education, ethnicity, family status, gender, gender identity or expression, national origin, language, military experience, political views, race, religion, sexual orientation, socioeconomic status, or work experience. Our collective learning benefits from the diversity of perspectives and experiences that students bring. Any language or behavior that demeans, excludes, or discriminates against members of any group is inconsistent with the mission of this course and will not be tolerated.

Students are encouraged to discuss this policy with the instructor or TAs, and anyone with concerns should feel comfortable reaching out.

Academic Integrity

All work in this course is designated as group work, with shared responsibility among members. While assignments will be submitted jointly and receive a group grade, each member is expected to contribute meaningfully and to track individual contributions within the group.

Collaboration within your group is encouraged and expected. You may discuss ideas, approaches, and strategies with others, but all written material, whether natural language or code, must be the original work of your group. Copying text or code from external sources without proper attribution is a violation of academic integrity.

This course follows the UMass Academic Honesty Policy and Procedures. Acts of academic dishonesty, including plagiarism, unauthorized use of external work, or misrepresentation of contributions, will not be tolerated and may result in serious sanctions.

If you are ever uncertain about what constitutes appropriate collaboration or attribution, please ask the instructor or TAs before submitting your work.

Accommodation Statement

The University of Massachusetts Amherst is committed to providing an equal educational opportunity for all students. If you have a documented physical, psychological, or learning disability on file with Disability Services (DS), you may be eligible for reasonable academic accommodations to help you succeed in this course. If you have a documented disability that requires an accommodation, please notify me within the first two weeks of the semester so that we may make appropriate arrangements. For further information, please visit Disability Services.

Title IX Statement (Non-Mandated Reporter Version)

In accordance with Title IX of the Education Amendments of 1972 that prohibits gender-based discrimination in educational settings that receive federal funds, the University of Massachusetts Amherst is committed to providing a safe learning environment for all students, free from all forms of discrimination, including sexual assault, sexual harassment, domestic violence, dating violence, stalking, and retaliation. This includes interactions in person or online through digital platforms and social media. Title IX also protects against discrimination on the basis of pregnancy, childbirth, false pregnancy, miscarriage, abortion, or related conditions, including recovery.

There are resources here on campus to support you. A summary of the available Title IX resources (confidential and non-confidential) can be found at the following link: https://www.umass.edu/titleix/resources. You do not need to make a formal report to access them. If you need immediate support, you are not alone. Free and confidential support is available 24/7/365 at the SASA Hotline 413-545-0800.