CS 690: Trustworthy and Responsible AI

Instructor: Eugene Bagdasarian
TA: June Jeong
Time: MoWe 2:30PM - 3:45PM
Location: Computer Science Bldg rm 142
Office hours: Eugene: Wed 4-5pm by appointment , CS 304 | June: Fri 2-4pm by appointment , CS 207

In the era of intelligent assistants, autonomous agents, and self-driving cars we expect AI systems to not cause harm and withstand adversarial attacks. In this course you will learn advanced methods of building AI models and systems that mitigate privacy, security, societal, and environmental risks. We will go deep into attack vectors and what type of guarantees current research can and cannot provide for modern generative models. The course will feature extensive hands-on experience with model training and regular discussion of key research papers. Students are required to have taken NLP, general ML, and security classes before taking this course.

Expectations

  • Required reading, attendance, and participation
  • Each group: weekly presentation + code for assignments
  • Group research project

Grading Breakdown

Component Percentage Details
Attendance 10% Allowed to miss any 4 classes
Assignments (slides + report + code) 40% 2 total (20% each), allowed 1 late day per assignment.
Final Project 50%
  • 1-page Proposal (10%)
  • Mid-Semester Presentation (5%)
  • Final Report (20%)
  • Final Presentation (15%)
(Optional) bonus up to 5% Active participation, excellent code implementation, slide efforts

Syllabus: Weekly Schedule

Fall 2025 Class Schedule
Week Class # Date Topic Notes Links/Slides/Assignments
Week 11Wed, Sep 3Intro + Project group formationsFirst Day of ClassesBonus Assignment: Startup ideas
No reading
Week 22Mon, Sep 8Overview Privacy and SecuritySlidesAssignment 1 Release (Due 9/19)
πŸ“– Paper 1: Towards the Science of Security and Privacy in Machine Learning
3Wed, Sep 10Privacy. Membership Inference AttacksSlides
πŸ“– Paper 1: Membership Inference Attacks From First Principles
πŸ“– Paper 2: Membership Inference Attacks against Machine Learning Models
Week 34Mon, Sep 15Privacy. Training Data AttacksLast day to add/drop (Grad)
πŸ“– Paper 1: Extracting Training Data from Large Language Models
πŸ“– Paper 2: Language Models May Verbatim Complete Text They Were Not Explicitly Trained On
πŸ“– Paper 3: Membership Inference Attacks Cannot Prove that a Model Was Trained On Your Data
πŸ“– Paper 4: Imitation Attacks and Defenses for Black-box Machine Translation Systems
5Wed, Sep 17Privacy. Federated LearningAssignment 2 Release (Due 9/26)
πŸ“– Paper 1: Communication-Efficient Learning of Deep Networks from Decentralized Data
πŸ“– Paper 2: Advances and Open Problems in Federated Learning
Fri, Sep 19Assignment 1: Synthetic data + reconstruction
Week 46Mon, Sep 22Privacy. Differential Privacy, Part 1. Basics
πŸ“– Paper 1: Deep Learning with Differential Privacy
πŸ“– Paper 2: Scaling Laws for Differentially Private Language Models
7Wed, Sep 24Privacy. Differential Privacy, Part 2. In-Context Learning, Private EvolutionAssignment 3 Release (Due 10/3)
πŸ“– Paper 1: Differentially Private Synthetic Data via Foundation Model APIs 1: Images
πŸ“– Paper 2: Learning Differentially Private Recurrent Language Models
Fri, Sep 26Assignment 2: Federated Learning
Week 58Mon, Sep 29Privacy. Data Analytics, PII Filtering with LLMsAssignment 4 Release (Due 10/10)
πŸ“– Paper 1: Beyond Memorization: Violating Privacy Via Inference with Large Language Models
πŸ“– Paper 2: Can Large Language Models Really Recognize Your Name?
9Wed, Oct 1Privacy. Contextual Integrity
πŸ“– Paper 1: Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory
πŸ“– Paper 2: AirGapAgent: Protecting Privacy-Conscious Conversational Agents
Fri, Oct 3Assignment 3: Differential Privacy
Week 610Mon, Oct 6Guest Talk (Amer Sinha) + Future Directions Discussions
No paper reading
11Wed, Oct 8Mid-Semester Project Presentations (All Teams) Instructions
No paper reading
Fri, Oct 10Assignment 4: PII Extraction
Project Proposal to Gradescope
Week 712Mon, Oct 13Holiday - Indigenous People's Day (No Class)
13Wed, Oct 15No class. Project work
No paper reading
Fri, Oct 17Assignment 5: Contextual Integrity
Assignment 6 Release (Due 10/24)
Week 814Mon, Oct 20Security. Jailbreaks + Prompt injections
πŸ“– Paper 1: Universal and Transferable Adversarial Attacks on Aligned Language Models
πŸ“– Paper 2: Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection
15Wed, Oct 22Assignments 4, 5, 6 Presentations.
Fri, Oct 24Assignment 6: Jailbreaking and prompt injections
Week 916Mon, Oct 27Security. Adversarial Examples in Multi-modal systemsAssignment 8 Release (Due 11/7)
πŸ“– Paper 1: Are aligned neural networks adversarially aligned?
πŸ“– Paper 2: Self-interpreting Adversarial Images
17Wed, Oct 29Security. Poisoning and Backdoors
πŸ“– Paper 1: How To Backdoor Federated Learning
Fri, Oct 31Assignment 7: Multi-modal attacks
Week 1018Mon, Nov 3Security. Watermarks
πŸ“– Paper 1: SoK: Watermarking for AI-Generated Content
19Wed, Nov 5Security. Alignment AttacksAssignment 9 Release (Due 11/14)
πŸ“– Paper 1: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Fri, Nov 7Assignment 8: Backdoors
Week 1120Mon, Nov 10Security. Principled Defenses
πŸ“– Paper 1: Defeating Prompt Injections by Design
πŸ“– Paper 2: Contextual Agent Security
Week 1222Wed, Nov 12Societal. Model Fairness and Biases
πŸ“– Paper 1: Differential Privacy Has Disparate Impact on Model Accuracy
Fri, Nov 14Assignment 9: Alignment attacks + RLHF
21Mon, Nov 17Environmental. Overthinking
πŸ“– Paper 1: OverThink: Slowdown Attacks on Reasoning LLMs
23Wed, Nov 19Societal. Propaganda, Misinformation, and Deception
πŸ“– Paper 1: Propaganda-as-a-service
Fri, Nov 21Assignment 10: Resource overhead attacks
Week 1324Mon, Nov 24Optional class to chat about future of AI and jobs.
No paper reading
25Tue, Nov 25Thanksgiving recess begins after last class
Week 1426Mon, Dec 1Guest Lecture: Deep Research Agents
No paper reading
27Wed, Dec 3Guest Lecture: Multi-Agent Systems Security.
No paper reading
Week 1528Mon, Dec 8Final Project Poster Session.
No paper reading
Fri, Dec 12Final Project Report is Due

Group Project

Instructions for Your Project

You will design an AI Startup and try to defend it against attacks.

Pre-requisites:

  • You need to operate on customer data, i.e. users, companies
  • Your product should use LLMs
  • Your product operates with external parties, i.e. customers, vendors, etc.

Example projects:

  • Customer support bot
  • AI Tutor
  • Business assistant
  • ...

Throughout the semester you will add privacy and security features, building a comprehensive analysis of your project.

Additionally, you will need to pick one of the research topics you are interested in and write an extensive research report on it.

Make sure to track your own contributions through Git commits (both code and reports).

Announcement β€” Mid-Semester Presentations

When: Wed, Oct 8 (entire class, 1h 45m)

Format: 10 teams Β· 6 min talk + 2 min Q&A per team (β‰ˆ 8 min/slot). Please keep slides visual and concise (5-6 slides max).

  • What to cover:
    • Project title and a brief tagline
    • Problem and motivation: why this topic matters
    • Startup scenario or application context (if relevant): what the system does, what the usecases are, who the users are, and what data it handles
    • Threat model or research question: key privacy/security risks you aim to investigate
    • Experiment plan: what you plan to evaluate and how
    • Planned defenses & evaluation ideas:
    • Next steps & open questions: points you want feedback on
  • Slides: Add to the shared class deck before class.
  • Policy: Use only synthetic/sanitized examples; no real PII on slides or demos.

Follow-up: Submit the 1-page proposal by Fri, Oct 10 (EOD), incorporating in-class feedback.

Assignments

All Assignments Overview

  1. Build synthetic data + Show attacks (MIA or data extraction) β†’ reconstruction
  2. Implement Federated Learning
  3. Implement Differential Privacy and Private Evolution
  4. PII filtering/extraction
  5. Contextual Integrity β†’ airgap, context hijacking
  6. Jailbreaking and prompt injections
  7. Multi-modal attacks
  8. Backdoors and watermarking
  9. Alignment attacks + RLHF
  10. Resource overhead attacks

Assignment Process

  1. Create your repository: Each team must create a repository on GitHub.
  2. Share access: Add the teaching staff as collaborators.
  3. Roles:
    • One lead author writes the report and conducts the main experiments.
    • Other team members advise and consult.
    • The lead author receives 80% of the grade, other members receive 20%.

Structure of Each Assignment

  • Reading Report: Summarize, critique, and connect the assigned papers to your project.
    Include key discussion points from class as well.
  • Code: Implement the required attacks/defenses, include documentation and results.
  • Presentation: Prepare a short slide deck summarizing your findings.

Deadlines & Presentations

  • Due: Slide deck, reading report, and code are due Friday of that week, 11:59 PM EST.
  • Presentation: Happens at the beginning of the following week’s class.

(Bonus) Week 2 - Startup Ideas & Group Formation

Not graded Β· Bonus points for best ideas

Timeline

  • Week 2 class: Present your startup idea in-class (~5 minutes).
  • During class: Form project groups.

Content

  • Present your ideas for a startup for 5 minutes.
  • Show how the startup touches on user data and opens up privacy/security challenges.
  • Form groups.
  • Don’t forget about Slack! Join here

This assignment is not graded! But best ideas will get bonus points.

Week 3 β€” Build a Synthetic Dataset & MIA/Extraction Attack

Due Fri 9/19 Β· 11:59 PM EST

At a Glance

What to submit:

Timeline

  • Friday 11:59 PM EST: slides + reading report + code due
  • Next class: short in-class presentation

Reading Report

Summarize & critique: 2-3 sentence summary + 1 strength + 1 weakness.

Connect

  • How do MIAs exploit overfitting? How does this connect to your MIA implementation?
  • How does data extraction differ from MIAs, and why are LLMs vulnerable?
  • Is your dataset vulnerable to MIA? Why?

Discussion

  • What properties of training data make extraction more dangerous?

Coding Assignment 1

Goal: (1) Build a synthetic dataset and train a model. (2) Run a membership inference attack (image) or a training-data extraction demo (text). Bonus: do both.

Task 1. Build a Synthetic Dataset β€” Requirements
  • Create a realistic synthetic dataset. Programmatic or manual; small but coherent.
  • Any data type: tabular, text, image, audio, or video (the product should still use an LLM).
  • Size & labels:
    • β‰₯ 100 labeled samples.
    • Labels suitable for training/testing (class/category).
    • Manual sanity-check for coherence and correct labels.
  • Use-case representation: realistic privacy-sensitive scenario. Examples:
    • Tabular: customer transactions, demographics, sensors.
    • Text: user queries, chatbot logs, search queries.
    • Image: simple shapes, icons, handwritten digits.
    • Audio/Video: short labeled clips (optional).
Task 2. Attack β€” Requirements
  • Train a baseline model suitable for your data type.
  • Implement MIA or training-data extraction and report AUC/attack accuracy or a justified qualitative score.
  • Analyze why attack strength matches (or not) reconstruction quality.
Deliverables (GitHub)
  • Dataset file β€” CSV (tabular), JSON/TXT (text), or folder (images/audio)
  • Attack code
  • README with:
    • Dataset (use case, #samples & label distribution, generation method, examples)
    • Attack details (design choices, implementation, metrics & results, vulnerability analysis, implications)

Presentation

  • Upload: Add your slides to the shared class slide deck.
  • Summary: Summarize the assigned paper(s) β€” key contributions, methods, findings.
  • Connection: Relate to your startup/project (threat models, risks, defenses).
  • Dataset details: Briefly explain your synthetic dataset and attack setup.
  • Demo: Code demo: show results on your dataset (success rate, reconstruction quality, and why).

Week 4 β€” Assignment 2: Federated Learning

Presentation 9/24 Β· Report & Code Due Fri 9/26 Β· 11:59 PM ET

At a Glance

What to submit:

    Slide deck (add to shared class deck)
  • Reading report (PDF) β†’ Gradescope
  • Code + README β†’ GitHub repo

Timeline

  • Wed 9/24 (in class): 1-2 min presentation + quick demo
  • Fri 9/26 11:59 PM ET: reading report + code due

In-Class Presentation (1-2 minutes) (9/24)

  • Summary: 2-3 bullet points from papers (main idea & why it matters).
  • Connection: 1 bullet on how FL relates to your project/startup (threats, risks, defenses).
  • Demo: Show your FL code results (e.g., a training log line or accuracy plot).
  • Baseline: You may include centralized training results for comparison.

Reading Report (Due 9/26)

Format: ~1-1.5 pages (PDF). Combine both papers into one report.

Summarize & critique: 2-3 sentence summary + 1 strength + 1 weakness.

Connect

  • What privacy protections does FL claim, and what risks remain?
  • What new attack surface does FL introduce?
  • Would FL improve privacy for your project? Why/why not?

Discussion

  • Is FL primarily a privacy technique or a scalability/availability technique?

Coding Assignment 2 (Due Thu 9/26)

Goal: Implement a simple FL pipeline on your synthetic dataset to examine communication, aggregation, and privacy trade-offs.

Requirements
  • Use β‰₯ 5 simulated clients with partitioned subsets (iid or non-iid; describe which).
  • Implement local training on each client and a central aggregator (FedAvg or equivalent).
  • Compare FL vs. centralized baseline on:
    • Accuracy
    • Convergence speed (rounds/epochs to reach a target accuracy)
    • Communication overhead
    • Example comms estimate: Total β‰ˆ 2 Γ— (#params Γ— 4 bytes) Γ— #clients Γ— #rounds

Deliverables (GitHub)
  • Code in your Github Repo
  • README.md can include:
    • Design Choices: number of clients, how did you simulate non-iid, why did you set the number of local epochs as such, etc.
    • Model architecture & training details (optimizer, LR, batch size, epochs)
    • Dataset description & client partitioning (iid/non-iid, label distribution, sizes)
    • Aggregation method & client simulation details (rounds, local epochs, LR, participation rate)
    • Performance comparison (tables/plots): accuracy, convergence, communication overhead
    • Observed vulnerabilities and implications for your project

Week 5 β€” Assignment 3: Differential Privacy (DP-SGD)

Presentation Wed 10/1 Β· Report & Code Due Fri 10/3 Β· 11:59 PM ET

At a Glance

What to submit:

  • Slide deck (add to shared class deck)
  • Reading report (PDF) β†’ Gradescope
  • Code + README β†’ GitHub repo

Timeline

  • Wed 10/1 (in class): 1-2 min presentation + quick DP demo
  • Fri 10/3 11:59 PM ET: reading report + code + README due

In-Class Presentation (10/1)

  • Summary: Methods (DP-SGD / DP mechanisms), guarantees, and utility costs (2-3 bullets).
  • Connection: Where DP fits in your threat model & deployment scenario.
  • Code demo: Show one DP-SGD run with noise level and Ξ΅ estimate; note the utility drop.

Reading Report (Due 10/3)

Summarize & critique: 2-3 sentence summary + 1 strength + 1 weakness for each paper.

Connect

  • How does DP mitigate MIA/extraction risks from Week 3?
  • How does DP synthetic data extend DP principles? What new risks arise?

Discussion

  • When is the privacy-utility trade-off β€œgood enough” for your use case?
  • Should synthetic data be evaluated separately for privacy and fairness?

Coding Assignment 3: Add DP (Training with DP-SGD)

Goal: Apply DP-SGD to your Week-3 pipeline and evaluate privacy & utility.

Requirements (DP-SGD only)
  • Use per-example clipping and Gaussian noise via Opacus or TF-Privacy.
  • Compute & report Ξ΅(Ξ΄) with the library’s accountant (set Ξ΄ = 1/N).
  • Sweep a small grid:
    • Clip norm C ∈ {0.5, 1.0}
    • Noise multiplier Οƒ ∈ {0.5, 1.0, 2.0}
  • Keep runtime modest (reasonable batch size & epochs); fix other hypers as needed.
  • (Optional) Re-run your Week-3 MIA/extraction on the best DP setting to show impact.
What to Report
  • Compare to non-DP baseline (Week-3 best).
  • Effect of hyperparameters
  • (Optional) MIA/extraction metric change on best DP setting.
Deliverables (GitHub)
  • Code with DP-SGD enabled (Opacus/TF-Privacy).
  • README can include:
    • Design & implementation: where per-example clipping & noise are applied.
    • Settings: dataset, model, batch size, epochs, C, Οƒ, learning rate, seed(s).
    • Results: table/figure with Ξ΅(Ξ΄) and utility vs. baseline, effect of hyperparameters.
    • Takeaways: 2-4 bullets about the privacy-utility trade-off for your project.

Week 5 β€” Assignment 4: PII Filtering

Presentation Wed 10/22 Β· Report & Code Due Fri 10/10 Β· 11:59 PM ET

At a Glance

What to submit:

  • Slide deck (present in class 10/22)
  • Reading report (PDF) β†’ Gradescope
  • Code + README β†’ GitHub repo

Timeline

  • Wed 10/22 (in class): 1-2 min presentation + quick demo
  • Fri 10/10 11:59 PM ET: reading report + code + README due

In-Class Presentation (10/22)

  • Summary: inference threats vs. memorization (2-3 bullets)
  • Connection: where PII can leak in your pipeline (ingest, logs, prompts, outputs, tools)
  • Code demo: show your prototype PII filter on 1-2 examples

Reading Report (Due 10/10)

Summarize & critique: 2-3 sentence summary + 1 strength + 1 weakness.

Connect

  • How do inference attacks differ from training-data extraction?
  • What heuristic filter could an LLM app use to detect/block PII leakage?
  • Does your dataset have PII-like strings? If so, what is your filtering strategy?

Discussion

  • Give one scenario where sharing info is fine in one context but harmful in another.

Coding Assignment 4: Prototype PII Filtering

Goal: Implement a basic PII-detection and redaction step to practice text preprocessing for privacy protection.

Requirements
  • Dataset: use your project dataset; it should already contain synthetic PII-like fields.
  • Detector: implement regex + simple checks for classes: EMAIL, PHONE, CREDIT_CARD, DATE/DOB, NAME, IP.
  • Three redaction modes:
    • Strict mask (e.g., [EMAIL], [PHONE])
    • Partial mask (e.g., ***-**-1234, j***@example.com)
    • LLM mask (feed your data to an LLM and let it filter out PIIs for you.)
  • Evaluation:
    • Precision / Recall / F1 per class (+ micro overall)
    • Residual leakage rate: % of docs with any missed high-risk item (CREDIT_CARD/SSN-like)
    • Add β‰₯ 5 adversarial cases (e.g., spaced digits, leetspeak, Unicode confusables, inserted dots) and report catches vs. misses
Deliverables (GitHub)
  • Code for detector + redaction + evaluation
  • README including:
    • Dataset: which fields contain PII-like strings
    • Design & implementation: detector patterns and validation checks
    • Redaction modes: 2-3 concrete examples
    • Results: P/R/F1 per class, residual leakage, (optional) Ξ”utility, runtime
    • Adversarial tests: what was caught vs. missed; known failure modes
    • Implications: where this is sufficient vs. risky for your project

Week 6 β€” Assignment 5: Contextual Integrity

Presentation Wed 10/22 Β· Report & Code Due Fri 10/17 Β· 11:59 PM ET

At a Glance

What to submit:

  • Slide deck (present in class 10/22)
  • Reading report (PDF) β†’ Gradescope
  • Code + README β†’ GitHub repo

Timeline

  • Wed 10/22 (in class): 1-2 min presentation + quick demo
  • Fri 10/17 11:59 PM ET: reading report + code + README due

In-Class Presentation (10/22)

  • Summary: what is contextual integrity
  • Connection: how does contextual integrity relate to privacy in AI agents?
  • Code demo: show agentic interactions protecting user privacy

Reading Report (Due 10/17)

Summarize & critique: 2-3 sentence summary + 1 strength + 1 weakness.

Connect

  • What are unique challenges in building privacy-preserving AI agents?
  • Is Contextual Integrity sufficient for ensuring privacy in AI agent at inference time?
  • What are challenges of operationalizing Contextual Integrity for AI Agents?

Discussion

  • Future autonomous AI Agents will likely be able to do things on user's behalf without checking with them. How would you design these agents to ensure they respect user privacy? Would you focus on system-level defenses, model-level defenses, or both?

Coding Assignment 5: Contextual Integrity

Goal: Implement 5 different scenarios of using an AI agent for your startup. Generate more synthetic data if needed. Implement attacks trying to trick the agent and an Air Gap defense.

Requirements
  • Dataset: use your project dataset; augment if needed for synthetic conversations.
  • Create 5 different scenarios of using an AI agent for your startup.
  • Implement attacks that try to trick the agent. Implemented automated techniques to generate context hijacking.
  • Compute & report the success rate of the attacks.
  • Implement the Air Gap defense and show how it mitigates the attacks.
Deliverables (GitHub)
  • Dataset scenarios, user data, and any prompts used (attacks, defenses)
  • Code agent interactions
  • README including:
    • Dataset: What type of scenarios are included? What attacks were attempted? What was private user data?
    • Design & implementation: Both attack design and defense designs. How did you implement dynamic attacks.
    • Metrics: How did you measure privacy and utility?
    • Results:Privacy and utility tradeoff with different attack strategies and defenses.
    • Discussion: Limitations and future work for your design.

Week 8 β€” Assignment 6: Jailbreaks & Prompt Injections

Presentation Wed 10/22 Β· Report & Code Due Fri 10/24 Β· 11:59 PM ET

At a Glance

What to submit:

  • Slide deck (present in class 10/22)
  • Reading report (PDF) β†’ Gradescope
  • Code + README β†’ GitHub repo

Timeline

  • Wed 10/22 (in class): short team presentation + demo
  • Fri 10/24 11:59 PM ET: reading report + code + README due

Reading Report (Due 10/24)

Summarize & critique: 2-3 sentence summary + 1 strength + 1 weakness.

Connect

  • Compare adversarial prompts to vision adversarial examples. What is easier to exploit? Why is that more vulnerable?
  • How does indirect prompt injection differ from direct jailbreak attacks?
  • What is your strategy for defending your system against these attacks?
  • In what attack scenario is your system exposed to an indirect prompt injection? Why would an adversary do it? What capability does the attacker have, and what is the attacker’s gain?

Discussion

  • Are alignment guardrails fundamentally brittle to adversarial prompting?
  • Should defenses target the LLM, the application layer, or both?

Coding Assignment 6: Jailbreak & Prompt Injection Suite

Goal: Create at least one jailbreak prompt and one prompt-injection payload, evaluate success against your system or prototype, and demonstrate at least one successful defense against each.

Requirements
  • Attack harness: design dynamic jailbreak strings (implement GCG) and prompt injection payloads (e.g., from web content or hidden instructions).
  • Working jailbreak: provide the exact prompt, model/app tested, and evidence of success.
  • Working prompt injection: provide the source content (snippet or doc), how it was injected, and evidence of success.
  • Evaluation metric: define success (e.g., restricted content generated, attacker instruction executed, or sensitive info leaked).
  • Defense: add β‰₯1 mechanism (input filtering, retrieval allowlist, prompt rewriting/sanitization, tool scoping).
  • Report rates: attack success rate before and after defense, including examples of successful attacks.
Deliverables (GitHub)
  • Code + README

README can include:

  • Design & Implementation: attack and defense designs, including how you implemented dynamic (adaptive) attacks.
  • Metrics & Results: success criteria and quantitative metrics for attack strategies and defenses (include success rates + brief analysis).
  • Discussion: limitations (in attack or defense), open challenges, and possible future improvements to your system or defense design.

Week 9 β€” Assignment 7: Multi-modal attacks

Presentation Wed 10/29 Β· Report & Code Due Fri 10/31 Β· 11:59 PM ET

At a Glance

What to submit:

  • Slide deck (present in class 10/29)
  • Reading report (PDF) β†’ Gradescope
  • Code + README β†’ GitHub repo

Timeline

  • Wed 10/29 (in class): short team presentation + demo
  • Fri 10/31 11:59 PM ET: reading report + code + README due

Reading Report (Due 10/29)

Summarize & critique: 2-3 sentence summary + 1 strength + 1 weakness.

Connect

  • What are key properties of multi-modal attacks?
  • Why images are easier to attack than text?
  • What is the key challenge of aligning modalities?

Discussion

  • What are potential defenses against multi-modal attacks?

Coding Assignment 7: Multi-modal attacks

Goal: Explore and evaluate multi-modal attacks on aligned language models.

Requirements
  • Task: Implement the multi-modal attack method from Are aligned neural networks adversarially aligned?.
  • Dataset: use your project dataset; augment if needed for synthetic multi-modal data. Create 5 examples of the attack.
  • Evaluate the attack success rate on a set of prompts.
  • Implement at least one defense mechanism and show how it mitigates the attack.
  • Report the attack success rate before and after defense, including examples of successful attacks.
Deliverables (GitHub)
  • Code + README

README can include:

  • Design & Implementation: attack and defense designs, including how you implemented dynamic (adaptive) attacks.
  • Metrics & Results: success criteria and quantitative metrics for attack strategies and defenses (include success rates + brief analysis).
  • Discussion: limitations (in attack or defense), open challenges, and possible future improvements to your system or defense design.

Week 10 β€” Assignment 8: Backdoors & Watermarking

Presentation Wed 11/5 Β· Report & Code Due Fri 11/7 Β· 11:59 PM ET

At a Glance

What to submit:

  • Slide deck (present in class 11/5)
  • Reading report (PDF) β†’ Gradescope (due 11/7)
  • Code + README β†’ GitHub repo (due 11/7)

Timeline

  • Wed 11/5 (in class): short team presentation + demo
  • Fri 11/7 11:59 PM ET: reading report + code + README due

In-Class Presentation (11/5)

  • Summary: Watermark goals/limits; FL backdoor mechanics (2-3 bullets).
  • Connection: Threat model and attack scenario on your system.
  • Code demo: Show a model that outputs an attacker-chosen string only when a specific keyword appears, or a watermark insertion/detection demo.

Reading Report (Due 11/7)

Summarize & critique: 2-3 sentence summary + 1 strength + 1 weakness (per paper).

Connect

  • How does watermarking address misinformation risks compared to other defenses?
  • How do backdoor attacks in FL compare to poisoning attacks studied earlier?
  • In your own system, what trigger or training modification could act as a backdoor?

Discussion

  • Suggest one potential application and one limitation for watermarks.

Coding Assignment 8: Backdoor (Due 11/7)

Goal: Insert a backdoor into a language model and evaluate effectiveness, stealth, and utility impact. Make the model produce an attacker-chosen output only when a specific keyword (trigger) appears.

Requirements
  • Threat Model: clearly define attacker capability (e.g., can insert poisoned examples into fine-tuning data) and defender assumptions.
  • Attack Design: insert a trigger token (e.g., pz_trig_42) into training prompts and replace the expected output with your chosen target (e.g., ACCESS_GRANTED).
  • Backdoor Pipeline: implement trigger + poisoning; report ASR and CA.
    • ASR (Attack Success Rate): % of triggered inputs producing the target.
    • CA (Clean Accuracy): model utility on non-triggered data.
  • Robustness Tests: vary trigger position (prefix/middle/suffix), punctuation/case changes, and run continued fine-tuning on clean data (measure ASR decay).
Deliverables (GitHub)
  • Code + README

README should include:

  • Threat model, design choices, and implementation details
  • Evaluation metrics (ASR, CA) and results
  • Trade-offs: utility vs. security
  • Limitations and potential defenses

Week 11 β€” Assignment 9: Alignment Attacks

Presentation Wed 11/12 Β· Report & Code Due Fri 11/14 Β· 11:59 PM ET

At a Glance

What to submit:

  • Slide deck (present in class 11/12)
  • Reading report (PDF) β†’ Gradescope (due 11/14)
  • Code + README β†’ GitHub repo (due 11/14)

Timeline

  • Wed 11/12 (in class): short team presentation + demo
  • Fri 11/14 11:59 PM ET: reading report + code + README due

In-Class Presentation (11/12)

  • Summary: Misalignment from narrow finetuning (2-3 bullets).
  • Connection: Map the attack to your system architecture (where it manifests, why).
  • Code demo: Show an alignment attack on your system.

Reading Report (Due 11/14)

Summarize & critique: 2-3 sentence summary + 1 strength + 1 weakness.

Connect

  • How does finetuning-based misalignment compare to jailbreaks or indirect prompt injection?
  • What potential finetuning could incur an alignment attack on your dataset/system?

Discussion

  • In what domain could β€œemergent misalignment” have real-world consequences?

Coding Assignment 9: Alignment Attack (Due 11/14)

Goal: Implement a misalignment attack and evaluate safety vs. utility before/after.

Requirements
  • Review & dataset selection: Review the paper codebase and select a dataset from the provided options.
  • Identify flaws: Find 2 examples of each flaw type in the insecure dataset (e.g., security anti-patterns, correctness/robustness failures, style issues).
  • Induce misalignment: Use LoRA/QLoRA to finetune for 1-3 epochs to intentionally induce misalignment.
  • Evaluate misaligned model on a clean eval set: show that your model is misaligned.
  • Apply one alignment intervention (choose one):
    • SFT-Good: finetune on a matched Good-Code dataset (β‰₯ 250 examples) modeling secure/maintainable practices.
    • DPO / preference tuning: use provided bad-good preference pairs (β‰₯ 200) to run DPO (or equivalent).
    • Critique-and-Revise: finetune on data where the model critiques and revises poor code (bad β†’ critique β†’ improved).
  • Re-evaluate the aligned model on the same eval set and report changes across all metrics.
  • Analyze trade-offs: does alignment improve security at the cost of verbosity, runtime, or overly conservative output?
Deliverables (GitHub)
  • Code + README

README should include:

  • Concrete examples of each flaw type.
  • Description of your alignment intervention and training method.
  • A table comparing metrics for: Base Β· Bad-SFT (misaligned) Β· Realigned.
  • 2-3 example prompt generations (before vs. after alignment).
  • Short discussion of utility vs. safety trade-offs (e.g., security improves but response length increases; docstrings added but test coverage drops).

Week 12 β€” Assignment 10: Resource abuse

Presentation Wed 11/19 Β· Report & Code Due Fri 11/21 Β· 11:59 PM ET

At a Glance

What to submit:

  • Slide deck (present in class 11/19)
  • Reading report (PDF) β†’ Gradescope (due 11/21)
  • Code + README β†’ GitHub repo (due 11/21)

Timeline

  • Wed 11/19 (in class): short team presentation + demo
  • Fri 11/21 11:59 PM ET: reading report + code + README due

In-Class Presentation 1 slide (11/19)

  • Summary: How you implement resource-overhead attacks.
  • Connection: Map the attack to your system architecture (where it manifests, why).

Reading Report (Due 11/21)

Summarize & critique: 2-3 sentence summary + 1 strength + 1 weakness.

Connect

  • Why emerging AI systems are vulnerable to resource abuse attacks.
  • What mitigation strategies could work to prevent them?

Discussion

  • What are the threat models that are most vulnerable to resource abuse attacks?

Coding Assignment 10: Resource Abuse Attack (Due Fri 11/21 Β· 11:59 PM ET)

Goal: Build a demo that makes a reasoning model spend noticeably more compute (time/tokens) without changing the final answer.

Requirements
  • Data: any 5-10 Q&A examples (project set or a small public slice). Related to your startup.
  • Attack: add one benign β€œdecoy” to the context (e.g., Sudoku/MDP note, long self-check, harmless tool loop) that increases compute.
  • Measure: report average slowdown S = t_attacked / t_base (and token overhead if available). A couple of example generations are enough.
  • Try a tiny defense (context sanitizer or a simple time/token budget) and show before/after.
Deliverables (GitHub)
  • Code + a short README (what you tried, model/settings, how you measured).
  • A small table or JSON/CSV with per-item baseline vs. attacked times (and tokens if you have them) + the mean slowdown.
  • Two brief example transcripts (baseline vs. attacked). For a defense, show its effect.

Course Policies

Course Grade Scale (100-499)

GradeRangeGradeRange
A95-100B83-86
A-90-94B-80-82
B+87-89C+77-79
C73-76C-70-72
D+67-69D63-66
F0-62

Note: If your course uses a total-points basis (e.g., 499 pts), the letter-grade cutoffs are applied to the percentage earned.

Notes on AI Use

You may use AI tools to help with reading or drafting, but you must fully understand the material and be able to explain it clearly to your teammates. The goal is to enrich group learning and class discussionβ€”not just to generate text. You need to provide the β€œHow I used AI” section in your report.

Late Policy

Each assignment includes two late days (24 hours) that may be used without penalty. Late days do not accumulate across assignmentsβ€”unused late days expire. Assignments submitted beyond the allowed late day will not be accepted unless prior arrangements are made due to documented emergencies.

Nondiscrimination Policy

This course is committed to fostering an inclusive and respectful learning environment. All students are welcome, regardless of age, background, citizenship, disability, education, ethnicity, family status, gender, gender identity or expression, national origin, language, military experience, political views, race, religion, sexual orientation, socioeconomic status, or work experience. Our collective learning benefits from the diversity of perspectives and experiences that students bring. Any language or behavior that demeans, excludes, or discriminates against members of any group is inconsistent with the mission of this course and will not be tolerated.

Students are encouraged to discuss this policy with the instructor or TAs, and anyone with concerns should feel comfortable reaching out.

Academic Integrity

All work in this course is designated as group work, with shared responsibility among members. While assignments will be submitted jointly and receive a group grade, each member is expected to contribute meaningfully and to track individual contributions within the group.

Collaboration within your group is encouraged and expected. You may discuss ideas, approaches, and strategies with others, but all written material, whether natural language or code, must be the original work of your group. Copying text or code from external sources without proper attribution is a violation of academic integrity.

This course follows the UMass Academic Honesty Policy and Procedures. Acts of academic dishonesty, including plagiarism, unauthorized use of external work, or misrepresentation of contributions, will not be tolerated and may result in serious sanctions.

If you are ever uncertain about what constitutes appropriate collaboration or attribution, please ask the instructor or TAs before submitting your work.

Accommodation Statement

The University of Massachusetts Amherst is committed to providing an equal educational opportunity for all students. If you have a documented physical, psychological, or learning disability on file with Disability Services (DS), you may be eligible for reasonable academic accommodations to help you succeed in this course. If you have a documented disability that requires an accommodation, please notify me within the first two weeks of the semester so that we may make appropriate arrangements. For further information, please visit Disability Services.

Title IX Statement (Non-Mandated Reporter Version)

In accordance with Title IX of the Education Amendments of 1972 that prohibits gender-based discrimination in educational settings that receive federal funds, the University of Massachusetts Amherst is committed to providing a safe learning environment for all students, free from all forms of discrimination, including sexual assault, sexual harassment, domestic violence, dating violence, stalking, and retaliation. This includes interactions in person or online through digital platforms and social media. Title IX also protects against discrimination on the basis of pregnancy, childbirth, false pregnancy, miscarriage, abortion, or related conditions, including recovery.

There are resources here on campus to support you. A summary of the available Title IX resources (confidential and non-confidential) can be found at the following link: https://www.umass.edu/titleix/resources. You do not need to make a formal report to access them. If you need immediate support, you are not alone. Free and confidential support is available 24/7/365 at the SASA Hotline 413-545-0800.