CS 690: Trustworthy and Responsible AI
Instructor: Eugene Bagdasarian
TA: June Jeong
Time: MoWe 2:30PM - 3:45PM
Location: Computer Science Bldg rm 142
Office hours: Eugene: Wed 4-5pm by
appointment , CS 304 | June: Fri 2-4pm by appointment , CS 207
In the era of intelligent assistants, autonomous agents, and self-driving cars we expect AI systems to not cause harm and withstand adversarial attacks. In this course you will learn advanced methods of building AI models and systems that mitigate privacy, security, societal, and environmental risks. We will go deep into attack vectors and what type of guarantees current research can and cannot provide for modern generative models. The course will feature extensive hands-on experience with model training and regular discussion of key research papers. Students are required to have taken NLP, general ML, and security classes before taking this course.
Expectations
- Required reading, attendance, and participation
- Each group: weekly presentation + code for assignments
- Group research project
Grading Breakdown
| Component | Percentage | Details |
|---|---|---|
| Attendance | 10% | Allowed to miss any 4 classes |
| Assignments (slides + report + code) | 40% | 2 total (20% each), allowed 1 late day per assignment. |
| Final Project | 50% |
|
| (Optional) bonus | up to 5% | Active participation, excellent code implementation, slide efforts |
Syllabus: Weekly Schedule
| Week | Class # | Date | Topic | Notes | Links/Slides/Assignments |
|---|---|---|---|---|---|
| Week 1 | 1 | Wed, Sep 3 | Intro + Project group formations | First Day of Classes | Bonus Assignment: Startup ideas |
| No reading | |||||
| Week 2 | 2 | Mon, Sep 8 | Overview Privacy and Security | Slides | Assignment 1 Release (Due 9/19) |
| π Paper 1: Towards the Science of Security and Privacy in Machine Learning | |||||
| 3 | Wed, Sep 10 | Privacy. Membership Inference Attacks | Slides | ||
| π Paper 1: Membership Inference Attacks From First Principles | |||||
| π Paper 2: Membership Inference Attacks against Machine Learning Models | |||||
| Week 3 | 4 | Mon, Sep 15 | Privacy. Training Data Attacks | Last day to add/drop (Grad) | |
| π Paper 1: Extracting Training Data from Large Language Models | |||||
| π Paper 2: Language Models May Verbatim Complete Text They Were Not Explicitly Trained On | |||||
| π Paper 3: Membership Inference Attacks Cannot Prove that a Model Was Trained On Your Data | |||||
| π Paper 4: Imitation Attacks and Defenses for Black-box Machine Translation Systems | |||||
| 5 | Wed, Sep 17 | Privacy. Federated Learning | Assignment 2 Release (Due 9/26) | ||
| π Paper 1: Communication-Efficient Learning of Deep Networks from Decentralized Data | |||||
| π Paper 2: Advances and Open Problems in Federated Learning | |||||
| Fri, Sep 19 | Assignment 1: Synthetic data + reconstruction | ||||
| Week 4 | 6 | Mon, Sep 22 | Privacy. Differential Privacy, Part 1. Basics | ||
| π Paper 1: Deep Learning with Differential Privacy | |||||
| π Paper 2: Scaling Laws for Differentially Private Language Models | |||||
| 7 | Wed, Sep 24 | Privacy. Differential Privacy, Part 2. In-Context Learning, Private Evolution | Assignment 3 Release (Due 10/3) | ||
| π Paper 1: Differentially Private Synthetic Data via Foundation Model APIs 1: Images | |||||
| π Paper 2: Learning Differentially Private Recurrent Language Models | |||||
| Fri, Sep 26 | Assignment 2: Federated Learning | ||||
| Week 5 | 8 | Mon, Sep 29 | Privacy. Data Analytics, PII Filtering with LLMs | Assignment 4 Release (Due 10/10) | |
| π Paper 1: Beyond Memorization: Violating Privacy Via Inference with Large Language Models | |||||
| π Paper 2: Can Large Language Models Really Recognize Your Name? | |||||
| 9 | Wed, Oct 1 | Privacy. Contextual Integrity | |||
| π Paper 1: Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory | |||||
| π Paper 2: AirGapAgent: Protecting Privacy-Conscious Conversational Agents | |||||
| Fri, Oct 3 | Assignment 3: Differential Privacy | ||||
| Week 6 | 10 | Mon, Oct 6 | Guest Talk (Amer Sinha) + Future Directions Discussions | ||
| No paper reading | |||||
| 11 | Wed, Oct 8 | Mid-Semester Project Presentations (All Teams) | Instructions | ||
| No paper reading | |||||
| Fri, Oct 10 | Assignment 4: PII Extraction Project Proposal to Gradescope | ||||
| Week 7 | 12 | Mon, Oct 13 | Holiday - Indigenous People's Day (No Class) | ||
| 13 | Wed, Oct 15 | No class. Project work | |||
| No paper reading | |||||
| Fri, Oct 17 | Assignment 5: Contextual Integrity Assignment 6 Release (Due 10/24) | ||||
| Week 8 | 14 | Mon, Oct 20 | Security. Jailbreaks + Prompt injections | ||
| π Paper 1: Universal and Transferable Adversarial Attacks on Aligned Language Models | |||||
| π Paper 2: Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection | |||||
| 15 | Wed, Oct 22 | Assignments 4, 5, 6 Presentations. | |||
| Fri, Oct 24 | Assignment 6: Jailbreaking and prompt injections | ||||
| Week 9 | 16 | Mon, Oct 27 | Security. Adversarial Examples in Multi-modal systems | Assignment 8 Release (Due 11/7) | |
| π Paper 1: Are aligned neural networks adversarially aligned? | |||||
| π Paper 2: Self-interpreting Adversarial Images | |||||
| 17 | Wed, Oct 29 | Security. Poisoning and Backdoors | |||
| π Paper 1: How To Backdoor Federated Learning | |||||
| Fri, Oct 31 | Assignment 7: Multi-modal attacks | ||||
| Week 10 | 18 | Mon, Nov 3 | Security. Watermarks | ||
| π Paper 1: SoK: Watermarking for AI-Generated Content | |||||
| 19 | Wed, Nov 5 | Security. Alignment Attacks | Assignment 9 Release (Due 11/14) | ||
| π Paper 1: Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs | |||||
| Fri, Nov 7 | Assignment 8: Backdoors | ||||
| Week 11 | 20 | Mon, Nov 10 | Security. Principled Defenses | ||
| π Paper 1: Defeating Prompt Injections by Design | |||||
| π Paper 2: Contextual Agent Security | |||||
| Week 12 | 22 | Wed, Nov 12 | Societal. Model Fairness and Biases | ||
| π Paper 1: Differential Privacy Has Disparate Impact on Model Accuracy | |||||
| Fri, Nov 14 | Assignment 9: Alignment attacks + RLHF | ||||
| 21 | Mon, Nov 17 | Environmental. Overthinking | |||
| π Paper 1: OverThink: Slowdown Attacks on Reasoning LLMs | |||||
| 23 | Wed, Nov 19 | Societal. Propaganda, Misinformation, and Deception | |||
| π Paper 1: Propaganda-as-a-service | |||||
| Fri, Nov 21 | Assignment 10: Resource overhead attacks | ||||
| Week 13 | 24 | Mon, Nov 24 | Optional class to chat about future of AI and jobs. | ||
| No paper reading | |||||
| 25 | Tue, Nov 25 | Thanksgiving recess begins after last class | |||
| Week 14 | 26 | Mon, Dec 1 | Guest Lecture: Deep Research Agents | ||
| No paper reading | |||||
| 27 | Wed, Dec 3 | Guest Lecture: Multi-Agent Systems Security. | |||
| No paper reading | |||||
| Week 15 | 28 | Mon, Dec 8 | Final Project Poster Session. | ||
| No paper reading | |||||
| Fri, Dec 12 | Final Project Report is Due |
Group Project
Instructions for Your Project
You will design an AI Startup and try to defend it against attacks.
Pre-requisites:
- You need to operate on customer data, i.e. users, companies
- Your product should use LLMs
- Your product operates with external parties, i.e. customers, vendors, etc.
Example projects:
- Customer support bot
- AI Tutor
- Business assistant
- ...
Throughout the semester you will add privacy and security features, building a comprehensive analysis of your project.
Additionally, you will need to pick one of the research topics you are interested in and write an extensive research report on it.
Make sure to track your own contributions through Git commits (both code and reports).
Announcement β Mid-Semester Presentations
When: Wed, Oct 8 (entire class, 1h 45m)
Format: 10 teams Β· 6 min talk + 2 min Q&A per team (β 8 min/slot). Please keep slides visual and concise (5-6 slides max).
- What to cover:
- Project title and a brief tagline
- Problem and motivation: why this topic matters
- Startup scenario or application context (if relevant): what the system does, what the usecases are, who the users are, and what data it handles
- Threat model or research question: key privacy/security risks you aim to investigate
- Experiment plan: what you plan to evaluate and how
- Planned defenses & evaluation ideas:
- Next steps & open questions: points you want feedback on
- Slides: Add to the shared class deck before class.
- Policy: Use only synthetic/sanitized examples; no real PII on slides or demos.
Follow-up: Submit the 1-page proposal by Fri, Oct 10 (EOD), incorporating in-class feedback.
Assignments
All Assignments Overview
- Build synthetic data + Show attacks (MIA or data extraction) β reconstruction
- Implement Federated Learning
- Implement Differential Privacy and Private Evolution
- PII filtering/extraction
- Contextual Integrity β airgap, context hijacking
- Jailbreaking and prompt injections
- Multi-modal attacks
- Backdoors and watermarking
- Alignment attacks + RLHF
- Resource overhead attacks
Assignment Process
- Create your repository: Each team must create a repository on GitHub.
- Share access: Add the teaching staff as collaborators.
- Roles:
- One lead author writes the report and conducts the main experiments.
- Other team members advise and consult.
- The lead author receives 80% of the grade, other members receive 20%.
Structure of Each Assignment
- Reading Report: Summarize, critique, and connect the assigned papers to your project.
Include key discussion points from class as well. - Code: Implement the required attacks/defenses, include documentation and results.
- Presentation: Prepare a short slide deck summarizing your findings.
Deadlines & Presentations
- Due: Slide deck, reading report, and code are due Friday of that week, 11:59 PM EST.
- Presentation: Happens at the beginning of the following weekβs class.
(Bonus) Week 2 - Startup Ideas & Group Formation
Not graded Β· Bonus points for best ideasAt a Glance
- Paper: Towards the Science of Security and Privacy in Machine Learning
- Slide deck: Add Slides here!
Timeline
- Week 2 class: Present your startup idea in-class (~5 minutes).
- During class: Form project groups.
Content
- Present your ideas for a startup for 5 minutes.
- Show how the startup touches on user data and opens up privacy/security challenges.
- Form groups.
- Donβt forget about Slack! Join here
This assignment is not graded! But best ideas will get bonus points.
Week 3 β Build a Synthetic Dataset & MIA/Extraction Attack
Due Fri 9/19 Β· 11:59 PM ESTAt a Glance
- π Membership Inference Attacks From First Principles
- Focus: Synthetic data β Train β Attack β (optional) Reconstruct
What to submit:
-
Slide deck
(Google Slides link)
- Reading report (PDF) to Gradescope
- Code & dataset to Github Repo
Timeline
- Friday 11:59 PM EST: slides + reading report + code due
- Next class: short in-class presentation
Reading Report
Summarize & critique: 2-3 sentence summary + 1 strength + 1 weakness.
Connect
- How do MIAs exploit overfitting? How does this connect to your MIA implementation?
- How does data extraction differ from MIAs, and why are LLMs vulnerable?
- Is your dataset vulnerable to MIA? Why?
Discussion
- What properties of training data make extraction more dangerous?
Coding Assignment 1
Goal: (1) Build a synthetic dataset and train a model. (2) Run a membership inference attack (image) or a training-data extraction demo (text). Bonus: do both.
Task 1. Build a Synthetic Dataset β Requirements
- Create a realistic synthetic dataset. Programmatic or manual; small but coherent.
- Any data type: tabular, text, image, audio, or video (the product should still use an LLM).
- Size & labels:
- β₯ 100 labeled samples.
- Labels suitable for training/testing (class/category).
- Manual sanity-check for coherence and correct labels.
- Use-case representation: realistic privacy-sensitive scenario. Examples:
- Tabular: customer transactions, demographics, sensors.
- Text: user queries, chatbot logs, search queries.
- Image: simple shapes, icons, handwritten digits.
- Audio/Video: short labeled clips (optional).
Task 2. Attack β Requirements
- Train a baseline model suitable for your data type.
- Implement MIA or training-data extraction and report AUC/attack accuracy or a justified qualitative score.
- Analyze why attack strength matches (or not) reconstruction quality.
Deliverables (GitHub)
- Dataset file β CSV (tabular), JSON/TXT (text), or folder (images/audio)
- Attack code
- README with:
- Dataset (use case, #samples & label distribution, generation method, examples)
- Attack details (design choices, implementation, metrics & results, vulnerability analysis, implications)
Presentation
- Upload: Add your slides to the shared class slide deck.
- Summary: Summarize the assigned paper(s) β key contributions, methods, findings.
- Connection: Relate to your startup/project (threat models, risks, defenses).
- Dataset details: Briefly explain your synthetic dataset and attack setup.
- Demo: Code demo: show results on your dataset (success rate, reconstruction quality, and why).
Week 4 β Assignment 2: Federated Learning
Presentation 9/24 Β· Report & Code Due Fri 9/26 Β· 11:59 PM ETAt a Glance
- π Communication-Efficient Learning of Deep Networks from Decentralized Data
- π Advances and Open Problems in Federated Learning
- Focus: Implement FedAvg (or equivalent), compare to centralized baseline, analyze privacy/comms trade-offs.
What to submit:
-
Slide deck (add to shared class deck)
- Reading report (PDF) β Gradescope
- Code + README β GitHub repo
Timeline
- Wed 9/24 (in class): 1-2 min presentation + quick demo
- Fri 9/26 11:59 PM ET: reading report + code due
In-Class Presentation (1-2 minutes) (9/24)
- Summary: 2-3 bullet points from papers (main idea & why it matters).
- Connection: 1 bullet on how FL relates to your project/startup (threats, risks, defenses).
- Demo: Show your FL code results (e.g., a training log line or accuracy plot).
- Baseline: You may include centralized training results for comparison.
Reading Report (Due 9/26)
Format: ~1-1.5 pages (PDF). Combine both papers into one report.
Summarize & critique: 2-3 sentence summary + 1 strength + 1 weakness.
Connect
- What privacy protections does FL claim, and what risks remain?
- What new attack surface does FL introduce?
- Would FL improve privacy for your project? Why/why not?
Discussion
- Is FL primarily a privacy technique or a scalability/availability technique?
Coding Assignment 2 (Due Thu 9/26)
Goal: Implement a simple FL pipeline on your synthetic dataset to examine communication, aggregation, and privacy trade-offs.
Requirements
- Use β₯ 5 simulated clients with partitioned subsets (iid or non-iid; describe which).
- Implement local training on each client and a central aggregator (FedAvg or equivalent).
- Compare FL vs. centralized baseline on:
- Accuracy
- Convergence speed (rounds/epochs to reach a target accuracy)
- Communication overhead
Example comms estimate:
Total β 2 Γ (#params Γ 4 bytes) Γ #clients Γ #rounds
Deliverables (GitHub)
- Code in your Github Repo
- README.md can include:
- Design Choices: number of clients, how did you simulate non-iid, why did you set the number of local epochs as such, etc.
- Model architecture & training details (optimizer, LR, batch size, epochs)
- Dataset description & client partitioning (iid/non-iid, label distribution, sizes)
- Aggregation method & client simulation details (rounds, local epochs, LR, participation rate)
- Performance comparison (tables/plots): accuracy, convergence, communication overhead
- Observed vulnerabilities and implications for your project
- π Deep Learning with Differential Privacy
- π Differentially Private Synthetic Data via Foundation Model APIs: Images
- Focus: DP-SGD methods & guarantees, privacy-utility trade-offs, implications for your project.
- Slide deck (add to shared class deck)
- Reading report (PDF) β Gradescope
- Code + README β GitHub repo
- Wed 10/1 (in class): 1-2 min presentation + quick DP demo
- Fri 10/3 11:59 PM ET: reading report + code + README due
- Summary: Methods (DP-SGD / DP mechanisms), guarantees, and utility costs (2-3 bullets).
- Connection: Where DP fits in your threat model & deployment scenario.
- Code demo: Show one DP-SGD run with noise level and Ξ΅ estimate; note the utility drop.
- How does DP mitigate MIA/extraction risks from Week 3?
- How does DP synthetic data extend DP principles? What new risks arise?
- When is the privacy-utility trade-off βgood enoughβ for your use case?
- Should synthetic data be evaluated separately for privacy and fairness?
- Use per-example clipping and Gaussian noise via Opacus or TF-Privacy.
- Compute & report Ξ΅(Ξ΄) with the libraryβs accountant (set Ξ΄ = 1/N).
- Sweep a small grid:
- Clip norm C β {0.5, 1.0}
- Noise multiplier Ο β {0.5, 1.0, 2.0}
- Keep runtime modest (reasonable batch size & epochs); fix other hypers as needed.
- (Optional) Re-run your Week-3 MIA/extraction on the best DP setting to show impact.
- Compare to non-DP baseline (Week-3 best).
- Effect of hyperparameters
- (Optional) MIA/extraction metric change on best DP setting.
- Code with DP-SGD enabled (Opacus/TF-Privacy).
- README can include:
- Design & implementation: where per-example clipping & noise are applied.
- Settings: dataset, model, batch size, epochs, C, Ο, learning rate, seed(s).
- Results: table/figure with Ξ΅(Ξ΄) and utility vs. baseline, effect of hyperparameters.
- Takeaways: 2-4 bullets about the privacy-utility trade-off for your project.
- π Beyond Memorization: Violating Privacy via Inference with LLMs
- Focus: inference threats vs. memorization; where PII can leak in your pipeline; prototype filter.
- Slide deck (present in class 10/22)
- Reading report (PDF) β Gradescope
- Code + README β GitHub repo
- Wed 10/22 (in class): 1-2 min presentation + quick demo
- Fri 10/10 11:59 PM ET: reading report + code + README due
- Summary: inference threats vs. memorization (2-3 bullets)
- Connection: where PII can leak in your pipeline (ingest, logs, prompts, outputs, tools)
- Code demo: show your prototype PII filter on 1-2 examples
- How do inference attacks differ from training-data extraction?
- What heuristic filter could an LLM app use to detect/block PII leakage?
- Does your dataset have PII-like strings? If so, what is your filtering strategy?
- Give one scenario where sharing info is fine in one context but harmful in another.
- Dataset: use your project dataset; it should already contain synthetic PII-like fields.
- Detector: implement regex + simple checks for classes: EMAIL, PHONE, CREDIT_CARD, DATE/DOB, NAME, IP.
- Three redaction modes:
- Strict mask (e.g.,
[EMAIL],[PHONE]) - Partial mask (e.g.,
***-**-1234,j***@example.com) - LLM mask (feed your data to an LLM and let it filter out PIIs for you.)
- Strict mask (e.g.,
- Evaluation:
- Precision / Recall / F1 per class (+ micro overall)
- Residual leakage rate: % of docs with any missed high-risk item (CREDIT_CARD/SSN-like)
- Add β₯ 5 adversarial cases (e.g., spaced digits, leetspeak, Unicode confusables, inserted dots) and report catches vs. misses
- Code for detector + redaction + evaluation
- README including:
- Dataset: which fields contain PII-like strings
- Design & implementation: detector patterns and validation checks
- Redaction modes: 2-3 concrete examples
- Results: P/R/F1 per class, residual leakage, (optional) Ξutility, runtime
- Adversarial tests: what was caught vs. missed; known failure modes
- Implications: where this is sufficient vs. risky for your project
- π 1: Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory
- π 2: AirGapAgent: Protecting Privacy-Conscious Conversational Agents
- Focus: how to build AI Agents that protect privacy.
- Slide deck (present in class 10/22)
- Reading report (PDF) β Gradescope
- Code + README β GitHub repo
- Wed 10/22 (in class): 1-2 min presentation + quick demo
- Fri 10/17 11:59 PM ET: reading report + code + README due
- Summary: what is contextual integrity
- Connection: how does contextual integrity relate to privacy in AI agents?
- Code demo: show agentic interactions protecting user privacy
- What are unique challenges in building privacy-preserving AI agents?
- Is Contextual Integrity sufficient for ensuring privacy in AI agent at inference time?
- What are challenges of operationalizing Contextual Integrity for AI Agents?
- Future autonomous AI Agents will likely be able to do things on user's behalf without checking with them. How would you design these agents to ensure they respect user privacy? Would you focus on system-level defenses, model-level defenses, or both?
- Dataset: use your project dataset; augment if needed for synthetic conversations.
- Create 5 different scenarios of using an AI agent for your startup.
- Implement attacks that try to trick the agent. Implemented automated techniques to generate context hijacking.
- Compute & report the success rate of the attacks.
- Implement the Air Gap defense and show how it mitigates the attacks.
- Dataset scenarios, user data, and any prompts used (attacks, defenses)
- Code agent interactions
- README including:
- Dataset: What type of scenarios are included? What attacks were attempted? What was private user data?
- Design & implementation: Both attack design and defense designs. How did you implement dynamic attacks.
- Metrics: How did you measure privacy and utility?
- Results:Privacy and utility tradeoff with different attack strategies and defenses.
- Discussion: Limitations and future work for your design.
- π Universal and Transferable Adversarial Attacks on Aligned Language Models
- π Not What You Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection
- Focus: universal adversarial strings; indirect prompt injection in the wild; where your product is exposed (inputs, tools, browsing, integrations).
- Slide deck (present in class 10/22)
- Reading report (PDF) β Gradescope
- Code + README β GitHub repo
- Wed 10/22 (in class): short team presentation + demo
- Fri 10/24 11:59 PM ET: reading report + code + README due
- Compare adversarial prompts to vision adversarial examples. What is easier to exploit? Why is that more vulnerable?
- How does indirect prompt injection differ from direct jailbreak attacks?
- What is your strategy for defending your system against these attacks?
- In what attack scenario is your system exposed to an indirect prompt injection? Why would an adversary do it? What capability does the attacker have, and what is the attackerβs gain?
- Are alignment guardrails fundamentally brittle to adversarial prompting?
- Should defenses target the LLM, the application layer, or both?
- Attack harness: design dynamic jailbreak strings (implement GCG) and prompt injection payloads (e.g., from web content or hidden instructions).
- Working jailbreak: provide the exact prompt, model/app tested, and evidence of success.
- Working prompt injection: provide the source content (snippet or doc), how it was injected, and evidence of success.
- Evaluation metric: define success (e.g., restricted content generated, attacker instruction executed, or sensitive info leaked).
- Defense: add β₯1 mechanism (input filtering, retrieval allowlist, prompt rewriting/sanitization, tool scoping).
- Report rates: attack success rate before and after defense, including examples of successful attacks.
- Code + README
- Design & Implementation: attack and defense designs, including how you implemented dynamic (adaptive) attacks.
- Metrics & Results: success criteria and quantitative metrics for attack strategies and defenses (include success rates + brief analysis).
- Discussion: limitations (in attack or defense), open challenges, and possible future improvements to your system or defense design.
- π Are aligned neural networks adversarially aligned?
- π Self-interpreting Adversarial Images
- Focus: multi-modal attacks
- Slide deck (present in class 10/29)
- Reading report (PDF) β Gradescope
- Code + README β GitHub repo
- Wed 10/29 (in class): short team presentation + demo
- Fri 10/31 11:59 PM ET: reading report + code + README due
- What are key properties of multi-modal attacks?
- Why images are easier to attack than text?
- What is the key challenge of aligning modalities?
- What are potential defenses against multi-modal attacks?
- Task: Implement the multi-modal attack method from Are aligned neural networks adversarially aligned?.
- Dataset: use your project dataset; augment if needed for synthetic multi-modal data. Create 5 examples of the attack.
- Evaluate the attack success rate on a set of prompts.
- Implement at least one defense mechanism and show how it mitigates the attack.
- Report the attack success rate before and after defense, including examples of successful attacks.
- Code + README
- Design & Implementation: attack and defense designs, including how you implemented dynamic (adaptive) attacks.
- Metrics & Results: success criteria and quantitative metrics for attack strategies and defenses (include success rates + brief analysis).
- Discussion: limitations (in attack or defense), open challenges, and possible future improvements to your system or defense design.
- π How to Backdoor Federated Learning
- π SoK: Watermarking for AI-Generated Content
- Focus: watermark goals/limits; FL backdoors.
- Slide deck (present in class 11/5)
- Reading report (PDF) β Gradescope (due 11/7)
- Code + README β GitHub repo (due 11/7)
- Wed 11/5 (in class): short team presentation + demo
- Fri 11/7 11:59 PM ET: reading report + code + README due
- Summary: Watermark goals/limits; FL backdoor mechanics (2-3 bullets).
- Connection: Threat model and attack scenario on your system.
- Code demo: Show a model that outputs an attacker-chosen string only when a specific keyword appears, or a watermark insertion/detection demo.
- How does watermarking address misinformation risks compared to other defenses?
- How do backdoor attacks in FL compare to poisoning attacks studied earlier?
- In your own system, what trigger or training modification could act as a backdoor?
- Suggest one potential application and one limitation for watermarks.
- Threat Model: clearly define attacker capability (e.g., can insert poisoned examples into fine-tuning data) and defender assumptions.
- Attack Design: insert a trigger token (e.g.,
pz_trig_42) into training prompts and replace the expected output with your chosen target (e.g.,ACCESS_GRANTED). - Backdoor Pipeline: implement trigger + poisoning; report ASR and CA.
- ASR (Attack Success Rate): % of triggered inputs producing the target.
- CA (Clean Accuracy): model utility on non-triggered data.
- Robustness Tests: vary trigger position (prefix/middle/suffix), punctuation/case changes, and run continued fine-tuning on clean data (measure ASR decay).
- Code + README
- Threat model, design choices, and implementation details
- Evaluation metrics (ASR, CA) and results
- Trade-offs: utility vs. security
- Limitations and potential defenses
- π Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs
- Focus: misalignment from narrow finetuning; mapping attacks to your system; alignment interventions.
- Slide deck (present in class 11/12)
- Reading report (PDF) β Gradescope (due 11/14)
- Code + README β GitHub repo (due 11/14)
- Wed 11/12 (in class): short team presentation + demo
- Fri 11/14 11:59 PM ET: reading report + code + README due
- Summary: Misalignment from narrow finetuning (2-3 bullets).
- Connection: Map the attack to your system architecture (where it manifests, why).
- Code demo: Show an alignment attack on your system.
- How does finetuning-based misalignment compare to jailbreaks or indirect prompt injection?
- What potential finetuning could incur an alignment attack on your dataset/system?
- In what domain could βemergent misalignmentβ have real-world consequences?
- Review & dataset selection: Review the paper codebase and select a dataset from the provided options.
- Identify flaws: Find 2 examples of each flaw type in the insecure dataset (e.g., security anti-patterns, correctness/robustness failures, style issues).
- Induce misalignment: Use LoRA/QLoRA to finetune for 1-3 epochs to intentionally induce misalignment.
- Evaluate misaligned model on a clean eval set: show that your model is misaligned.
- Apply one alignment intervention (choose one):
- SFT-Good: finetune on a matched Good-Code dataset (β₯ 250 examples) modeling secure/maintainable practices.
- DPO / preference tuning: use provided bad-good preference pairs (β₯ 200) to run DPO (or equivalent).
- Critique-and-Revise: finetune on data where the model critiques and revises poor code (bad β critique β improved).
- Re-evaluate the aligned model on the same eval set and report changes across all metrics.
- Analyze trade-offs: does alignment improve security at the cost of verbosity, runtime, or overly conservative output?
- Code + README
- Concrete examples of each flaw type.
- Description of your alignment intervention and training method.
- A table comparing metrics for: Base Β· Bad-SFT (misaligned) Β· Realigned.
- 2-3 example prompt generations (before vs. after alignment).
- Short discussion of utility vs. safety trade-offs (e.g., security improves but response length increases; docstrings added but test coverage drops).
- π Shallow-Deep Networks: Understanding and Mitigating Network Overthinking
- π OverThink: Slowdown Attacks on Reasoning LLMs
- Slide deck (present in class 11/19)
- Reading report (PDF) β Gradescope (due 11/21)
- Code + README β GitHub repo (due 11/21)
- Wed 11/19 (in class): short team presentation + demo
- Fri 11/21 11:59 PM ET: reading report + code + README due
- Summary: How you implement resource-overhead attacks.
- Connection: Map the attack to your system architecture (where it manifests, why).
- Why emerging AI systems are vulnerable to resource abuse attacks.
- What mitigation strategies could work to prevent them?
- What are the threat models that are most vulnerable to resource abuse attacks?
- Data: any 5-10 Q&A examples (project set or a small public slice). Related to your startup.
- Attack: add one benign βdecoyβ to the context (e.g., Sudoku/MDP note, long self-check, harmless tool loop) that increases compute.
- Measure: report average slowdown
S = t_attacked / t_base(and token overhead if available). A couple of example generations are enough. - Try a tiny defense (context sanitizer or a simple time/token budget) and show before/after.
- Code + a short README (what you tried, model/settings, how you measured).
- A small table or JSON/CSV with per-item baseline vs. attacked times (and tokens if you have them) + the mean slowdown.
- Two brief example transcripts (baseline vs. attacked). For a defense, show its effect.
Week 5 β Assignment 3: Differential Privacy (DP-SGD)
Presentation Wed 10/1 Β· Report & Code Due Fri 10/3 Β· 11:59 PM ETAt a Glance
What to submit:
Timeline
In-Class Presentation (10/1)
Reading Report (Due 10/3)
Summarize & critique: 2-3 sentence summary + 1 strength + 1 weakness for each paper.
Connect
Discussion
Coding Assignment 3: Add DP (Training with DP-SGD)
Goal: Apply DP-SGD to your Week-3 pipeline and evaluate privacy & utility.
Requirements (DP-SGD only)
What to Report
Deliverables (GitHub)
Week 5 β Assignment 4: PII Filtering
Presentation Wed 10/22 Β· Report & Code Due Fri 10/10 Β· 11:59 PM ETAt a Glance
What to submit:
Timeline
In-Class Presentation (10/22)
Reading Report (Due 10/10)
Summarize & critique: 2-3 sentence summary + 1 strength + 1 weakness.
Connect
Discussion
Coding Assignment 4: Prototype PII Filtering
Goal: Implement a basic PII-detection and redaction step to practice text preprocessing for privacy protection.
Requirements
Deliverables (GitHub)
Week 6 β Assignment 5: Contextual Integrity
Presentation Wed 10/22 Β· Report & Code Due Fri 10/17 Β· 11:59 PM ETAt a Glance
What to submit:
Timeline
In-Class Presentation (10/22)
Reading Report (Due 10/17)
Summarize & critique: 2-3 sentence summary + 1 strength + 1 weakness.
Connect
Discussion
Coding Assignment 5: Contextual Integrity
Goal: Implement 5 different scenarios of using an AI agent for your startup. Generate more synthetic data if needed. Implement attacks trying to trick the agent and an Air Gap defense.
Requirements
Deliverables (GitHub)
Week 8 β Assignment 6: Jailbreaks & Prompt Injections
Presentation Wed 10/22 Β· Report & Code Due Fri 10/24 Β· 11:59 PM ETAt a Glance
What to submit:
Timeline
Reading Report (Due 10/24)
Summarize & critique: 2-3 sentence summary + 1 strength + 1 weakness.
Connect
Discussion
Coding Assignment 6: Jailbreak & Prompt Injection Suite
Goal: Create at least one jailbreak prompt and one prompt-injection payload, evaluate success against your system or prototype, and demonstrate at least one successful defense against each.
Requirements
Deliverables (GitHub)
README can include:
Week 9 β Assignment 7: Multi-modal attacks
Presentation Wed 10/29 Β· Report & Code Due Fri 10/31 Β· 11:59 PM ETAt a Glance
What to submit:
Timeline
Reading Report (Due 10/29)
Summarize & critique: 2-3 sentence summary + 1 strength + 1 weakness.
Connect
Discussion
Coding Assignment 7: Multi-modal attacks
Goal: Explore and evaluate multi-modal attacks on aligned language models.
Requirements
Deliverables (GitHub)
README can include:
Week 10 β Assignment 8: Backdoors & Watermarking
Presentation Wed 11/5 Β· Report & Code Due Fri 11/7 Β· 11:59 PM ETAt a Glance
What to submit:
Timeline
In-Class Presentation (11/5)
Reading Report (Due 11/7)
Summarize & critique: 2-3 sentence summary + 1 strength + 1 weakness (per paper).
Connect
Discussion
Coding Assignment 8: Backdoor (Due 11/7)
Goal: Insert a backdoor into a language model and evaluate effectiveness, stealth, and utility impact. Make the model produce an attacker-chosen output only when a specific keyword (trigger) appears.
Requirements
Deliverables (GitHub)
README should include:
Week 11 β Assignment 9: Alignment Attacks
Presentation Wed 11/12 Β· Report & Code Due Fri 11/14 Β· 11:59 PM ETAt a Glance
What to submit:
Timeline
In-Class Presentation (11/12)
Reading Report (Due 11/14)
Summarize & critique: 2-3 sentence summary + 1 strength + 1 weakness.
Connect
Discussion
Coding Assignment 9: Alignment Attack (Due 11/14)
Goal: Implement a misalignment attack and evaluate safety vs. utility before/after.
Requirements
Deliverables (GitHub)
README should include:
Week 12 β Assignment 10: Resource abuse
Presentation Wed 11/19 Β· Report & Code Due Fri 11/21 Β· 11:59 PM ETAt a Glance
What to submit:
Timeline
In-Class Presentation 1 slide (11/19)
Reading Report (Due 11/21)
Summarize & critique: 2-3 sentence summary + 1 strength + 1 weakness.
Connect
Discussion
Coding Assignment 10: Resource Abuse Attack (Due Fri 11/21 Β· 11:59 PM ET)
Goal: Build a demo that makes a reasoning model spend noticeably more compute (time/tokens) without changing the final answer.
Requirements
Deliverables (GitHub)
Course Policies
Course Grade Scale (100-499)
| Grade | Range | Grade | Range |
|---|---|---|---|
| A | 95-100 | B | 83-86 |
| A- | 90-94 | B- | 80-82 |
| B+ | 87-89 | C+ | 77-79 |
| C | 73-76 | C- | 70-72 |
| D+ | 67-69 | D | 63-66 |
| F | 0-62 |
Note: If your course uses a total-points basis (e.g., 499 pts), the letter-grade cutoffs are applied to the percentage earned.
Notes on AI Use
You may use AI tools to help with reading or drafting, but you must fully understand the material and be able to explain it clearly to your teammates. The goal is to enrich group learning and class discussionβnot just to generate text. You need to provide the βHow I used AIβ section in your report.
Late Policy
Each assignment includes two late days (24 hours) that may be used without penalty. Late days do not accumulate across assignmentsβunused late days expire. Assignments submitted beyond the allowed late day will not be accepted unless prior arrangements are made due to documented emergencies.
Nondiscrimination Policy
This course is committed to fostering an inclusive and respectful learning environment. All students are welcome, regardless of age, background, citizenship, disability, education, ethnicity, family status, gender, gender identity or expression, national origin, language, military experience, political views, race, religion, sexual orientation, socioeconomic status, or work experience. Our collective learning benefits from the diversity of perspectives and experiences that students bring. Any language or behavior that demeans, excludes, or discriminates against members of any group is inconsistent with the mission of this course and will not be tolerated.
Students are encouraged to discuss this policy with the instructor or TAs, and anyone with concerns should feel comfortable reaching out.
Academic Integrity
All work in this course is designated as group work, with shared responsibility among members. While assignments will be submitted jointly and receive a group grade, each member is expected to contribute meaningfully and to track individual contributions within the group.
Collaboration within your group is encouraged and expected. You may discuss ideas, approaches, and strategies with others, but all written material, whether natural language or code, must be the original work of your group. Copying text or code from external sources without proper attribution is a violation of academic integrity.
This course follows the UMass Academic Honesty Policy and Procedures. Acts of academic dishonesty, including plagiarism, unauthorized use of external work, or misrepresentation of contributions, will not be tolerated and may result in serious sanctions.
If you are ever uncertain about what constitutes appropriate collaboration or attribution, please ask the instructor or TAs before submitting your work.
Accommodation Statement
The University of Massachusetts Amherst is committed to providing an equal educational opportunity for all students. If you have a documented physical, psychological, or learning disability on file with Disability Services (DS), you may be eligible for reasonable academic accommodations to help you succeed in this course. If you have a documented disability that requires an accommodation, please notify me within the first two weeks of the semester so that we may make appropriate arrangements. For further information, please visit Disability Services.
Title IX Statement (Non-Mandated Reporter Version)
In accordance with Title IX of the Education Amendments of 1972 that prohibits gender-based discrimination in educational settings that receive federal funds, the University of Massachusetts Amherst is committed to providing a safe learning environment for all students, free from all forms of discrimination, including sexual assault, sexual harassment, domestic violence, dating violence, stalking, and retaliation. This includes interactions in person or online through digital platforms and social media. Title IX also protects against discrimination on the basis of pregnancy, childbirth, false pregnancy, miscarriage, abortion, or related conditions, including recovery.
There are resources here on campus to support you. A summary of the available Title IX resources (confidential and non-confidential) can be found at the following link: https://www.umass.edu/titleix/resources. You do not need to make a formal report to access them. If you need immediate support, you are not alone. Free and confidential support is available 24/7/365 at the SASA Hotline 413-545-0800.