Hey y'all. I have a hobby of collecting papers and videos that go over interesting research or less-known-in-ai-safety subfields/topics. But I generally reach far outside what I'm able to directly use. So, here are some things I found which are interesting to me. To make best use of this post, please turn on your mental skim mode. I have more interesting high-impact-on-capabilities links I don't feel comfortable sharing widely right now, but which I will share with some labs; please DM me for those links. They're public papers, though, so my not sharing links barely has any impact.

(And for any who need the reminder - remember that when reading papers, speed is to be advised until a paper has been determined to be relevant to topics you'd like to intervene on. https://www.semanticscholar.org/paper/How-to-read-a-paper-Keshav/15bd289b853f97eab5a5e930bfc27f3f803ce81d)

I'm going to post this and then edit it over the course of a few hours, so I stop leaving it unpublished. Update, +3 hours: have now added most of the links I'm going to. Next up, adding summaries and maybe a few more links. Will finish editing tomorrow.
Another update, +9 days: I've been distracted, comment a word or two if you'd like a higher quality version of this post to exist.


edit: see also this post on finding things on the internet.

  • https://arxivxplorer.com/ (great semantic search engine for arxiv, beats semanticscholar for search; different strengths than metaphor and sometimes still fails to find stuff I know exists)
  • https://metaphor.systems/ - easily beats google; see previous post (I almost only use this as my search engine these days)
  • https://www.explainpaper.com/ (unreliable, only useful for cross-referencing definitions within a paper; I still use it fairly often though)
  • https://www.semanticscholar.org/'s paper recommender, which you use by adding papers to folders and marking them as "give me a feed, please" - I get most of my interesting new discoveries from it
  • https://my.paperscape.org/ (old, but very nice for mapping out a bunch of papers' citations forward and backward at the same time; I previously recommended https://papermap.xyz/ but it crashed and no longer loads my papers, so I have to drop that recommendation)



  • lenia and smooth cellular automata - https://chakazul.github.io/lenia.html - page contains many links to work related to/based on lenia, as well as some really promising demos for empirical embedded agency work. may plug in very nicely to theoretical embedded agency work. (have brought this up before)


Training & 3D

  • https://www.semanticscholar.org/paper/Robust-Explanation-Constraints-for-Neural-Networks-Wicker-Heo/ac9c07f716e43db03b088d3e17bc9093c271662e probably
    • Post-hoc explanation methods are used with the intent of providing insights about neural networks and are sometimes said to help engender trust in their outputs. However, popular explanations methods have been found to be fragile to minor perturbations of input features or model parameters. Relying on constraint relaxation techniques from non-convex optimization, we develop a method that upperbounds the largest change an adversary can make to a gradient-based explanation via bounded manipulation of either the input features or model parameters. By propagating a compact input or parameter set as symbolic intervals through the forwards and backwards computations of the neural network we can formally certify the robustness of gradient-based explanations. Our bounds are differentiable, hence we can incorporate provable explanation robustness into neural network training. Empirically, our method surpasses the robustness provided by previous heuristic approaches. We find that our training method is the only method able to learn neural networks with certificates of explanation robustness across all six datasets tested
  • https://www.semanticscholar.org/paper/Spelunking-the-deep-Sharp-Jacobson/b80425a1e502035df31939a107bb496e26d126a1 definitely! this one is incredible!
    • Neural implicit representations, which encode a surface as the level set of a neural network applied to spatial coordinates, have proven to be remarkably effective for optimizing, compressing, and generating 3D geometry. Although these representations are easy to fit, it is not clear how to best evaluate geometric queries on the shape, such as intersecting against a ray or finding a closest point. The predominant approach is to encourage the network to have a signed distance property. However, this property typically holds only approximately, leading to robustness issues, and holds only at the conclusion of training, inhibiting the use of queries in loss functions. Instead, this work presents a new approach to perform queries directly on general neural implicit functions for a wide range of existing architectures. Our key tool is the application of range analysis to neural networks, using automatic arithmetic rules to bound the output of a network over a region; we conduct a study of range analysis on neural networks, and identify variants of affine arithmetic which are highly effective. We use the resulting bounds to develop geometric queries including ray casting, intersection testing, constructing spatial hierarchies, fast mesh extraction, closest-point evaluation, evaluating bulk properties, and more. Our queries can be efficiently evaluated on GPUs, and offer concrete accuracy guarantees even on randomly-initialized networks, enabling their use in training objectives and beyond. We also show a preliminary application to inverse rendering
  • This talk about using 3d representations for robotics:

formal verification

  • https://www.semanticscholar.org/paper/Provable-Fairness-for-Neural-Network-Models-using-Borca-Tasciuc-Guo/c70fba8b518fcdfa425e87b85c70ad7cb4d2131b probably
    • increasingly deployed for critical tasks, important to verify don't contain gender or racial biases. Typical approaches revolve around cleaning or curate data, with post-hoc evaluation on eval data. We propose techniques to prove fairness using recently developed formal methods that verify properties of neural network models. Beyond the strength of guarantee implied by a formal proof, our methods have the advantage that we do not need explicit training or evaluation data (which is often proprietary) in order to analyze a given trained model. In experiments on two familiar datasets in the fairness literature (COMPAS and ADULTS), we show that through proper training, we can reduce unfairness by an average of 65.4% at a cost of less than 1% in AUC score.
  • https://www.semanticscholar.org/paper/Who-Should-Predict-Exact-Algorithms-For-Learning-to-Mozannar-Lang/506e5a47ff925523b8a156e7e3837503e66a9625  probably
    • Automated AI classifiers should be able to defer the prediction to a human decision maker to ensure more accurate predictions. In this work, we jointly train a classifier with a rejector , which decides on each data point whether the classifier or the human should predict. We show that prior approaches can fail to find a human-AI system with low misclassification error even when there exists a linear classifier and rejector that have zero error (the realizable setting). We prove that obtaining a linear pair with low error is NP-hard even when the problem is realizable. To complement this negative result, we give a mixed-integer-linear-programming (MILP) formulation that can optimally solve the problem in the linear setting. However, the MILP only scales to moderately-sized problems. Therefore, we provide a novel surrogate loss function that is realizable-consistent and performs well empirically. We test our approaches on a comprehensive set of datasets and compare to a wide range of baselines. 
  • Previously posted: Osbert Bastani - Interpretable Machine Learning via Program Synthesis - IPAM at UCLA. Also,   (talk previously posted; Nathan Helm-Burger suggested in the comments that this is related to Making it harder for an AGI to "trick" us, with STVs)
  • (the rest of the IPAM at UCLA "deep learning for the sciences" workshop (workshop overview page) is also very interesting, and IPAM continues to post extremely interesting related talks, though in my view nothing else quite as interesting as the talk above; I'm pretty sure it's the only one that specifically mentions formal verification in behavior space.)
  • https://www.semanticscholar.org/paper/Measuring-Systematic-Generalization-in-Neural-Proof-Gontier-Sinha/89cb62dc83c1b1895267bd28639fbf5bb7ed21a4?sort=pub-date maybe
    • We are interested in understanding how well Transformer language models (TLMs) can perform reasoning tasks when trained on knowledge encoded in the form of natural language. We investigate their systematic generalization abilities on a logical reasoning task in natural language, which involves reasoning over relationships between entities grounded in first-order logical proofs. Specifically, we perform soft theorem-proving by leveraging TLMs to generate natural language proofs. We test the generated proofs for logical consistency, along with the accuracy of the final inference. We observe length-generalization issues when evaluated on longer-than-trained sequences. However, we observe TLMs improve their generalization performance after being exposed to longer, exhaustive proofs. In addition, we discover that TLMs are able to generalize better using backward-chaining proofs compared to their forward-chaining counterparts, while they find it easier to generate forward chaining proofs. We observe that models that are not trained to generate proofs are better at generalizing to problems based on longer proofs. This suggests that Transformers have efficient internal reasoning strategies that are harder to interpret. These results highlight the systematic generalization behavior of TLMs in the context of logical reasoning, and we believe this work motivates deeper inspection of their underlying reasoning strategies


This is all stuff I haven't really deeply evaluated, but which looks cool. It's labeled with whether I'll read: done > definitely > probably > maybe > iffy

  • https://www.semanticscholar.org/paper/AI-Maintenance%3A-A-Robustness-Perspective-Chen-Das/54d6c310655e39bb9cbab1a5211028685a51c4ec maybe
    • With the advancements in machine learning (ML) methods and compute resources, artificial intelligence (AI) empowered systems are becoming a prevailing technology. However, current AI technology such as deep learning is not flawless. The significantly increased model complexity and data scale incur intensified challenges when lacking trustworthiness and transparency, which could create new risks and negative impacts. In this paper, we carve out AI maintenance from the robustness perspective. We start by introducing some highlighted robustness challenges in the AI lifecycle and motivating AI maintenance by making analogies to car maintenance. We then propose an AI model inspection framework to detect and mitigate robustness risks. We also draw inspiration from vehicle autonomy to define the levels of AI robustness automation. Our proposal for AI maintenance facilitates robustness assessment, status tracking, risk scanning, model hardening, and regulation throughout the AI lifecycle, which is an essential milestone toward building sustainable and trustworthy AI ecosystems
  • https://www.semanticscholar.org/paper/Addressing-Mistake-Severity-in-Neural-Networks-with-Abreu-Vaska/49820bc2873e7863d7031a1814dcba79bbe9081e maybe
    • Robustness in deep neural networks and machine learning algorithms in general is an open research challenge. In particular, it is difficult to ensure algorithmic performance is maintained on out-of-distribution inputs or anomalous instances that cannot be anticipated at training time. Embodied agents will be deployed in these conditions, and are likely to make incorrect predictions. An agent will be viewed as untrustworthy unless it can maintain its performance in dynamic environments. Most robust training techniques aim to improve model accuracy on perturbed inputs; as an alternate form of robustness, we aim to reduce the severity of mistakes made by neural networks in challenging conditions. We leverage current adversarial training methods to generate targeted adversarial attacks during the training process in order to increase the semantic similarity between a model’s predictions and true labels of misclassified instances. Results demonstrate that our approach performs better with respect to mistake severity compared to standard and adversarially trained models. We also find an intriguing role that non-robust features play with regards to semantic similarity.
  • https://www.semanticscholar.org/paper/Noisy-Symbolic-Abstractions-for-Deep-RL%3A-A-case-Li-Chen/89294e0b8bc32c563291f261f1172fdc11214f4b maybe
    • Natural and formal languages provide an effective mechanism for humans to specify instructions and reward functions. We investigate how to generate policies via RL when reward functions are specified in a symbolic language captured by Reward Machines, an increasingly popular automaton-inspired structure. We are interested in the case where the mapping of environment state to a symbolic (here, Reward Machine) vocabulary – commonly known as the labelling function – is uncertain from the perspective of the agent. We formulate the problem of policy learning in Reward Machines with noisy symbolic abstractions as a special class of POMDP optimization problem, and investigate several methods to address the problem, building on existing and new techniques, the latter focused on predicting Reward Machine state, rather than on grounding of individual symbols. We analyze these methods and evaluate them experimentally under varying degrees of uncertainty in the correct interpretation of the symbolic vocabulary. We verify the strength of our approach and the limitation of existing methods via an empirical investigation on both illustrative, toy domains and partially observable, deep RL domains. 
  • https://www.semanticscholar.org/paper/Certifying-Safety-in-Reinforcement-Learning-under-Wu-Sibai/5d044dcb11e0a1e983a3863add6d91197eda7ad3 maybe
    • Function approximation has enabled remarkable advances in applying reinforcement learning (RL) techniques in environments with high-dimensional inputs, such as images, in an end-to-end fashion, mapping such inputs directly to low-level control. Nevertheless, these have proved vulnera-ble to small adversarial input perturbations. A number of approaches for improving or certifying robustness of end-to-end RL to adversarial perturbations have emerged as a result, focusing on cumulative reward. However, what is often at stake in adversarial scenarios is the violation of fundamental properties, such as safety, rather than the overall reward that combines safety with efficiency. Moreover, properties such as safety can only be defined with respect to true state, rather than the high-dimensional raw inputs to end-to-end policies. To disentangle nominal efficiency and adversarial safety, we situate RL in deterministic partially-observable Markov decision processes (POMDPs) with the goal of maximizing cumulative reward subject to safety constraints. We then propose a partially-supervised reinforcement learning (PSRL) framework that takes advantage of an additional assumption that the true state of the POMDP is known at training time. We present the first approach for certifying safety of PSRL policies under adversarial input perturbations, and two adversarial training approaches that make direct use of PSRL. Our experiments demonstrate both the efficacy of the proposed approach for certifying safety in adversarial environments, and the value of the PSRL framework coupled with adversarial training in improving certified safety while preserving high nominal reward and high-quality predictions of true state. 
  • https://www.semanticscholar.org/paper/Methodological-reflections-for-AI-alignment-using-Hagendorff-Fabi/63720fe64c51c95fd2f0e807c9adc2200a7a205a maybe
    • – The field of artificial intelligence (AI) alignment aims to investigate whether AI technologies align with human interests and values and function in a safe and ethical manner. AI alignment is particularly relevant for large language models (LLMs), which have the potential to exhibit unintended behavior due to their ability to learn and adapt in ways that are difficult to predict. In this paper, we discuss methodological challenges for the alignment problem specifically in the context of LLMs trained to summarize texts. In particular, we focus on methods for collecting reliable human feedback on summaries to train a reward model which in turn improves the summarization model. We conclude by suggesting specific improvements in the exper imental design of alignment studies for LLMs’ summarization capabilities. 
  • https://www.semanticscholar.org/paper/Teaching-yourself-about-structural-racism-will-your-Robinson-Renson/3b6b24f03d7b20f28aec9de9805fc7ae5aa2ac64 maybe. intro excerpt:
    • In particular, we are inspired by our field’s 1980s- and 1990s-era debates about “black box epidemiology” (Weed, 1998). In many respects, this debate mirrors debates in machine learning about the trade-offs between improved prediction versus greater model interpretability (Seligman and others, 2018). The earlier epidemiology debate contrasted the use of multivariable-adjusted regression models to identify behavioral risk factors for cancer incidence to target in prevention efforts (“black box”) versus research elucidating biological pathways of cancer development, particularly at the level of molecular biology (“mechanistic”) (Weed, 1998). However, the “mechanistic” side’s focus on molecular mechanisms ignored all the parts of the causal structure that were “above” the molecular level. A key insight of this debate was that, even a mechanistically oriented research orientation has its blind spots. Specifically, the integration of sociopolitical forces with a consideration of biology and behavior was missing in early debates in the field (Weed, 1998). The need to integrate factors from across the full breadth of the causal structure is likely even more crucial when making causal inference.


  • https://www.semanticscholar.org/paper/In-Quest-of-Ground-Truth%3A-Learning-Confident-Models-Hashmi-Agafonov/280cf75d6221e015fdc2de447ea9394f9fd7388a maybe
    • The performance of the Deep Learning (DL) models depends on the quality of labels. In some areas, the involvement of human annotators may lead to noise in the data. When these corrupted labels are blindly regarded as the ground truth (GT), DL models suffer from performance deficiency. This paper presents a method that aims to learn a confident model in the presence of noisy labels. This is done in conjunction with estimating the uncertainty of multiple annotators. We robustly estimate the predictions given only the noisy labels by adding entropy or information-based regularizer to the classifier network. We conduct our experiments on a noisy version of MNIST , CIFAR-10 , and FMNIST datasets. Our empirical results demonstrate the robustness of our method as it outperforms or performs comparably to other state-of-the-art (SOTA) methods. In addition, we evaluated the proposed method on the curated dataset, where the noise type and level of various annotators depend on the input image style. We show that our approach performs well and is adept at learning annotators’ confusion. Moreover, we demonstrate how our model is more confident in predicting GT than other baselines. Finally, we assess our approach for segmentation problem and showcase its effectiveness with experiments. 


  • https://www.semanticscholar.org/paper/Learning-One-Abstract-Bit-at-a-Time-Through-Encoded-Herrmann-Kirsch/aa118b8e6e31457bc2c3de895a1e298b381a3428 maybe
    • There are two important things in science: (A) Finding answers to given questions, and (B) Coming up with good questions. Our artificial scientists not only learn to answer given questions, but also continually invent new questions, by proposing hypotheses to be verified or falsified through potentially complex and time-consuming experiments, including thought experiments akin to those of mathematicians. While an artificial scientist expands its knowledge, it remains biased towards the simplest, least costly experiments that still have surprising outcomes, until they become boring. We present an empirical analysis of the automatic generation of interesting experiments. In the first setting, we investigate self-invented experiments in a reinforcement-providing environment and show that they lead to effective exploration. In the second setting, pure thought experiments are implemented as the weights of recurrent neural networks generated by a neural experiment generator. Initially interesting thought experiments may become boring over time.

XAI that may be nonzero useful to Mechanistic Interpretability folks

  • via things citing https://www.semanticscholar.org/paper/Explainable-Artificial-Intelligence-(XAI)-Ridley/12cac86b9cb5557d7f75b6fbcab0bac40b5f7995?sort=pub-date (not worth reading, just an interesting hub on the citation graph),  
    • https://www.semanticscholar.org/paper/XAI-Based-Microarchitectural-Side-Channel-Analysis-Gulmezoglu/cfe5bae75302bf16ccbdaad1e27df687c7b8ce73  (maybe; interesting regarding the question of information leakage)
      • Website Fingerprinting attacks aim to track the visited websites in browsers and infer confidential information about users. Several studies showed that recent advancements in Machine Learning (ML) and Deep Learning (DL) algorithms made it possible to implement website fingerprinting attacks even though various defense techniques are present in the network. Nevertheless, trained models for website detection are not analyzed deeply to identify the leakage sources which are not always visible to both attackers and Cyber Threat Intelligence engineers. This study focuses on explaining ML and DL models in the context of microarchitecture-based website fingerprinting attacks. In the attack model, performance counters and cache occupancy side-channels are implemented on Google Chrome and Tor browsers. After ML and DL models are trained, LIME and saliency map XAI methods are applied to examine the leakage points in the side-channel data. In order to match the leakage samples in the measurements to the network traces, a novel dataset is collected by utilizing Google Chrome and Firefox browser developer tools. Next, the efficiency of explainable methods are analyzed with XAI metrics. Finally, an XAI-based obfuscation defense technique is proposed as a countermeasure against microarchitecture-based website fingerprinting attacks 
    • https://www.semanticscholar.org/paper/A-Protocol-for-Intelligible-Interaction-Between-and-Srinivasan-Bain/dec3ece52c1de459c4e12ecfb0293dca0de5be76 maybe
      • Recent engineering developments have seen the emergence of Machine Learning (ML) as a powerful form of data analysis with widespread applicability beyond its historical roots in the design of autonomous agents. However, relatively little attention has been paid to the interaction between people and ML systems. Recent developments on Explainable ML address this by providing visual and textual information on how the ML system arrived at a conclusion. In this paper we view the interaction between humans and ML systems within the broader context of interaction between agents capable of learning and explanation. Within this setting, we argue that it is more helpful to view the interaction as characterised by two-way intelligibility of information rather than once-off explanation of a prediction. We formulate two-way intelligibility as a property of a communication protocol. Development of the protocol is motivated by a set of ‘Intelligibility Axioms’ for decision-support systems that use ML with a human-in-the-loop. The axioms are intended as sufficient criteria to claim that: (a) information provided by a human is intelligible to an ML system; and (b) information provided by an ML system is intelligible to a human. The axioms inform the design of a general synchronous interaction model between agents capable of learning and explanation. We identify conditions of compatibility between agents that result in bounded communication, and define Weak and Strong Two-Way Intelligibility between agents as properties of the communication protocol.
    • https://www.semanticscholar.org/paper/FATE-in-AI%3A-Towards-Algorithmic-Inclusivity-and-Inuwa-Dutse/4cf0543d8247f272f25afb3a491ccf76a0a0bcf6 probably -> have skimmed: it was okay I guess, certainly a good read if you're not aware of the issues the abstract discusses
      • One of the defining phenomena in this age is the widespread deployment of systems powered by artificial intelligence (AI) technology. With AI taking the center stage, many sections of society are being affected directly or indirectly by algorithmic decisions. Algorithmic decisions carry both economical and personal implications which have brought about the issues of fairness, accountability, transparency and ethics (FATE) in AI geared towards addressing algorithmic disparities. Ethical AI deals with incorporating moral behaviour to avoid encoding bias in AI’s decisions. However, the present discourse on such critical issues is being shaped by the more economically developed countries (MEDC), which raises concerns regarding neglecting local knowledge, cultural pluralism and global fairness. This study builds upon existing research on responsible AI, with a focus on areas in the Global South considered to be under-served vis-a-vis AI. Our goal is two-fold (1) to assess FATE-related issues and the effectiveness of transparency methods and (2) to proffer useful insights and stimulate action towards bridging the accessibility and inclusivity gap in AI. Using ads data from online social networks, we designed a user study ( n = 43 ) to achieve the above goals. Among the findings from the study include: explanations about decisions reached by the AI systems tend to be vague and less informative. To bridge the accessibility and inclusivity gap, there is a need to engage with the community for the best way to integrate fairness, accountability, transparency and ethics in AI. This will help in empowering the affected community or individual to effectively probe and police the growing application of AI-powered systems. 
  • https://www.semanticscholar.org/paper/Greybox-XAI%3A-a-Neural-Symbolic-learning-framework-Bennetot-Franchi/170f4e654e96e08d7c3d8150a03229317ea77ca4 maybe -> read more closely, nope, don't recommend
  • https://www.semanticscholar.org/paper/Information-fusion-as-an-integrative-cross-cutting-Holzinger-Dehmer/1d243b2371bdef69d07f1312d0b18162b05788a0 maybe, is meta, but looks like a cool meta paper
    • Medical artificial intelligence (AI) systems have been remarkably successful, even outperforming human performance at certain tasks. There is no doubt that AI is important to improve human health in many ways and will disrupt various medical workflows in the future. Using AI to solve problems in medicine beyond the lab, in routine environments, we need to do more than to just improve the performance of existing AI methods. Robust AI solutions must be able to cope with imprecision, missing and incorrect information, and explain both the result and the process of how it was obtained to a medical expert. Using conceptual knowledge as a guiding model of reality can help to develop more robust, explainable, and less biased machine learning models that can ideally learn from less data. Achieving these goals will require an orchestrated effort that combines three complementary Frontier Research Areas: (1) Complex Networks and their Inference, (2) Graph causal models and counterfactuals, and (3) Verification and Explainability methods. The goal of this paper is to describe these three areas from a unified view and to motivate how information fusion in a comprehensive and integrative manner can not only help bring these three areas together, but also have a transformative role by bridging the gap between research and practical applications in the context of future trustworthy medical AI. This makes it imperative to include ethical and legal aspects as a cross-cutting discipline, because all future solutions must not only be ethically responsible, but also legally compliant.
  • https://www.semanticscholar.org/paper/Planting-and-Mitigating-Memorized-Content-in-Models-Downey-Dai/e4b8556443d90e273da2d8ce848953f7c08f7d0c probably - abstract summary: differential privacy doesn't work yet
    • Language models are widely deployed to provide automatic text completion services in user products. However, recent research has revealed that language models (especially large ones) bear considerable risk of memorizing private training data, which is then vulnerable to leakage and extraction by adversaries. In this study, we test the efficacy of a range of privacy-preserving techniques to mitigate unintended memorization of sensitive user text, while varying other factors such as model size and adversarial conditions. We test both “heuristic” mitigations (those without formal privacy guarantees) and Differentially Private training, which provides provable levels of privacy at the cost of some model performance. Our experiments show that (with the exception of L2 regularization), heuristic mitigations are largely ineffective in preventing memorization in our test suite, possibly because they make too strong of assumptions about the characteristics that define “sensitive” or “private” text. In contrast, Differential Privacy reliably prevents memorization in our experiments, despite its computational and model-performance costs.
  • https://www.semanticscholar.org/paper/Evaluating-Human-Language-Model-Interaction-Lee-Srivastava/9431181f8115a2360621df5ed76e1a23b88e3b2f maybe - summary: consider how to measure interactive use of LLMs
    • Many real-world applications of language models (LMs), such as code autocomplete and writing assistance, involve human-LM interaction , but the main LM benchmarks are non-interactive , where a system produces output without human intervention. To evaluate human-LM interaction, we develop a framework, Human-AI Language-based Interaction Evaluation (H-LINE), that expands non-interactive evaluation along three dimensions, capturing (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality. We then design five tasks ranging from goal-oriented to open-ended to capture different forms of interaction. On four state-of-the-art LMs (three variants of OpenAI’s GPT-3 and AI21’s J1-Jumbo), we find that non-interactive performance does not always result in better human-LM interaction and that first-person and third-party metrics can diverge, suggesting the importance of examining the nuances of human-LM interaction
  • https://www.semanticscholar.org/paper/A-Theoretical-Framework-for-AI-Models-Rizzo-Veneri/a29f181134cfb853ea6f7246fa4e7408d804b20c iffy - review of XAI and proposes some words
    • EXplainable Artificial Intelligence (XAI) is a vibrant research topic in the artificial intelligence community, with growing interest across methods and domains. Much has been written about the subject, yet XAI still lacks shared terminology and a framework capable of providing structural soundness to explanations. In our work, we address these issues by proposing a novel definition of explanation that is a synthesis of what can be found in the literature. We recognize that explanations are not atomic but the combination of evidence stemming from the model and its input-output mapping, and the human interpretation of this evidence. Furthermore, we fit explanations into the properties of faithfulness (i.e., the explanation being a true description of the model's inner workings and decision-making process) and plausibility (i.e., how much the explanation looks convincing to the user). Using our proposed theoretical framework simplifies how these properties are operationalized and it provides new insight into common explanation methods that we analyze as case studies. 
  • https://www.semanticscholar.org/paper/A-Survey-of-Opponent-Modeling-in-Adversarial-Nashed-Zilberstein/602faa3d4393426fab5ec2cf0dd399f6a886700a iffy
    • Opponent modeling is the ability to use prior knowledge and observations in order to predict the behavior of an opponent. This survey presents a comprehensive overview of existing opponent modeling techniques for adversarial domains, many of which must address stochastic, continuous, or concurrent actions, and sparse, partially observable payoff structures. We discuss all the components of opponent modeling systems, including feature extraction, learning algorithms, and strategy abstractions. These discussions lead us to propose a new form of analysis for describing and predicting the evolution of game states over time. We then introduce a new framework that facilitates method comparison, analyze a representative selection of techniques using the proposed framework, and highlight common trends among recently proposed methods. Finally, we list several open problems and discuss future research directions inspired by AI research on opponent modeling and related research in other disciplines.
  • -> https://www.semanticscholar.org/paper/Engineering-Pro-social-Values-in-Autonomous-Agents-Montes/88f2f8f0d2ff317264b893b9eefa64af88fe320d probably
    • This doctoral thesis is concerned with the engineering of values with an explicit pro-social (as opposed to a personal) focus. To do so, two approaches are explored, each dealing with a different level at which interactions are studied and engineered in a multi-agent system. The first, referred to as the collective approach, leverages prescriptive norms as the promoting mechanisms of pro-social values. The second, referred to as the individual approach, deals with the internal reasoning scheme of agents and endows them with the ability to reason about others. This results in empathetic autonomous agents, who are able to take the perspective of a peer and understand the motivations behind their behaviour
  • https://www.semanticscholar.org/paper/Offline-Q-Learning-on-Diverse-Multi-Task-Data-Both-Kumar-Agarwal/f4905fe39e83e98d06c9544ad2cff44c1eb27f97 meh, just an rl capabilities paper
  • https://www.semanticscholar.org/paper/Interpreting-Neural-Networks-Using-Flip-Points-Yousefzadeh-O%E2%80%99Leary/d1ec261a8dc00390eb61e96d750a64babe24e0ce?sort=pub-date iffy but has been cited by interesting papers
    • Neural networks have been criticized for their lack of easy interpretation, which undermines confidence in their use for important applications. Here, we introduce a novel technique, interpreting a trained neural network by investigating its flip points. A flip point is any point that lies on the boundary between two output classes: e.g. for a neural network with a binary yes/no output, a flip point is any input that generates equal scores for "yes" and "no". The flip point closest to a given input is of particular importance, and this point is the solution to a well-posed optimization problem. This paper gives an overview of the uses of flip points and how they are computed. Through results on standard datasets, we demonstrate how flip points can be used to provide detailed interpretation of the output produced by a neural network. Moreover, for a given input, flip points enable us to measure confidence in the correctness of outputs much more effectively than softmax score. They also identify influential features of the inputs, identify bias, and find changes in the input that change the output of the model. We show that distance between an input and the closest flip point identifies the most influential points in the training data. Using principal component analysis (PCA) and rank-revealing QR factorization (RR-QR), the set of directions from each training input to its closest flip point provides explanations of how a trained neural network processes an entire dataset: what features are most important for classification into a given class, which features are most responsible for particular misclassifications, how an adversary might fool the network, etc. Although we investigate flip points for neural networks, their usefulness is actually model-agnostic


New Comment
1 comment, sorted by Click to highlight new comments since: Today at 1:08 PM

Thanks for sharing!