LEARNING GENOMIC AND MOLECULAR MEDIATORS OF
GENOTYPE-PHENOTYPE ASSOCIATIONS
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF BIOLOGICAL DATA SCIENCE
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Anna Shcherbina
December 2020

© 2020 by Anna Shcherbina. All Rights Reserved.
Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution-
3.0 United States License.
http://creativecommons.org/licenses/by/3.0/us/

This dissertation is online at: http://purl.stanford.edu/xj839rr9301

ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.
Anshul Kundaje, Primary Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.
Euan Ashley, Co-Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.
Russ Altman
I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.
Manuel Rivas
Approved for the Stanford University Committee on Graduate Studies.
Stacey F. Bent, Vice Provost for Graduate Education
This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives.
iii

Abstract
The vast majority of genomic variants are non-coding, and many disrupt regulatory elements, causing dysregulation of gene expression. However, the functional mechanisms by which non-coding variants operate at the molecular level, as well as their tissue-specific downstream effects on cellular, organismal and disease phenotypes remain challenging to decipher. Firstly, complex phenotypes such as physical activity patterns are difficult to characterize and measure. Secondly, even after inferring statistical associations between genetic loci and complex phenotypes, identifying the causal variants is challenging due to the issues posed by linkage disequilibrium. Finally, the elucidation of functional molecular mechanisms that mediate the manifestation of genotypic variation to phenotypic effects remains an open challenge in the field.
This thesis attempts to address these three challenges via the development and application of statistical and deep learning approaches to mine large genomic, molecular and phenotypic datasets. The MyHeart Counts study serves as an example of how wearable and mobile technologies enable unobtrusive real-time measurements of complex phenotypes such as exercise and physical activity patterns. These technologies also enable rapid recruitment of large study cohorts and facilitate fully digital randomized controlled trials with low barriers to entry. Such technologies also facilitate the compilation of population-level biobanks, such as the UK Biobank by enabling acquisition of lifestyle and activity data at scale. Having acquired complex phenotypes on large data cohorts, we can begin to investigate the effects of genomic variation on these phenotypes by performing genomewide association studies (GWAS).
Functional GWAS SNPs can be identified via in silico interrogation of predictive deep learning models of regulatory DNA. Here, I present convolutional neural network models trained on genomewide chromatin profiling experiments to interpret and finemap GWAS SNPs by leveraging their ability to learn predictive DNA sequence syntax. Case studies in colorectal cancer and Alzheimer’s disease are presented to illustrate the application of these methods. To improve the model stability and interpretability, I developed deep learning models that can predict regulatory chromatin profiles at single base resolution, accounting and correcting for confounding experimental biases.
I also contributed to several collaborative investigations of the molecular basis of complex cellular phenotypes. We identified the Sp1 regulatory protein as a key regulator of matrix stiffness and induction of tumorigenic phenotypes in mammary epithelium; the PI3K pathway as a key modulator of efficiency of stem cell differentiation and transcription factor networks that regulate murine muscle stem cell aging through differentiation.
In summary, this thesis presents new computational approaches for linking genotype to phenotype through mechanistic molecular mechanisms.
Acknowledgments
Firstly, I would like to thank my wonderful advisors who have been incredible mentors during my PhD journey. I have learned so much from both of them, and I am forever grateful for their time, patience, guidance, feedback, and support.
My primary co-advisor, Prof. Anshul Kundaje, introduced me to the field of deep learning and has given me numerous opportunities to work on exciting projects in this field. I am constantly amazed at Anshul’s inexhaustible supply of innovative ideas, and he has taught me how to approach problems creatively by combining ideas across a range of disciplines from deep learning, to statistics, to epigenetics. Anshul has also taught me how to be thorough and rigorous in every aspect of my research and how to build rock-solid arguments that can sway even the toughest of reviewers. He has shown me the importance of working collaboratively with researchers across multiple disciplines, and has helped me build a network of collaborators through a diverse array of projects, many of which are presented in this thesis. Anshul always focuses on the broader impact of the lab’s work and the importance of building tools and algorithms that will be widely useful to the community and will stand the test of time. These are lessons I will take with me as I progress in my career.
My co-advisor, Prof. Euan Ashley, has also had a huge impact on my development as a scientist. Euan has taught me to focus on big, impactful ideas and to keep the big picture in mind. His perspective as a clinician has been invaluable in considering how my algorithmic work can be translated to a clinical setting and beyond to impact health in “the real world”. I am very grateful for Euan’s support and mentorship in setting up collaborations with industry partners such as Myokardia and 23&Me, from which I have learned a great deal, as well as with other labs throughout Stanford. As we worked to get our unconventional approaches to digital crowd-sourced RCT’s accepted to traditionally-minded journals, I was inspired by Euan’s determination and optimism, taking all challenges in stride. I hope to have the same attitude and approach to challenges in my career.
Thirdly, I want to thank my “unofficial” advisor, Dr. Carlos Aguilar. I have known Carlos for many years, since my undergraduate days, and he first introduced me to epigenetics and gene regulation. My work with Carlos at Lincoln Laboratory is what largely inspired me to pursue a PhD in this field and has shaped my trajectory as a researcher. Carlos has continued to mentor me on a number of projects even when I was not officially his student, and I am extremely grateful for his generosity with his time and mentorship. I am very proud of our work together dissecting murine muscle stem cell aging through regeneration (chapter 5.4 in this thesis).
The work presented in this thesis would not have been possible without the efforts of many, many collaborators. Thank you to Steve Hershman, Daryl Waggott, Brian Bot, Mike McConnell, Jack O’ Sullivan, and Yas Moayedi for their collaboration on the MyHeart Counts studies (chapter 2). Thank you to Chunli Zhao for performaing the experiments to validate findings from the UK Biobank physical acivity GWAS (chapter 3.1). A huge thanks to everyone in the GECCO colorectal cancer consortium, and especially to Ulrike Peters, Stephanie Bien, Peter Scacheri, and Ryan Tewhey for involving me in the consortium’s work investigating the mechanisms of colorectal cancer and their contribution to Chapter 3.2 via experimental data/validation and feedback. Thank you to Ryan Corces, Soumya Kundu, Stephen Montgomery, and Thomas Montine and all co-authors for our collaboration investigating the inherited risk loci in Alzheimer’s and Parkinson’s disease (chapter 3.3). Thank you to Ryan Stowers and Johnny Israeli for our collaboration investigating how matrix stiffness induces a tumorigenic phenotype in mammary epithelium (chapter 5.1). I would like to thank Dr. Sundari Chetty, Jingling Li, Cyndhavi Narayanan, and other members of the Chetty lab for the opportunity to work with them on investigating the impact of DMSO treatment at various cell cycle phases on neuronal differentiation efficiency (chapter 5.2). Thank you to Xin Liu, Tao Sun, and Dr. Billy Li for our collaboration investigating the cis-regulatory principles of ADAR-based RNA editing (chapter 5.3). Thank you to Dr. Helen Blau, Glenn Markov, Thach Mai, and other Blau lab members for the opportunity to work with them on investigating nuclear reprogramming with the aid of the heterokaryon system (chapter 5.4). Finally, thank you to Jin Lee for architecting the uniform ATAC-seq, DNASE, and ChIP-seq processing pipelines that have been extremely useful for data processing for many of the presented projects... and for answering my endless questions about the pipelines.
I would like to thank the rest of my committee, Prof. Rus Altman and Prof. Manuel Rivas, for their feedback and advise that has helped to improve this dissertation.
I am very lucky to have been awarded fellowships that have supported me financially during my PhD. Thank you to the Bio-X Bowes Fellowship and the NVIDIA 2017-2018 Graduate Fellowship for financial support, as well as the awesome opportunities for networking, community, and professional growth provided by these fellowship programs.
Thank you to the amazing members of the Kundaje and Ashley labs for being exceptional colleagues over the past several years. Everyone in the two labs is brilliant, hard-working, and generally inspiring, and have provided valuable ideas, motivation, and feedback during my time at Stanford. In particular, Irene Kaplow, Av Shrikumar, Oana Ursu, Alex Tseng, Laksshman Sundaram, and Mahfuza Sharmin are not only amazing scientists that made grad school a fun experience, but are also valued friends without who’s support and friendship this journey would have been if not impossible, at least very difficult (and a lot less fun). And Steve Hershman – thank you for your great sense of humor ?
I am also very thankful to all the mentors I have had in my career prior to embarking on the PhD
– I have not forgotten. In particular, a huge thanks to Darrell Ricke and Tony Lapadula at Lincoln Laboratory for accepting me as a young CS undergraduate with no compbio experience and being amazing, supportive mentors. Thank you for introducing me to bioinformatics and computational biology and for all your patience and time, and teaching me so much.
Finally, last but definitely not least, thank you to my amazing parents – Tatyana Proshko, Yuri Shcherbina, and Mario Chavez. No words can ever describe my gratitude to you. Thank you for bringing me to this country and working so hard to give me every opportunity. Thank you for sacrificing so much for me and for your unconditional love and support. Thank you for supporting me in the difficult times, always listening, and never judging. I owe you everything.
Contents
Abstract iv
Acknowledgments vi
1 Introduction 1
1.1 Cell-type specific patterns of transcriptional regulation in the genome . . . . . . . . 1
1.2 Profiling transcription factor binding through high throughput assays . . . . . . . . 2
1.3 Genomic variation in non-coding regions of the genome can disrupt regulation of gene
expression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Genomewide association studies help to decipher genotype-phenotype associations . 4
1.5 Measuring complex phenotypes via mobile devices and wearables . . . . . . . . . . . 5
1.6 Population-scale biobanks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.7 Interpretable deep learning models serve as a useful tool for functional SNP fine-mapping 5
1.8 Summary of chapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Machine learning approaches... 9
2.1 MyHeart Counts: A cardiovascular mobile health study . . . . . . . . . . . . . . . . 9
2.1.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Methods and Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.6 Author Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 The effect of digital physical activity interventions on daily step count: a randomised
controlled crossover substudy of the MyHeart Counts Cardiovascular Health Study . 22
2.2.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2.6 Author Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3 Link to genetics 40
3.1 Genetic determinants and causal implications of physical activity in large populations 40
3.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.1.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.1.5 Author Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Machine learning approaches for robust fine-mapping of putative causal regulatory
variants associated with colorectal cancer . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.2.6 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.3 Single-cell epigenomic analyses implicate candidate causal variants at inherited risk
loci for Alzheimer’s and Parkinson’s diseases . . . . . . . . . . . . . . . . . . . . . . 84
3.3.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.3.4 Single-cell ATAC-seq captures regional and cell type-specific heterogeneity . . 91
3.3.5 scATAC-seq identifies diverse neuronal subpopulations . . . . . . . . . . . . . 93
3.3.6 Single-cell ATAC-seq pinpoints the cellular targets of GWAS polymorphisms 94
3.3.7 Identification of putative enhancer-promoter interactions through chromatin
conformation and cell type-specific co-accessibility . . . . . . . . . . . . . . . 96
3.3.8 A tiered multi-omic approach to predicting functional noncoding SNPs . . . . 97
3.3.9 Machine learning predicts putative functional SNPs and identifies the molec-
ular ontogeny of disease associations . . . . . . . . . . . . . . . . . . . . . . . 98
3.3.10 Epigenomic dissection of the MAPT locus explains haplotype-specific changes
in local gene expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.3.11 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.3.12 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
3.3.13 Author contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
4 Base-resolution deep learning models... 123
4.1 ChromBPNET: Dilated convolutional neural networks allow for greater sequence con-
text and base-resolution modeing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.1.1 Strengths and limitations of support vector machine models versus CNN mod-
els on binned genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.1.2 ChromBPNET model architecture and training . . . . . . . . . . . . . . . . . 127
4.1.3 ChromBPNET hyperparameter tuning . . . . . . . . . . . . . . . . . . . . . . 130
4.1.4 ChromBPNET baseline performance . . . . . . . . . . . . . . . . . . . . . . . 130
4.1.5 Enzymatic Bias Effect Correction . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.1.6 ChromBPNET sequence importance scores with DeepSHAP . . . . . . . . . . 140
4.1.7 Footprint comparisons from ChromBPNET against gold standards . . . . . . 140
4.1.8 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5 Molecular phenotype to cellular phenotype links 147
5.1 Matrix stiffness induces a tumorigenic phenotype in mammary epithelium through
changes in chromatin accessibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.1.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.1.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.1.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.1.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
5.1.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
5.1.6 Author Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
5.2 Cell cycle dynamics of human pluripotent stem cells primed for differentiation . . . . 165
5.2.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
5.2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
5.2.3 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
5.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
5.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
5.2.6 Author Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
5.3 Learning cis-regulatory principles of ADAR-based RNA editing from CRISPR-mediated
mutagenesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
5.3.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
5.3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
5.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
5.3.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
5.3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
5.3.6 Author Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
5.4 Transient relief from AP-1 epigenetic roadblock augments reprogramming to pluripo-
tency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
5.4.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
5.4.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
5.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
5.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
5.4.5 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
5.4.6 DPGP clustering of peak and gene trajectories . . . . . . . . . . . . . . . . . 220
5.4.7 Motif Enrichment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
5.4.8 Transcription factor expression analysis . . . . . . . . . . . . . . . . . . . . . 221
5.4.9 Chromatin state distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
5.4.10 Author Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
5.5 Dissecting Murine Muscle Stem Cell Aging Through Regeneration Using Integrative
Genomic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
5.5.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
5.5.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
5.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
5.5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
5.5.5 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
5.5.6 Author Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
A MyHeart Counts 251
A.1 Supplementary Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
A.1.1 Data Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
A.1.2 Motion Tracking Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
A.1.3 Unsupervised Machine Learning Analysis . . . . . . . . . . . . . . . . . . . . 252
A.1.4 Heart Age and 10 Year Risk Assessment . . . . . . . . . . . . . . . . . . . . . 253
A.1.5 Validation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
A.2 App Screenshots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
A.3 Survey instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
A.4 Supplementary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
A.4.1 Physical Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
A.4.2 Validation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
A.4.3 Models of life satisfaction and self-reported disease . . . . . . . . . . . . . . . 256
B Digital cross-over randomized trial ... 265
B.1 Supplementary Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
B.2 Supplementary Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270

B.3 Supplementary Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
C Genetic determinants and causal implications... 277
C.1 Supplementary Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
D Single-cell epigenomic analyses implicate candidate ... 280
D.1 Supplementary Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
D.2 Supplementary Note 1 - Quality control analysis of bulk ATAC-seq data . . . . . . . 290
D.3 Supplementary Note 2 - Single-cell ATAC-seq provides reference cell populations for
deconvolution of cell type-specific signals in bulk data . . . . . . . . . . . . . . . . . 290
D.4 Supplementary Note 3 - Single-cell ATAC-seq identifies brain region-specific differ-
ences in glial cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
D.5 Supplementary Note 4 - Single-cell ATAC-seq identifies neuronal cell class-specific
biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
D.6 Supplementary Note 5 - Tiered approach to identification of functional GWAS poly-
morphisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
D.7 Supplementary Note 6 - A multi-omic epigenetic dissection of the MAPT gene locus 293
D.8 Supplementary Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
D.8.1 Ancestry determination via PCA analysis on genomic data . . . . . . . . . . 294
D.8.2 SNP selection for colocalization testing . . . . . . . . . . . . . . . . . . . . . 294
D.8.3 Colocalization analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
D.8.4 CIBERSORT deconvolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
D.8.5 gkm-SVM allelic scores of candidate SNPs . . . . . . . . . . . . . . . . . . . . 296
D.8.6 Statistical significance and high confidence sets of gkm-SVM based allelic
scores for candidate SNPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
E Cell cycle dynamics of human pluripotent stem cells... 300
E.1 Supplementary Methods: Dataset generation . . . . . . . . . . . . . . . . . . . . . . 300
E.2 Supplementary Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
F Matrix stiffness induces a tumorigenic phenotype... 311
F.1 Supplementary Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
F.1.1 Hydrogel formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
F.1.2 Matrix deformation calculation . . . . . . . . . . . . . . . . . . . . . . . . . . 312
F.1.3 Encapsulation and cell culture . . . . . . . . . . . . . . . . . . . . . . . . . . 312
F.1.4 Immunofluorescence, confocal imaging and analysis . . . . . . . . . . . . . . . 312
F.1.5 Image analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
F.1.6 TEM preparation, imaging and quantification . . . . . . . . . . . . . . . . . . 313
F.1.7 Western blotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314

F.1.8 Quantitative polymerase chain reaction . . . . . . . . . . . . . . . . . . . . . 314
F.2 Supplementary Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
F.2.1 Supplementary Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
G Learning cis-regulatory principles of ADAR-based... 326
G.1 Supplementary Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
G.2 Supplementary Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
H Transient relief from AP-1 epigenetic roadblock... 357
H.1 Supplemental Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
H.2 Supplementary Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
H.2.1 Heterokaryon generation and isolation . . . . . . . . . . . . . . . . . . . . . . 363
H.2.2 RNA extraction and qRT-PCR . . . . . . . . . . . . . . . . . . . . . . . . . . 363
H.2.3 ATAC-seq library generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
H.2.4 Lentivirus production . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
H.2.5 Cas9 experiments in heterokaryons . . . . . . . . . . . . . . . . . . . . . . . . 364
H.2.6 DNA constructs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
I Dissecting Murine Muscle Stem Cell Aging... 366
I.1 Supplementary Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
List of Tables
2.1 Participant Cardiovascular Health Diagnoses and Family History . . . . . . . . . . . 20
2.2 Mean (SD) Fitness, Activity, and Sleep . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Baseline characteristics of MyHeart Counts digital RCT. . . . . . . . . . . . . . . . . 29
2.4 MHC RCT Intervention effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1 UK Biobank physical activity hits with existing associations in the GWAS catalog. . 59
3.2 UK Biobank GWAS hits with significant SIFT and Polyphen scores. . . . . . . . . . 60
3.3 UK Biobank GWAS hit validation in other cohorts. . . . . . . . . . . . . . . . . . . . 60
3.4 UK Biobank GWAS hits for physiacl activity phenotypes that validated in the 23&Me cohort for similar phenotypes and underwent experimental validation via knockdown
of associated loci . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.5 Test set performance for GECCO binned CNN classification and regression models. . 61
3.6 GECCO CNN choice of negative set effect on prediction performance. . . . . . . . . 63
3.7 Class imbalance in genomewide training and test dataset for GECCO binned models 63
3.8 Fraction of GWAS candidate variants found within accessible regions of the genome. 70
4.1 Performance metrics for ChromBPNET signal predicted on frozen sequence compo-
nent, using TOBIAS-initialized adn 20-filter BPNET bias models. . . . . . . . . . . 138
5.1 Histone ChIP-seq datasets in satellite and myoblast cells overlapped with ATAC-seq
samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
A.1 Subject demographic information. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.2 X2 statistical associations between K-means activity clusters and self-reported health 257
conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
A.3 Predictors of life satisfaction and disease status. . . . . . . . . . . . . . . . . . . . . . 257
A.4 Levels of activity and life satisfaction across U.S. geographic regions. . . . . . . . . .
B.1 Daily step count from HealthKit from Apple Watch and iPhone for users who reported 258
data on both device types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265

B.3 Coaching prompts for participant activity clusters. . . . . . . . . . . . . . . . . . . . 266
B.2 Secondary outcome effects in response to interventions. . . . . . . . . . . . . . . . . .
B.4 Intervention effects on primary outcome of mean daily step count for individuals who 272
completed ≥4 days of an intervention. . . . . . . . . . . . . . . . . . . . . . . . . . .
B.5 Intervention effects on primary outcome of mean daily step count for individuals who 273
completed 7 days of an intervention. . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
B.6 Post-hoc sensitivity analyses on baseline observations carried forward. . . . . . . . .
C.1 Physical activity phenotypes derived from UK Biobank data used to perform GWAS. 274
No pair of phenotypes in this list were correlated with R pearson ≥ 0.4 . . . . . . . 278
C.2 UK Biobank physical activity number of significant hits by phenotype. . . . . . . . 279
F.1 ATAC-seq quality control metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
F.2 Optimal overlap and IDR thresholded peak metrics . . . . . . . . . . . . . . . . . . 315
F.3 Homer Genome Ontology for na¨ıve overlap peak sets from soft and stiff matrices. . 316
F.4 Disease ontology of Sp1 target genes. . . . . . . . . . . . . . . . . . . . . . . . . . . 317
F.5 KEGG pathway analysis of Sp1 target genes. . . . . . . . . . . . . . . . . . . . . . . 317
F.6 qPCR Primers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
G.1 Feature engineering for machine learning. . . . . . . . . . . . . . . . . . . . . . . . . 338
G.2 XGBoost prediction performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
G.3 Normalized SHAP and F1 scores for each feature used to train the substrate-specific 350
AJUBA, NEIL1, and TTYH2 models. . . . . . . . . . . . . . . . . . . . . . . . . . . 352
List of Figures
1.1 Gene expression enhancers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Figure from Anshul Kundaje. Chromatin profiling assays identify repressed genes, 2
activate genes, and regulatory control elements within the genome. . . . . . . . . . . 3
1.3 Stegle et al, deep learning models for variant fine-mapping . . . . . . . . . . . . . . . 7
2.1 Flow chart of participants in My Heart Counts study. . . . . . . . . . . . . . . . . . 12
2.2 MyHeart Counts Dataset Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 MyHeart Counts Dataset Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Flow of participants through MHC Digital RCT. . . . . . . . . . . . . . . . . . . . . 38
2.5 Primary intervention effects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1 Pairwise spearman correlation between phenotypes utsed to perform UK Biobank 39
physical activity GWAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Population stratification from principal component analysis on genetic variant data 43
within the UK Biobank cohort. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3 Tally of UK Biobank GWAS hits with p-value ≤5e-8 across physical activity phenotypes. 48
3.4 ELISA analysis of neurotransmitter expression in neuron cells in response to knock-
down of the MEF2C gene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.5 Caption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.6 Learning transcription binding motifs via convolutional neural networks. . . . . . . . 64
3.7 Caption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.8 GECCO variant fine-mapping workflow. . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.9 Bassett[219] architecture for genome-wide classification and regression models of chro-
matin accessibility. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.10 Digestive, immune and CRC cell lines enrich for GECCO GWAS hits. . . . . . . . . 67
3.11 Performance metrics for GECCO multi-tasked classification and regression CNN’s. . 68
3.12 Model initialization with weights from pretrained ENCODE multi-tasked model. . . 69
3.13 GECCO candidate variant fine-mapping with ISM. . . . . . . . . . . . . . . . . . . . 72
3.14 Sigmoid saturation challenge for ISM variant interpretation. . . . . . . . . . . . . . . 74

3.15 rs981625 variant finemapping via CNN models and DeepLIFT. . . . . . . . . . . . . 76
3.16 rs1318920 CRC variant fine-mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.17 GECCO SVM validation workflow. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.18 GECCO SVM candidate functional variant:rs1318920. A) DeepSHAP and GKMexplain scores for the reference C allele, the alternate G allele, and the difference of the two tracks. B) TomTom motif match for HNF4A for the sequence region flanking rs1318920 with high GKMexplain delta scores. C) ChromHMM chromatin state annotations within the vicinity of rs1318920 in CRC-relevant cell types. D) MACS2 fold change tracks in the five GECCO datasets in the vicinity of rs1318920 and LD 79
tagged SNP rs11190164. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.19 GECCO SVM candidate functional variant:rs12896913 . . . . . . . . . . . . . . . . . 81
3.20 GECCO SVM candidate functional variant:rs28549017 . . . . . . . . . . . . . . . . . 82
3.21 GECCO SVM candidate functional variant:rs4360494 . . . . . . . . . . . . . . . . . 83
3.22 GECCO SVM candidate functional variant:rs6089354 . . . . . . . . . . . . . . . . . 87
3.23 MPRA validation for GECCO fine-mapped variant rs7130173 . . . . . . . . . . . . . 88
3.24 MPRA validation for GECCO fine-mapped variant rs72685323 . . . . . . . . . . . . 89
3.25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.25 Single-cell ATAC-seq identifies cell type-specific chromatin accessibility in the adult 90
brain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.26 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.26 Sub-clustering identifies diverse biologically relevant neuronal cell types in the adult 95
brain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.27 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.27 Machine learning predicts functional polymorphisms in AD and PD . . . . . . . . . 100
3.28 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.28 Vertical integration of multi-omic data and machine learning nominates novel gene 101
targets in AD and PD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.29 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.29 Epigenetic deconvolution of MAPT locus explains haplotype-associated transcrip- 104
tional changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.1 SVM vs CNN benchmarks on ENCODE DNASE datasets in canonical cell lines. . . 124
4.2 SVM vs CNN performance benchmarks on GECCO DNASE datasets. . . . . . . . .
4.3 Expected auPRC across fold and tasks for the ENCODE canonical cell line DNASE 126
datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.4 Expected auPRC across folds and tasks within the GECCO DNASE datasets. . . . .
4.5 Interpretation score stability across folds for the reference and alternate allele effects 126
on accessibility at variant rs636317 within the Alzheimers/Parkinsons dataset. . . . 128
4.6 ChromBPNET model predicts base-level count profile as well as 1kb resolution summed counts for ATAC-seq and DNASE data. Example IDR peak from K562 DNASE dataset ENCSR000EOT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.7 ChromBPNET architecture for ATAC-seq and DNASE datasets, with bias correction. 131
4.8 ChromBPNET performance metrics on ENCODE canonical cell line ATAC and DNASE datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
4.9 Predicting profile and count signal from enzymatic bias input . . . . . . . . . . . . . 134
4.10 Position weight matrices learned for ATAC-seq and DNASE bias. . . . . . . . . . . . 136
4.11 Comparison of count performance metrics for models trained to predict enzymatic bias137
4.12 Comparison of profile performance metrics for models trained to predict enzymatic
bias. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.13 Comparison of performance metrics for bias-corrected ChromBPNET model with un- 137
corrected model and negative-augmented model in HEPG2 ATAC-seq data. . . . .
4.14 Count predictions from baseline, bias-corrected, and bias-corrected with bias un- 138
plugged ChromBPNET models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.15 Signal to noise ratio comparisons across DeepSHAP scores for different accessibility 139
models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.16 Interpretation score stability across ChromBPNET, binned genomewide CNN, and SVM models, measured by calculating cosine similarity, Spearman correlation, and 141
Pearson correlation across five folds. . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.17 Distribution of per-base cosine similarity across base importance DeepSHAP and 143
DeepLIFT scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.18 ChromBPNET footprint and interpretation scores compared to other models for K562 144
IDR peak centered on chr1, 17348301. . . . . . . . . . . . . . . . . . . . . . . . . . .
4.19 ChromBPNET footprint and interpretation scores compared to other models for K562 145
IDR peak centered on chr1, 17348301. . . . . . . . . . . . . . . . . . . . . . . . . . .
4.20 Interpretation score signal within the Vierstra/Tobias highest confidence footprint 145
across models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1 Nuclear and chromatin alterations accompany phenotypic changes induced by ECM 146
stiffness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Chromatin accessibility changes are associated with normal and tumorigenic pheno- 155
types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
5.3 Sp1 mediates the stiffness-induced tumorigenic phenotype. . . . . . . . . . . . . . . . 158
5.4 HDACs 3 and 8 regulate the stiffness-induced tumorigenic phenotype. . . . . . . . . 161
5.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
5.5 Soft hydrogel culture produces more physiologically representative chromatin acces-
sibility profiles than standard tissue culture. . . . . . . . . . . . . . . . . . . . . . . . 164
5.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6 DMSO treatment of hPSCs changes gene expression trajectories in response to phase 170
of the cell cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
5.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
5.7 DMSO-induced changes converge upon PI3K signaling . . . . . . . . . . . . . . . . . 173
5.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
5.8 PI3K inhibition increases hPSC differentiation across all germ layers . . . . . . . . . 175
5.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.9 CRISPR/Cas9-mediated mutagenesis in endogenous RNA to dissect RNA editing by 179
ADAR1 in cells. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
5.10 Figure 2: RNA editing results from the targeted mutagenesis experiments. . . . . . . 182
5.11 Effects of NEIL1 single mutations on RNA structure. . . . . . . . . . . . . . . . . . . 183
5.12 Examples of RNA secondary structure changes of NEIL1 variants. . . . . . . . . . . 184
5.13 Cis regulatory features explain differences of editing levels among RNA variants. . . 186
5.14 Clustering of NEIL1 RNA structure with RNA editing level. . . . . . . . . . . . . . 188
5.15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.15 Quantitative model predicts editing level by combining complex RNA sequence and 189
structure features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
5.16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
5.16 Cis regulatory features synergistically contribute to model prediction. . . . . . . . . 194
5.17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
5.17 Chromatin accessibility dynamics during heterokaryon reprogramming. . . . . . . . . 209
5.18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.18 AP-1/Jun inhibits OCT4 expression during reprogramming through a distal regula- 212
tory element. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
5.19 Interaction between JUN and MBD3 during heterokaryon reprogramming. . . . . . . 214
5.20 Phosphorylation of JUN blocks interaction with MBD3 and activates OCT4. . . . .
5.21 Inhibition of AP-1 increases iPSC reprogramming efficiency and can replace exogenous 216
OCT4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
5.22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.22 Muscle stem cells act aberrantly as a result of aging and poorly regenerate muscle 226
after injury. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.23 Alterations in metabolism associate with global changes in histone methylation of 227
young and aged muscle stem cells. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
5.24 Retinoic acid receptors contribute to maintenance of muscle stem cell quiescence but
are lost in age. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
5.25 Chromatin accessibility is modified during muscle stem cell regeneration and exhibits divergent regenerative trajectories in aging. . . . . . . . . . . . . . . . . . . . . . . . 232
5.26 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
5.26 Aging engenders variations in transcription factor binding dynamics during regeneration.235
5.27 Silencing Ddit3 restores myogenic differentiation in aged muscle stem cells. . . . . . 236
A.1 Screenshots from MyHeart Counts App with Consent Form. . . . . . . . . . . . . . 258
A.2 Data returned to the user by the MyHeart Counts application. . . . . . . . . . . . .
A.3 A: Physical Activity Readiness Questionnaire (PAR-Q). B: Activity and Sleep Survey: 259
on-the-job activity[460, 459] leisure-time activity[460, 459, 222] . . . . . . . . . . . . A.4 A: Activity and Sleep Survey: Moderate or Vigorous Physical Activity[376], sleep[4].B: 259
Well-Being[334] and Risk Perception[227]. . . . . . . . . . . . . . . . . . . . . . . . . 260
A.5 A: Diet Survey[34]. B: Cardiovascular Health Survey[203]. . . . . . . . . . . . . . . . 261
A.6 Activity Transition States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.7 Bland Altman analysis of app-reported six minute walk distance vs. measured six 261
minute walk distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.8 K-means clustering of subjects’ activity patterns based on 10 features: proportion of time spent in the “stationary”, “automotive” (driving), “walking”, “cycling”, and 262
“running” states during the weekdays as well as during the weekends. . . . . . . . . 263
A.9 Assessment of subjects’ cardiovascular risk . . . . . . . . . . . . . . . . . . . . . . . . 264
B.1 User participation in study over time . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
B.2 95% CI for intervention effects on daily step count for Apple Watch users. . . . . . . 271
B.3 subset of iPhone users who completed all four interventions (n=493). . . . . . . . . . 274
B.4 User progress through coaching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
B.5 K-means clustering of subjects’ activity patterns . . . . . . . . . . . . . . . . . . . . 275
D.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
D.1 Analysis of bulk ATAC-seq data from adult brain identifies brain-regional heterogeneity.282
D.2 Quality control of scATAC-seq libraries. . . . . . . . . . . . . . . . . . . . . . . . . .
D.3 Cell type-specific scATAC-seq data enables deconvolution of chromatin accessibility 283
data from bulk regions in the adult brain. . . . . . . . . . . . . . . . . . . . . . . . . 284
D.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
D.4 scATAC-seq reveals epigenetic encoding of region-specific cellular gene regulatory pro- 285
grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
D.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
D.5 Neuronal cell class-specific peaks and genes delineate differences between biologically
relevant neuronal cell types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
D.6 HiChIP and co-accessibility implicates disease-relevant genes in AD and PD through 288
linkage of noncoding GWAS SNPs to target genes . . . . . . . . . . . . . . . . . . . 289
E.1 Differentially expressed genes within the PI3K-AKT signaling pathway . . . . . . . . 303
E.2 Differentially expressed genes within the TNF signaling pathway . . . . . . . . . . . 304
E.3 Differentially expressed genes within the cGMP-PKG signaling pathway . . . . . . . 305
E.4 Differentially expressed genes within the VEGF signaling pathway . . . . . . . . . . 306
E.5 DMSO treatment regulates the cell cycle of hPSCs . . . . . . . . . . . . . . . . . . . 307
E.6 DMSO treatment regulates the cell cycle of hPSCs . . . . . . . . . . . . . . . . . . . 308
E.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
E.7 PI3K inhibition increases HUES6 hPSC differentiation across all germ layers . . . . 310
F.1 Mechanical characterization of 3D culture matrices . . . . . . . . . . . . . . . . . . . 319
F.2 IPNs are deformable by cell generated traction forces. . . . . . . . . . . . . . . . . . 319
F.3 Characterization of MCF10A clusters in soft and stiff matrices. . . . . . . . . . . . . 320
F.4 HME1 mammary epithelial cells adopt a tumorigenic phenotype in response to stiff
matrices through PI3K-Sp1-HDAC-mediated pathway. . . . . . . . . . . . . . . . . . 321
F.5 HDAC inhibition by SAHA broadly alters chromatin accessibility . . . . . . . . . . . 322
F.6 Sp1 motifs are enriched in accessible chromatin regions from stiff matrices. . . . . . . 323
F.7 Stiff matrices induce Sp1 phosphorylation that is reduced by PI3K or class I HDAC
inhibition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
F.8 Cancer cell line tumor-like morphology is also enhanced by matrix stiffness and di-
minished by Sp1 inhibition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
G.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
G.1 Performance of targeted-mutagenesis RNA substrates. . . . . . . . . . . . . . . . . . 328
G.2 Comparison of RNA coverage, gDNA coverage and Editing Levels. . . . . . . . . . . 329
G.3 Editing levels from targeted-mutagenesis libraries. . . . . . . . . . . . . . . . . . . . 330
G.4 Selected examples of variants of NEIL1, TTYH2 and AJUBA. . . . . . . . . . . . . .
G.5 Correlation of selected cis regulatory features to editing levels of all variants of NEIL1, 331
TTYH2, and AJUBA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
G.6 Correlation of selected cis regulatory features to editing levels of all variants of NEIL1, 332
TTYH2, and AJUBA (continue). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
G.7 Example of NEIL1 cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
G.8 Clustering of TTYH2 with RNA editing levels. . . . . . . . . . . . . . . . . . . . . . 335
G.9 Consensus structures of selected TTYH2 clusters. Editing level for each cluster is
shown in G.8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
G.10 Joint training and testing across substrates. . . . . . . . . . . . . . . . . . . . . . . . 337
G.11 XGBoost model performance across training sets. . . . . . . . . . . . . . . . . . . . . 338
H.1 Dynamic ATAC-seq trajectories during heterokaryon reprogramming. . . . . . . . . 358
H.2 Motif correlation matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
H.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
H.3 Characterization of inducible dominant-negative AP-1 (acidic-fos) . . . . . . . . . . . 361
H.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
H.4 Interrogating regulatory regions with dCas9-KRAB at human OCT4 during repro- 362
gramming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
H.5 RNAi knockdown and constitutive JNK validation . . . . . . . . . . . . . . . . . . .
I.1 Alterations in metabolism associate with global changes in histone methylation of 364
young and aged muscle stem cells.(Expanded) . . . . . . . . . . . . . . . . . . . . . .
I.2 Retinoic acid receptors contribute to maintenance of muscle stem cell quiescence but 367
are lost in age (expanded). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
I.3 Chromatin accessibility is modified during muscle stem cell regeneration and exhibits 368
divergent regenerative trajectories in aging (expanded). . . . . . . . . . . . . . . . .
I.4 Aging engenders variations in transcription factor binding dynamics during regener- 369
ation (expanded). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
I.5 Silencing Ddit3 restores myogenic differentiation in aged muscle stem cells (Expaneded).371

Chapter 1
Introduction
As the cost of sequencing has continued to decrease, large volumes of genomic sequencing data have been generated[202]. This enables the extensive profiling of molecular phenotypes via sequence data, and facilitates the linking of molecular mechanisms, to cellular mechanisms and to complex organismal phentoypes. For example, at the molecular level, variants can effect protein structure, RNA initiation, transcription rates, splicing, editing, degradation, and translation. These molecular effects can in turn influence morphology, response to stimuli, and growth rates at the cellular level. Cellular effects can then influence complex traits and disease phenotypes at the organismal level. These layers of phenotypes are interconnected, and identifying the links between them remains an active area of research.
1.1 Cell-type specific patterns of transcriptional regulation in the genome
Associations of molecular phenotypes to cellular mechanisms differ across tissues and cell types due to different patterns of gene regulation. Though all cells in an organism have the same genome, the manner in which this genome is read and translated differs across cell types. A complex array of cell-type specific regulatory mechanisms determine cell fate and gene expression across tissues and cell types. These mechanisms encompass regions of the genome that are involved in increasing transcription rates, termed ”enhancer”, as well as regions of the genome that are involved in silencing gene expression, termed ”repressors”1.1. Enhancers and repressors are characterized by chromatin histone modification marks[42], such as histone acetylation and deacetylation. For example, acetylated lysine residues within histone tails are associated with transcriptional activation by altering chromatin structure. These modifications open up nucleosomes, enabling transcription
1

Figure 1.1: Figure from https://courses.lumenlearning.com/wm-biology1/chapter/ reading-eukaryotic-transcription-gene-regulation/.An enhancer is a DNA sequence that promotes transcription. Each enhancer is made up of short DNA sequences called distal control elements. Activators bound to the distal control elements interact with mediator proteins and transcription factors. Two different genes may have the same promoter but different distal control elements, enabling differential gene expression.
factor proteins to bind to DNA. In contrast, deacetylated histones are associated with transcriptional repression by allowing histones to interact iwth DNA more strongly and leading to more rigid nucleosome structure that prevents binding of transcriptional machinery[512].
1.2 Profiling transcription factor binding through high throughput assays
A number of high-throughput assays have been developed to profile transcription factor binding patterns throughout the genome1.2. For purposes of these thesis, we focus on ATAC-seq, DNASEseq, and histone ChIP-seq, described below:
• ATAC-seq[79] provides a measure of chromatin accessibility. In this assay, a hyperactive Tn5 transposase enzyme inserts sequencing adapters in accessible regions within chromatin. This

Figure 1.2: Figure from Anshul Kundaje. Chromatin profiling assays identify repressed genes, activate genes, and regulatory control elements within the genome.
is followed by a sequencing step, whereby sequencing reads as used to infer regions of increased accessibility, which correspond to transcription factor binding sites and can be used to infer nucleosome positions.
• DNASE-seq [436] is an assay that captures DNASE-digested DNA fragments indicative of DNase I HS sites within the genome. These fragments are then sequenced via high-throughput sequencing methods.
• Histone ChIP-seq [333] is an assay in which proteins like histones and transcription factors are covalently crosslinked to their genomic DNA substrates, providing a snapshot of histone interactions within a given cell. Chromatin is then isolated and fragmented, and complexes of bound protein-DNA are captured via antibodies specific to the histone of interest. The cross-linking step is then reversed, and high-throughput sequencing performed on the DNA fragments.
1.3 Genomic variation in non-coding regions of the genome can disrupt regulation of gene expression.
Genomic variation refers to sequence differences in DNA among individuals. Variation in genomic sequence can arise from several sources, including random mutational events, crossing over and recombination of chromosomes during meisosis, and the independent assortment of alleles for separate traits. Genetic variation can span from single nucleotide polymorphisms to chromosomal rearrangements of many megabases. Here we focus primarily on single nucleotide polymorphisms (SNPs) in the genome, as these are the most common type of variation and have been estimated to account for approximately 90% of all sequence variants[155].
Single nucleotide polymorphisms (SNPs) are differences in alleles at a single base pair within the DNA of a population of organisms. According to dbSNP[420], 324 million variants have been found in human genomes. On average, SNPs occur every 100 to 3000 bases in the human genome, and, according to the HapMap Project[462], it is estimated that between three and five percent of SNPs have functional effects.
95 percent of common variants in the genome are non-coding[448],[301]. These variants do not fall within the exon regions of genes, and hence do not disrupt gene and protein expression directly by altering amino acid sequences. Instead, many non-coding variants affect gene regulation by disrupting regulatory elements within the DNA.
1.4 Genomewide association studies help to decipher genotypephenotype associations
Genomewide association studies[458] identify statistical associations between genomic variants, such as single nucleotide poymorphisms, and phenotypic traits. The GWAS approach has revolutionized the field, providing associations for a number of disease phenotypes and traits with common genomic variation. Many of these have been curated in databases such as the GWAS Catalog[80], MSigDB[271], and others.
Although GWAS is a powerful tool for establishing associations between genetic variation and phenotypes, it has several limitations. Most importantly, GWAS cannot be used to establish causality due to challenges posed by linkage disequilibrium. Linkage disequilibrium refers to the nonrandom association of alleles at different loci [431]. Several factors drive linkage disequilibrium, including genetic recombination, mutation rates, genetic drift, population structure and genetic linkage whereby DNA sequence in close proximity on a chromosome are inherited together during meiosis. LD poses a challenge for identifying genotype-phenotype associations because a variant with no functional effect on a phenotype that is in close proximity to a functional variant will regardless show up as having a significant statistical association with the phenotype.
For GWAS to be effective, they should be conducted in a sufficiently large population to provide statistical power for association testing. The needed sample size depends on a number of factors, such as the effect size of the variation on the phenotype, linkage of variants at the loci of interest, and the prevalance of a trait within a population. However, generally studies with fewer than 2000 cases and 2000 controls will have low power, barring special circumstances[439].
1.5 Measuring complex phenotypes via mobile devices and wearables
Fortunately, the proliferation of mobile devices and wearables such as smart phones and smart watches provide new approaches to rapidly gather phenotype data on large cohorts and facilitate GWAS analysis. Commercially available wearable devices have become increasingly popular among consumers. 350 million wearable devices were sold in 2018 worldwide[461]. Wearable devices allow researchers to gather data on physical activity, sleep, and other complex behavioral traits in a real-world setting over long periods of time, enabling more rigorous GWAS analysis of these traits. Traditionally, GWAS of behavioral traits have largely relied on self-reported data that are prone to measurement error and thus have limited statistical power. Self-reported phenotypes are also biased, measuring individuals’ perception of their behavior rather than the behavior itself. Wearables and mobile devices utilize onboard sensors such as tri-axial accelerometers and gyroscopes to provide high-resolution measurements of true behaviors[168]. Consequently, within the last five years a number of GWAS studies have been conducted on features derived from wearable data, examining traits such as sleep[212], physical activity[130], and others.
1.6 Population-scale biobanks
The wearable technology revolution has been accompanied by the curation and release of populationscale biobanks, such as the UK Biobank[452], Biobank Japan[325], and the Million Veterans Project[164]. These biobanks collect multi-omics data, as well as lifestyle data and electronic healthy records, on hundreds of thousands to millions of individuals, providing a rich data source for highly-powered GWAS analysis[130].
1.7 Interpretable deep learning models serve as a useful tool for functional SNP fine-mapping
The emergence of high-throughput assays for profiling epigenetic signatures, wearable technologies for precise phenotype measurements, and population-scale biobanks for high-powered GWAs has been accompanied by the development of novel statistical and machine learning methods to integrate and interpret these datasets. Deep learning models, particularly convolutional neural networks, are able to effectively predict chromatin accessibility by learning motif patterns in underlying sequence data. Stegle et al pioneered the application of such neural network alogrithms to prediction of variant function[27], summarized in figure 1.3. Unlike GWAS, CNN’s do not require large sample sizes to train, as they learn patterns of transcription factor binding across different parts of the genome, rather than at a single position in the genome across many individuals 1.3A. Convolutional kernels function similarly to position weight matrices, and taking a convolution product of randomly initialized kernels with the underlying sequence identifies motif patterns within the sequence1.3B. By stacking convolutional layers together, higher layers are able to learn increasingly complex patterns of motifs, such as motif grammars of homodimers and heterodimers. By learning the association of these motif grammers with chromatin accessibility, convolutional CNN’s are able to predict presence or absence of peaks in labels derived from assays such as ATAC-seq and DNASE-seq.
Having learned these associations, the models can then be queried to predict SNP effects on chromatin accessibility. For example, the reference and alternate alleles in a GWAS candidate SNP, along with their sequence context, can be provided to a trained CNN model, and changes in prediction values can be quantified1.3C. If a prediction effect is observed, algorithms such as DeepSHAP[284] and DeepLIFT [424] can be applied to interpret the base-level importance scores of the SNP and it’s flanking regions to determining the model’s predictions.
1.8 Summary of chapters
In this thesis, I present case studies of machine learning approaches to analyzing genomic variation effects at the molecular, cellular, and phenotypic levels.
Starting at the complex phenotype level, Chapter 2 details the MyHeart Counts study as an example of the use of wearable and mobile health technologies to measure and characterize complex phenotypes. A randomized controlled trial of physical activity is presented to illustrate the feasibility of using interventions delivered via mobile devices/wearables to change complex behavioral traits. Chapter 3 segues into the link between complex phenotypes like physical activity and genetics by presenting a GWAS of physical activity performed in the UK Biobank dataset.
I next discuss machine learning approaches for variant fine-mapping, examining large-scale GWAS studies of colorectal cancer and Alzheimer’s disease. I discuss the application of support vector machines and convolutional neural networks to fine-mapping of these GWAS variants and determining their functional effects on the non-coding genome. Chapter 4 discusses the refinement of convolutional neural network models to base-pair level resolution, enabling transcription factor binding footprinting and identification of cooperative effects across different transcription factors.
Finally, Chapter 5 provides several case studies of computational approaches to link molecular phenotypes with cellular phenotypes. We first examine the epigentic role of the Sp1 transcription factor in HDAC inhibition and resulting effects on cellular morphology in breast cancer tumors. We next examine the epigenetic effects of DMSO treatment on PI3K and associated genes, with downstream effects on efficiency of iPSC differentiation into neurons. We apply machine learning mechanisms to learn the cis-regulatory principles of ADAR-based RNA editing. We examine the role of the AP-1 transcription factor in induced pluripotency and cellular reprogramming. Finally, we examine the epigenetic mechanisms involved in healing in muscoskeletal injury, and how these

Figure 1.3: Figure reproduced from Stegle et al[27]. (A) DNA sequence and the molecular response variable along the genome for three individuals. Conventional approaches in regulatory genomics consider variations between individuals, whereas deep learning allows exploiting intra-individual variations by tiling the genome into sequence DNA windows centred on individual traits, resulting in large training data sets from a single sample. (B) One-dimensional convolutional neural network for predicting a molecular trait from the raw DNA sequence in a window. Filters of the first convolutional layer (example shown on the edge) scan for motifs in the input sequence. Subsequent pooling reduces the input dimension, and additional convolutional layers can model interactions between motifs in the previous layer. (C) Response variable predicted by the neural network shown in (B) for a wild-type and mutant sequence is used as input to an additional neural network that predicts a variant score and allows to discriminate normal from deleterious variants. (D) Visualization of a convolutional filter by aligning genetic sequences that maximally activate the filter and creating a sequence motif. (E) Mutation map of a sequence window. Rows correspond to the four possible base pair substitutions, columns to sequence positions. The predicted impact of any sequence change is colour-coded. Letters on top denote the wild-type sequence with the height of each nucleotide denoting the maximum effect across mutations

8
mechanisms are mediated by age.
Chapter 2
Machine learning approaches to characterizing complex phenotypes
2.1 MyHeart Counts: A cardiovascular mobile health study
2.1.1 Abstract
Importance Studies have established the importance of physical activity and fitness, yet limited data exist on the associations between objective, real-world physical activity patterns, fitness, sleep, and cardiovascular health.
Objectives To assess the feasibility of obtaining measures of physical activity, fitness, and sleep from smartphones and to gain insights into activity patterns associated with life satisfaction and self-reported disease.
Design, Setting, and Participants The MyHeart Counts smartphone app was made available in March 2015, and prospective participants downloaded the free app between March and October 2015. In this smartphone-based study of cardiovascular health, participants recorded physical activity, filled out health questionnaires, and completed a 6-minute walk test. The app was available to download within the United States.
Main Outcomes and Measures The feasibility of consent and data collection entirely on a smartphone, the use of machine learning to cluster participants, and the associations between activity patterns, life satisfaction, and self-reported disease.
Results From the launch to the time of the data freeze for this study (March to October 2015), the number of individuals (self-selected) who consented to participate was 48,968, representing all 50 states and the District of Columbia. Their median age was 36 years (interquartile range, 27-50 years), and 82.2% (30,338 male, 6,556 female, 10 other, and 3115 unknown) were male. In total, 40,017 (81.7% of those who consented) uploaded data. Among those who consented, 20,345 individuals
9
(41.5%) completed 4 of the 7 days of motion data collection, and 4552 individuals (9.3%) completed all 7 days. Among those who consented, 40,017 (81.7%) filled out some portion of the questionnaires, and 4990 (10.2%) completed the 6-minute walk test, made available only at the end of 7 days. The Heart Age Questionnaire, also available after 7 days, required entering lipid values and age 40 to 79 years (among 17,245 individuals, 43.1% of participants). Consequently, 1334 (2.7%) of those who consented completed all fields needed to compute heart age and a 10-year risk score. Physical activity was detected for a mean (SD) of 14.5% (8.0%) of individuals’ total recorded time. Physical activity patterns were identified by cluster analysis. A pattern of lower overall activity but more frequent transitions between active and inactive states was associated with equivalent self-reported cardiovascular disease as a pattern of higher overall activity with fewer transitions. Individuals’ perception of their activity and risk bore little relation to sensor-estimated activity or calculated cardiovascular risk.
Conclusions and Relevance A smartphone-based study of cardiovascular health is feasible, and improvements in participant diversity and engagement will maximize yield from consented participants. Large-scale, real-world assessment of physical activity, fitness, and sleep using mobile devices may be a useful addition to future population health studies.
2.1.2 Introduction
Investigators have established the importance of physical activity, fitness, sleep and diet in the maintenance of cardiovascular health. Low fitness is a key risk factor[324, 229] while insufficient physical activity accounts for 5.3 million deaths per year and approximately 6% of the burden of coronary heart disease[320, 60, 228]. Decrements in sleep quality through sleep fragmentation and obstructive sleep apnea also affect overall mortality[434].
Most of these observations, particularly with respect to activity, have been achieved through individual efforts of research coordinators and have required in-person consent, interviews, exercise or sleep studies, and follow up[69, 231]. Such methods rely on accurate post hoc participant recall. Survey-based physical activity estimation has been shown to systematically overestimate measured activity[71, 167].
Mobile technology, in particular advances in smartphone sensors, offers a new approach to the study of cardiovascular health and fitness[230, 297, 245, 82, 344]. Direct measurement of activity through always-on, low-power motion chips provides a promising alternative to questionnaire-based approaches, as recognized by large-scale projects such as the UK Biobank[450] and the US Precision Medicine Initiative. Widespread ownership of smartphones in the developing and developed world could thus transform global clinical research.
In 2015, Apple Inc. (Cupertino, California) introduced an open-source framework (ResearchKit) to facilitate clinical research and standardization of data collection[204]. Here, we report the first findings from MyHeart Counts, one of the launch applications for the framework. MyHeart Counts

is a cardiovascular health study administered entirely via mobile phone, incorporating direct sensorbased measurements of physical activity and fitness, as well as questionnaire assessment of sleep, lifestyle factors, risk perception, and overall wellbeing.
Our objectives in this study were two-fold: 1) to establish the feasibility of mobile consent and real time gathering of sensor and survey data from a large ambulatory population, and 2) to investigate the relationships between patterns of physical activity, fitness and self-reported well being/medical history.
2.1.3 Methods and Statistics
Data acquisition
This study was approved by the Stanford University Institutional Review Board (e-protocol number 31409). Prospective participants downloaded the free application from the Apple app store between March and October 2015. The consent process was developed specifically for the smartphone platform and incorporates unambiguous language in a “card” format optimized for reading and understanding on a phone (FiguresA.1-A.2). Following consent, a secondary screen seeks specific permission for sharing of each category of phone data with researchers. At any time, the participant can withdraw a specific category of data, or their entire participation, directly from their phone.
Consented participants were able to contribute data to a range of study components, including health surveys on diet, well-being, risk perception, work-related and leisure-time physical activity, sleep and cardiovascular health (Figure 2.1, FiguresA.3-A.5). Subjects also self-reported demographic information, such as age, sex, and ethnicity. For reporting of ethnicity, they were given the opportunity to select multiple options (defined by the investigators) or none at all. Over the course of the initial 7-day monitoring period, the participant’s motion was recorded through the motion co-processor chip of the phone. The low-power motion chip integrates signals including tri-axial accelerometer, gyroscope, compass, and barometer to estimate distance as well as the presence and modality of movement such as stationary, walking, running, cycling, or driving. On day 7, participants were requested to complete a self-administered 6-minute walk fitness test that utilizes GPS-calibrated pedometer functionality built into the motion co-processor chip. Reminders to complete surveys occur on a daily basis during the initial 7-day monitoring period.
Statistical Analysis
K-means and hierarchical clustering were applied to define groups with cohesive patterns of physical activity from the motion tracking data. Features for clustering included percent of time spent stationary, percent of time spent active, number of state changes between active and stationary, and the fraction of time spent on each activity (driving, stationary, walking, running, cycling, unknown) (Figure 2.2A, FigureA.1A, FigureA.8). Categorical comparison among multiple groups

Figure 2.1: Data on downloads were derived from iTunes Connect (http://itunesconnect.apple.com), and data on participant consent numbers were derived from Sage Synapse (http://www.synapse.org). Study components are color coded, and matched colors are used to indicate correspondence between components in A and B.
was performed using the Chi square test. We tested for associations with life satisfaction using linear regression models with age and sex included as covariates. For reported presence of disease, we tested association using logistic models with age and sex as covariates. For both outcomes, stepwise selection of significant univariate predictors was performed to build a multivariate model. When analyzing geographic differences in life satisfaction and activity, we developed a mixed effects model with three-digit zip code prefix modeled as a random effect and US Census region modeled as a fixed effect. Detailed information on statistical analyses is contained in the Supplementary Methods section.
2.1.4 Results
Participation and demographics
From launch to the time of the data freeze for this study (March-October 2015), the number of individuals who consented to participate was 48,968 (Table A.1,Figure2.1). Participants are predominantly male (82.2%) with a median age of 36+/-15 years, self-reported interquartile range = [27,50]. Participants were from all 50 United States and the District of Columbia, with the most participants from California (n=4423) and the fewest participants from North Dakota (n=35). Of 20,323 respondents to medical history survey questions, 1,827 reported having a disease while 4,649 reported being on medications (Table 2.1). Participation dropped markedly during the 7 day period and for some measures, data are contributed only from several thousand individuals.
Quantity of physical activity
Of the 20,345 individuals whose phones recorded physical activity, the majority (68.3%) were estimated by their phones to be stationary for over 50% of the time for which data was recorded, spending 14.5% of their time active (10.9% of time walking, and 3.3% of time on vigorous activity such as running). (Table 2.2). Males’ phones on average reported 3.8% more time active than females (p=3.39e-8). A linear regression of sensor-measured active time onto age yields p-value=0.58, adjusted R2=-3.98e-5. Linear regression of self-reported active time onto age yields p-value=5.38e-3, with coefficient of interaction between age and activity equal to -0.49 (30 seconds). This indicates no strong associations between active time and age.
Patterns of physical activity
K-means clusters of physical activity data are shown in Figure 2.2a. Clusters of activity levels were found to be significantly correlated with self-reported cardiovascular health status, as determined by a X2 test for presence/absence of chest pain, diabetes, joint pain, and heart disease (Figure 2.2b, Table2.2). Individuals in the least active cluster were found to have an elevated risk for all conditions listed above with Chi-squared standardized residuals ranging from 2.5 for hypertension to 6.3 for heart condition. Conversely, individuals in the ”weekend warriors” cluster were found to be at a significantly lower risk (standardized residuals ≤ -2) for chest pain, diabetes, joint pain, and heart condition (Figure 2.2b, Table2.2). Weekend warriors were defined as individuals who were more active during the weekend than during the weekdays. These individuals (Figure 2.2A) spent approximately 25% more time in the ”active” state during the weekend.

Figure 2.2: Figure 2a: Clusters of recorded physical activity based on proportion of time participants’ phones reported they were stationary over the course of 2 weekdays and 2 weekend days. Two dimensions of clustering are illustrated for clarity from the original four. N=20,345 subjects were included in the analysis. 2b: Probability of heart disease (p≤0.001, N=17062, X2=22.682, V=0.0121), joint pain (p=3.42e-2, N=17062, X2=34.161, V=0.0149), chest pain (p≤0.001, N=17062, X2=34.16, Cramer V=0.0149), and diabetes (p≤0.001, N=17062, X2=23.068, V=0.0122) for individuals in different activity clusters. N=17,062 subjects were included in the analysis. 2c: Difference in mean life satisfaction (p≤0.001, mean effect size = 0.383 points) between subjects in recorded physical activity clusters (active, weekend warriors, drivers, inactive).
The second analysis focused on the number of state changes from stationary to active and vice versa (Figure2.1). Cluster analysis suggested that although state changers were less active overall than the weekend warriors, they experienced similarly better cardiovascular health status compared to those in inactive clusters.
Fitness
4990 subjects (10.2% of consented participants) completed the 6-minute walk test, with a mean step count of 693+/-127 steps, and 455+/-520 meters mean distance walked. (Table 2.1). Participants who completed the 6-minute walk test were slightly older than the general study population and had a higher ratio of men to women (median age = 42, mean age = 43.2; males:females = 5.6, compared to 4.6 for the entire cohort). Sensor recordings indicate that the 6-minute walk cohort was active during a mean of 15.1% of their total recorded time, compared to 14.5% for the full cohort.
Sleep
Each participant self-reported the number of hours slept each night (Table 2.2). Overall, participants report a mean of 7.8 hours of sleep per night, N=34,048 (69.5%). Females report an average of 0.3 hours more sleep than males (p≤ 2.2e-16, Nfemale=6,556, Nmale=30,338).
We derived daily bedtimes for each participant based on the last time of movement recorded by the motion chip. We then compared the distributions of self-reported life satisfaction ratings (1-10 scale) for participants with the earliest bedtimes (earliest tertile) to the participants with the latest bedtimes (latest tertile) using median bedtimes for each participant (N=14,895, 30.4%). Individuals with two or fewer bedtimes or outliers (bedtimes before 7:30pm or after 3:30am) were excluded. Participants who retire earliest in the evening report an overall higher life satisfaction rating (mean: 7.48) than participants who stay awake latest (mean: 6.80), p≤0.001 (Figure 2.3B). Individuals who retire earliest tend to be older than those who retire latest (medians: 44 and 33 years old, respectively). A linear model adjusted for age and sex found median bedtime (in hours) to be a significant univariate predictor of life satisfaction (β= -0.16 95% CI(-0.18, -0.14), p≤0.001, N=14,179).
Models of life satisfaction and self-reported disease
In addition to associations with health conditions, activity levels were also found to correlate with subjects’ life satisfaction (p-value=1.47e-14,φ=0.125) (Figure 2.2C). Subjects in the inactive cluster reported the lowest life satisfaction (mean = 6.82 on a scale of 1 - 10), while subjects in the most active cluster reported the highest level (mean = 7.48). Drivers and weekend warriors reported mean life satisfaction values of 7.14 and 7.36, respectively.
We tested the association of life satisfaction and self-reported disease status in our population with dietary, lifestyle, and other factors. Overall life satisfaction scores clustered around a mean of 7.12 on a 10 point scale. Since many lifestyle predictors are correlated, we derived a multivariate linear model using stepwise selection on all significant univariate predictors, including age and sex as covariates. We found that fruit intake, sugary drink intake, recorded activity, and minutes of self-reported vigorous activity remained as significant predictors of life satisfaction (TableA.3). For disease status, we used stepwise selection on the significant predictors to derive a multivariate logistic regression model, with age and sex as covariates, that showed family history, whole grain consumption, and job activity as significant predictors (TableA.3).
Geographic diversity
We analyzed the pattern of behavior across the United States (Figure 2.3A) with a mixed effects model containing zip codes as a random effect and US census region as a fixed effect. We found significant differences between US census regions in both the measured activity levels (n=14,406, ANOVA p≤0.001) and the reported life satisfaction (n=14,391, ANOVA p=0.001). The West had the highest average activity proportion, while the Midwest, South, and Northeast report lower recorded activity levels (Figure A.4). Based on 16 hours of non-sleeping time a day, individuals in the West had on average an additional hour of physical activity each week compared with those in the Northeast. The West also reported the highest life satisfaction and the Northeast reported the lowest life satisfaction. The 0.2 difference in life satisfaction is equivalent to 15% of the entire range (6.9 - 8.2) seen between developed countries in previous results.18

Figure 2.3: A. Average proportion of time spent active per state (n=14,406). Time spent active is the sum of time walking, running, and cycling recorded by the mobile application. B. Participants satisfaction with life stratified by their sleep pattern (N=14,895). An individual with an ”early” or ”late” bedtime has an average bedtime in the earliest or latest 33% of the cohort, respectively. C. Aggregate comparison of calculated 10-year cardiovascular risk versus self-predicted 10-year risk on a scale of 1 (not at all) to 5 (extremely). D Aggregate comparison of calculated lifetime cardiovascular risk versus self-predicted lifetime risk.
Perceived activity and actual activity
At baseline, participants were asked to rate how active they were on a scale of 1 to 6 in the LeisureTime Activity Survey (Figure A.4A). In the Moderate or Vigorous Physical Activity questionnaire, participants were also asked to report the number of minutes of moderate and vigorous physical activity that they performed in a week. These values were compared to the total time participants spent in the ”running,” ”walking” and ”cycling” states, as determined by the motion tracker data. Although the large number of subjects in the study, we observed a significant relationship (p≤0.001) between the perceived and reported activity levels, but the correlation between perceived and reported levels was negligibly small (R2=5.3e-4).
Perceived risk and actual risk
A participant’s ten-year and lifetime risk of stroke and myocardial infarction were calculated according to the 2013 AHA/ACC ASCVD guidelines[369, 395]. Predicted risk calculations were compared with subjects’ self-reported perceptions of risk (FigureA.4B, FigureA.9)[449]. A Pearson correlation (R2) of 0.18 was observed between subjects’ perceived 10-year risk and the calculated 10-year risk (Figure2.3C). Of the 1334 subjects who completed all questions on the heart age questionnaire necessary to compute a heart age and a 10-year risk score, 512 underestimated their 10-year risk (mean difference=6.0%), while 817 overestimated their 10-year risk (mean difference=1.2%). Similarly, subjects did poorly at predicting their lifetime risk: a Pearson correlation of 0.09 was observed between subjects’ perceived and calculated risk (Figure2.3D). 457 subjects overestimated their lifetime risk by a mean of 12.7%, while 501 subjects underestimated their risk by a mean of 12.0%, indicating that individuals predicted their personal risk with low accuracy.
2.1.5 Discussion
Seminal investigations established the importance of physical activity, fitness, diet and sleep for cardiovascular health[60, 43, 87]. Such studies were completed with time-consuming, in-person measurements with substantial reliance on participant recall. Mobile technology allows an alternative approach to such studies[167, 419, 61, 215], with major challenges and opportunities.
Large-scale data afford approaches to analysis and insights that are not available from smaller scale data[277]. Here we used an unsupervised clustering approach to define categories of individuals by their physical activity patterns. Such approaches[35] allow the data, rather than prior assumptions about its structure, to drive categorization. Despite decades of research, there is little certainty as to the optimal pattern of physical activity to recommend for health. Indeed, advice from national organizations has changed significantly over time[470]. While causality requires randomization, we report here correlative associations not just with overall activity but with a pattern of more frequent transition from inactive to active states. For example, our result that subjects who changed activity state frequently tended to be healthier aligns with prior findings that link prolonged periods of uninterrupted sedentary time with increased risk for metabolic syndrome and diabetes[146, 201, 43]. Such observations support the randomized assessment of interventions aimed at augmenting activity state transitions during daily living[201].
A major advantage of a smartphone based study is that most people carry the phone with them, allowing not just passive registration of motion, but active assessment of changing psychological states such as life satisfaction and happiness. A major disadvantage is the inherent ascertainment bias. While such bias exists in all studies, for example, in the individuals who choose to contact a study coordinator or in the inclusion and exclusion criteria for a clinical trial, it is important to minimize this bias as much as possible. Of particular note, the bar for entry to this study is much lower than equivalent studies carried out using in-person visits. This has the demonstrated advantage that many people consent, but the notable disadvantage that those individuals are by definition less invested in the study and thus less likely to complete all portions. For some data points in this study, we have data for only several thousand individuals while several tens of thousands consented. We believe the low bar in fact represents an opportunity to engage this larger group who are interested enough to download the app and answer a few questions but not much more. Balancing engagement, data feedback, and study design remain areas for further research. We delayed the 6minute walk test and heart age score tasks until completion of all other portions of the study, to minimize bias from this information, but that certainly contributed to the drop in participation in these tasks. An easy method to link lipid data directly from one’s electronic health record (EHR) would help, but even in the PINNACLE EHR-based cardiovascular registry, data to calculate 10-year risk score were available in ≤30% of patients.32 Future versions of MyHeart Counts will introduce more personalization and earlier participant feedback. Elements of gamification, exemplified by the recent Pok´emon Go, could also be introduced to maximize engagement.
We found a significant disconnect between an individual’s perceived cardiovascular risk and their actual risk derived from the 2013 ASCVD Pooled Cohort Equations.These findings are in line with those reported by Reiner et al., who concluded that the actual presence of CVD risk factors in participants did not appear to alter their perception of risk compared to participants without CVD risk factors33. Similarly, Ko and Boo concluded that, among cardiovascular risk factors, dyslipidemia, obesity, smoking, and a family history of CVD did not affect self-perceived health[192]. Imes and Lewis concluded that, even when individuals are aware of their cardiovascular disease risk, the relationship between health-related behavior change and perceived risk was inconsistent[179]. For example, our results illustrate that self-reported minutes of moderate or vigorous physical activity and movement recorded by the phone do not agree, which suggests that subjects were poor at predicting their levels of physical activity[51]. Such a disconnect between perceived and actual levels of physical activity and cardiovascular risk highlights the potential utility of mobile phones as personalized informational tools to optimize healthy lifestyles. The MyHeart Counts application provides the user with feedback in the form of a “heart age” relative to ideal cardiovascular health status one approach to personalizing and making more visceral the understanding of risk (FigureA.1,A.2). In addition, we include feedback in the form of a plot showing where each individual falls in relation to the overall study distribution for 6-minute walk distance. The natural extension of such findings is towards tailored physical activity and lifestyle recommendations[186] and indeed future versions of the application will introduce randomized studies of motivational strategies for improving activity, diet, and cardiovascular health measures.
Our study has several important additional limitations. The demographics of the currently enrolled population reflect those of typical iPhone users[7] - for example, young males are heavily overrepresented. We are currently testing engagement strategies that target other populations. Some individuals do not carry their phones with them at all times, so physical activity measurements are a lower bound for actual physical activity. While daily questions were used to try to capture activity lost in this way, a stronger approach comes in the form of increasing users adoption of wearable technology[108]. Furthermore, the motion trackers cannot distinguish the cause of periods of lack of motion. Additionally, it is likely that as in most studies of physical activity, participants may be more active than usual during the first weeks of the study. Consequently, in a follow-up study we will track individuals for multiple weeks to quantify the impact of different types of coaching strategies on modification of subject behavior. Validation of 6 minute walk step count values reported by the iPhone (FigureA.7) suggests that the step count algorithm needs improvement to achieve sufficient accuracy for clinical use[61]. Finally, the 2013 AHA/ACC ASCVD risk calculator has limitations. Specifically, the 10-year risk score was implemented for ages 40-79 years and does not fully account for biogeographic ancestry and lifestyle factors.
In summary, we demonstrate: 1) feasibility of consenting and engaging a large population across the U.S. using only mobile phones; 2) that large scale data can be gathered in real time from mobile devices, stored securely, transferred, de-identified and shared securely including with participants; 3) that a data set for the 6-minute walk test larger than any previously collected could be generated in weeks; 4) that state transition patterns of activity, not just absolute activity, relate to reported presence of disease; and 5) that there is a poor relationship between perceived and recorded physical activity, as well as perceived and formally estimated risk. Importantly, we also show the major challenges and limitations of mobile health research, including the skewed age and gender of participants plus the rapid drop-off in engagement over time, with the resulting loss of data collection for several measures. Participant engagement needs to be optimized to maximize full participation of those who have expressed at least enough interest to download the app and consent to join the study, in order to realize the promise of this novel approach to population health research.
Demographic Number of Participants % of responders (% of all participants)
Family history
Father or brother with heart attack or coronary artery disease before age 55 3890 17.98% (09.72%)
Mother or sister with heart attack or coronary artery disease before age 65 1600 07.39% (03.99%)
None 16144 74.62% (40.34%)
No response 18383
Medications
To treat and lower cholesterol 2904 12.43% (07.25%)
To treat hypertension and lower blood pressure 3385 14.49% (08.45%)
To treat diabetes/pre-diabetes and lower blood sugar 698 02.99% (01.74%)
None 16364 70.07% (40.89%)
No response 16666
Heart disease
Heart attack/Myocardial infarction 474 02.11% (01.18%)
Heart bypass surgery 230 01.02% (00.57%)
Coronary blockage/stenosis 370 01.65% (00.92%)
Coronary stent/angioplasty 488 02.17% (01.22%)
Angina (heart chest pains) 448 01.99% (01.12%)
High coronary calcium score 106 00.47% (00.26%)
Heart failure or CHF 163 00.73% (00.41%)
Atrial fibrillation (Afib) 493 02.20% (01.23%)
Congenital heart defect 413 01.84% (01.03%)
None 19272 85.82% (48.16%)
No response 17560
Vascular disease
Stroke 158 00.74% (00.39%)
Transient ischemic attack (TIA) 152 00.71% (00.38%)
Carotid artery blockage/stenosis 235 01.09% (00.59%)
Carotid artery surgery or stent 322 01.50% (00.80%)
Peripheral vascular disease (blockage/stenosis, surgery, or stent) 254 01.18% (00.63%)
Abdominal aortic aneurysm 77 00.36% (00.19%)
None 20269 94.42% (50.65%)
No response 18550
Table 2.1: Participant cardiovascular health diagnoses and family history. 20,323 participants provided responses to medical history questions.
2.1.6 Author Contributions
Ms Shcherbina and Dr Ashley had full access to all the data in the study and take responsibility for the integrity of the data and the accuracy of the data analysis. Dr McConnell and Ms Shcherbina contributed equally to this work.
Study concept and design: McConnell, Pavlovic, Waggot, Rosenberger, Myers, Champagne, Landray, Yeung, Ashley.
Acquisition, analysis, or interpretation of data: McConnell, Shcherbina, Homburger, Goldfeder, Waggot, Cho, Haskell, Myers, Mignot, Landray, Tarassenko, Harrington.
Drafting of the manuscript: McConnell, Shcherbina, Homburger, Goldfeder, Waggot, Myers, Ashley.
Critical revision of the manuscript for important intellectual content: McConnell, Pavlovic, Homburger, Goldfeder, Waggot, Cho, Rosenberger, Haskell, Myers, Champagne, Mignot, Landray, Tarassenko, Harrington, Yeung, Ashley.
Statistical analysis: Shcherbina, Homburger, Goldfeder, Waggot, Myers.
Administrative, technical, or material support: McConnell, Pavlovic, Myers, Harrington, Yeung, Ashley.
Study supervision: McConnell, Waggot, Myers, Yeung, Ashley.
Demographic
Self-reported
Sleep Per
Night(hours)
(N=34048) Self-reported
Activity Per Week (minutes)(N=3174 Sensormeasured
Time Ac-
9)tive(%)Time Walking(%)Time Vigor.
Act(%)(N=18 SensormeasuredMins ActiveMins. Walking-
Mins Vigor.
Act(N=18683)
683) Sensor-
. measured 6 Minute Walk
(Steps)(N=49 Sensor-
measured 6
Minute Walk
19)Distance (Meters)
(N=1268)
7.8 (1.16) Overall 207 (227) 14.46
(6.87)10.91
(5.36)3.26
(2.89) 969 (460)731
(359)218
(193) 693 (127) 455 (520)
Sex
7.7 (1.14) Male
(N=30338) 213 (231) 14.77
(6.87)11.24(5. (2.97) 990 (460)753
38)3.53(360)236
(199) 695 (120) 452.8 (521)
8.0 (1.17) Female184 (205)11.01737 (412)613688 (148)481.2 (521)
(N=6556)(6.15)9.15(338)124
(5.05)1.86(138)
(2.07)
Age(years)
7.9 (1.25) 30
(N=12181) 212 (240) 13.1
(17.5)11.02
(5.39)3.03
(2.74) 871
(1139)738
(361)201
(183) 682 (125) 427.3 (497)
7.8 (1.15) 30-40 (N=9024) 203 (222) 14.7
(19.5)11.02
(5.36)3.34
(2.80) 985
(1307)737
(359)227
(187) 684 (137) 439.8 (517)
7.7 (1.05) 40-49 (N=6328) 197 (210) 15.1
(20.4)10.69
(5.37)3.35
(3.06) 1005
(1367)716
(359)224
(205) 701 (121) 464.3 (538)
7.7 (1.06) 50-59 (N=7068) 206 (219) 13.3
(18.9)11.23
(5.34)3.53
(2.96) 891
(1266)752
(357)236
(198) 703 (114) 448.2 (505)
7.6 (1.04) 60-69229 (249)20.41367716 (110)494.1 (549)
(N=1684)(25.5)9.91(1709)664
(5.47)3.35(366)224
(3.42)(229)
7.6 (0.99) 70 (N=519)233 (215)27.31829677 (121)558.1 (548)
(29.4)9.26(1970)620
(5.02)3.75(336)251
(4.06)(272)
*These numbers are self-reported averages
Table 2.2: Exercise activity and sleep information was collected through questionnaires (N = 34,282), time active was collected via motion tracker (N = 20,345). 4,990 individuals participated in the 6 minute walk test. Standard deviations are reported in parentheses after mean values.
2.2 The effect of digital physical activity interventions on daily step count: a randomised controlled crossover substudy of the MyHeart Counts Cardiovascular Health Study
2.2.1 Abstract
Smartphone applications may enable interventions to increase physical activity but this has limited evidence in randomized trials. The MyHeart Counts Cardiovascular Health study (MHC) is a smartphone-based longitudinal research study aimed at elucidating the determinants of cardiovascular health. Among MHC participants, we performed a cross-over randomized trial to investigate the response to four different physical activity coaching interventions on the primary outcome of change in daily step count. Secondary outcomes included self-reported happiness , 6-minute walk distance, and sleep quality.
The trial was completed entirely using personal smartphones: participants were digitally consented, the interventions were delivered via the device, and measurements of the primary and secondary outcomes were collected from smartphone sensors. Participants were enrolled from December 12, 2016 to June 6, 2018 and followed for up to 5 weeks. Participants completed the trial through the MyHeart Counts phone app in a free-living setting. All adults over 18 years of age with access to a smartphone (Apple iPhone, version 5S or later) were eligible to participate.
After one week of baseline measurements, participants (n=2783) were randomized to a sequence of four, week-long interventions delivered in random order. Interventions consisted of either daily prompts to complete 10,000 steps (completed by n=853); hourly prompts to stand following a full hour of sitting (completed by n=879); instructions to read the guidelines from the American Heart Association website (completed by n=868); and e-coaching based upon the individual’s personal activity patterns from the baseline week of data collection (completed by n=896).
2783 participants consented to enroll in the coaching study, of whom 1075 completed the baseline week of data collection and at least one of the four interventions. 493 individuals completed the full set of assigned interventions. All four interventions were found effective at modestly increasing daily step count by a mean of 266 ±75 steps from a baseline mean of 2958±69 steps (p=0.003). Intervention-specific step increases were: 319±74 steps (p≤0.001) for the prompt to read the AHA Website, 267±74 steps (p≤0.001) for the hourly stand prompt, 254±73 steps (p≤0.001) for the participant-specific prompts, and 226±75 steps (p=≤0.01 versus baseline) for the prompt to complete the daily 10,000 step goal. No significant effect of the interventions on any prespecified secondary outcome was observed.
The results indicate that four smartphone-based physical activity coaching interventions significantly increased daily physical activity in a short-term, randomized, crossover trial completed entirely using personal smartphones.
2.2.2 Introduction
Physically active individuals are at lower risk for many conditions, including cardiovascular disease[323], metabolic syndrome[409], and depression[430] compared to their sedentary counterparts. American adults average 4700 daily steps[211],[19],[287], and most do not meet the US recommendation of 150 minutes of exercise per week[5].
There has been recent interest in evaluating the relationship between daily step count and health outcomes[472]. Interventions that have been shown to effectively increase step count have been associated with improved health outcomes, including reduced incidence of hypertension, improved cardiovascular health biomarkers, and reduced body mass index (BMI)[135] There is an opportunity to develop interventions that increase daily steps and that may improve subsequent health outcomes.
Smartphones and smartwatches, such as the Apple iPhone (Apple, Cupertino, CA) offer a potential platform to achieve this goal[342, 302]. To date, most studies of mobile health monitoring have used wearables and smartphones to track physical activity and provided feedback to the user and/or clinician[406]. No entirely digital randomized studies (i.e., without some level of person-toperson contact) have examined the effect of wearable and smartphone-based intervention on physical activity.
The My Heart Counts (MHC) smartphone app[445] provides a platform to conduct clinical research through smartphones[303]. It allows researchers to obtain informed consent for study participation in a fully automated fashion[170] and enables the monitoring of an individual’s physical activity via the integrated HealthKit platform([41]). The MHC app also surveys self-reported happiness and physical well-being and sleep quality. Over 50,000 individuals have used the MHC app. Herein, we present results of a cross-over randomized trial trial among a subset of those participants and examine the app’s potential role as a platform to deliver randomized interventions.
2.2.3 Methods
Study design and participants
The MyHeart Counts app version 2.0 was used to conduct the RCT https://itunes.apple.com/us/ app/myheart-counts/id972189947?mt=8. All smartphone (Apple iPhone, version ≥5S, iOS version ≥9) users over 18 years of age, able to read and understand English were eligible to participate in the study. The period of enrollment was from December 2016 through July 2018 (application versions 2.0, 2.0.1, 2.0.2, 2.0.3). On October 11th, 2017 shortly after version 2.0.2 was released, previously registered users were notified by e-mail of the option to complete the RCT with this new version of the app.
Once a user has downloaded the MHC app from the Apple App Store, they are guided through an e-consent process to enroll in the study. Participants were given the option to share their data only with Stanford (”narrow sharing”) or to share their data more broadly with qualified researchers worldwide (”broad sharing”). They had to make an active choice to complete the consent process, as no default choice was presented.
The flow of users through the study is illustrated in 2.4. After completing the initial consent protocol, users underwent a week of baseline interaction with the app. During this time, the HealthKit toolkit[209],[1] gathered information about daily step count, distance walked, time spent in bed, and time spent asleep. The core motion sensors of the smartphone were also used to keep track of daily minutes of walking, running, bicycling, resting, and driving. Users completed a daily questionnaire to track changing perceptions of happiness and mental well being[335],[125],[9].
At completion of the week of baseline monitoring, users were assigned to one of five clusters based on their level of physical activity during weekdays and the weekend. These clusters were generated from participant data (n=48,968) from the first iteration of the MHC study[303]. Participants who completed the baseline week of monitoring were assigned to clusters as follows: individuals who were active throughout the baseline week, (n=94), individuals who were active on weekends and sedentary on weekdays (n=462), individuals who were active on weekdays and largely sedentary on weekends (n=610), individuals who were largely sedentary throughout the week (n=1598), and individuals who spent over 15% of their awake time driving (n=212). Upon opening the application the first time after a week of monitoring, a pop-up with a second consent requested an active choice to participate in a four-week coaching study.
As participants progressed through the study, they received electronic badges for completing tasks such as onboarding, the baseline week and each of the four weeks of coaching. Electronic badges received are visible on the participant’s dashboard. The full list of badges available can be revealed by triggering an expanded view (B.5).
Randomization and masking
2783 users completed the baseline week of data and agreed to the coaching consent. Upon enrollment in the RCT, users were randomized to one of the 24 permutations (four choose four) of the four interventions in a cross-over design via a random number generator built into the app. Each intervention was applied for a period of seven days.
Procedures
The four interventions were serially delivered to users as daily messages to their smartphones, as summarized below, with examples of messages specific to each of the four interventions illustrated in B.4.
• 10K Daily Step Prompt: If the user had not completed 5000 steps (50% of the default daily 10K step goal) by 3 pm local time, they received a message indicating how many additional steps were needed to meet the goal. If the user had completed more than 5000 steps by 3 pm local time, no message was sent.
• Cluster-Specific Prompt: Based on the activity cluster assigned to the user, a daily message from those listed in eTable 3 was sent as coaching tailored to the user’s individual activity pattern. The original set of 10 k-means clusters is illustrated in supplementary eFigure 6. These were derived from the first MyHeart Counts study on 48,000 individuals who completed a minimum of 4 days of motion tracking with the app. The findings from this study are detailed in https://jamanetwork.com/journals/jamacardiology/fullarticle/2592965. The K of 10 was derived by trying different values of K and selecting the one that minimized the BIC criterion. The 15% cutoff for ”drivers” was obtained by looking at the median feature values for the cluster with the highest proportion of time spent driving (cluster c in the figure below). In the RCT, the baseline week of Core Motion data is used to derive the 10 feature values for every individual. Euclidean distance in the feature space is used to identify the nearest cluster for the individual. For the RCT, the 5 most distinct clusters were selected (distinct clusters were those whose centroids in 10-d space were furthest apart when computing pairwise centroid distances). 5 clusters rather than 10 were selected mostly as a matter of convenience – coming up with behavioral coaching prompts for 10 clusters, some with high similarity, was not feasible with the resources we had available.
• Read AHA Website Prompt: A daily message was sent to the user directing them to read the American Heart Association website (http://www.heart.org).
• Hourly Stand Prompt: If the user had been sitting for the last hour, they received a prompt directing them to stand up for 60 seconds. If the user had been active within the last 60 minutes, no prompt was provided.
Analytics on user interaction with the prompts for the four interventions were recorded and sent to Amazon Web Services – it was ascertained that users received and opened the prompt messages on their phones. By design, the prompt was sent to users smartphones each day of the intervention.
Outcomes
The primary outcome measure was the change in the user’s daily step count, as measured by HealthKit, between the intervention weeks and the baseline week. Step values from the user’s smartphone (as opposed to the Watch or other wearables that provided data to HealthKit) were used. Secondary outcomes were prespecified for the subgroups of users for which these additional assessments were captured. These included:
• Change in daily step count from baseline in response to an intervention, as measured by the smartwatch.
• Changed in daily minutes walked from baseline in response to an intervention, as measured by the core motion accelerometer of the smartphone.
• Change in daily minutes spent in bed, as measured by wearables that provide data to HealthKit.
• Change in daily sleep quality, measured by HealthKit as the quotient of the number of minutes spent asleep divided by the number of minutes spent in bed. Minutes spent in bed and asleep were read in from HealthKit using the field values HKCategoryValueSleepAnalysisAsleep HKCategoryValueSleepAnalysisInBed. These values were captured from a number of devices that provided HealthKit data to the app.
• Change from baseline in daily self-reported happiness score in response to an intervention.
Statistical Analysis
Power analysis was performed to determine the number of subjects needed to identify an effect size of 2000 steps with p-value ≤ 0.05. The 2000 step effect size was deemed feasible based on the findings of prior randomized controlled trials of interventions aimed at increasing physical activity levels[430],[472]. From the initial phase of the MyHeart Counts study, the standard deviation of daily step count from HealthKit was determined to be 3500+/-200 steps. Mean daily step count was 3200+/-200 steps. Using these heuristics, a one-tailed T-test power calculation was performed, suggesting that an RCT participant sample size of 1000 had power of 1 to detect an increase of 2000 steps with p-value of 0.05. The same approach suggested power of 0.99 to detect an increase of 1000 steps, power of 0.94 to detect an increase of 500 steps, and power of 0.37 to detect an increase of 200 steps. This analysis resulted in a pre-specified sample size of N=1000 for the study.
Between 853 and 896 users completed each of the four interventions2.4. 164 users enrolled and completed the RCT 2 or more times. For this subgroup of users, only their first completion of the full study was used in the statistical analysis. 240 additional users completed the first baseline week of data collection but declined to participate in coaching.
The sample size was dictated by user downloads of the MHC app, as well as the application development cycle. The study ran between December 12, 2016 and July 27, 2018 and was available to users who downloaded the 2.0 version of the MHC app from the app store. Version 2.1 of the MHC app was released on June 8, 2018. This version of the app incorporated software edits that reduced app power usage and consequently a longer device battery life. The user cohort for the new app version was excluded from analysis for purposes of this study, as the reduced impact of the MHC app on device battery life might serve as a potential confounding factor for intervention effects. Consequently, the sample size was frozen on July 27, 2018, at which point all individuals who had downloaded MHC 2.0 had completed the study.
Ascertainment of dropoput
Data from study participants was included if they completed one or more days of baseline monitoring and one or more days in at least one of the four interventions. Any days when a participant did not register 200 or more steps via HealthKit on either their smartphone or smartwatch were excluded from analysis.
In exploratory analysis of dropout criteria, we tested whether more stringent data filtering would affect the outcome. B.4,B.5. One such filter involved limiting analysis to include participants who completed 200 or more steps for 4 or more days during the baseline week of monitoring and at least one of the interventions B.4. Alternatively, participants were required to complete 200 or more steps for all 7 days of the baseline week as well as for one or more of the interventions B.5. The results from the 1 day, 4 day, and 7 day thresholds were comparable, and hence the least stringent inclusion criteria are reported in the main results.
Data modeling
Daily step count for each user was obtained by summing values for the ”HKQuantityTypeIdentifierStepCount” field that flowed into the MHC app from HealthKit. Similarly, daily distance walked (in meters) was obtained by summing values for the ”HKQuantityTypeIdentifierDistanceWalk” field. The HealthKit step count field was capped at 20,000 steps/day, while the distance field was capped at 25,000 m/day. Users were required to complete one or more days of an intervention for the corresponding data to be included in the analysis. Data from the watch and phone were analyzed separately B.2. Data from other devices that provide data to HealthKit were not used in the primary analysis, but were incorporated to measure duration and quality of sleep (secondary study outcomes). The within-subject analysis was performed by fitting a linear mixed effects model (R version 3.4.4, lme function from R library nlme ) for daily step count, in accordance with B.3, treating the intervention group and number of days in study as fixed linear effects, and the users as random effects. Interaction terms between ”Days In Study” and ”Intervention” were considered for inclusion in the model, but were not found to be statistically significant so were removed from the final model. A quadratic term for ”Days In Study” was also considered for inclusion in the model but not found to be statistically significant. The ”Intervention” variable was categorical, with base value of ”baseline” and contrasts set to ”Hourly Stand Prompt”, ”Cluster-Specific Prompt”, ”Read AHA Website Prompt”, and ”10K Daily Step Prompt”. Although some study participants continued to use the MHC app beyond completion of the study, any days that were not part of the baseline week of data collection or one of the above-mentioned interventions were not included in the model.
The data were analyzed based on an intention-to-treat principle[6] —even if users were active on a given day and did not receive hourly stand-up prompts or prompts to complete their daily 10K step goal, the user was considered to have been provided the intervention. The model was generated with the R nlme package [45],[46]. P-values below 0.01 were considered significant for terms in the model. Marginal means for daily step count were computed from the lme model with the R lsmeans package[463]. A Tukey HSD test[175] with alpha value 0.05 was performed to check for statistically significant difference between pairs of interventions. An FDR-corrected p-value threshold of 0.05 was used to determine whether differences between two interventions were significant.
Secondary exploratory analyses were performed using the linear modeling framework described above to examine data from smartwatches B.2,B.2 and to analyze data from users who completed all four interventions B.3,B.3.
Tukey’s HSD test was applied to check for significant differences in mean daily step count between the four interventions[2]. Pairwise differences of means between the interventions as well as between each intervention and baseline were calculated.
The same lme model was then built to analyze intervention effects for secondary outcomes (daily minutes of walking measured by core motion data, sleep duration, sleep quality, and self-reported happiness).
Feedback survey
On 10/11/2017, we sent out an email blast to all consented users who had not previously unsubscribed to our email list to fill out a 13 question feedback survey administered using Qualtrics[3].
2.2.4 Results
Of all individuals who consented to the coaching study (n=2783), 493 (17.7%) remained in the study for 35 days or longer and completed all 4 interventions 2.4. A subset of users (n=164, 5.9%) completed the study more than once, and only their first pass through the study was kept for analysis. Each intervention was completed by approximately 900 individuals (n=892 for 10K Daily Steps, n=935 for Cluster-Specific coaching, n=896 for Read AHA Website, n=921 for the Hourly Stand prompt) 2.4. The mean age of users was 44.42±7.5 years, the majority of of study participants were male (73.5%), and of those who reported ethnicity, 86.6% self-identified as White 2.3.
A group of 240 users consented to baseline monitoring but declined to consent to be randomized to one of the interventions. Of these participants, 58 continued to use the MHC app for one week or longer after declining participation in the RCT. For this subgroup, no significant difference was observed in daily step count in the 7 days prior to declining the RCT compared to the 7 days after (p=0.35).
Table 2.3: Baseline Characteristics for the 1075 study participants who completed one or more of the interventions and for the 493 participants who completed all four interventions.

Characteristic Number who com-
pleted baseline
and one or more in-
terventions (n=1075) % Number who com-
pleted baseline
and all 4 inter-
ventions
(n=493) %
Age (y),mean+/-SD 48.04+/-
15.28 50.59+/-
15.45
Sex
Male71567.07360
Female27125.42131
Other000
NA (did not respond)898.352 73.02
26.57
0
<1
Self-reported race or ethnic group via m
American Indian ultiple-choice q
1 uestionnaire
<1 0 0
Asian 26 2 11 2
Black 21 2 15 3
Hispanic 25 2 13 2.63
I prefer not to indicate 4 <1 2 <1
Pacific Islander 2 <1 1 <1
White 482 44.83 264 53.55
Other 19 <1 4 <1
NA (did not respond) 495 46.05 183 37.12
Self-reported level of education
Didn’t go to school 4 <1 0 0
Grade school 2 <1 2 <1
High school diploma or G.E.D. 43 4.00 30 4.06
Some college or vocational school or Associate Degree 185 17.21 88 17.85
College Graduate or Baccalaureate Degree 300 27.91 149 30.22
Master’s Degree 219 20.37 103 20.89
Table 2.3 – continued from previous page

pleted baseline

and one or more in-
terventions (n=1075) pleted baseline
and all 4 inter-
ventions
(n=493)
Doctoral Degree (Ph.D., M.D., J.D., etc.)
NA
Physical Activity Readiness Questionnaire
Has your doctor ever said that you have a heart condition and that you should only do physical activity recommended by a doctor?
YES
NO
NA
Do you feel pain in your chest when you do physical activity?
YES
NO
NA
In the past month, have you had chest pain when you were not doing physical activity?
YES
NO
NA
Do you lose your balanced because of dizziness or do you ever lose consciousness? 128
194
61
907
107
36
932
107
85
882
108 11.91
18.05
5.67
84.37
9.95
3.35
86.70
9.95
7.91
82.05
10.05 66
65
32
424
37
18
438
37
33
423
37 13.39
13.18
6.49
86.00
7.51
3.65
88.84
7.51
6.69
85.80
7.51
YES 90 8.37 36 7.70

pleted baseline

and one or more in-
terventions (n=1075) pleted baseline
and all 4 inter-
ventions
(n=493)
NO
NA
Do you have a bone or joint problem that could be made worse by a change in your physical activity?
YES
NO
NA
Do you know of any reason why you should not do physical activity?
YES
NO
NA
Is your doctor currently prescribing drugs (for example water pills) for your blood pressure or heart condition?
YES NO
NA
Weight (kg), mean ± SD
Height (cm), mean ± SD
Smoking Status: Are you currently smoking cigarettes?
True 877
108
167
800
108
14
954
107
273
695
107
83.83+/-
19.90
169.70 +/-
32.03
7 81.58
10.05
15.53
74.42
10.05
1.30
88.74
9.95
25.40
64.65
9.95
NA
NA
<1 419
36
69
386
38
9
447
37
132
324
37
84.73+/21.43
175.18+/-
9.58
3 84.99
7.30
14.00
78.30
7.71
1.83
90.67
7.51
26.77
65.72
7.51
NA
NA
<1
False 421 39.16 307 62.27

pleted baseline

and one or more in-
terventions (n=1075) pleted baseline
and all 4 inter-
ventions
(n=493)
NA (did not respond)
Heart Disease *
Present
Absent
NA (did not respond)
Vascular Disease **
Present
Absent
NA (did not respond)
Family History of heart disease
Father/brother with heart attack/coronary artery disease before age 55 y
Mother/sister with heart attack/coronary artery disease before age
65 y
None
NA (did not respond)
Medications
To treat and lower cholesterol
To treat hypertension and lower blood pressure
To treat diabetes or prediabetes and lower blood glucose level
None
NA (did not respond) 647
141
725
284
66
725
284
137
76
592
284
191
196
41
504
284 60.19
13.12
67.44
26.42
6.14
67.44
26.42
12.74
7.07
55.07
26.42
17.77
18.23
3.81
46.88
26.42 183
75
365
53
35
407
51
86
45
325
53
121
116
24
279
50 37.12
15.21
74.04
10.75
7.10
82.56
10.34
17.44
9.13
65.92
10.75
24.54
23.53
4.87
56.59
10.14

Activity cluster assignment

pleted baseline
and one or more in-
terventions (n=1075) pleted baseline
and all 4 inter-
ventions
(n=493)
Sedentary, low activity 474 46.24 262 53.14
Active during workdays, sedentary on off-work days 208 20.29 122 24.75
Active 22 2.15 15 3.04
Sedentary during workdays, active on off-work days 115 11.22 69 14.00
Drivers 206 20.10 25 5.07

The 55 responses to the Feedback Survey collected within 10 days were included for further analysis. When asked which feature of the app they liked the most, the responders (N=55) responded most favorably to the 6-Minute-Walk Test (mean score 52.38 out of 100) and the Risk Score/ Heart Age questionnaire (48.05 out of 100). These features provide meaningful feedback to the user and may in this way lead to longer engagement with the app.
Relative to baseline, each of the four interventions had a significant positive effect on daily step count 2.5,2.4. Participants took a mean of 2955+/-74 steps during the baseline week of monitoring. All four interventions increased average daily step count: 185+/-78 steps (p=0.017) for the 10K Daily Step prompt, 237+/-77 steps (p=2.10e-3) for the Cluster-Specific prompt, 290+/-77 (p=2.00e-4) for the Read AHA Website prompt, and 219+/-78 (p=4.80e-3) for the Hourly Stand prompt. Tukey HSD analysis indicated that no intervention was more significant than the others. These results were consistent when the dataset was restricted to the 493 users who had completed all four interventions, though this subgroup had a higher mean daily baseline step count (3115+/-119) compared to all users (2914+/-74), and the intervention effects were slightly higher in magnitude (additional 7 to 36 steps on average). B.3. Post-hoc sensitivity analysis on baseline observations carried forward was conducted. The significant improvement in median step count from baseline to the end of follow-up was still observed B.6.
Smartwatch data also indicated a significant positive effect on step count from each of the four interventions, with a mean increase of 595 steps, and average increases ranging from 565 steps for the AHA Website prompt to 705 steps for the Hourly Stand prompt B.2. A subgroup of 171 users
Baseline 10K Steps Personal Advice Hourly Stand Read
Daily Steps from HealthKit: smartphone
N 1168 853 896 879 868
Mean+/-SE 2955+/-74 3140+/-77 3192+/-77 3173+/-77 3245+/-77
Standard deviation 3701 3562 3690 3515 3715
P-value 1.77e-2 2.10e-3 4.80e-3 2.00e-4
Effect size+/-SE 185+/-78 237+/-77 219+/-78 290+/-77
Daily Steps from HealthKit: smartphone users who completed all interventions
N 493 493 493 493 493
Mean+/-SE 3115+/-119 3399+/-115 3405+/-115 3406+/-115 3441+/-115
Standard deviation 3836 3785 3806 3698 3899
P-value 3.40e-3 2.60e-3 2.60e-3 8.00e-4
Effect size+/-SE 284+/-97 290+/-96 291+/-97 326+/-97
Daily Steps from HealthKit: smart watch
N 266 184 209 200 205
Mean+/-SE 3783+/-172 4350+/-191 4373+/-189 4489+/-190 4349+/-189
Standard deviation 4541 4261 4169 4090 4116
P-value 3.60e-3 2.30e-3 3.00e-4 4.40e-3
Effect size+/-SE 566+/-194 589+/-193 705+/-195 565+/-198
Table 2.4: I
ntervention effects on primary outcome of mean daily step count.
provided daily step count data from both a smartphone and a watch. On average, the smartwatch reported step count 828 steps higher than the smartphone during the baseline week of monitoring (mean and SD = 3783+/-172 steps).
No statistically significant effect on sleep duration, sleep quality, fraction of day spent walking, or self-reported happiness was observed in response to the four interventions B.2,B.3. No statistically significant cumulative effects were observed. A statistically significant difference as observed between each intervention and baseline, but not between pairs of interventions, and not between the final intervention applied and the first intervention applied.
There was a reverse association between the number of days in the study and the number of steps per day [-8.19 steps/day, (p=1.60e-3)] when adjusted for other covariates.
2.2.5 Discussion
The study demonstrates the feasibility of using smartphones to carry out a cross-over randomized trial entirely in the digital domain. The results suggest daily coaching delivered via a smartphone is effective in the short term at modestly increasing daily step count compared to baseline. These findings extend previous studies showing smartphone interventions are capable of increasing physical activity.
Digital studies extend the reach of clinical trials to anyone with a cell phone (≥ 5 billion individuals globally), and reduce barriers to entry for participants to join in research. However, user interaction with mobile health apps, including the MHC app, tends to drop off significantly over the course of the studies B.1. This has been consistent not only for the MHC app, but also for mPower[66], Asthma Health (30), and other studies which followed a similar e-recruitment process[131]. In our cohort, only 493 individuals (17%) completed the full set of four assigned interventions. A subset of users did not open the app on a daily basis, resulting in missing data. While digital methods extend the accessibility of the study, the low retention rate may diminish the generalizability of the findings.
However, there are a number of reasons why we believe our results are robust, reliable and have clinical implications. First, the number of participants that were lost to follow-up was even between all four intervention groups – this is reassuring for several reasons. It implies that there is no direct effect of one of the interventions that causes attrition. Furthermore, data was likely lost completely at random and not related to a participant’s characteristic. Lastly, it suggests that any potential bias that occurred from participant bias (also known as attrition bias) is shared evenly between all intervention groups.
Second, our results were consistent amongst our sensitivity analysis. We observed that when we included data from the participants who dropped out via a baseline observation carried forward analysis, our results were again consistent; namely all four interventions showed a significant increase in mean step count B.6. The same pattern was observed when carrying forward the step count for the last intervention a participant underwent and when carrying forward the mean step count across all interventions.
Finally, we have compared the key characteristics between participants that completed the study and those that dropped out: we observed no difference in any demographic, implying that the participants that completed the study are similar to those that did not. Importantly, this implies that better retention of participants will increase precision of the effect size but not the direction
2.3.
There are a number of unanswered questions which should be addressed in future research. First, it is unclear if gamification will increase long-term user engagement. In conducting the MyHeart Counts digital RCT we learned several lessons of use for future studies in this domain.
Although the MyHeart Counts cohort is diverse 2.3, the cohort, like most randomized trials, is not entirely representative of the greater U.S. population. Specifically, the user base is younger, more educated, and enriched for Caucasian males. This is partly due to the release of the app on the iPhone. For future digital trials it would be desirable to release the app on multiple mobile platforms (i.e. Android, HTC) to reach a broader segment of the population.
Additionally, future digital trials may improve user retention by providing more extensive personalized feedback to participants. Upon completion of the study, participants were requested to fill out a feedback questionnaire https://stanforduniversity.qualtrics.com/jfe/preview/SV 87Fi9fwa54N98EZ? Q CHL=preview). When asked which feature of the app they liked the most, the responders (N=55) responded most favorably inclined towards the 6mw and heart age score. These features provide meaningful feedback to the user and may in this way lead to longer engagement with the app.
Digital trials have the potential to greatly improve clinical trials; they are much cheaper and much less labour-intensive; however, all digital studies have been limited by attrition. Although participant retention remains a challenge, previous studies have attempted to mitigate attrition through gamification, providing detailed user feedback and, direct user incentives [354]. We attempted to gamify the app by allowing users to earn achievement badges as they progress through the different phases of coaching.
Future studies can aim to assess the retention impact of these and other approaches to gamification and user feedback.
Our study has several limitations. User interaction with the MHC app dropped off significantly over the course of the study B.1, mirroring observations from the mPower[66], Asthma Health[505], and other studies which followed a similar e-recruitment process [131]. In our cohort, only 493 individuals (17%) completed the full set of four assigned interventions. A subset of users did not open the app on a daily basis, resulting in missing data. While digital methods extend the accessibility of the study, the low retention rate may diminish the generalizability of the findings. It remains challenging to ascertain whether the MHC coaching app is performing as expected in all real life situations, as it is currently not possible to simulate a multi-week intervention where the messaging is dependent on device data and accelerometry. Furthermore, although participants were requested to carry the phone on their person at all times while in the study, there was no immediate mechanism to ensure compliance other than a daily questionnaire asking what percent of time they carried the phone. The difference in step count values between the Apple Watch and the iPhone for individuals who used both devices suggests that the Watch, more likely to be worn continuously, did indeed yield higher step count. While evaluating the short-term response to simple ”nudges” is useful, future studies should explore longer follow up[416]. Finally, given that no dose response was observed for the four interventions, the exact mechanism via which the interventions alter user behavior remains to be determined.
In this cross-over randomized trial conducted entirely in the digital domain, all four examined interventions significantly increased the primary outcome of daily step count. Interventions appeared equivalent in effectiveness. No effect was observed on secondary outcomes of sleep duration, sleep quality, and self-reported happiness. These results suggest that behavioral coaching programs administered through smartphones can lead to short term increases in daily physical activity.
2.2.6 Author Contributions
This work is authored by Anna Shcherbina, Steven Hershman, Laura Lazzeroni, Abby King, Jack O’Sullivan, Eric Hekler, Yasbanoo Moayedi, Aleksandra Pavlovic, Daryl Waggott, Abhinav Sharma, Alan Yeung, Jeffrey Christle, Matthew Wheeler, Michael McConnell, Robert A Harrington, Euan Ashley. MVM, AP, ACK, EH, DW, AY, RAH, and EAA conceptualised and designed the study. SGH and DW managed app development. ACK, EH, AP, and EAA designed the interventions. SGH, AShc, JWO, RAH, and EAA contributed to acquisition, analysis, or interpretation of data. AShc, SGH, JWO, and EAA drafted the manuscript. MVM, SGH, JWO, ASha, YM, JWC, and
EAA critically revised the manuscript for important intellectual content. AShc and LL did statistical analysis. MVM, MTW, and EAA supervised the study.

Figure 2.4: Flow of participants through MHC Digital RCT.

Figure 2.5: Mean daily step count from the iPhone during the one week baseline (“Baseline”) and each of the four intervention. “10K steps” – the user is prompted by the phone mid-way through the day how many additional steps he/she needs to walk to meet the daily 10,000 step goal; “Personalized Advice” – the user receives a text message each morning with coaching tailored to their personalized activity cluster, determined from the baseline week; “Hourly Stand” – if the user has been sedentary for 60 minutes, he/she receives a text message encouraging him/her to stand up; “Read AHA website” – the user receives a text message each morning directing him/her to literature from the American Heart Association.

Chapter 3
Link to genetics
3.1 Genetic determinants and causal implications of physical activity in large populations
3.1.1 Introduction
The UK Biobank prospective longitudinal study [452] and similar population cohorts provide a new opportunity to study genotypic drivers of complex phenotypic traits on a large scale. The UK Biobank database, compiled between 2006 and 2010, contains data on 500,000 individuals ranging in age from 40 to 69. The biobank provides thousands of baseline measurements on demographics, lifestyle, environment, psychosocial traits, clinical diagnoses, and biomarkers. 487,409 participants within the biobank have genotype data inputed to 50 million variants. Within this group, 96,220 also have high-resolution accelerometry data collected over a period of one week from a custom wrist-worn triaxial accelerometer.
The large-scale, multi-dimensional data within the UK Biobank enables us to examine the genetic determinants of physical activity, which remain poorly understood. Although much research exists showing that lack of physical activity is a major risk factor for mortality[345], [330], it remains unclear at the molecular level why some individuals exercise a great deal more than others. Causal determinants could conceivably relate to bones, joints, obesity, cardiovascular or skeletal muscle fitness. It could also relate to the brain.
We performed a genomewide association study (GWAS) of physical activity outcomes in UK Biobank. From the GWAS, we determined a candidate list of variants with strong associations with activity phenotypes. Among these, variant rs42850 was found to be associated with duration of moderate activity. This variant was intronic to the MEF2C gene. In vitro knockdown of MEF2C, which is expressed in neuron, was associated with decreased levels of brain-derived neurotrophic factor (BDNF). These findings suggest that the MEF2C gene may be associated with inducing an
40
”exercise high”, potentially contributing to differing physical activity levels across individuals.
3.1.2 Methods
Population stratification
Population stratification within the UK Bioban imputed variant panel was performed via principal component analysis with 3 principal components on 487,409 subjects (all individuals in the UK Biobank with genetic data). The PLINK [92] command − − pcaapprox3 was executed on imputed chromosome 1 SNPs with minor allele frequency ≥=0.1 After filtering for minor allele frequency, 443,757 variants remained. Decision boundaries for assigning individuals to one of four major ancestry groups (South Asian, East Asian, European, African/Caribbean) were generated along PC2. PC2 ≤ -0.011 indicated East Asian ancestry; PC2 in range [-0.007, -0.004] indicated South Asian ancestry; PC2 in range [-0.002,0.002] indicated European ancestry; PC2 ≥= 0.003 indicated African/ Caribbean ancestry. These cutoffs were selected by overlaying self-reported ancestry from UK Biobank field 21000 (http://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=21000), which revealed distinct ancestry clusters. Thresholds were drawn conservatively so that samples with PC2 values outside of the above-specified ranges were removed from GWAS analysis, as they did not clearly align with any of the 4 major ethnic background categories.
The boundaries for establishing main ancestry groups are drawn along PC2, as follows:
• PC2 ≤ -0.011 indicates East Asian ancestry
• PC2 in range [-0.007, -0.004] indicates South Asian ancestry
• PC2 in range [-0.002,0.002] indicates European ancestry
• PC2 ≥= 0.003 indicates African/ Caribbean ancestry
Once major ancestry groups had been determined, principal component analysis with 3 PC’s was executed on each of the five main ancestry subgroups individually. The resulting principal components were incorporated as covariates for GWAS analysis.
UK Biobank dataset filtering for GWAS analysis
The following filtering steps were performed on the UK Biobank imputed SNP dataset
• Duplicate SNPs were excluded from the UK Biobank .bgen files using the plink − − exclude command.
• Rare variants (minor allele frequency ≤ 0.001) were excluded from analysis.
• Exclude pairs of individuals with kinship coefficient greater than 0.22.
GWAS analysis was performed on the filtered datasets for each of the four ancestry groups determined above.
Featurization of phenotypic traits for GWAS
Physical activity data: 47 features with pairwise abs(spearman correlation) ≤ 0.75 were selected from an initial set of 122 identified activity features. The 47 phenotypes and the number of indivdiuals for whom each phenotype was available are indicated in table C.1. The pairwise spearman correlation across each of these features is indicated in figure 3.1.
All values were quantile-normalized and outliers that were more than 4 standard deviations away from the mean were excluded prior to running the GWAS.
Accelerometry features were obtained from field 90004 in UK Biobank. https://biobank.ctsu.ox.
ac.uk/crystal/field.cgi?id=90004 The following features were generated from this data:
• Overall Acceleration Average
• Standard Deviation of Acceleration
• Hourly acceleration averages (i.e. 1 2 means average acceleration for the subject between 1 am and 2 am)
• Fraction of accelerometry values for the subject below a given mg (milli-gravity) threshold (represented as 1mg - 2000mg)
• Number of transition states at 10 mg and 25 mg. Transition states were determined by analyzing 10 second data intervals (i.e. 3 measurements). If the first measurement was above 10mg and the last two measurements were below 10mg (or vice versa), it was determined that a transition state had occurred.
• Discrete Wavelet Transform features – percent of energy contained at levels 5 and 6 of an 8 level transform using the Daubechies 4 wavelet. The DWT transformation and choice of features was determined based on transformations and features in [39],[359],[316].
• mean acceleration mg in 6 hour block. These were defined as a morning block (6 am - 12 pm, an afternoon block (12 pm - 6pm), and evening block (6 pm - 12 am), and a night block (12 am - 6 am).
• mean acceleration mg in 8 hour intervals. These were defined as a night/morning interval (12 am - 9am), a workday interval (9 am - 5pm), and an evening interval (5pm - 12 am).
• mean acceleration mg in 12 hour intervals. These were defined as a day interval (5 am - 9 pm) and a night interval (9 pm - 5am).

Figure 3.1: Pairwise spearman correlation between phenotypes utsed to perform UK Biobank physical activity GWAS.
The remaining phenotypes were obtained from answers to questionnaires: Continuous feature values underwent quantil normalization and outlier removal (outliers defined as 1.5x interquartile range):
• Duration of walks http://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=874
• Frequency of strenuous sports in the last four weeks http://biobank.ndph.ox.ac.uk/showcase/ field.cgi?id=991
• Frequency of walking for pleasure in the last 4 weeks http://biobank.ndph.ox.ac.uk/showcase/ field.cgi?id=971
• Number of days of moderate activity gte 10 minutes per week http://biobank.ndph.ox.ac.uk/ showcase/field.cgi?id=884
• Duration of moderate activity http://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=894
• Time spent outdoors in the summer http://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=1050
• Time spend outdoors in the winter http://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=1060 • Duration of walking for pleasure http://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=981
• Number of days of vigorous physical activity gte 10 minutes per week. http://biobank.ndph. ox.ac.uk/showcase/field.cgi?id=904
• Duration of strenuous sports http://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=1001
• Number of days per week walked more than 10 minutes http://biobank.ndph.ox.ac.uk/showcase/ field.cgi?id=864
• Usual walking pace http://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=924 Another set of phenotypes were derived from ’yes’/’no’ questionnaires within UK Biobank:
• Job involves heavy manual or physical work http://biobank.ndph.ox.ac.uk/showcase/field.cgi? id=816
• Job involves mainly walking or standing http://biobank.ndph.ox.ac.uk/showcase/field.cgi?id= 806
• Job involves shift work http://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=826
GWAS association tests
Two approaches were tried for handling GWAS covariates. In the first, only sex, age, height, and BMI were included as covariates for GWAS analysis. In the second more conservative approach a set of 172 covariates selected based on phenotypes from UKBB with highest mutual information content. In both cases we used the following process to get residuals: (1) transform categorical variables into numerical dummy binary variables, (2) impute missing values using k nearest neighbors (k=10), and
(3) run multivariate linear regression.
Continuous phenotype GWAS was performed with the following PLINK command:
plink --bfile ukb_imp_chr$2_v2 --pheno phenotypes.continuous.txt
--linear hide-covar --geno 0.1 --missing-phenotype -1000 --covar covariates.simple.txt
--out results/$1/$1.$2.continuous --adjust --pheno-name $1
--keep euro_minus_exclusion_minus_firstdegree.txt
Categorical phenotype GWAS was performed with the following PLINK command:
plink --bfile ukb_imp_chr$2_v2 --pheno phenotypes.continuous.txt
--assoc --geno 0.1 --missing-phenotype -1000 --covar covariates.simple.txt
--out results/$1/$1.$2.continuous --adjust --pheno-name $1
--keep euro_minus_exclusion_minus_firstdegree.txt
A p-value threshold of 5e-8 was used to determine significance. Significant hits were analyzed with the FUMA software[489] to identify statistically significanta tissue and pathway enrichments. The DeepSEA[519] web tool was used to annotate GWAS hits for functional effects across tissue types. The GREAT software[309] was used for further variant annotation and association with genes/ enriched GO Terms.
Validation of GWAS hits in other cohorts
GWAS hits were validated with a p-value threshold of 0.01 in the 23&Me cohort for physical activity phenotypes ”number of days of vigorous physical activity per week” and ”number of days of moderate physical activity per week. These phenotypes were also validated within the Klimentidis et al GWAS on the UK Biobank data [225].
Experimental validation of GWAS findings
GWAS hits for physical activity phenotypes that had validated in the 23&Me cohort and fell within 10kb of a protein coding gene were validated experimentally via knockdown of associated genes. The genes MEF2C, CACNA1S, DPY19L1, and SLC6A3 were targed for validation. According to the GTEX portal[279], these genes are expressed in neurons. Hence, iPSC neurons were cultured and differentiated into mature neuron cells. Within the mature neuron cells, MEF2C, CACNA1S, DPY19L1, and SLC6A3 were knocked down. Neurons were stimulated with potassium chloride (KCl) to simulate stress caused by exercise. ELISA analysis of neurotransmitter expression was performed for knockdowns and control in KCl- and KCl+ conditions for dopamine, epinephrine, dynorphin, enkephlin, noradrenaline, and brain-derived neurotrophic factor (BDNF).
3.1.3 Results
Significant GWAS associations for physical activity in UK Biobank
From the population stratification analysis, principal components 1 - 3 were found to separate individuals within the UKBB genetic data cohort by biogeographic ancestry3.2. Based on cutoffs along principal component 2, as described in the methods, the following ancestry representation counts were observed:
• 2,156 E. Asian
• 8,111 S. Asian
• 7,073 African/Caribbean
• 463,388 European
11,958 UK Biobank variants with GWAS p-value ≤ 5e − 8 were identified. Of these hits, 600 had minor allele frequency ≥ 0.01. The number of hits by phenotype is presented in table C.2, and the tallies are illustrated in 3.3. The full list of these 11,958 variants is here https://drive. google.com/file/d/0B7U9hibWYQ85ZlZ1WWlMMnV5NDA/view?usp=sharing; variants with maf ≥ 0.01 are highlighted in yellow. Of the 11,958 GWAS hits, 2,061 are associated with a self-reported phenotype, of which 256 have maf ≥ 0.01. In contrast, 9,893 variants are associated with a measured phenotyped, of which 344 have maf ≥ 0.01. 3 variants are associated with both a self-reported and a measured phenotype.
14 variants showed up with known associations in the GWAS catalog 3.1.
4 missense variants were identified with significant SIFT and Polyphen scores 3.2. Of these, only rs16891982 was identified as deleteriousin SIFT and possibly damaging in PolyPhen.
Functional annotation of GWAS hits with FUMA
Three sets of phenotype associations were found to have significant functional annotations with FUMA[489]. 45 variants with maf ≥ 0.01 were associated with standard deviation of acceleration. This set of variants were enriched for GO term ”Regulation of mononuclear cellular migration” (p≤0.01) and GO term ”Phosphatase regualtor activity” (p≤0.01). 64 variants with marf ≥ 0.01 were associated with time spent outdoors. This set of variants was enriched for GO term ”Inositol

Figure 3.2: Population stratification from principal component analysis on genetic variant data within the UK Biobank cohort. A) PC1 vs PC2, color-coded by self-reported biogeographic ancestry. B) PC2 vs PC3, color-coded by self-reported biogeographic ancestry. Vertical lines indicate cutoff values along PC2 for ancestry determination from the PCA.

Figure 3.3: Tally of UK Biobank GWAS hits with p-value ≤5e-8 across physical activity phenotypes. A) Tally of all GWAS hits with p-value ≤ 5e-8. B) Tally of GWAS hits with p-value ≤5e-8 and minor allele frequency ≥0.01.
triphosphate kinase activity” (p ≤ 0.05) and were associated with IP6K1 and IP6K2, which are upregulated in neuronal cells. Variants associated with the phenotype of mean acceleration between 6 and 7 o’clock in the morning were enriched for GO term ”Snare interactions in vesicular transport” (p≤2e-12).
Validation of GWAS findings
GWAS hits were validated in the 23&Me physical activity cohort as well as the UK Biobank GWAS conducted by Klimentidis et al. [225]. Seven GWAS associations with days of vigorous and moderate activity validated with p-value ≤ 0.01, illustrated in table 3.3.
Furthermore, five intronic variants were found to be associated with duration of moderation activity, duration of walks, and weekly days of vigorous activity, and these associations were validated for analogous phenotypes in the 23&Me cohort 3.4. Given that these variants were intronic or synonymous within genes with expression in myoblasts and neurons (i.e. tissues with a direct plausible link to physical activity levels), these variants were validated experimentally. The corresponding genes were knocked down in differentiated neurons (see Methods), and an ELISA assay was used to measure resulting changes in neurotransmitter levels. Results of this validation are illustrated in figure 3.4. When the MEF2C gene was knocked down in neurons stimulated with KCl to simulate exercise, expression of BDNF significantly dropped compared to controls. No significanta change in expression of noradrenaline nor adrenaline was observed.
3.1.4 Discussion
The findings of the UK Biobank physical activity GWAS suggest a potential link between variants with functional effects in neurons and physical activity levels. The experimental validation of the link between MEF2C expression and BDNF expression suggests a possible mechanism, whereby variants such as rs42850, intronic to MEF2C, affect regulation of its expression and potentially influence the psychological and emotional response a person experiences to exercise. Further investigation and experimental validation is needed to verify this hypothesis.
3.1.5 Author Contributions
Euan Ashley conceptualized the study. Anna Shcherbina performed the GWAS analysis. Chunli Zhao performed all experimental validation. We thank David Amar and Manuel Rivas for helpful discussions on dataset filtering for GWAS and on correction for covariates.
3.2 Machine learning approaches for robust fine-mapping of putative causal regulatory variants associated with colorectal cancer
3.2.1 Abstract
We present an integrative approach for context-specific fine mapping and interpretation of noncoding disease-associated variants based on deep learning models of regulatory DNA sequence. We used this approach to analyze over 50 genome-wide significant loci associated with colorectal cancer (CRC) based on a meta-analysis of 125,218 CRC cases and controls from three consortia: Colon Cancer Family Registry, Colorectal Cancer Transdisciplinary Study, and Genetics and Epidemiology of Colorectal Cancer Consortium. First, we trained novel multi-task, convolutional-recurrent neural networks (NN) to accurately map 1Kb bins of DNA sequence across the genome to corresponding DNASE-seq (DHS) profiles across 193 diverse ENCODE+REMC cellular contexts. Next, we fine-tuned the reference NNs on DHS and H3K27ac profiles in 21 primary CRC tumor samples and mucosa from 6 healthy controls. This transfer learning approach resulted in significant accuracy gains relative to training only on the CRC relevant tissues. We used a novel feature attribution method for neural networks called DeepLIFT to efficiently infer sample-specific regulatory potential (predictive importance scores) of every nucleotide in 1Kb bins centered at 23,727 variants across all the associated loci, which include 3910 variants in the 99% credible sets of causal variants. Variants with high, robust DeepLIFT[424] scores across multiple samples, bootstrapped models and heterogeneous genomic background sets, highlighting putative functional regulatory variants, were further interrogated using in-silico mutagenesis (ISM) to estimate the signed influence (effect size) of all alleles on local chromatin state. DeepLIFT profiles of nucleotides surrounding high scoring variants further highlighted sequence features such as transcription factor (TF) motifs harboring these variants. Our approach exhibited high specificity. Within each GWAS locus, across 100s of 99% credible set variants and 1000s of background SNPs, only 1-5 SNPs exhibited high and statistically significant functional scores. Several high scoring SNPs overlapped tissue-specific enhancers in colorectal mucosa. Further, these variants were often found at flanks of DHSs directly disrupting, creating or indirectly influencing motifs of lineage-specific TFs. High confidence candidates are being tested using high-throughput CRISPR screens and reporter assays in CRC lines. Our framework can easily generalize to other disease phenotypes and to score non-coding rare variants and somatic mutations including indels and short SVs.
3.2.2 Introduction
The GECCO consortium has performed a number of GWAS to identify genetic associations with colorectal cancer[59],[199]. 134 credible GWAS sets were identified, with majority of tag SNPs found in non-coding regions of the genome 3.5 Linkage disequilibrium makes it challenging to identify the functional SNP in these credible sets, and consequently determining a mechanistic explanation for the function of non-coding SNPs remains an unsolved challenge in the field.
We sought to address this challenge by developing machine learning algorithms to finemap variants in the GECCO GWAS. This modeling approach is also useful for answering questions about pleiotropy – which tissues are affected by a given SNP, and are these tissues relevant for colorectal cancer.
To build these models, we worked with collaborators in the Scacheri lab, who performed accessibility assays (H3K27ac histone ChIP-seq and DNASE-seq) on 6 healthy colon crypts and 24 colorectal cancer tumor samples[111]. H3K27ac CHiP-seq was also performed on 3 CRC cell lines – COLO205, SW480, and HCT116. These datasets were augmented with DNASE data in the three CRC cell lines.
These DNASE and ChIP-seq datasets were used to build convolutional neural network (CNN) models of chromatin accessiblity. Taking advantage of transcription factor (TF) binding motif repeats across the genome, a multi-tasked convolutional neural network was trained to distinguish accessible transcription factor binding motifs from inaccessible genome regions. In this model, motifs were represented as artificial neurons in the network3.6. The neurons/PWMs are convolved with the input sequence to learm motif binding patterns for transcription factors.
Such models are trained genomewide, enabling us to exploit intra-individual variations by tiling the genome into sequence DNA windows centred on individual traits, resulting in large training data sets from a single sample. This approach is based on Stegle et al’s foundation paper on the use of convolutional neural networks for predicting chromatin accessibility in the genome[28]. As shown in 3.7, because a given transcription factor binds several thousand times along the genome, the CNN model should learn the binding motif pattern, and the model’s prediction of accessibility should reflect a change in that pattern induced by a SNP.
The CNN models’ performance and interpretability was evaluated against the support vector machine gold standard algorithm, LSGKM[251],[166].
Candidate SNPs from the CRC GWAS were then evaluated for causal effects by determining their effects on model predictions. Candidates with significant effects on predictions in the SVM or CNN models were subjected to experimental validation via massively parallel reporter assays (MPRAs)[312].
3.2.3 Methods
The overall variant fine-mapping workflow is illustrated in 3.8. Three pipelines were developed to train deep learning models to predict chromatin accessibility – the ENCODE ATAC-seq and ChIP-seq pipelines[210],[252] were used to process FASTQ data. Pipeline outputs were passed through seqdataloader [29] to generate genomewide labels, and the kerasAC[30] software was used to generate convolutional neural network models to predict chromatin accessibilty 3.9A. Variants from the GECCO GWAS were then passed through the series of filters in 3.8B to identify SNPs within peaks, found within regions of the genome with accurate model predictions, and predicted to be functional via in silico mutagenesesis and DeepLIFT/DeepSHAP.
ChIP-seq and DNASE data processing and training label generation
Paired-end FASTQ datasets were processesed with the ENCODE DCC ATAC-seq pipeline (v 1.0)[210], which also supports DNASE data, and the ENCODE DCC ChIP-seq pipeline (v 1.0)[252] in the histone mode.
Checking for GWAS SNP enrichment in GECCO datasets
Overlap fold enrichment was defined according to Equations 3.1,3.2,3.3.
The overlap analysis was performed on the LD-expanded GECCO GWAS, including variants within the 1000 Genomes with R2 ≥= 0.8 with GECCO panel SNPs[59],[199]. LD values for hg19 were obtained from [368].
Annotations were obtained for GECCO DNASE and H3K27ac datasets, as well as Roadmap enhancer chromatin state mnemonic bed files for the 15-state core mark model (https://egg2.wustl.edu/ roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/), as documented at the Roadmap portal: https://egg2.wustl.edu/roadmap/web portal/chr state learning. html#core 15states. Additional annotations for enrichment analysis were obtained from the set of Roadmap DNASE MACS2 consolidated peaks (https://egg2.wustl.edu/roadmap/data/byFileType/ peaks/consolidated/narrowPeak/).
(3.1)
(3.2)
(3.3)
Enrichment values were calculated for these GECCO DNASE and H3K27ac ChIP-seq Overlap peaks, as well as to Roadmap DNASE datasets and ChromHMM enhancers.
Convolutional neural network (CNN) model training for binned classification and regression
The seqdataloader[29] (v1.0) toolkit was used to tile the hg19 reference genome into 1kb bins with a stride of 50. To generate classification labels, the following algorithm was utilized:
• a) The central 200 bases of each 1kb bin were checked for overlap with IDR peaks
• b) The 200 bp region around each peak summit was extracted.
• If a) overlapped with b), the 1kb bin received a positive label.
• If there was no overlap between a) and b), but there was overlap between the full 1kb bin and the IDR peak, the bin was considered as ambiguous and excluded from training.
• If there was no overlap between the 1kb bin and the IDR peak, the bin received a negative label.
To generate regression labels, the central 200 bases of each 1kb bin were overlapped with the asinh(fold change) tracks from MACS2. The signal within the central 200 bp bins was summed.
Any genome bins overlapping the hg19 black list[21] were excluded from the training, validation, and test splits.
10-fold cross validation was used to train classification and regression models. Folds consisted of separate chromosomes within hg19 to avoid potential train/test contamination between overlapping 1kb genome bins. The folds were generated as follows:
hg19_splits[0]={’test’:[’chr1’],
’valid’:[’chr10’,’chr8’],
’train’:[’chr2’,’chr3’,’chr4’,’chr5’,’chr6’,’chr7’,’chr9’,’chr11’,
’chr12’,’chr13’,’chr14’,’chr15’,’chr16’,’chr17’,’chr18’,’chr19’,’chr20’,
’chr21’,’chr22’,’chrX’,’chrY’]}
hg19_splits[1]={’test’:[’chr19’,’chr2’],
’valid’:[’chr1’],
’train’:[’chr3’,’chr4’,’chr5’,’chr6’,’chr7’,’chr8’,’chr9’,’chr10’,’chr11’,
’chr12’,’chr13’,’chr14’,’chr15’,’chr16’,’chr17’,’chr18’,’chr20’,’chr21’,
’chr22’,’chrX’,’chrY’]}
hg19_splits[2]={’test’:[’chr3’,’chr20’],
’valid’:[’chr19’,’chr2’],
’train’:[’chr1’,’chr4’,’chr5’,’chr6’,’chr7’,’chr8’,’chr9’,’chr10’,’chr11’,
’chr12’,’chr13’,’chr14’,’chr15’,’chr16’,’chr17’,’chr18’,’chr21’,’chr22’,
’chrX’,’chrY’]}
hg19_splits[3]={’test’:[’chr13’,’chr6’,’chr22’],
’valid’:[’chr3’,’chr20’],
’train’:[’chr1’,’chr2’,’chr4’,’chr5’,’chr7’,’chr8’,’chr9’,’chr10’,’chr11’,
’chr12’,’chr14’,’chr15’,’chr16’,’chr17’,’chr18’,’chr19’,’chr21’,
’chrX’,’chrY’]}
hg19_splits[4]={’test’:[’chr5’,’chr16’,’chrY’],
’valid’:[’chr13’,’chr6’,’chr22’],
’train’:[’chr1’,’chr2’,’chr3’,’chr4’,’chr7’,’chr8’,’chr9’,’chr10’,’chr11’,
’chr12’,’chr14’,’chr15’,’chr17’,’chr18’,’chr19’,’chr20’,’chr21’,’chrX’]}
hg19_splits[5]={’test’:[’chr4’,’chr15’,’chr21’],
’valid’:[’chr5’,’chr16’,’chrY’],
’train’:[’chr1’,’chr2’,’chr3’,’chr6’,’chr7’,’chr8’,’chr9’,’chr10’,’chr11’,
’chr12’,’chr13’,’chr14’,’chr17’,’chr18’,’chr19’,’chr20’,’chr22’,’chrX’]}
hg19_splits[6]={’test’:[’chr7’,’chr18’,’chr14’],
’valid’:[’chr4’,’chr15’,’chr21’],
’train’:[’chr1’,’chr2’,’chr3’,’chr5’,’chr6’,’chr8’,’chr9’,’chr10’,’chr11’,
’chr12’,’chr13’,’chr16’,’chr17’,’chr19’,’chr20’,’chr22’,’chrX’,’chrY’]}
hg19_splits[7]={’test’:[’chr11’,’chr17’,’chrX’],
’valid’:[’chr7’,’chr18’,’chr14’],
’train’:[’chr1’,’chr2’,’chr3’,’chr4’,’chr5’,’chr6’,’chr8’,’chr9’,’chr10’,
’chr12’,’chr13’,’chr15’,’chr16’,’chr19’,’chr20’,’chr21’,’chr22’,’chrY’]}
hg19_splits[8]={’test’:[’chr12’,’chr9’],
’valid’:[’chr11’,’chr17’,’chrX’],
’train’:[’chr1’,’chr2’,’chr3’,’chr4’,’chr5’,’chr6’,’chr7’,’chr8’,’chr10’, ’chr13’,’chr14’,’chr15’,’chr16’,’chr18’,’chr19’,’chr20’,’chr21’,’chr22’,
’chrY’]}
hg19_splits[9]={’test’:[’chr10’,’chr8’],
’valid’:[’chr12’,’chr9’],
’train’:[’chr1’,’chr2’,’chr3’,’chr4’,’chr5’,’chr6’,’chr7’,’chr11’,’chr13’,
’chr14’,’chr15’,’chr16’,’chr17’,’chr18’,’chr19’,’chr20’,’chr21’,’chr22’,
’chrX’,’chrY’]}
Several additional negative sets were tried in addition to the genome-wide training described above that was ultimately used. These are outlined in 3.6. The ”shuffled positives” negative set consisted of dinucleotide-shuffling the base order within the positive bins, with the goal of preserving GC content. The GC balanced negative set consisted of selecting bins at random from the genomewide negative set, ensuring that the GC content in those bins matched that of the positive bins. A balance of 10 negatives to 1 positive was used. The ”Union of positives” set consists of IDR peaks called in all ENCODE DNASE datasets, subsetting to those peaks that are not present in the GECCO sample of interest. ”Union of positives + Sampled universal DNASE negatives” refers to the ”Union of positives” negative set with additionl regions that we term universl DNASE negatives –i.e. there are no peaks at these regions within any of the ENCODE cell lines.
The Bassett architecture3.9, [219] in multi-tasked mode with 5 tasks was used to train classification and regression models. Hyperparameter search was performed varying the following parameters:
• number of convolution layers (1-5)
• number of dense layers (0-3)
• number of filters in convolution layers (100-1000)
• size of dense layers (60-500)
None of these modifications to the hyperparameters led to improved area under the precision recall curve (auPRC) for classification nor to improved spearman and pearson correlation scores for regression on the held-out test sets. Hence the Kelley Bassett architecture was ultimately adopted as optimal for these learning tasks.
Convolutional neural network (CNN) model variant fine-mapping via DeepSHAP
The GECCO credible set variants were intersected with DNASE and H3K27ac peaks in the merged overlap peak set across the five datasets. Model predictions were obtained for reference and alternate alleles. If the model correctly predicted the reference allele, the DeepLift[424] and DeepSHAP[284] algorithms were used to obtain sequence importance scores for each variant’s reference and alternate alleles, using the 1 kb region centered at the variant3.8B.
Support vector machine model training
The LS-GKM algorithm[251],[166] was used to train support vector machine models to predict accessible regions in each of the 5 DNASE datasets. The SVM models were trained following the Kundaje lab SVM pipeline protocol: https://github.com/kundajelab/SVM pipelines. For each DNASE datasets (health colon, primary tumor, SW480, HCT116,COLO205), up to 60,000 most significant IDR peaks were selected for a given fold (see fold definitions above), the summit positions were identified, and a 1kb flank around the summit was calculated. These comprised the positive set for SVM model training. The GC content for each bin in the seqdataloader bin set was calculated, and for each of the positive regions, a negative genome bin with comparable GC content was selected at random. The SVM algorithm with the GKM kernel (LS-GKM default settings) was then trained on this joint positive/negative bin training set via 10-fold cross-validation.
Support vector machine model variant fine-mapping via GKMExplain
Following the filteres in 3.8B, variants were determined for interpretation with the GKMExplain[425] algorithm. GKMExplain was executed on bins with the reference and alternate alleles, and a delta score was calculated across the two alleles. The distribution of delta scores was plotted for all variants in the 90% credible set (N=23,900), and a Kolmogorov-Smirnov divergence test was performed to identify variants with signficant scores. Those variants were flagged for experimental follow up.
MPRA experimental validation
In collaboration with the Tewhey lab, promising candidate SNPs were validated with an MPRA assay[312]. GWAS variants within the SMAD7, GREM1, MYC, POLD3, were subjected to ”bashing”, whereby saturation mutagenesis was performed within the 200 bp region centered at the SNP within the HCT116, SW480, and RKO cell lines.
3.2.4 Results
134 loci were found to be significantly (p≤5e-8) associated with colorectal cancer (CRC) via GWAS. Along these loci, 23,613 SNPs comprised the 90% credible set, while 3911 variants comprised the 99% credible set.
Enrichments of GECCO GWAS variants were observed in CRC DNASE and CHiPseq datasets
.
In comparing enrichments for GWAS variants across GECCO datasets, Roadmap DNASE datasets, and ChromHMM 15-state core enhancer model, the strongest enrichments were observed Roadmap DNASE digestive datasets, GECCO SW480 DNASE datsaets, GECCO COLO205 datasets, and Roadmap DNASE T-Cell datasets3.10A. These results suggest that the GECCO datasets are within the correct cell types implicated in CRC. The presence of CRC cell lines and digestive tissue datasets at the top of this list is expected, as these originate in the colon. The presence of T-cells near the top of hte list suggests an immune component to CRC. Stronger enrichments were observed in GECCO H3K27ac datasets compared to DNASE datasets within the same cell types 3.10B, suggesting the concentration of function variants within active enhancres, demarcated by H3K27ac peaks, as compared to within simply accessible chromatin regions. Lower enrichments were observed in COLL, COLM, and MODC single cell ATAC-seq datasets compared to bulk DNASE and H3K27ac, possibly due to sparser data in the single cell ATAC-seq 3.10C.
CNN model performance metrics
Test set performance metrics, averaged across the 10-fold models, are illustrated in 3.5 and 3.11A,B. Single-tasked cell-type specific classification models achieved area under the receiver operator curve (auROC) values between 0.96 (SW480) and 0.98 (healthy colon controls). Such high auROC values are not surprising due to the extremely high ratio of negative to positive labels in the dataset 3.7. The area under the precision recall curve (auPRC) varied between 0.35 for COLO205 and 0.55 for HCT1163.11C,D. Regression models achieved spearman correlation of 0.775 and pearson correlation of 0.782 for normalized peak signal in colon crypts 3.11E, and a spearman correlation of 0.876 and pearson correlation of 0.860 for normalized signal in primary tumor 3.11F.
Negative set determination
The choice of negative set was determined to be an important factor in driving model performance3.6.It was found that training genome-wide produced the highest auPRC on the test set.
However, a challenge of training genomewide is that it produces a high class imbalance – the ratio of negative to positive genome bins ranged from 60:1 for the primary tumor DNASE samples to 240:1 for the COLO205 DNASE cell line data3.7. To mitigate the effects of this high class imbalance, three strategies were found to be helpful:
• Augment positive set by including reverse complement sequences in same batch
• Upsample positives to a given percent of each batch 30% upsampled positives were optimal for CRC dataset.
• Prediction calibration with Platt scaling[362] (in the classification paradigm) and Isotonic regression[500] (in the regression paradigm).
.
Using these approaches, the models were able to achieve auPRC and auROC test set performance metrics far above the expected metric values based on class imbalance 3.7.
Pre-initialization of model weights from multi-tasked CNN trained on ENCODE datasets
Model performance was compared for models with random Glorot initialization versus models initialized from a pre-trained multi-task model. The pre-trained model was trained on 700 ENCODE[120] DNASE cell lines using the Bassett architecture[219], with soft multi-tasking in the final layer. 3.12 illustrates differences in performance before training and after early stopping (3 epochs of no validation loss decrease) for models initialized randomly versus those initialized with the ENCODE weights. Before the first epoch of training, models initialized with ENCODE weights achieved 0.20 higher negative accuracy, as they were not dominated by the high class imbalance in the data to the extent of the randomly initialized models. On average across the five GECCO tasks, the models initialized with ENCODE weights achieved 0.07 auPRC higher on the test set after training had completed.
CH RSNP POS A1 A2 MAF GWAS catalog
PMID GWAS Mean Zs-
core Features associated with
15 rs16960773 35604502 G A 0.0706758 28061514 Optic nerve measurement
(cup-to-disc ratio) 5.463 Job Involves
ShiftWork
4 rs5743618 38798648 A C 0.265273 27182965 Asthma 6.048 Job Involves
Shift Work
6 rs12203592 396321 T C 0.209969 27182965 Malepattern baldness 5.962 Job Involves
Shift Work
5 rs16891982 33951693 C G 0.068939 27182965 Monobrow 5.468 Job Involves
Shift Work
15 rs1426654 48426484 G A 0.0283663 26926045 Hair color 5.638 Job Involves
Shift Work
17 rs1494648 9878947333 T C 0.00196016 26830138 Alzheimer disease and age of onset 6.053 Job Involves
Shift Work
1 rs55916418 86752918 T A 0.0073503 26634245 Post bron-
chodilator
FEV1 5.518 Job Involves
Shift Work
4 rs13107325 103188709 T C 0.0716872 26604143 Childhood
body mass
index 6.218 Job Involves
Shift Work
8 rs4129585 143312933 A C 0.426732 26198764 Schizophrenia 5.951 Job Involves
Shift Work
10 rs1555804 23393066 G A 0.00808699 25918132 Diisocyanateinduced asthma 6.60525 Overall Acceleration Average, 2 3, 14 15, 13 14
1 rs2814778 159174683 C T 0.019935 25884002 Neutrophil count
in HIV-
infection 9.7415925 93Overall Acceleration Average
10 rs7072776 22032942 A G 0.28102 23535729 Breast cancer 5.536 Number of days vigorous physical activity
10 rs11012732 21830104 G A 0.327173 21804547 Meningioma -5.512 Number Days Walked 10Min-
utes
19 rs429358 45411941 C T 0.15407 20100581 Brain imaging -5.476 Number Days Walked 10Min-
utes
Table 3.1: UK Biobank physical activity hits with existing associations in the GWAS catalog.
rsID Location Alle leConse

quence Impact Symbol SIFT PolyPhen
rs4302331 3:33055721 G missense Moderate GLB1 tolerated benign
rs13060847 3:391100 G missense Moderate CHL1 tolerated benign
rs4849116 3:330557212 G missense Moderate NT5DC4 tolerated possibly damaging
rs16891982 5:33951693 G missense Moderate SLC45A2 deleterious probably damaging
Table 3.2: UK Biobank GWAS hits with significant SIFT and Polyphen scores.

Table 3.3: UK Biobank GWAS hit validation in other cohorts.
Locus Tissues expressed
(GTEx) Associated
GWAS hits GWAS
P-value Phenotype association In
silico Validation cohort Functional consequence
MEF2C Myoblast/ neuron rs42850 2.57e-8 Duration moderate activity 23&Me intronic
CACNA1S Myoblast rs16847664 5.14e-10 Duration of walks 23&Me synonymous
DPY19L1 Neuron rs1186716 1.35e-8 Weekly days of vigorous phys.
activity 23&Me 3’
SLC6A3 Neuron rs28382258 3.46e-8 Duration of moderate ac-
tivity 23&Me intronic
TMEM161BAS1 Neuron rs6414946, rs61595689, rs34316 3.14e-8,7.59e-9
2.59e-8 ,
Duration moderate activity 23&Me intronic, intronic,
3’ UTR
Table 3.4: UK Biobank GWAS hits for physiacl activity phenotypes that validated in the 23&Me cohort for similar phenotypes and underwent experimental validation via knockdown of associated
loci
recall at FDR50 0.60
recall at FDR20 0.34
auPRC 0.58
auROC 0.89
Unbalanced accuracy 0.93
Balanced accuracy 0.78
Positive accuracy 0.60
Negative accuracy 0.95
Number positives in test set (per task) 11834
Number Negativs in test set (per task) 150688
Table 3.5: Test set performance for GECCO binned CNN classification and regression models.

Figure 3.4: ELISA analysis of neurotransmitter expression in neuron cells in response to knockdown of the MEF2C gene. Controls are indicated in red; cells with MEF2 knockdown are indicated in blue. KCl+ refers to neuron cells stimulated with KCl to simulate stress induced by exercise. KCl- refers to unstimulated cells. A) ELISA analysis of transmitter expression for brain-derived neurotrophic factor (BDNF). B) ELISA analysis of transmitter expression for adrenaline. C) ELISA analysis of transmitter expression for noradrenaline.

Figure 3.5: Caption
Negative Set for Training Published Methods auPRC for
Held-out
Chr 1
Shuffled positives DeepBind (Frey,
2015)[13] 0.1
GC-balanced negatives, randomly sampled from genome 0.1
Dinucleotide-balanced negatives, randomly sampled from genome 0.1
Union of positives Bassett (Kelley,
2016)[219] 0.25
Union of positives + Sampled universal DNASE negatives 0.30
Genome bins with at least one TF binding event DeepSEA (Troyanskaya, 2015)[520] NA
Whole genome negatives ∼0.50
Table 3.6: GECCO CNN choice of negative set effect on prediction performance.
Healthy Tumor SW480 HCT116 COLO205
Pos in Test Set 53874 113246 102532 56798 34747
Neg in Test Set 5966799 5879450 5894715 5962722 5995062
Neg:Pos 110.75 51.92 57.49 104.98 172.53
Pos/Total 8.95e-3 1.89e-2 1.71e-2 9.44e-3 5.76e-3
Table 3.7: Class imbalance in genomewide training and test dataset for GECCO binned models

Figure 3.6: Learning transcription binding motifs via convolutional neural networks.A)The genome is split into 1kb bins to predict chromatin accessibility from underlying sequence data. Classification and regression models learn transcription factor binding motifs within the input sequence data. BC) Motif position weight matrices (PWM)s can be represented as neurons that , where the input sequence bases are X1, X2, ... X2, and the PWM weights are represented as W1...WN. The neuron output Z is generated by convolving the PWM (CNN kernel) with the input sequence.

Figure 3.7: Caption

Figure 3.8: GECCO variant fine-mapping workflow.
GECCO variant fine-mapping workflow. A) Data processing and generation of convolutional neural network (CNN) models with the ENCODE DCC ATAC-seq pipeline, seqdataloader, and kerasAC. B) Decision pipeline for determining SNP functionality from CNN models.

Figure 3.9: Bassett[219] architecture for genome-wide classification and regression models of chromatin accessibility.

Figure 3.10: Digestive, immune and CRC cell lines enrich for GECCO GWAS hits. A) Enrichments of GWAS variants in GECCO H3K27ac and DNASE datasets in healthy colon, primary tumor, SW480, HCT116, COLO205. Enrichments in ChromHMM 15-state model enhancer datasets, colorcoded by tissue of origin. Enrichments in Roadmap DNASE datasets, color-coded by tissue or origin. B) GWAS enrichments in GECCO datasets in DNASE (solid line) and H3K27ac (dashed line). C) GWAS enrichments in GECCO datasets in single cell ATAC-seq data across four regions of the colon: MODC (modular descending), COLL (left colon), COLM (colonic mucosa) shown in dotted line, overlaid with B).

Figure 3.11: Performance metrics for GECCO multi-tasked classification and regression CNN’s.A) Classification CNN, mean ROC across 10 folds on the test set. B) Range of observed auROC values across 10 folds for classification CNN. C) Mean PRC across 10 folds on the test set, classification CNN. D) Range of observed auPRC values across 10 folds for classification CNN. E) Spearman and correlation values across joint test set across 10 folds within healthy colon crypts. F) Spearman and correlation values across joint test set across 10 folds within primary tumor samples.

Figure 3.12: Model initialization with weights from pretrained ENCODE multi-tasked model.
Variant fine-mapping with ISM and DeepLIFT
N=4674 (19.8%) of variants in the GWAS hit set of N=23,600 were found to lie within a DNASE or H3K27ac peak in one of the GECCO datasets. Of these, 1002 were in teh 99% credible set, and 176 were within both the 99% credible set and had no coding variant with high LD (rˆ2 ≥ 0.8)3.8.
GWAS
N=23600 hits GWAS 99% Credible Set N=3969 GWAS 99%
Credible Set with No LD Coding SNP
N=629
In H3K27ac peak 4121 906 145
In DNASE peak 1559 329 45
DNASE or
H3K27ac 4674 (19.8%) 1002 (25.2%) 176 (27.9%)
Table 3.8: Fraction of GWAS candidate variants found within accessible regions of the genome. Fraction of variants within regions of accessible chromatin from the GECCO GWAS hit list (GWAS p-val ≤ 8e-5), the GWAS 99% credible set, and those SNPs within the 99% credible set with no coding variants in high LD (r2 ≥= 0.8).
Variant fine-mapping with ISM yielded results in 3.13. Variants found below the diagonal (i.e. r981625, rs1318920) were predicted by the CNN models to have lower probability of having a motif binding site with the alternate allele compared to the reference allelle – i.e. these variants were predicted to cause loss of function in healthy controls 3.13A,B. rs1318920 is predicted to cause loss of function across several of the datasets – healthy colon, primary tumor, and the sw480 cell line. rs981625 also suggests loss of function in healthy colon, SW480, and HCT116. In contrast, several variants (rs186956493, rs35695519, rs28552706, rs11255805, rs6126017, rs2223660, rs12896913, rs58658771, rs10821846) were predicted to be in accessible chromatin regions with the alternate allele with a higher probability compared to the reference allele.

Figure 3.13: GECCO candidate variant fine-mapping with ISM.
The contrast between panels A and B illustrates a challenge in performing ISM variant interpretation is poised by sigmoid saturation 3.14. In the final layer of the CNN classification model, the outputs of the Dense layer get passed through a sigmoid activation to derive a probability between 0 and 1. However, large differences in logit space between alternate and reference allele model predictions may get compressed to small differences along the sigmoid output. An example of this is observed for rs1977415, which has a difference of 8 in logit space and is observed as clearly offdiagonal in 3.14B. However, because the prediction of this variant is highly negative, transforming it through a sigmoid activation produces a negligible difference in post-sigmoid space: P(ref)=3.01e-16, while P(alt)=5.85e-12. Model predictions for this SNP lie in the saturation region of the sigmoid, and hence the SNP may be a false negative for prediction of a functional effect. In contrast, rs981625 has a much smaller difference between reference and alternate alleles in the logit space (0.82 for logit ref and -0.45 for logit alt). However, because these values fall into the non-saturated input region for the sigmoid, the transformed outputs show a difference between reference and alternate alleles (P=0.52 for ref, P=0.17 for alt). Hence, in probability space, the SNP will show up as having an effect on the model’s prediction. For this region, it was found to be importnat to perform ISm analysis in both logit space and sigmoid space, to avoid missing functional variants whose ISM effects would fall into the region where the sigmoid saturates.

Figure 3.14: Sigmoid saturation challenge for ISM variant interpretation. A)Delta scores in logit space for four variants in the GECCO 90% credible SNP set, representative of the input domains where the sigmoid does and does not saturate. B) ISM Delta scores in logit space for the GECCO GWAS significant loci. C) ISM delta scores in probability space for the GECCO GWAS significant loci.
The variant rs9816523.15 was one of the SNPs with the highest change in model prediction between the reference and alternate alleles. The variant lies within the summit region of a peak in DNASE assays for all five datasets, and within an H3K27ac peak in the healthy colon tissue, primary tumor sample, and COLO205 cell line3.15A. With the reference G allele, the classification CNN model predicts a positive (peak presence) with p ranging from 0.46 for HCT116 to 0.811 for primary tumore. The prediction probability drops to between 0.073 (COLO205) and 0.478 (Tumor) for the alternate G allele (0.32). Interpreting the model predictions for C and G with DeepLift, suggests a change in DeepLIFT scores within the 10 base region flanking rs981625 in the DeepLIFT delta track 3.15B. Searching for this sequence (CTCCACCC) in the TomTom[174] database yields a significant match (p≤0.001) for the KLF5 motif3.15C. This is inline with prior findings of the GECCO consortium 3.15D, which identified a KLF5 superenhancer within the vicinity of this variant.
rs1318920 was another variant with a high difference in importance scores when the C allele is mutated to a T allele3.16. There are score differences in the HNF4A motif, which is disrupted by the SNP directly. However, the rs1318920 SNP, located at position chr10:101353285, lies 63 bases away from the peak summit, which is called by MACS2 at position chr10:101353222. Given that the motif lies 63 base pairs away from the summit, it is less likely to drive change in transcription factor binding at the peak. However, we also observe an AP-1 motif with high importance score located 60 bases downstream from rs1318920, with the center nucleotide of the motif positioned exactly at the DNASE peak summit (chr10:101353222).
We hypothesize that AP-1 binding at the AP-1 motif is primarily driving the accessible chromatin site here (based on proximity to summit). The AP1 binding is likely modulated/affected by HNF4A or HNF4G binding at the flanking HNF motif that overlaps the rs1318920 SNP (supported by the feature interaction scores between the SNP, the HNF4A motif and the AP1 motif). Hence, any oligo overlapping this SNP that will be tested for functional activity in MPRAs or luciferase assays should probably include the sequence spanning both motifs.
The functional effects of rs1318920 are corroborated by ChromHMM annotations from the core mark 15-state model 3.16B, where it is the sole variant in the LD region found to lie in an active enhancer (yellow) in CRC-associated cell types and tissues from Roadmap.

Figure 3.15: rs981625 variant finemapping via CNN models and DeepLIFT. A)Fold change bigwig tracks for DNASE (orange) and H3K27ac (blue) data for 1kb regions centered on rs981652. CNN Bassett classification model predictions for the reference allele C and alternate allele G. B) DeepLIFT scores in the SW480 cell line for the 1kb region centered on rs981625. Top track shows scores when the reference C allele is observed in the input sequence, middle track shows scores when the alterante D allele is observed in the input sequence; bottom track shows the delta score G- C. C) Top match for the motif region flanking rs98125 within the TomTom database[174]. D) KLF5 superenhancer for colorectal cancer. Regional association plot showing the unconditional –log10[P] for the association with CRC risk in the combined meta-analysis of up to 125,478 individuals, as a function of genomic position (build 37) for each variant in the chromosome 13 (chr13) region. Lead variants are indicated by diamonds and their positions are indicated by dashed vertical lines. The color labeling and shape of all other variants indicate the lead variant with which they are in strongest LD.[199] Comparison against support vector machine (SVM) gold standard
The LSGKM SVM validation workflow was used as a field gold-standard to compare the interpretations of the GECCO classification and regression CNN models (see Methods). The SVM workflow is illustrated in 3.17.
Within the set of 23,716 candidate variants, 1582 were found to overlap with a DNASE peak in at least one of the five GECCO datasets. These were scored with the GKMExplain algorithm, and a KS-test was performed on the distribution of GKMExplain delta scores to identify variants with significant delta scores between the reference and alternate alleles. 9 variants yielded GKMexplain delta scores with p-value ≤ 0.05: rs1318920, rs16983844, rs17127473, rs12896913, rs16983844, rs2244490, rs6089354, rs28549017, rs4360494.
Of these 9 variants, 5 were found to either have a GWAS p-value ≤ 1e-5 or a high LD association (rˆ2 ≥ 0.8) with a GWAS tagged SNP. The five variants with significant differences in GKMExplain scores between the reference and alternate alleles are described in 3.18,3.19,3.20, 3.21, 3.22.
The first of these, rs1318920, was also identified as a top hit in the ISM and DeepLIFT analysis, and the fact that it shows up in the top five candidate SNPs in the SVM validation corroborates that the CNN models are performing accurately. Both the DeepSHAP and GKMExplain algorithms predict a loss-of-accessiblity in the region flanking the SNP when the reference C allele is mutated to the alternate T allele. Scanning this region in TomTom reveals a strong match for HNF4A, as noted above, with active enhancer activity observed in CRC relevant cell types. Noteworthy, the LD-associated GWAS tag (rs11190164) does not lie within a peak region in any of the CRC datasets, providing further support that rs1318920 is driving the GWAS association 3.18 rs12896913 3.19 was also in the top five most likely functional variants from the SVM analysis. Both GKMExplain and DeepSHAP predict a gain of accessibility when the reference C allele is mutated to the alternate T allele 3.19A. Scanning the high-scoring flank in TomTom suggests a strong match for the JUN:FOS AP-1 motif (TGAGTCA) (p=2.54e-5). It is plausible that the mutation of the C to a T completes the PWM for JUN:FOS and enables TF-binding. rs12896913 has a GWAS significant p-value of 5e-10, and is likely the variant driving the association since the tagged lead SNP rs4901473 does not fall into a peak in any of the datasets.
rs28549017 has a strong delta GKMExplain score, but does not show a significant deltaDeepSHAP score, highlighting some of the limitations of binned CNN’s see (BPNET section) 3.20. The GKMexplain analysis suggests a loss of function, as mutating the reference G to the alternate C beaks a likely CTCF binding motif. As in the previous examples, the lead tagged SNP rs9924886 does not fall into a peak in any of hte datasets, while rs28549017 does.
Similar functional patterns are observed for rs43604943.21, where a G to C mutation breaks a CTCF motif near a strong peak summit.
Finally,for rs6089354 3.22, GKMexplain predicts loss of accessibility in a cooperative function – both ZIC1 and ERG motifs in a heterodimer are disrupted by the G to A mutation.

Figure 3.16: rs1318920 CRC variant fine-mapping. A)Gradient x Input interpretation scores for the reference C allele for rs1318920 (top track), the alternate T allele (middle track), and the delta of the two (bottom track). The variant position is indicated in the boxed region to the right. The highest change in interpretation scores is observed for an AP-1 motif (TGAATCA) 63 bases upstream from rs1318920 (indicated in black box on the left). The positions of this motif near the summit of the corresponding DNASE peak in primary tumor as well as the position of rs1318920 further downstream are indicated by black lines in the orange peak region. B) ChromHMM funcational annotaion for rs1318920 and variants in high LD (rˆ2 ≥0.8). Yellow indicates active enhancers, red indicates active TSS, green indicates active transcription, gray indicates repressors, white indicates no known functional activity.

Figure 3.17: GECCO SVM validation workflow.

Figure 3.18: GECCO SVM candidate functional variant:rs1318920. A) DeepSHAP and GKMexplain scores for the reference C allele, the alternate G allele, and the difference of the two tracks. B) TomTom motif match for HNF4A for the sequence region flanking rs1318920 with high GKMexplain delta scores. C) ChromHMM chromatin state annotations within the vicinity of rs1318920 in CRCrelevant cell types. D) MACS2 fold change tracks in the five GECCO datasets in the vicinity of rs1318920 and LD tagged SNP rs11190164.

Figure 3.19: GECCO SVM candidate functional variant:rs12896913

Figure 3.20: GECCO SVM candidate functional variant:rs28549017

Figure 3.21: GECCO SVM candidate functional variant:rs4360494 MPRA Validation
Two functional variants predicted by the CNN were found to validated experimentally by MPRA analysis. rs71301733.23 suggests a cooperative effect of a weak motif centered on the variant position as well as a stronger CTCF motif approximately 100 bases downstream.
rs72685323 was also validated for functional activity in HCT116 using a multi-parallel reporter assay (MPRA)3.24. The allele and orientation with the highest luciferase activity was selected and a 200 bp flank around the SNP was analyzed. CNN analysis suggests a gain of function mutation that activates a canonical AP-1 motif. The variant showed a strong functional effect (fold change =
0.69, p-value= 1.29E-05).
3.2.5 Discussion
The pilot study described here suggests that it is feasible to train and interpret convolutional neural networks to identify functionally important variants from GWAS credible sets of colorectal cancer. ISM analysis of 23,727 variants across 134 credible sets identified 1 - 3 variants in each of the credible sets where introduction of the alternate allele altered the model’s prediction. Model interpretation with DeepLIFT suggests functional explanation for the variants’ effects – i.e. disruption of the ELK motif for variant 1 and disruption of HNF4A (and possible interactions with AP-1) for variant 2. The model’s prediction was equally likely to change from negative to positive as from positive to negative, suggesting that this approach can identify gain of function and loss of function mutations.
3.2.6 Contributions
The GECCO consortium, and specifically, Jeroen Huyghe, Stephanie Bien, Andre Kim, Tabitha Harrison, and all co-authors of [59], [199] generated the GECCO GWAS. DHS and ChIP-seq data was generated by Peter Scacheri’s lab; MPRA validation was performed by Ryan Tewhey’s lab.
3.3 Single-cell epigenomic analyses implicate candidate causal variants at inherited risk loci for Alzheimer’s and Parkinson’s diseases
3.3.1 Abstract
Genome-wide association studies (GWAS) in neurological diseases have identified thousands of variants associated with disease phenotypes. However, the majority of these variants do not alter coding sequences, making it difficult to assign their function. Here, we present a multi-omic epigenetic atlas of the adult human brain through profiling of single-cell chromatin accessibility landscapes and three-dimensional chromatin interactions of diverse adult brain regions across a cohort of cognitively healthy individuals. We developed a machine learning classifier to integrate this multi-omic framework and predict dozens of functional single nucleotide polymorphisms (SNPs) for Alzheimer’s disease (AD) and Parkinson’s disease (PD), nominating target genes and cell types for previously orphaned GWAS loci. Moreover, we dissected the complex inverted haplotype of the MAPT (encoding tau) PD risk locus, identifying putative ectopic regulatory interactions in neurons that may mediate this disease association. This work greatly expands our understanding of inherited variation and provides a roadmap for the epigenomic dissection of causal regulatory variation in disease.
3.3.2 Introduction
Alzheimer’s disease (AD) and Parkinson’s disease (PD) affect 50 and 10 million individuals worldwide, as two of the most common neurodegenerative disorders. Several large consortia have assembled genome-wide association studies (GWAS) that associate genetic loci with clinical diagnoses of probable AD dementia[237, 208, 241, 48] or probable PD[349, 93, 327], or with their characteristic pathologic features. These efforts have led to the identification of dozens of potential risk loci for these prevalent neurodegenerative diseases. However, most risk loci reside in noncoding regions and so it remains unclear if the nominated (often nearest) gene is the functional disease-relevant gene, or if some other gene is involved[161].
Most functional noncoding SNPs would be predicted to exert their effects through alteration of gene expression via perturbation of transcription factor binding and regulatory element function[161]. Such regulatory elements are highly cell type-specific[331], suggesting that the resultant effects of noncoding SNPs would be equally cell type-specific. Thus, comprehensive nomination of putative functional noncoding SNPs in the brain requires cataloging the regulatory elements that are active in every brain cell type in the correct organismal and regional context. These critical data hold the promise to illuminate the functional significance of genetic risk loci in the molecular pathogenesis of common neurodegenerative diseases.
Previous work has carefully mapped such cell type-specific gene regulatory landscapes in human brain, predominantly during early developmental time points[267], in organoid culture systems[22, 469, 332], or in induced pluripotent stem cell-derived cellular models[437, 382]. Additional studies have profiled chromatin accessibility in macrodissected post-mortem adult human brain[158, 159, 73, 114]. Such data sets have provided a rich resource for the nomination of putative functional SNPs in neurologic disease using multi-omic approaches[267, 437, 159, 414]. Moreover, recent work has profiled chromatin accessibility and three-dimensional chromatin conformation in primary brain cell types from resected pediatric brain tissue to explore the roles of noncoding SNPs in AD[114]. Lastly, innovative analytical approaches, for example leveraging machine learning (ML), have greatly expanded our ability to predict the functional effects of noncoding SNPs[249, 423, 23, 475]. Cumulatively, this work has provided important advances in our understanding of the role of noncoding SNPs in disease predisposition, particularly in neurological disease.
Here, we build on the current understanding of inherited variation in neurodegenerative disease through implementation of a multi-omic framework that enables accurate prediction of functional noncoding SNPs. This framework layers bulk Assay for Transposase-accessible chromatin using sequencing (ATAC-seq)[76], single-cell ATAC-seq (scATAC-seq)[404], and HiChIP enhancer connectome[322, 321] data over a ML classifier to predict putative functional SNPs driving association with neurodegenerative diseases. Through these efforts, we pinpoint putative target genes and cell types of several noncoding GWAS loci in AD and PD, providing a roadmap for application of this data and technology to any neurological disorder and enabling a more comprehensive understanding of the role of inherited noncoding variation in disease.
3.3.3 Results
Bulk chromatin accessibility landscapes in macrodissected tissue identify brain regional epigenomic heterogeneity We profiled the bulk chromatin accessibility landscapes of 7 macrodissected brain regions across 39 cognitively healthy individuals to characterize the role of the noncoding genome in neurodegenerative diseases. These brain regions include distinct isocortical regions [superior and middle temporal gyri (SMTG), parietal lobe (PARL), and middle frontal gyrus (MDFG)], striatal regions [caudate nucleus (CAUD) and putamen (PTMN)], the hippocampus (HIPP), and the substantia nigra (SUNI) (3.25a; see 3.3.12). From these bulk ATAC-seq libraries, we compiled a merged set of 186,559 reproducible peaks (3.25b). Here, a reproducible peak is defined as any peak that is called in at least 30% of the bulk ATAC-seq samples from any given brain region (D.1a; see 3.3.12). Dimensionality reduction via t-distributed stochastic neighbor embedding (t-SNE) identified 4 distinct clusters of samples, grouped roughly by major brain region (3.25c). While many region-specific peaks in chromatin accessibility could be identified from this bulk ATAC-seq data, most of these peaks corresponded to cell types predominantly present in a single region (3.25d). A detailed analysis of this bulk ATAC-seq data primarily revealed region-specific differences in chromatin accessibility (D.1b-h and D.2).

Figure 3.22: GECCO SVM candidate functional variant:rs6089354

Figure 3.23: MPRA validation for GECCO fine-mapped variant rs7130173. A) MPRA results for bashing the 200bp region centered on rs7130173. B) MACS2 fold change signal tracks in DNASE (orange) and H3K27ac (blue) for the five GECCO datasets. C) CNN regression model predictions for the four variant alleles. D) DeepLIFT scores (bottom track) and MPRA scores (top track) for saturated mutagenesis of the 200 bp region centered on rs7130173. E) deepLIFT delta track.

Figure 3.24: MPRA validation for GECCO fine-mapped variant rs72685323. A) MPRA luciferase assay results. B) MACS2 fold change signal tracks for DNASE (orange) adn H3K27ac (blue) datsaets. C) Regression CNN predictions for the 4 alleles. D) DeepLIFT tracks for all four alleles of rs72685323, with delta track for alternate G minus reference C allele in the bottom row.

Figure 3.25 (previous page): a, Schematic of the brain regions profiled in this study. Indicated colors are used throughout. b, Bar plot showing the number of reproducible peaks identified from samples in each brain region. c, t-SNE dimensionality reduction of bulk ATAC-seq data showing all samples profiled in this study, colored by the region of the brain from which the data was generated. Each dot represents a single piece of tissue with technical replicates merged where applicable. d, Sequencing tracks of region-specific ATAC-seq peaks. e, Left; UMAP dimensionality reduction after iterative LSI of scATAC-seq data from 10 different samples. Each dot represents a single cell (N = 70,631). Dots are colored by their corresponding cluster. Right; Bar plot showing the number of cells per cluster. Each cluster is labeled to the right of the bar plot and the predicted cell type corresponding to each cluster is shown by color. f, The same UMAP dimensionality reduction shown in Figure 1e but each cell is colored by its gene activity score for the annotated lineage-defining gene. Gene activity scores were imputed using MAGIC. Grey represents the minimum gene activity score while purple represents the maximum gene activity score for the given gene. The minimum and maximum scores are shown in the bottom left of each panel. The gene of interest and the cell type that it identified are shown in the upper left of each panel. g, Bar plot showing the overlap of bulk ATAC-seq and scATAC-seq peak calls. ”Bulk ATAC-seq” represents the number of peaks from the bulk ATAC-seq merged peak set that are overlapped by a peak called in our scATAC-seq merged peak set. ”Single-cell ATAC-seq” represents the number of peaks from our scATAC-seq merged peak set that are overlapped by a peak called in our bulk ATAC-seq merged peak set. Overlap is considered as any overlapping bases. h, Heatmap representation of binarized peaks (N = 221,062) from the scATAC-seq peak set. Each row represents an individual pseudo-bulk replicate (3 per cell type) and each column represents an individual peak. Feature groups, sets of peaks that are uniquely accessible within a specific cell class or group of cell classes, containing fewer than 1000 peaks are not displayed. Heatmap color represents the column-wise Z-score of normalized chromatin accessibility at the peak region across all pseudo-replicates. i, Bar plot of the percent of peaks from the set of scATAC-seq binarized peak set that overlap peaks identified by bulk ATAC-seq (”Overlap Bulk”) or are uniquely identified by scATAC-seq (”scATAC Only”). j, Motif enrichments of binarized peaks identified in Figure 1h. Motif enrichment is tested within all peaks identified as significant in the given cell class (Supplementary Table 5). Due to redundancy in motifs, TF drivers were predicted using the average gene expression in GTEx brain samples and accessibility at TF promoters in cell class-grouped scATAC-seq profiles. The final list of TFs represents a trimmed set of all TFs with the most likely driving TF labeled below. Color represents the –log10(adjusted p-value) of the hypergeometric test for motif enrichment. k, Footprinting analysis of the SPI1 (left; CIS-BP M6484 1.02) and JUN/FOS (right; CIS-BP M4625 1.02) transcription factors across the 6 major cell classes. The motif logos are shown above and the Tn5 transposase insertion biases are shown below.
3.3.4 Single-cell ATAC-seq captures regional and cell type-specific heterogeneity
While many region-specific peaks in chromatin accessibility could be identified from this bulk ATACseq data, most of these peaks unsurprisingly corresponded to cell types predominantly present in a single region. For example, region-specific chromatin accessibility was observed at the dopamine receptor D2 (DRD2) gene in the striatum, corresponding to medium spiny neurons[446], the Iroquois homeobox 3 (IRX3) gene in the substantia nigra, corresponding to diencephalic-origin astrocytes[235], or the potassium voltage-gated channel modifier subfamily S member 1 (KCNS1) gene in the isocortex, corresponding to various neuronal populations[232] (3.25d and D.1h). To better understand brain-regional cell type-specific chromatin accessibility landscapes, we performed singlecell chromatin accessibility profiling in 10 samples spanning the isocortex (N=3), striatum (N=3), hippocampus (N=2), and substantia nigra (N=2). In total, we profiled chromatin accessibility in 70,631 individual cells (D.1e) after stringent quality control filtration (D.2a ). Unbiased iterative clustering[404, 172] and Harmony-based batch correction of these single cells identified 24 distinct clusters (3.25e) which were assigned to known brain cell types based on gene activity scores compiled from chromatin accessibility signal in the vicinity of key lineage-defining genes[172, 363] (3.25f; see 3.3.12). These gene activity scores are based on the observation that chromatin accessibility within the gene body, at the promoter, and at distal regulatory elements is correlated with gene expression[326, 433, 408, 149]. Additionally, 13 of the 24 clusters showed regional specificity with some clusters composed almost entirely from a single brain region. We did not identify any clusters that were clearly segregated by gender but the sample size used in this study was not powered to make such a determination. Cumulatively, we defined 8 distinct cell classes, including the 6 main brain cell types (excitatory neurons, inhibitory neurons, microglia, oligodendrocytes, astrocytes, and OPCs), and identified one cluster (Cluster 18) as putative doublets that we excluded from downstream analyses (3.25e). These cell groupings varied largely in the total number of cells per grouping and showed distinct donor and regional compositions.
Using these clusters, we then called peaks from scATAC-seq pseudo-bulk chromatin accessibility to create a union set of 359,022 reproducible peak. Overall, 89% of bulk ATAC-seq peaks were overlapped by a peak called in the scATAC-seq data (3.25g). Conversely, only 34% of scATAC-seq peaks were overlapped by a peak from the bulk ATAC-seq peak set (3.25g). Consistent with a role for distal regulatory elements in cell type-specific gene regulation[115], we found an enrichment in distal/intronic peaks and a depletion in promoter peaks in the peak set specifically identified via scATAC-seq. To better understand the cell type specificity of our scATAC-seq peaks, we identified cell type-specific peaks through ”feature binarization”, which identifies peaks that are uniquely accessible in a single cell type or subset of cell types33. This analysis identified 221,062 highly cell type-specific peaks within the 6 primary brain cell types, comprising ≥60% of all peaks identified from our scATAC-seq data (3.25h). These cell type-specific peaks were also enriched for distal/intronic peaks and depleted for promoter peaks. Some of these peaks were shared across the different neuronal cell types while others were shared across astrocytes, OPCs, and oligodendrocytes (3.25h). However, 48% of peaks called in our single-cell ATAC-seq data were specific to a single cell type (N=172,111 peaks; 3.25h) with the vast majority of these cell type-specific peaks remaining undetected in our bulk ATAC-seq analyses. Consistent with previous work[308], we found an enrichment of peaks from less abundant cell types (less than 20% of cells; i.e. microglia, astrocytes, and OPCs) within the set of peaks identified via scATAC-seq but not bulk ATAC-seq (3.25i). Similarly, examining per-cell accessibility at the peaks specifically identified via scATAC-seq, we found significantly fewer cells supporting these peaks. These results highlight the utility of single-cell methods when cell type-specific peaks are difficult to identify from bulk tissues containing multiple distinct cell types at varying frequencies.
To predict which TFs may be responsible for establishing and maintaining these cell type-specific regulatory programs, we performed motif enrichment analyses of peaks specific to each cell type (3.25j). We identified many known drivers of cell type identity, such as motifs specific to SOX9 and SOX10 in oligodendrocytes[446, 235], or to ASCL1 in OPCs[232, 326]. Lastly, TF footprinting from our scATAC-seq-derived cell type-specific chromatin accessibility data showed enrichment of binding of key lineage defining TFs such as SPI1 in microglia[433] and JUN/FOS in neurons[408] (3.25k). Notably, the three isocortical samples, derived from distinct brain regions, showed high similarity based on Pearson correlation, supporting their use as biological replicates. These data provide reference cell profiles for cell type-specific deconvolution of bulk ATAC-seq data (D.3, D.3) and identify brain regional heterogeneity in glial cells, such as astrocytes and OPCs (D.4 and D.4).
3.3.5 scATAC-seq identifies diverse neuronal subpopulations
Given the well-understood diversity of neuronal types and functions, we sought to further subdivide our scATAC-seq data based on neuronal subtypes. Extracting all cells previously labeled as neurons (Clusters 1-7, 11, and 12; N = 21,116 cells), we performed unbiased iterative clustering followed by Harmony-based batch correction, identifying 30 discrete neuronal clusters (3.26a). For clarity, these are referred to as ”neuronal clusters” to avoid confusion with the 24 clusters identified in our broad analysis above. Each neuronal cluster was interpreted to represent a unique neuronal cell type or cell state and annotated using gene activity scores for key lineage-defining genes (3.26b ). This identified both broad neuronal classes and very granular neuronal subdivisions, even discriminating between striatopallidal (Neuronal Clusters 11-12) and striatonigral (Neuronal Cluster 21) medium spiny neurons, which both reside within the striatum but project to different brain areas (3.26a). These data identified neuronal cell class-specific peaks, genes, and transcription factor activity (D.5 and D.5). While this analysis did identify a neuronal cluster corresponding predominantly to substantia nigra dopaminergic neurons (Neuronal Cluster 7), a key cell type lost in PD, we derived a more refined subset of tyrosine hydroxylase (TH)-positive dopaminergic neurons by sub-clustering only cells from the two substantia nigra samples (N = 403 dopaminergic neurons).
3.3.6 Single-cell ATAC-seq pinpoints the cellular targets of GWAS polymorphisms
To understand if any particular cell type-specific regions of chromatin accessibility were enriched for neurodegenerative disease-associated SNPs, we performed LD score regression41 using a collection of relevant GWAS studies. Within the peak regions of our broad cell classes, cell type-specific LD score regression revealed a significant increase in per-SNP heritability for AD in the microglia peak set, reinforcing previous studies[208, 190, 138] (3.26c). Similar analyses in PD showed no significant enrichment in SNP heritability in any particular cell type, perhaps because the cellular bases of PD are more heterogeneous than AD (3.26c). Though not a focus of the current study, we note that the data generated here can be used to inform the cellular ontogeny of any brain-related GWAS (3.26c). We also confirmed that the heritability of GWAS SNPs from traits not directly related to brain cell types, such as lean body mass and coronary artery disease, was not significantly enriched in any of the tested brain cell types. To ensure that the lack of significance in cell class-specific peaks was not due to obfuscation of neuronal sub-types, we performed the same LD score regression analyses within the peak regions for the neuronal cell classes identified through sub-clustering (3.26d). This analysis confirmed our previous findings and showed no significant enrichment for AD or PD SNPs within the peak regions of any neuronal sub-classes (3.26d).

Figure 3.26 (previous page): a, Left; UMAP dimensionality reduction after iterative LSI of scATACseq data from neuronal cells from 10 different samples. Each dot represents a single cell (N = 21,116). Dots are colored by their corresponding neuronal sub-cluster. Neuronal cluster numbers are overlaid on the UMAP above each neuronal cluster centroid. Right; Bar plot showing the number of cells per cluster. Each neuronal cluster sub-annotation is labeled to the right of the bar plot and indicated by color. b, The same UMAP dimensionality reduction shown in Figure 2a but each cell is colored by its gene activity score for the annotated lineage-defining gene. Gene activity scores were imputed using MAGIC. Grey represents the minimum gene activity score while purple represents the maximum gene activity score for the given gene. The minimum and maximum scores are shown in the bottom left of each panel. The gene of interest is shown in the upper right of each panel. c-d, LD score regression identifying the enrichment of GWAS SNPs from various brain-related and non-brain-related conditions in the peak regions of various (c) cell classes from the broad scATAC-seq clustering or (d) neuronal cell classes identified from the neuronal sub-clustering analysis. Peaks were identified from pseudo-bulk replicates of each of the annotated cell classes. The dotted line represents the Bonferroni-corrected significance threshold, adjusted for the number of cell classes tested. The size of the point for each cell class indicates whether this cell class passes the Bonferroni-corrected significance threshold (larger) or not (smaller).
3.3.7 Identification of putative enhancer-promoter interactions through chromatin conformation and cell type-specific co-accessibility
While our scATAC-seq data would enable us to identify the target cell types of functional noncoding SNPs, we sought to additionally identify the target genes of each GWAS locus. To do this, we mapped the enhancer-centric three-dimensional (3D) chromatin architecture in multiple brain regions using HiChIP28 for histone H3 lysine 27 acetylation (H3K27ac) which marks active enhancers and promoters (3.27a ). In total, we generated 3D interaction maps for 6 of the 7 regions profiled by ATAC-seq (putamen was excluded given the high overlap with the caudate nucleus), averaging 158 million valid interaction pairs identified per region. We identified 833,975 predicted 3D interactions across all brain regions profiled, of which 331,730 (40%) were reproducible in at least two brain regions. Of these loops, 67.4% had an ATAC-seq peak present in both anchors, 29.2% had an ATAC-seq peak present in one anchor, and 3.4% did not overlap any ATAC-seq peaks identified in either the bulk or scATAC-seq datasets.
Additionally, correlated variation of chromatin accessibility in peaks across single cells has been shown to predict functional interactions between regulatory elements[363, 74]. Using this co-accessibility framework, we predicted regulatory interactions from our scATAC-seq data from the variation across all cells, identifying 2,822,924 putative pairwise interactions between regions of chromatin accessibility. This set of interactions showed only moderate overlap ( 20%) with our HiChIP data, consistent with the ability of this technique to identify cell type-specific regulatory interactions, whereas HiChIP of bulk brain tissue is better suited for identification of more shared regulatory interactions. Together, these two techniques define a compendium of putative regulatory interactions in the various brain regions studied here, thus enabling downstream linkage of GWAS SNPs to putative target genes.
3.3.8 A tiered multi-omic approach to predicting functional noncoding SNPs
To annotate functional effects of GWAS polymorphisms, we first compiled a comprehensive set of putative disease-relevant SNPs in AD and PD, taking into account the propensity of nearby SNPs to be co-inherited based on linkage disequilibrium (LD). We identified (i) any SNPs passing genomewide significance (p ≤ 5e-8) in recent GWAS1–3,5–7, (ii) any SNPs exhibiting colocalization of GWAS and eQTL signal (FINEMAP/eCAVIAR colocalization posterior probability ≥ 0.01), and (iii) any SNPs in linkage disequilibrium with a SNP in the previous two categories based off of an LD R2 value greater than or equal to 0.8 calculated from Phase 1 genotypes of individuals of European ancestry in the 1000 Genomes dataset (see 3.3.12). In total, this identified 9,707 SNPs including 3,245 unique SNPs across 44 loci associated with AD and 6,496 across 86 loci associated with PD, with a single locus containing 34 SNPs appearing in both diseases.
Using this catalog of putative disease-relevant noncoding polymorphisms, we developed a tiered multi-omic approach to predict functional noncoding GWAS polymorphisms by (i) overlapping these SNPs with peaks of chromatin accessibility in our bulk or scATAC-seq data (Tier 3), (ii) identifying the subset of Tier 3 SNPs that may also affect predicted regulatory interactions (Tier 2), and (iii) predicting which Tier 2 SNPs might directly affect transcription factor binding (Tier 1) (3.27a).
To predict these Tier 1 SNPs that might directly affect transcription factor binding, we implemented a ML framework to score the allelic effect of a SNP on chromatin accessibility. Using the gapped k-mer support vector machine (gkm-SVM) framework[165], we trained predictive regulatory sequence models of chromatin accessibility from each of the 24 broad clusters derived from our scATAC-seq data (3.27b; see 3.3.12). The gkm-SVM models for all 24 scATAC-seq clusters exhibited high prediction performance on held-out test sequences and across a 10-fold validation scheme. We used three complementary approaches, GkmExplain[423], in silico mutagenesis[70], and deltaSVM[249] to predict the allelic impact of candidate SNPs on chromatin accessibility in each cluster by providing the sequences corresponding to both alleles of each SNP to the models for each of the 24 clusters. All three approaches showed high concordance of predicted allelic effects across all candidate SNPs.
As an orthogonal metric for Tier 1 SNPs, we performed allelic imbalance analyses with our bulk ATAC-seq data using the robust allele-specific quantification and quality control (RASQUAL) statistical framework[236]. Allelic imbalance refers to the differential chromatin accessibility observed between two alleles when one allele is more readily bound by a transcription factor.
Using this tiered approach, we identified genes and molecular processes that could be implicated in AD and PD (D.6a-d and D.6). To avoid overinterpretation, we focused our downstream analyses on the subset of GWAS loci that were most likely to involve noncoding regulation based on absence of any LD SNPs in coding regions (D.6e).
3.3.9 Machine learning predicts putative functional SNPs and identifies the molecular ontogeny of disease associations
This multi-omic approach identified two main categories of novel associations within our Tier 1 SNPs: established disease-related genes where the precise causative SNP remains unknown, and novel genes previously not implicated in disease pathogenesis. Many studies have investigated the role of genes such as Phosphatidylinositol Binding Clathrin Assembly Protein (PICALM)[504], Solute Carrier Family 24 Member 4 (SLC24A4)[441], Bridging Integrator 1 (BIN1)[331, 25], and Membrane Spanning 4-Domains A6A (MS4A6A)[286] in AD since their implication in the disease by GWAS. However, it remains unclear which polymorphisms drive these associations. In the case of PICALM, our models predicted a potential functional variant (rs1237999) disrupting a putative FOS/AP1 factor binding site within an oligodendrocyte-specific regulatory element 35-kb upstream of PICALM (3.27c-d). Moreover, rs1237999 showed significant allelic imbalance with the variant (effect) allele showing diminished accessibility in bulk ATAC-seq data from heterozygotes across multiple brain regions (3.27e). Lastly, rs1237999 showed 3D interaction with both PICALM and the EED gene, a polycomb-group family member involved in maintaining a repressive transcriptional state. This expands the potential functional role of this association to a novel gene and specifically points to a role for oligodendrocytes which were not previously implicated in this phenotypic association[504].
Similarly, the SLC24A4 locus harbors a small LD block with 46 SNPs that all reside within an intron of SLC24A4. Previous work has implicated both SLC24A4 and the nearby Ras And Rab Interactor 3 (RIN3) gene in this association but the true mediator remains unclear[396, 244]. Our multi-omic approach identifies a single SNP, rs10130373, which occurs within a microglia-specific peak, disrupts an SPI1 motif, and communicates specifically with the promoter of the RIN3 gene (3.27f-g). This is consistent with the role of RIN3 in the early endocytic pathway which is crucial for microglial function and of particular disease relevance in AD53. We identify similar examples in the BIN1 and MS4A6A loci (D.7).

Figure 3.27 (previous page): a, Schematic of the overall strategy for tiered identification of putative functional SNPs and their corresponding gene targets. b, Schematic of the gkm-SVM machine learning approach used to predict which noncoding SNPs alter transcription factor binding and chromatin accessibility. c,f, Normalized scATAC-seq-derived pseudo-bulk tracks, H3K27ac HiChIP loop calls, co-accessibility correlations, and publicly available H3K4me3 PLAC-seq loop calls (Nott. et al. 2019) in the (c) PICALM gene locus (chr11:85599000-86331000) and (f) SLC24A4 locus (chr14:91998000-92729000). scATAC-seq tracks represent the aggregate signal of all cells from the given cell type and have been normalized to the total number of reads in TSS regions, enabling direct comparison of tracks across cell types. For HiChIP, each line represents a FitHiChIP loop call connecting the points on each end. Red lines contain one anchor overlapping the SNP of interest while grey lines do not. For co-accessibility, only interactions involving the accessible chromatin region of interest are shown. For PLAC-seq, MAPS loop calls from microglia (blue), neurons (orange), and oligodendrocytes (purple) are shown. d,g, GkmExplain importance scores for each base in the 50-bp region surrounding (d) rs1237999 and (g) rs10130373 for the effect and non-effect alleles from the gkm-SVM model corresponding to (d) oligodendrocytes (Cluster 21) and (g) microglia (Cluster 24). The predicted motif affected by the SNP is shown at the bottom and the SNP of interest is highlighted in blue. e, Dot plot showing allelic imbalance at rs1237999. Significance of allelic imbalance was determined by RASQUAL. The bulk ATAC-seq counts determined by WASP and ASEReadCounter for the reference/non-effect (G) allele and variant/effect (A) allele are plotted. Each dot represents an individual bulk ATAC-seq sample (N = 140) colored by the brain region from which the sample was collected. Samples where fewer than 3 reads were present to support both the reference and variant allele (i.e. presumed homozygotes or samples with insufficient sequencing depth) are shown in grey. The blue line represents a linear regression of the non-grey points and the grey box represents the 95% confidence interval of that regression.
Moreover, the true promise in studying these noncoding polymorphisms is the identification of novel genes affected by disease-associated variation. The ITIH1 GWAS locus occurs within a 600-kb LD block harboring 317 SNPs and no plausible gene association has been made to date. We nominate rs181391313, a SNP occurring within a putative microglia-specific intronic regulatory element of the Stabilin 1 (STAB1) gene (3.28a). STAB1 is a large transmembrane receptor protein that functions in lymphocyte homing and endocytosis of ligands such as low density lipoprotein, two functions consistent with a role for microglia in PD54. This SNP is predicted to disrupt a KLF4 binding site, consistent with the role of KLF4 in regulation of microglial gene expression[216] (3.28b). Similarly, the KCNIP3 GWAS locus resides in a 300-kb LD block harboring 94 SNPs. Our results identify two putative mediators of this phenotypic association with different functional interpretations (3.28c). First, rs7585473 occurs ≥250 kb upstream of the lead SNP and disrupts an oligodendrocyte-specific SOX6 motif in a peak found to interact with the Myelin and Lymphocyte (MAL) gene, implicated in myelin biogenesis and function (3.28d). Alternatively, we find rs3755519 in a neuronal-specific intronic peak within the KCNIP3 gene with clear interaction with the KCNIP3 gene promoter. While this SNP does not show a robust ML prediction, nor reside within a known motif, significant allelic imbalance supports its predicted functional alteration of transcription factor binding (3.28e). Furthermore, this SNP is associated with KCNIP3 expression in three bulk brain regions from the Genotype and Tissue Expression (GTEx) database (frontal cortex, p = 4.04e-7; hippocampus, p = 1.45e-7; cerebellum, p = 3.47e-8) and fine-mapping analysis places rs3755519 within the 95% credible set of causal SNPs in all three brain regions. Together, these SNPs provide competing interpretations of this locus, implicating oligodendrocyte- and neuron-specific functions, and demonstrating the complexities of interpretation of functional noncoding SNPs. We additionally noted that many SNPs appear to disrupt binding sites related to CCCTC-Binding Factor (CTCF;
D.7).

Figure 3.28 (previous page): a,c, Normalized scATAC-seq-derived pseudo-bulk tracks, H3K27ac HiChIP loop calls, co-accessibility correlations, and publically available H3K4me3 PLAC-seq loop calls (Nott. et al. 2019) in (a) the ITIH1 gene locus (chr3:52168000-52890000) or (c) the KCNIP3 locus (chr2:94994000-95394000). scATAC-seq tracks represent the aggregate signal of all cells from the given cell type and have been normalized to the total number of reads in TSS regions, enabling direct comparison of tracks across cell types. For HiChIP, each line represents a FitHiChIP loop call connecting the points on each end. Red lines contain one anchor overlapping the SNP of interest while grey lines do not. For co-accessibility, only interactions involving the accessible chromatin region of interest are shown. For PLAC-seq, MAPS loop calls from microglia (blue), neurons (orange), and oligodendrocytes (purple) are shown. b,d, GkmExplain importance scores for each base in the 50-bp region surrounding (b) rs181391313 or (d) rs7585473 for the effect and non-effect alleles from the gkm-SVM model corresponding to (b) microglia (Cluster 24) or (d) oligodendrocytes (Cluster 21). The predicted motif affected by the SNP is shown at the bottom and the SNP of interest is highlighted in blue. e, Dot plot showing allelic imbalance at rs3755519. Significance of allelic imbalance was determined by RASQUAL. The bulk ATAC-seq counts determined by WASP and ASEReadCounter for the reference/non-effect (A) allele and variant/effect (T) allele are plotted. Each dot represents an individual bulk ATAC-seq sample (N = 140) colored by the brain region from which the sample was collected. Samples where fewer than 3 reads were present to support both the reference and variant allele (i.e. presumed homozygotes or samples with insufficient sequencing depth) are shown in grey. The blue line represents a linear regression of the non-grey points and the grey box represents the 95% confidence interval of that regression.
3.3.10 Epigenomic dissection of the MAPT locus explains haplotypespecific changes in local gene expression
One of the strongest PD-associated risk loci is the microtubule associated protein tau (MAPT) gene which encodes tau proteins whose pathological, hyperphosphorylated aggregates form neurofibrillary tangles in AD56. However, despite this long-known genetic association, it remains unclear how the MAPT locus may play a role in PD. The MAPT locus is present within a large 1.8-Mb LD block and manifests as two distinct haplotypes, H1 and H2, which differ by (i) ≥2000 SNPs across the two haplotypes and (ii) an 1-Mb inversion that includes the MAPT gene[442, 523] (3.29a). Previous reports have nominated multiple explanations for how these alterations are associated with PD, including increased MAPT expression in the H1 haplotype[477, 14] 3.29, different ratios of splice isoforms[352, 49, 238], and the use of alternative promoters[198]. We created a haplotype-specific map of chromatin accessibility and 3D chromatin interactions at the MAPT locus (3.29c). Using data from heterozygote H1/H2 individuals, we split reads into H1 and H2 haplotypes based on the presence of one of the 2366 haplotype divergent SNPs. We tiled the region into non-overlapping 500-bp bins (to avoid biases in peak calling) and performed a Wilcoxon rank sum test to identify regions differentially accessible both between H1/H1 and H2/H2 homozygotes and between split reads from H1/H2 heterozygotes. This identified 28 differentially accessible bins including an H1-specific putative regulatory element located 68 kb upstream of the MAPT promoter and the promoter of the KAT8 regulatory NSL complex subunit 1 (KANSL1) gene located 330 kb downstream of MAPT (3.29d). Using our HiChIP data, we performed haplotype-specific virtual 4C to determine if any changes in chromatin accessibility were accompanied by changes in 3D chromatin interaction frequency. We identified H2-specific 3D interactions between a putative domain boundary upstream of MAPT (labeled ”A”) and the region surrounding the KANSL1 promoter (labeled ”B”) spanning a distance of ≥600 kb inside the inversion breakpoints (3.29d). Additionally, the H1-specific putative regulatory element upstream of MAPT showed increased interaction with a second putative regulatory element intronic to MAPT as well as with the MAPT promoter (3.29d).
To better understand how these epigenetic changes impact haplotype-specific gene expression, we used RNA-sequencing data from the GTEx database. In addition to the previously mentioned haplotype-specific differences in MAPT expression (3.29b), we also identified significant changes in gene expression near the largest changes in chromatin accessibility and 3D interaction (”A” and ”B”; 3.29e). These increases in gene expression could play a functional role in MAPT haplotype-mediated pathologic changes or, more likely, be a non-functional byproduct of the genomic inversion.
These analyses illuminate how the genomic region inside the MAPT inversion breakpoints differs between the H1 and H2 haplotypes; alternatively, the inversion could alter MAPT gene expression by changing the relative orientation of the MAPT gene to enhancers and promoters outside of the breakpoints. In support of this, we identified a long-distance putative regulatory element located 650 kb upstream of the MAPT gene that showed elevated interaction with the MAPT promoter specifically in the H1 haplotype (3.29f). Indeed, we found multiple neuron-specific putative regulatory elements in this upstream region, consistent with the known neuron-specific expression of MAPT, and an increase in overall 3D interaction between this upstream region and the region surrounding MAPT inside of the inversion breakpoints. Additional studies will be necessary to demonstrate functional effects of these predicted regulatory interactions (3.29g).

Figure 3.29 (previous page): a, Schematic of the MAPT locus (chr17:44905000-46895000) showing all genes, the predicted locations of the inversion breakpoints, and the 2366 haplotype-divergent SNPs used for haplotype-specific analyses. b, Gene expression of the MAPT gene shown as a box plot from GTEx cortex brain samples subdivided based on MAPT haplotype (The lower and upper ends of the box represent the 25th and 75th percentiles and the internal line represents the median. The whiskers represent 1.5 multiplied by the inter-quartile range. Outliers are shown as individual dots. Significance determined by Wilcoxon rank sum test. c, Schematic for the allelic analysis of the MAPT region. Data from homozygous H1 and H2 individuals are directly compared. Data from heterozygous H1/H2 individuals are first split based off of the presence of haplotype-divergent SNPs in the reads and then compared. d, HiChIP (top) and bulk ATAC-seq (middle) sequencing tracks of the region representing the MAPT locus inside of the predicted inversion breakpoints (chr17:45510000-46580000; bottom). Each track represents the merge of all available H1 or H2 reads from all heterozygotes. HiChIP and ATAC-seq tracks represent unnormalized data from heterozygotes where reads were split based on haplotype. No normalization was performed because each sample is internally controlled for allelic depth. HiChIP is shown as a virtual 4C plot where the anchor is indicated by a dotted line and the signal represents paired-end tag counts overlapping a 10-kb bin. Regions showing significant haplotype bias in ATAC-seq are marked by an asterisk (Wilcoxon rank sum test). e, GTEx cortex gene expression of genes in the MAPT locus comparing H1 homozygotes. Regions A and B are shown as in Figure 5d. *p ≤ 0.05 by Wilcoxon rank sum test after multiple hypothesis correction. f, HiChIP (top) and cell type-specific scATAC-seq (middle) sequencing tracks of the region representing the MAPT locus outside of the predicted inversion breakpoints (bottom). HiChIP tracks for bulk homozygote H1 or H2 samples (normalized based on reads-in-loops) are shown at the top while haplotype-specific tracks from heterozygotes (unnormalized) are shown below. In each HiChIP plot, the anchor represents the MAPT promoter. scATAC-seq tracks represent the aggregate signal of all cells from the given cell type and have been normalized to the total number of reads in TSS regions, enabling direct comparison of tracks across cell types. g, Schematic illustrating the predicted haplotype-specific change in long-distance interaction between the MAPT promoter and the predicted distal regulatory element identified in Figure 5d. Regions marked A and B represent the same regions marked in Figure 5d-e.
3.3.11 Discussion
Here, we provide a high-resolution epigenetic characterization of the role of inherited noncoding variation in AD and PD. Our integrative multi-omic framework and ML classifier predicted dozens of functional SNPs, nominating gene and cellular targets for each noncoding GWAS locus. These predictions both inform well-studied disease-relevant genes, such as BIN1 in AD, and suggest novel gene-disease associations, such as STAB1 in PD. This greatly expands our understanding of inherited variation in AD and PD and provides a roadmap for epigenomic dissection of noncoding variation in neurodegenerative and other complex genetic diseases.
Together, this multi-omic resource captures the regional and cellular gene regulatory machinery that governs phenotypic expression of noncoding variation, thus allowing to the identification of the majority of polymorphisms that could putatively affect gene expression through overlap with peaks of chromatin accessibility (Tier 3). To further refine these putative functional variants, we identified the subset of polymorphisms that could be mapped to gene targets through 3D chromatin interactions or co-accessibility networks (Tier 2). Finally, we employed a ML approach to predict the subset of polymorphisms likely to perturb transcription factor binding and validated these predictions with measurements of allelic imbalance (Tier 1). In total we implicate 5 times as many genes in the phenotypic association of AD and PD and nominate functional noncoding variants for dozens of previously orphaned GWAS loci. Additionally, through our integrative analysis, we provide a comprehensive epigenetic characterization of the MAPT gene locus). The functional predictions made through our ML classifier and integrative analytical approach greatly expand our understanding of noncoding contributions to AD and PD. More broadly, this work represents a systematic approach to understanding inherited variation in disease and provides an avenue towards the nomination of novel therapeutic targets that previously remained obscured by the complexity of the regulatory machinery of the noncoding genome.
3.3.12 Methods
Code Availability
All custom code used in this work is available in the following GitHub repository: https://github. com/kundajelab/alzheimers parkinsons
Publicly Available Data Used In This Work
All QTL analysis was performed using GTEx v8. Additionally, we downloaded full-genome summary statistics of GWAS associations for three Alzheimer’s cohorts1–3 and two Parkinson’s cohorts[93, 348]; however, it should be noted that these cohorts are not all mutually exclusive. The Parkinson’s disease full GWAS summary statistics from Chang et al. were obtained through a research agreement with 23andMe. These summary statistics included those generated by 23andMe (N = 6476 PDaffected individuals and 302,042 disease-free controls) but not summary statistics from individuals incorporated into meta-analysis from the original publication. All GWAS data used in this study (except the data protected through our research agreement with 23andMe) has been compiled for ease of reproducibility and is available under doi 10.1101/2020.01.06.896159 here: https://zenodo. org/record/3817811. Additionally, we obtained MAPS-based loop calls directly from published PLAC-seq data from microglia, neurons, and oligodendrocytes[183].
Sequencing
Bulk ATAC-seq, and HiChIP were sequenced using an Illumina HiSeq 4000 with paired-end 75-bp reads. Single-cell ATAC-seq was sequenced using an Illumina NovaSeq 6000 with an S4 flow cell with paired-end 99 bp reads.
Sample acquisition and patient consent
Primary brain samples were acquired post-mortem with IRB-approved informed consent from Stanford University, the University of Washington, or Banner Health. Human donor sample sizes were chosen to provide sufficient confidence to validate methodological conclusions. Human brain samples were collected with an average post-mortem interval of 3.9 hours (range 2.0 – 6.9 hours). These brain regions include distinct isocortical regions [superior and middle temporal gyri (SMTG, Brodmann areas 21 and 22), parietal lobe (PARL, Brodmann area 39), and middle frontal gyrus (MDFG, Brodmann area 9)], striatum at the level of the anterior commissure [caudate nucleus (CAUD) and putamen (PTMN)], hippocampus (HIPP) at the level of the lateral geniculate nucleus, and the substantia nigra (SUNI) at the level of the red nucleus. Macrodissected brain regions were flash frozen in liquid nitrogen. Some samples were embedded in Optimal Cutting Temperature (OCT) compound. All samples were stored at -80 degrees C until use. Due to the limiting nature of these primary samples, this unique biological material is not available upon request.
Isolation of nuclei from frozen tissue chunks and bulk ATAC-seq data generation
Nuclei were isolated from frozen tissue as described previously19,33. This protocol, including the transposition reaction, is now available on protocols.io (dx.doi.org/10.17504/protocols.io.6t8herw). Briefly, frozen tissue fragments were Dounce homogenized to create a suspension of nuclei. Nuclei were purified using an iodixanol gradient and washed in resuspension buffer (RSB). Nuclei were counted and, for each replicate, 50,000 nuclei were aliquoted into a separate tube containing RSB with 0.1% Tween-20. Nuclei were pelleted and transposed as described in the protocol linked above according to the Omni-ATAC transposition conditions[114]. Transposed fragments were purified and amplified as described previously[76] with slight modification. Briefly, transposed fragments were pre-amplified for 3 cycles. The concentration of pre-amplified fragments was determined by qPCR and this concentration was used to estimate the total number of cycles required to obtain 160 femtomoles of fragments. A second PCR was performed to amplify the pre-amplified fragments for the desired number of cycles. Final libraries were again purified. Prior to sequencing, libraries were pooled and run on a 6% PAGE gel and excess primers and primer dimers below 125 bp were removed. Libraries were sequenced on an Illumina HiSeq4000 instrument as described above. After isolation and bulk ATAC-seq, remaining nuclei were cryopreserved in BAM Banker (Wako Chemicals) and stored at -80 degrees C for use in other assays such as scATAC-seq and HiChIP.
ATAC-seq Data Processing
The ENCODE DCC ATAC-seq pipeline (doi:10.5281/zenodo.211733) (V1.1.7) was used to process bulk ATAC-seq samples, starting from fastq files. The pipeline was executed with IDR enabled and the IDR threshold set to 0.05. The GRCh38 reference genome assembly was used, keeping only the primary chromosomes chr1 - chr22, chrX, chrY, chrM. The pipeline was executed with ATAQC enabled, using GENCODE version 29 TSS annotations. Biological replicates were analyzed individually, with the two technical replicates for each bio-rep provided as inputs to the ”atac.bams” argument of the pipeline. Other arguments to the pipeline were kept at their defaults.
Ancestry determination via PCA analysis on genomic data
Genotyping was performed on the bulk ATAC-seq datasets with the bcftools (1.7)[377]. The ‘bcftools mpileup‘ command was executed on individual bulk ATAC-seq filtered bam files to generate read pileups. The output of this command was fed into ‘bcftools call‘ to perform variant calling on the mpileup files. The resulting vcf files were merged with ‘bcftools merge‘, converted to plink 1.967 –bfile format, and filtered to include variants with population minor allele frequency (MAF) greater than or equal to 0.05. Chromosome 1 data from phase 3 of the 1000 Genomes Project68 was downloaded from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38 positions/ALL.chr1 GRCh38.genotypes.20170504.vcf.gz. The variants were filtered to those with MAF ≥ 0.05. Common variants were identified from the 1000 Genomes SNP set and the donor SNP set above, and the datasets were merged into a single PLINK binary format (bed) file. This yielded 447,096 SNPs for 2916 individuals in the combined 1000 Genomes / donor dataset. The PLINK –pca command was executed with the subject id’s of the 1000 Genomes Project individuals passed to the ‘–family –pca-cluster-names‘ flags to ensure that PCA would be performed on the 1000 Genomes cohort, and the unknown donors from this study would be projected on the resulting
PC’s. Individuals form 1000 Genomes and the donors from this study were jointly plotted along PC1 and PC2, and the ancestry for all donors was set as the ancestry of the closest 1000 Genomes individual in PC space.
ATAC-seq Peak Calling
Pipeline peak calls underwent several levels of filtering to identify credible peak sets. The IDR optimal peak set from the DCC pipeline for each biological replicate was determined. It was observed that although the IDR peaks for individual biological replicates were corrected for multiple testing, the high number of biological samples in the dataset served as another source of multiple testing error. To address this source of error, tagAlign files for all biological replicates for a given brain region/ condition were concatenated. The DCC pipeline (v1.1.7) was subsequently executed on the merged tagAlign files as single-replicate inputs. The pipeline generated pseudo-replicates from the input tagAlign files for each brain region/condition. Optimal IDR peaks were called from the pseudo-replicates. This set of IDR peaks was filtered to keep peaks supported by 30 percent or more of IDR peaks from the pipeline runs on individual biological replicates.
Sample-by-peak count matrices were then generated from the resulting set of filtered peaks.
Filtered peaks from the pooled tagAlign files were concatenated and truncated to within 200 base pairs of the summit (100 base pair flank kept upstream and downstream of the peak summit). These 200 bp regions were merged with the bedtools[377] merge command to avoid merging peaks with low levels of overlap. The bedtools coverage -counts was used to compute the number of tagAlign reads that overlapped each peak region in the pseudo-replicates in the merged tagAlign dataset.
This analysis yielded a total of n=186,559 peaks combined across the brain regions.
Motif enrichment
Motif enrichment was performed using the hypergeometric test as described previously[116, 187].
Feature Binarization
Identification of ”unique” peaks from ATAC-seq data was performed as described previously33. Briefly, for each of the cell classes (termed ”groups” here), we created 3 pseudo-bulk replicates which were used to create a counts matrix of insertion counts within each peak of the scATAC-seq peak set. This counts matrix was then log-normalized using ’edgeR::cpm(mat,log=TRUE,prior.count=3)’. We then calculated the intra-group mean and intra-group standard deviation across every peak in the scATAC-seq peak set. Then, for each peak, we rank the groups by their intra-group mean. Then, we iterate from the second lowest group asking whether the mean of that group is greater than the maximum intra-group mean plus the intra-group standard deviation of the next-lowest sample. This iterative process proceeds until a group is identified that meets this criterion. This point is defined as the break point and all groups with a higher intra-group mean are classified as positive for this peak and given a value of ”1”. All groups below the break point are given a value of ”0”. If a peak does not have a break point it is discarded. This peak ”binarization” procedure classifies all ”1s” as being higher than every individual ”0”. This also captures the peaks that are unique to multiple groups. We kept all combinations that were unique to 3 or fewer groups. To facilitate multiple hypothesis testing, we computed a contrast matrix for all observed combinations and ran limma’s eBayes test on the log-normalized counts matrix. We then extracted all of the FDR-adjusted pvalues from differential testing keeping those peaks that were below an FDR of 0.001. This resulted in the classification of 221,062 peaks.
Sequencing Tracks
Sequencing tracks were created using the WashU Epigenome Browser. All sequencing tracks of a given locus have the same y-axis. All tracks show data that has been normalized by ”reads-in-peaks” (for ATAC-seq) or ”reads-in-loops” for HiChIP to account for differences in signal-to-background ratios across multiple samples, unless otherwise stated. For all sequencing tracks, genes that are on the plus strand (i.e. 5’ to 3’ in the left to right direction) are shown in red and genes that are on the minus strand (i.e. 5’ to 3’ in the right to left direction) are shown in blue to enable identification of the TSS.
LD score regression
We apply stratified LD score regression, a method for partitioning heritability from GWAS summary statistics, to sets of cell type-specific ATAC-seq peaks to identify disease-relevant cell types for Alzheimer’s and Parkinson’s diseases along with other brain-related GWAS traits. Using our singlecell ATAC-seq data, peak coordinates were first converted from hg38 to hg19 for analysis with GWAS data. We followed the LD score regression tutorial (https://github.com/bulik/ldsc/wiki) as used previously[149] for single-cell specific analysis[148]. We used brain related GWAS summary statistics such as Alzheimer’s[237], Parkinson’s[93], Schizophrenia[377], Anorexia Nervosa[134], Attention Deficit Hyperactivity Disorder (ADHD)[124], Anxiety[341], Neuroticism[336] and Epilepsy[31] (https://zenodo.org/record/3817811). To serve as controls, we also used summary statistics for GWAS of traits not obviously linked to brain tissues such as Lean Body Mass[522], Bone Mineral Density[220] and Coronary Artery Disease[194]. In particular, we looked at the regression coefficient p-value, indicative of the contribution of this annotation to trait heritability, conditional on the baseline model described previously[149].
Allele counts from ATAC-seq data
The WASP mapping pipeline (https://github.com/bmvdgeijn/WASP/tree/master/mapping) was used to reduce biases in mapping and in filtering duplicate reads. Reads were mapped using bowtie2 to the UCSC hg38 reference genome. Variants were called on the resulting bam files using bcftools mpileup (v1.9) to produce VCF files. These VCF files and the WASP-corrected bam files were used as input for the GATK ASEReadCounter tool to obtain allele counts and their mapping quality. These allele counts were used to visualize significant allelic imbalance as determined by RASQUAL (see below). For plotting, samples that lacked at least 3 read counts for both the reference and alternate alleles were inferred to be either homozygous or too low coverage to presume heterozygosity. However, we note that these allele counts were only used for display purposes and did not contribute to any determination of significance for allelic imbalance.
Allelic imbalance from ATAC-seq data using RASQUAL
We intersected the coordinates of all LD-expanded candidate AD and PD GWAS and colocalization SNPs with peaks from our ATAC-seq data to obtain the candidate SNPs that we tested for allelespecific effects on chromatin accessibility. We used the createASVCF.sh script from the RASQUAL23 GitHub repository (https://github.com/natsuhiko/rasqual) to obtain the allele-specific counts at each candidate SNP for all samples. We used the fitAseNullMulti function from the QuASAR[185] GitHub repository to calculate for each donor the posterior probability of the three possible genotypes at all of the candidate SNP positions using all available brain region samples from that donor and assigned the genotype at each position to be the one with the highest posterior probability. Next, using these allele-specific counts and genotypes and the allele frequencies from the 1000 Genomes Project[37] for each candidate SNP, we created a VCF file for each brain region, which included the allele-specific counts and genotypes from only the samples that originated from those respective regions. Similarly, we created region-specific counts matrices, which contain columns of ATAC-seq read counts for each feature only from the samples that originated from the respective regions. We also ran the makeOffset.R script from the RASQUAL repository with a list of GC contents, corresponding to the GC content of each feature in the counts matrix, as an argument to generate the sample specific offset terms file for each brain region. Since RASQUAL is run on each feature from the counts matrix independently of other features, we further split the region-specific input VCF files, counts matrices, and offset files by chromosome and used the text2bin.R script from the RASQUAL repository to convert the region and chromosome-specific input counts matrices and offset files into the binary format required by RASQUAL.
Finally, we ran RASQUAL using the input VCF file, counts matrix, and offset file from each of the 22 chromosomes (chromosomes 1 – 22; chromosome X and chromosome Y did not have any candidate SNPs) from each of the brain regions and tested each candidate SNP present in each feature in the counts matrix. To test for genome-wide significance of each putative chromatin accessibility QTL (caQTL), we ran RASQUAL with the –random-permutation option along with the same inputs 10 times to generate a background set of null q-values. For each brain region, we used the empirical distribution of null q-values to identify those SNPs that have a q-value lower than the 10% False Discovery Rate (FDR) threshold as significant caQTLs as recommended by the authors (https://github.com/natsuhiko/rasqual/issues/21).
SNP selection for colocalization testing
A single test for colocalization of GWAS and eQTL association signals involves a locus, a GWAS, an eQTL tissue, and a gene expressed in that tissue. For each GWAS, we selected the set of all loci for which the lead GWAS variant had p-value ≤ 1e-5. Using eQTLs from GTEx brain tissues in the GTEx v8 dataset, we then found all tissue-gene combinations for which the lead SNP at one of the GWAS loci had an eQTL SNP (association p-value ≤ 1e-5) for that gene in that GTEx tissue. This resulted in a list of unique combinations of GWAS trait / genomic locus / eQTL tissue / eQTL gene, each to be tested individually for colocalization of GWAS and eQTL signals. The GWAS threshold of 1e-5 is less stringent than the threshold for genome-wide significance, but we favored sensitivity over specificity when selecting which SNPs to test, since colocalization with a strong eQTL signal may still suggest that a sub-threshold GWAS locus has an expression-mediated effect on disease.
Colocalization analysis
For each colocalization test combination as defined above, we selected all 1000 Genomes Phase 3 variants within a window of 500kb around the lead GWAS variant. We narrowed this list down to SNPs measured not only in the 1000 Genomes VCF, but also in the GWAS and eQTL summary statistics for the selected trait, tissue, and gene. We used a streamlined version of the FINEMAP tool[451] to compute posterior causal probabilities for each SNP at the locus in both the GWAS and eQTL studies, and then combined these probabilities as described in eCAVIAR[117] to compute a colocalization posterior probability (CLPP) score for this test locus. We considered a SNP weakly colocalized if its CLPP score exceeded 0.01; although this seems like a low probability, we have observed previously that loci exceeding this cutoff show considerable likelihood of haring causal eQTL and GWAS variants[157], and our goal in this analysis was to be as sensitive as possible in selecting putatively functional loci for subsequent orthogonal analysis steps.
Selection of candidate SNPs for ATAC-seq overlap analysis, HiChIP interaction tests, and gkm-SVM model-based allelic effect scores
Our goal was to identify SNPs with a causal effect on any of the selected GWAS traits. To minimize the chances of excluding causal GWAS SNPs, we selected the set of all variants achieving a genomewide significant p-value ≤ 5e-8 for any GWAS trait. We then added in any lead SNPs from the colocalization analysis that achieved CLPP score of ≥ 0.01, even those that did not pass the genomewide significance value of p ≤ 5e-8. We also included all trait-associated SNPs curated from two other Parkinson’s studies[93, 327]. In these studies, full summary statistics were not publicly available for the entire genome because meta-analysis was applied only to the subset of SNPs reaching genomewide significance in a previous Parkinson’s GWAS. We then computed the full set of SNPs that had LD R2 ≥ 0.8 with at least one of the SNPs in the set selected above. These LD calculations were performed on Phase 1 genotypes of individuals of European ancestry in the 1000 Genomes dataset, provided in full here (https://zenodo.org/record/3404275#.Xlw62XVKhhE). Pairwise LD values of all variants in the above subset were calculated via plink (v.1.90). These pairwise LD values were used to identify 1000 Genomes SNPs with R2 ≥= 0.8 with the SNPs in our dataset. Together, these LD buddies plus the original set of trait-relevant SNPs comprised the set of SNPs tested in our subsequent functional analyses.
Testing GWAS loci for overlap with ATAC-seq peaks
We tested all SNPs in the above set for overlap with ATAC-seq peaks from two different annotation formats. The first annotation consisted of bulk ATAC-seq peaks identified in one of 7 brain regions. The second annotation consisted of cluster-specific peaks from single-cell ATAC-seq data. For each variant selected for functional analysis, we determined all cellular contexts in which an ATAC-seq peak contained this variant, as well as the nearest peak if no peak contained the variant.
Single-cell ATAC-seq library generation
Cryopreserved nuclei were thawed on ice and 65,000 nuclei were transferred to a tube containing 1 ml of RSB-T [10 mM Tris-HCl pH 7.5, 10 mM NaCl, 3 mM MgCl2, 0.1% Tween]. Nuclei were pelleted at 500 RCF for 5 minutes at 4 degrees C in a fixed angle rotor. The supernatant was fully removed using two pipetting steps (p1000 to remove down to the last 100 ul, then p200 to remove all remaining supernatant). This pellet was then gently resuspended in 12 ul of 1x Nuclei Buffer (10x Genomics). To transpose, 5 ul of this nuclei suspension (containing 27,000 nuclei) was transferred to a tube containing 10 ul of transposition mix (10x Genomics). This reaction mixture was incubated at 37 degrees C for 1 hour to transpose. The remainder of library generation was completed as described in the 10x Genomics Single Cell ATAC Regent Kits User Guide (v1 Chemistry).
Single-cell ATAC-seq LSI clustering and visualization
Single-cell ATAC-seq clustering analysis was performed using an alpha version of the ArchR software[171]. To cluster our scATAC-seq data (for both broad clustering and neuronal sub-clustering), we first identified a robust set of peak regions followed by iterative LSI clustering[404, 172]. Briefly, we created 1-kb windows tiled across the genome and determined whether each cell was accessible within each window (binary). Next, we identified the top 50,000 accessible windows across all samples (accounting for GC bias) and performed an LSI dimensionality reduction (TF-IDF transformation followed by Singular Value Decomposition SVD) on these windows followed by Harmony batch correction[233]. We then performed Seurat[451] clustering (FindClusters v2.3) on the harmonized LSI dimensions at a resolution of 0.8, 0.4 and 0.2, keeping the clustering for which the minimum cluster size was greater than 100 cells (0.2 if this condition is not met). For each cluster, we called peaks on the Tn5-corrected insertions (each end of the Tn5-corrected fragments) using the MACS2 callpeak command with parameters ’–shift -75 –extsize 150 –nomodel –call-summits –nolambda –keep-dup all -q 0.05’. The peak summits were then extended by 250 bp on either side to a final width of 501 bp, filtered by the ENCODE hg38 blacklist (https://www.encodeproject.org/ annotations/ENCSR636HFF/), and filtered to remove peaks that extend beyond the ends of chromosomes. We then created a non-overlapping set of extended summits across all of these peaks as described previously[404, 172].
We then counted the accessibility for each cell in these peak regions to create an accessibility matrix. We then adopted the iterative LSI clustering approach[404, 172] to unbiasedly identify clusters that are due to biological vs technical variation. Briefly, we computed the TF-IDF transformation as described by Cusanovich et. al.[117]. To do this, we divided each index by the colSums of the matrix to compute the cell ”term frequency”. Next, we multiplied these values by log(1+ncol(matrix)/rowSums(matrix)), which represents the ”inverse document frequency”. This yields a TF-IDF matrix that can be used as input to irlba’s SVD implementation in R. We then used Harmony to batch correct the LSI dimensions in R. Using the first 25 reduced dimensions as input into a Seurat object, crude clusters were identified using Seurat’s (v2.3) SNN graph clustering FindClusters function with a resolution of 0.2. We then calculated the cluster sums from the binarized accessibility matrix and then log-normalized using edgeR’s ’cpm(matrix, log=TRUE, prior count=3)’ in R. Next, we identified the top 25,000 varying peaks across all clusters using ’rowVars’ in
R. This was done on the cluster log-normalized matrix rather than the sparse binary matrix because: (1) it reduced biases due to cluster cell sizes, and (2) it attenuated the mean-variability relationship by converting to log space with a scaled prior count. The 25,000 variable peaks were then used to subset the sparse binarized accessibility matrix and recompute the TF-IDF transform. We used SVD on the TF-IDF matrix to generate a lower dimensional representation of the data by retaining the first 25 dimensions. We then used Harmony to batch correct the LSI dimensions in R. We then used these reduced dimensions as input into a Seurat object and crude clusters were identified using Seurat’s (v.2.3) SNN graph clustering FindClusters function with a resolution of 0.6. This process was repeated a third time with a resolution of 1.0. Then, these same reduced dimensions were used as input to Seurat’s ’RunUMAP’ with default parameters and plotted in ggplot2 using R.
Single-cell ATAC-seq gene activity scores
Gene activity scores are based on the observation that chromatin accessibility within the gene body, at the promoter, and at distal regulatory elements is correlated with gene expression[172, 363, 171, 157]. Gene scores were calculated using ArchR v0.9.480 with default parameters. Briefly, ArchR infers gene activity scores using a distance-weighted accessibility model that aggregates accessibility signal inside the gene body and in the local genomic region. The resulting gene activity scores were additionally imputed using MAGIC85 to reduce noise due to scATAC-seq data sparsity.
Identification of clusters and cell types from scATAC-seq data
Different clusters and cell types were manually identified using promoter accessibility and gene activity scores for various lineage-defining genes. Microglia (Cluster 24) were identified based on accessibility near the IBA1, CD14, CD11C, PTGS1, and PTGS2 genes. Astrocytes (Clusters 1317) were identified based on accessibility near the GFAP and FGFR3 genes. Excitatory neurons (Clusters 1, 3, and 4 were identified based on accessibility near the SLC17A6 and SLC17A7 genes. Inhibitory neurons (Cluster 2, 11, and 12) were identified based on accessibility near the GAD2 and SLC32A1 genes. Medium spiny neurons (most of Cluster 2) were identified based on accessibility near the DARPP32 gene. Oligodendrocytes (Clusters 19-23) were identified based on accessibility near the MAG and SOX10 genes. OPCs (Clusters 8-10) were identified based on accessibility near the PDGFRA gene. All neuronal subsets were identified primarily as neurons based on accessibility near the NEFL, RBFOX3, VGF, and GRIN1 genes and then subdivided based on the region of origin and the accessibility near other genes mentioned above.
Single-cell ATAC-seq peak calling
For scATAC-seq peak calling from clusters or manually defined cell types, all single cells belonging to the given group were pooled together. These pooled fragment files were converted to the paired-end tagAlign format and processed with version 1.4.2 of the ENCODE DCC ATAC-seq pipeline. The conversion to tagAlign was performed as follows. For fragments on the positive strand, the read start coordinate was the fragment start coordinate, zero-indexed. The read end coordinate was the fragment start coordinate plus the read length (99 bp). For fragments on the negative strand, the read start coordinate was the fragment end coordinate, zero-indexed. The read start coordinate was the fragment end coordinate minus the read length (99 bp). Then, these tagAlign files were used as input to the DCC ATAC-seq pipeline. IDR optimal peak sets with an IDR threshold of 0.05 were determined for each cluster by the pipeline, using pseudo-bulk replicate tagAligns for the cluster. Other pipeline parameters were the same as for bulk ATAC-seq data (see above).
Single-cell ATAC-seq pseudo-bulk replicate generation and differential accessibility comparisons For differential comparisons of clusters or cell types, including Pearson correlation determination, non-overlapping pseudo-bulk replicates were generated from groups of cells. For each cell grouping (i.e a cluster or a cell type), a minimum of 300 cells was required in order to make at least two nonoverlapping pseudo-bulk replicates of 150 cells each. A maximum of 3 pseudo-bulk replicates was made per group if the total number of cells per group was greater than 450 cells. Cells were randomly deposited into one of the pseudo-bulk replicates and all available cells were used. In this way, the non-overlapping pseudo-bulk replicates are agnostic to which donor the cell came from but aware of individual cells (i.e. all reads from a given cell are deposited into the same pseudo-bulk replicate).
These pseudo-bulk replicates were then used for differential comparisons using DESeq2[282].
Identification of neuronal cell class-specific peaks, TF motifs, and genes
ArchR (version 0.9.4) was used to call peaks (using ”addReproduciblePeakSet) and identify cell class-specific peaks and genes (using ”getMarkerFeatures”). The cell class-specific peaks were tested from motif enrichment (using ”peakAnnoEnrichment”).
CIBERSORT deconvolution
CIBERSORT92 was used to deconvolve bulk ATAC-seq data using signature matrices generated from scATAC-seq data. Default parameters were used. For the cell type-specific classifier, pseudobulk replicates were generated for each of the 8 main cell types. For the cluster-specific classifier, pseudo-bulk replicates were generated for each of the 24 clusters.
Transcription factor footprinting
Transcription factor footprinting was performed as described previously[116]
HiChIP library generation
HiChIP library generation was performed as described previously[322]. One million cryopreserved nuclei were used per experiment. Enzyme MboI was used for restriction digest. Sonication was performed on a Covaris E220 instrument using the following settings: duty cycle 5, peak incident power 140, cycles per burst 200, time 4 minutes. All HiChIP was performed using H3K27ac as the target (Abcam ab4729).
HiChIP data analysis
HiChIP paired-end sequencing data was processed using HiC-Pro87 version 2.11.0 with a minimum mapping quality of 10. FitHiChIP88 was used to identify ”peak-to-all” interactions using peaks called from the one-dimensional HiChIP data. A lower distance threshold of 20 kb and an upper distance threshold of 2 Mb were used. Bias correction was performed using coverage-specific bias.
HiChIP linkage of SNPs to genes
To link SNPs to genes, we identified FitHiChIP loops that contained a SNP in one anchor and a TSS in the other anchor. This was performed for all LD-expanded SNPs to identify the full complement of genes that could be putatively implicated in AD and PD.
gkm-SVM machine learning classifier training and testing
For each of the 24 scATAC-seq clusters, we used a 10 fold cross-validation scheme to train weighted gapped k-mer Support Vector Machine (gkm-SVM) models to classify 1000 bp sequences into two classes - accessible (corresponding to sequences underlying peaks) and inaccessible (GC matched inaccessible genomic regions). The test sets for each of the 10 folds are as follows. Fold 0 consisted of chr 1. Fold 1 consisted of chr 2 and chr 19. Fold 2 consisted of chr 3 and chr 20. Fold 3 consisted of chr 6, chr 13, and chr 22. Fold 4 consisted of chr 5, chr 16, and chr Y. Fold 5 consisted of chr 4, chr 15, and chr 21. Fold 6 consisted of chr 7, chr 14, and chr 18. Fold 7 consisted of chr 11, chr 17, and chr X. Fold 8 consisted of chr 9 and chr 12. Fold 9 consisted of chr 8 and chr 10. For each of the 24 scATAC-seq clusters, we merged the IDR peaks with identical genomic coordinates (peaks with multiple summits) while preserving the summit position and the MACS2 p-value of the peak with the lowest p-value among the ones with the identical coordinates. Next, we ranked the peaks by the MACS2 p-value, expanded each peak by 500 bp on either side of the summit, to a total of 1000 bp, and eliminated those peaks with any ‘N’ bases in the 1000 bp. For each of 10 cross-validation folds, we kept up to 60,000 of the top peaks belonging to the training set and all of the peaks belonging to the much smaller test set, all of which comprised the positively labeled (accessible) examples for training.
In order to generate the negative (inaccessible) examples for each of the cross-validation folds in each single-cell cluster, first, we used seqdataloader (https://github.com/kundajelab/seqdataloader) to generate all 1000 bp sequences obtained by tiling the hg38 genome 200 bp at a time, with a stride of 50 bp, keeping those 200 bp segments that have no IDR peak summits in that cluster, and then expanding those 200 bp segments by 400 bp on each side for a total of 1000 bp. Next, we calculated the GC content of the selected positive examples and all other bins in the genome. We partitioned the positive examples into 20 equally numerous GC bins according to the GC-content percentile of the positive sequence with respect to the positive set. We assigned sequences from all other bins in the genome to GC bins according to their GC-content. Starting with an empty negative set, we then sampled a positive example, sampled a negative sequence from the same GC bin as the sampled positive example, added the negative sequence to the negative set, and repeated this process until the number of negative examples equaled the number of positive examples for both the training set and the test set.
For each of the 10 folds in each of the 24 clusters, we used the 1000-bp DNA sequences corresponding to the positive and GC-matched negative training examples as inputs to the gkmtrain function from the LS-GKM package95 with the default options, producing a total of 240 models; the default options for LS-GKM included the gapped k-mer + center weighted (wgkm) kernel (t = 4), a word length of 11 (l = 11), 7 informative columns (k = 7), 3 maximum mismatches to consider (d = 3), an initial value of the exponential decay function of 50 (M = 50), a half-life parameter of 50 (H = 50), a regularization parameter of 1.0 (c = 1.0), and a precision parameter of 0.001 (e = 0.001). We used the resulting support vectors for each trained model to score the DNA sequences corresponding to the positive and GC-matched negative test set examples for each fold in each cluster by running gkmpredict, and used the scikit-learn python library to calculate both auROC and auPRC accuracy metrics.
gkm-SVM allelic scores of candidate SNPs
We intersected the coordinates of all LD-expanded candidate AD and PD GWAS and colocalization SNPs with those of the peaks for each single-cell ATAC-seq cluster to obtain the SNPs in each cluster that are in peaks. For each SNP in a peak in each of the clusters, we retrieved the 1000 bp DNA sequence around the SNP, with the SNP at its center, and created a sequence corresponding to the effect allele by replacing the 500th position of the sequence with the effect allele. Similarly, we created another sequence corresponding to the non-effect allele by replacing the 500th position of the sequence with the non-effect allele. Furthermore, we repeated the same procedure to also produce 50 bp sequences for each SNP with the effect allele and the non-effect allele by retrieving the 50 bp DNA sequence around each SNP and replacing the 25th position with the effect and the non-effect allele, respectively.
For each SNP in a peak in each of the clusters, we computed GkmExplain[423] importance scores
for each position in each of the 1000 bp effect and non-effect allele sequences using each of the 10 gkm-SVM[165] models for the respective cluster. GkmExplain is a method to infer the importance or predictive contribution of every base in an input sequence to its corresponding output prediction from a gkm-SVM model. Next, for each SNP in a given cluster, we computed the average score for each position across all 10 models (from the 10 folds) for that cluster for both the effect allele sequence and the non-effect allele sequence, producing a set of consensus importance scores for both the effect allele and the non-effect allele. Then, we subtracted the sum of these consensus importance scores corresponding to the central 50 bp of the non-effect allele sequence from that of the effect allele sequence to compute the GkmExplain score for each SNP in each cluster.
To compute in silico mutagenesis (ISM) scores for each SNP in a peak in each of the clusters, we used each of the 10 fold gkm-SVM models from the respective cluster to compute model output prediction scores for the 50 bp effect and non-effect allele sequences by running gkmpredict. Then, we subtracted the score of the non-effect allele sequence from the score of the effect allele sequence to obtain the ISM score and computed the average ISM score for each SNP across all 10 folds in each cluster. To compute deltaSVM scores, we generated all possible non-redundant k-mers of size 11 and scored each of them using each of the 240 models. Next, for each SNP in a peak in each of the clusters, we used each of the 10 sets of k-mer scores from the 10-fold gkm-SVM models from the respective cluster to run deltaSVM21 on the 50 bp effect and non-effect allele sequences. We computed the average of the resulting deltaSVM scores for each SNP across all 10 folds in each cluster.
Statistical significance and high confidence sets of gkm-SVM based allelic scores for candidate
SNPs In order to obtain a statistical significance for each of the three gkm-SVM model based allelic SNP scores (GkmExplain, ISM and deltaSVM), for each SNP scored in each cluster, we generated 10 random 1000 bp sequences with the same di-nucleotide frequencies as those of the 1000 bp around the SNP using the fasta-shuffle-letters program from MEME Suite to serve as a null background set. For each null sequence, we created a null effect allele sequence and a null non-effect allele sequence by replacing the base at the center of the null sequence with the effect and non-effect allele, respectively.
For each SNP in a peak in each of the clusters, we computed GkmExplain importance scores for each of the central 200 bp in each of the 10 null effect and non-effect allele sequences using each of the 10 gkm-SVM models for the respective clusters. Next, for each pair of null sequences, we subtracted the sum of the importance scores corresponding to the central 50 bp of the null non-effect allele sequence from that of the null effect allele sequence to compute the null GkmExplain score.
To compute null in silico mutagenesis (ISM) scores for each SNP in a peak in each of the clusters, we used each of the 10 fold gkm-SVM models from the respective clusters to compute model output prediction scores for the central 50 bp of the null effect and non-effect allele sequences by running gkmpredict. Then, we subtracted the score of the null non-effect allele sequence from the score of the null effect allele sequence to obtain the null ISM score.
To compute null deltaSVM scores, for each SNP in a peak in each of the clusters, we used each of the 10 sets of k-mer scores from the 10 fold gkm-SVM models from the respective cluster to run deltaSVM on the central 50 bp of the null effect and non-effect allele sequences. We found that the t-distribution was a good fit (based on KS test) to the empirical null distribution for all three scores. Hence, we used the fitted t-distributions (using SciPy python library http://www.scipy.org/) to each of the three sets of null scores as the null distributions. To select SNPs with statistically significant gkm-SVM allelic scores, for each cluster, we selected those SNPs that fall outside the 95% confidence interval for all three null t-distributions fitted to the GkmExplain, ISM, and deltaSVM scores.
Next, we developed a method to identify putative transcription factor binding sites around each gkm-SVM scored statistically significant candidate SNP, by identifying the subsequences around the SNP whose base-resolution importance scores are significantly above that of the di-nucleotide matched shuffled background. We use the GkmExplain importance scores of all bases in the central 200 bp of all the null effect and non-effect allele sequences as a null distribution to identify bases around the SNP with high signal-to-noise ratio. For each SNP, we defined the active allele as the allele for which the 50 bp sequence centered on the SNP has the higher sum of non-negative importance scores relative to the other allele. Next, starting from the center of the active allele’s sequence, which is the location of the SNP, we continue advancing one pointer upstream and another downstream, each up to the position beyond which lie two consecutive bases that both have consensus importance scores that are not higher than 97.5% of the null importance scores. The subsequence between the terminal positions of the two pointers corresponds to one that underlies a series of bases with high GkmExplain importance scores that are significantly above the null scores of the di-nucleotide matched shuffled background sequences and potentially contains transcription factor binding sites and motifs that are relevant for the given cluster. We refer to these high-importance subsequences as seqlets. If a SNP does not have a seqlet that reaches a minimum length of 7 bp, then we alternatingly extend each end of the seqlet by 1 bp until this minimum length is reached.
Next, we defined two additional scores (prominence score and magnitude score) to further identify high confidence candidates from the gkm-SVM scored statistically significant candidate SNPs that are supported by seqlets that could potentially match identifiable transcription factor binding sites. We compute the sum of the non-negative consensus importance scores from the positions of the effect allele that overlap the active allele’s seqlet, which we refer to as the effect allele seqlet score, and divide that score by the sum of the non-negative consensus importance scores from the entire central 200-bp region of the effect allele sequence; we refer to this ratio as the effect allele seqlet signal-to-noise ratio. Similarly, we compute the non-effect allele seqlet score as the sum of the non-negative consensus importance scores in the non-effect allele sequence from the same positions overlapping the active seqlet. We obtain a corresponding non-effect allele seqlet signal-to-noise ratio by dividing the non-effect allele seqlet score by the sum of the non-negative consensus importance scores from the entire central 200-bp region of the non-effect allele sequence. Then, for each SNP, we compute the prominence score by subtracting the non-effect allele seqlet signal-to-noise ratio from the effect allele seqlet signal-to-noise ratio. In addition, we also compute a magnitude score by subtracting the non-effect allele seqlet score from the effect allele seqlet score. To compute the statistical significance of the prominence and magnitude scores for candidate SNPs, for each cluster, we compute null prominence scores and null magnitude scores for each pair of null effect and non-effect allele sequences using the same procedure described above and use the empirical null distributions to obtain p-values for the prominence and magnitude scores for each candidate SNP scored for that cluster. For each type of score, in order to control for any arbitrary bias in the sign of the score, we include the negative value of each score to the list of scores to enforce symmetry before fitting the distribution.
Finally, to prioritize SNPs that disrupt potential transcription factor binding sites, in each cluster, among the SNPs with statistically significant gkm-SVM allelic scores, we designate as high confidence SNPs those that have a prominence score with a p-value less than 0.05. These are the SNPs that have an allele that completely destroys a prominent and high-scoring seqlet and, as a result, potentially disrupts an important transcription factor binding site. Next, among the confident SNPs that do not pass the high confidence threshold, we designated as medium confidence SNPs those that have either a magnitude score with a p-value less than 0.05 or a prominence score with a p-value less than 0.10. The magnitude score threshold is intended to capture those SNPs that have a significant deleterious effect on the seqlet score, even if those SNPs do not necessarily destroy the entire seqlet and even for cases where the seqlet around the SNP is not among the most prominent seqlets in the local 200 bp sequence window. In addition, the relaxed prominence score threshold is intended to capture those SNPs that do not pass the stringent filter for the high confidence set, but nevertheless, demonstrate at least a partial deleterious effect on a moderately scoring seqlet around the SNP. Together, these two filters serve to increase the recall in the prioritization of the SNPs, allowing us to identify all promising SNPs that are worthy of in-depth evaluation, which can assess their potential regulatory effect through a case-by-case analysis. The remaining SNPs in the confident set, which fail to meet the threshold for medium confidence, are designated as low confidence SNPs, as they include SNPs that significantly reduce the GkmExplain score, the ISM score, and the deltaSVM score, but do not have a clear impact on a seqlet around the SNP, making it unlikely for them to have a disruptive effect on a key transcription factor binding site.
Identification of MAPT haplotypes
The MAPT haplotype block is part of one of the largest LD blocks in the human genome. To identify SNPs that belong exclusively to either the H1 or H2 haplotype, we used minor allele frequencies from dbSNP version 151. SNPs were required to be within the coordinates of the MAPT inversion breakpoints (hg38 chr17:45551578-46494237) and to have a minor allele frequency between 8.4% and 9%. While there are undoubtedly haplotype specific SNPs outside this frequency range, we chose this range to be as conservative as possible and to pick SNPs that showed minimal haplotype switching. Each SNP was verified to track with the predicted haplotype using LDLink[289]. This resulted in 2366 SNPs that could be confidently called as haplotype divergent.
MAPT locus differential expression analysis
A 900-kb block of variants in strong LD at the MAPT locus hampered the resolution of colocalization methods for identifying causal variants and/or genes at this locus. To probe this locus more deeply, we assembled a list of 2366 variants uniquely found in either the H1 or the H2 haplotype of the MAPT locus (described above). For each of the 838 individuals genotyped in GTEx v8, we counted the number of variants in support of either haplotype. We designated individuals as homozygous if they possessed less than 1% of variants favoring the opposite haplotype and heterozygous if 45% to 55% of variants supported either haplotype. This determined the individual’s haplotype in all but six cases, which were excluded from the remainder of the MAPT analysis. In total, we identified 539 individuals with the H1/H1 haplotype, 260 with H2/H1, and 33 with H2/H2. Our a priori gene of interest was MAPT, whose expression had previously been demonstrated to be higher in H1 than H2 haplotypes. At a nominal cutoff of p ≤ 0.05, we confirmed this expected direction of differential MAPT expression (higher in H1 haplotypes) in multiple tissues, with the strongest contrasts in ”Brain - Cortex”.
We then extended our analysis to include all genes expressed in any of the brain tissues from GTEx v8. We compared the log2-fold change of gene expression (TPM) between H1/H1 and H1/H2 individuals, given that these subgroups had the largest sample size. A change was considered statistically significant if a Wilcoxon rank-sum test between the two groups produced a p-value of ≤ 0.05 / (total N genes) / (total N tissues). We also performed pairwise Wilcoxon rank-sum test comparisons for each gene in each brain tissue between all 3 pairings of haplotypes.
MAPT haplotype-specific ATAC-seq and HiChIP analysis
For both ATAC-seq and HiChIP, reads from heterozygote donors were re-mapped to an N-masked genome (using bowtie2 or HiCPro, respectively) where all dbSNP v151 positions were masked to ”N”. After alignment, SNPsplit was used to divide reads mapping to either the H1 or H2 haplotypes based on the presence of one of the 2366 haplotype-divergent SNPs identified above. In this way, reads mapping to regions that lack a haplotype-divergent SNP could not be assigned in an allelic fashion to either the H1 or H2 haplotypes and were ignored. For track-based visualizations of haplotype-specific data, all available data from a given haplotype was merged agnostic to what brain region the data was derived from. To identify regions with haplotype-specific chromatin accessibility in the MAPT locus, the entire locus was tiled into non-overlapping 500 bp bins and the number of Tn5 transposase insertions were counted for each haplotype in each bin for each sample. A Wilcoxon signed-rank test was used to determine if the difference between H1 and H2 for each bin was significant after multiple hypothesis correction (FDR ≤ 0.01).
Data availability
All data generated in this work is available through GEO accession GSE147672. https://www.ncbi. nlm.nih.gov/geo/query/acc.cgi?acc=GSE147672
To facilitate broad access to our data, we have created WashU Epigenome browser session (Session ID: drS3o1n4kJ) for our scATAC-seq data in the following track formats: (i) broad cell types (”Corces scATAC BroadCellTypes”), (ii) broad clusters (”Corces scATAC BroadClusters”), (iii) neuron subclusters (”Corces scATAC NeuronSubClusters”), and (iv) neuron subclustered cell types / LDSC groups (”Corces scATAC NeuronSubCellTypes”). These tracks are accessible via the following link http://epigenomegateway.wustl.edu/legacy/?genome=hg38&session=drS3o1n4kJ
3.3.13 Author contributions
Ryan Corces (RC), Howard Chang (HC), and Thomas Montine (TM) conceived of and designed the project. RC and TM compiled the figures and wrote the manuscript with help and input from all authors. Anna Shcherbina (AS) and RC performed bulk ATAC-seq data processing and analysis. RC performed all HiChIP data analysis with help from Maxwell Mumback and Jeffrey Granja (JG). JG, RC, and AS performed all single-cell ATAC-seq data processing and analysis with supervision from William J Greenfleaf, Anshul Kundaje, Stephen Montgomery, and Howard Chang. Michael Gloudemans (MG) performed GWAS locus curation, colocalization analysis, and GTEx analysis and
LF and BL performed all LD score regression analysis with supervision from SM. Soumya Kundu
(SK) and AS performed all machine learning analysis with supervision from AK. Bosh Liu (BL), Shadi Shams (SS) and RC performed all ATAC-seq, scATAC-seq, and HiChIP data generation with help from S. Tansu Bagdatli (SB) and MM. Kathleen Montine (KM) curated the frozen tissue specimens used in this work.

Chapter 4
Base-resolution deep learning models to interpret the regulatory sequence code from chromatin profiling data
4.1 ChromBPNET: Dilated convolutional neural networks allow for greater sequence context and base-resolution modeing
4.1.1 Strengths and limitations of support vector machine models versus CNN models on binned genome
Bassett CNN generalizes genomewide, whereas SVM’s do not
The LSGKM algorithm was used to train SVM models to predict chromatin accessibility within the GECCO datasets as well as within ENCODE canonical cell line DNASE datasets. The performance of the models on a held out test set was compared against the performance of Bassett architecture convolutional neural networks on the same datasets. 10-fold cross validation was performed in both cases. Training/test regimes and performance values are illustrated in 4.1 (Canonical ENCODE cell lines: GM12878, HEPG2, H1ESC, IMR90, K562) and 4.2 (GECCO DNASE datasets). The following train/test regimes were used:
• SVM training set –60,000 most significant IDR peaks from the DNASE dataset as positive
123
examples, with 60,000 GC-matched negatives selected at random from non-peak regions within the genome.
• SVM test set –All IDR peaks for the test chromosomes in the given fold, with an equal number of GC-matched negatives from the test chromosomes.
• genomewide training set – All seqdataloader 1kb genomewide classification bins within the current fold’s training split chromosomes.
• genomewide test set – All seqdataloader 1kb genomewide classification bins within the current fold’s test split chromosomes.
• ”SVM-SVM-Genome” indicates that an SVM model was trained on the SVM training set and tested on the genomewide test set.
• ”CNN-SVM-Genome” indicates that a Bassett convolutional neural network classification model was trained on the same training set used for the SVM model (see above) and tested on the SVM test set.
• ”CNN-Genome-Genome” indicates that a Bassett CNN was trained and tested genome-wide.
• ”SVM-SVM-SVM” indicates that an SVM model was trained and tested on the SVM train and test set, respectively.
• ”CNN-Genome-SVM” indicates that a CNN model was trained on the genomewide training set and tested on the SVM test set.
• ”CNN-SVM-SVM” indicates that a CNN model was trained on the SVM training set and tested on the SVM test set.
Training an SVM model genomewide was not computationally feasible, and hence that comparison is not included.

Figure 4.1: S
VM vs CNN benchmarks on ENCODE DNASE datasets in canonical cell lines. Each point represents a fold from 10-fold cross validation, split across chromosomes.
For both the GECCO datasets and the ENCODE Canonical cell lines, the lowest performance across these comparisons was observed for the ”SVM-SVM-Genome” case. This suggests that the GKM kernel is unable to capture the complexity of combinatorial motif grammars throughout the genome. Both the SVM and the CNN model achieved higher performance (auPRC ≥ 0.9) on the SVM test set, with low variance across folds. On the genomewide test set, the CNN model trained genomewide achieved much higher performance compared to the SVM model, suggesting that the greater capacity of the neural network is able to capture more complex patterns of transcription factor binding.
Binned CNN models lack stability across training folds in performance and interpretation
Although the CNN-genome-genome case achieved auPRC around 0.5 on the genomewide test set, we observe a high degree of instability across the different folds. This is not merely driven by differences in class imbalance across folds, as shown in 4.3,4.4.
The lower stability in prediction performance across folds for the CNN models compared to the SVM models is also observed for sequence importance scores4.5. In calcualting the in silico mutagenesis scores across the 1kb region centered on the Alzheimers’ candidate GWAS hit rs636317 (see section ”Single-cell epigenomic analyses implicate candidate causal variants at inherited risk loci for Alzheimer’s and Parkinson’s diseases”), a high degree of instability is observed for both the classification and regression CNN models 4.5B. This is in contrast to the SVM GKMExplain scores for this SNP, which are highly stable across folds 4.5A.

Figure 4.2: SVM vs CNN performance benchmarks on GECCO DNASE datasets. Each point represents a fold from 10-fold cross-validation, split across chromosomes.

Figure 4.3: Expected auPRC across fold and tasks for the ENCODE canonical cell line DNASE datasets.

Figure 4.4: Expected auPRC across folds and tasks within the GECCO DNASE datasets.
4.1.2 ChromBPNET model architecture and training
To address lack of interpretation stability in genowide binned CNN models, and enable the learning of cooperative SNP grammers (i.e. homodimer and heterodimer patterns), the ChromBPNET architecture was developed. The architecture was adapted from the BPNET model [38] for learning profile patterns for individual transcription factors, and expanded to support learning accessibility profiles in ATACseq, DNASE, and histone ChIP-seq datasets. As in BPNET, we make use of dilated convolution layers (with a dilation factor of 2) to allow learning from a larger input receptor field. The ChromBPNET architecture utilizes a multinomial negative log likelihood loss to learn the count profile within a 1473 base input sequence region. We also use an MSE loss to learn the log(sum(counts)) within the 1kb output region. A typical model input profile and prediction is illustratd in 4.6. After the final dilation layer, the architecture splits into a ”count head” and a ”profile head”. The count head passes the output of the final dilated convolution layer through a global average pool, before optionally concatenating the output of this pool with a bias term. The output is then passed through a final dense layer, and an MSE loss is computed on the log(count) prediction. The profile head also optionally concatenates the output of the final dilated convolution layer with a bias term and provides a logit-space profile prediction, which is evaluated with a multinomial negative log likelihood loss.
The input profile consists of 5’ counts from the unfiltered aligned BAM file generated by the ENCODE ATAC-seq and/or DNASE pipeline[210], calculated according to the deepTools[386] script:

non-stranded (can be PE or SE) bamCoverage -p16 -v --binSize 1 --samFlagExclude 780 --Offset 1 1 \ --minMappingQuality 30 -b $cur_bam -o $cur_bam.bpnet.unstranded.bw

forward strand -- assumes SE data bamCoverage -p16 -v --binSize 1 --samFlagExclude 796 --Offset 1 1 \ --minMappingQuality 30 -b $cur_bam -o $cur_bam.bpnet.plus.bw

reverse strand -- assumes SE data

bamCoverage -p16 -v --binSize 1 --samFlagExclude 780 --samFlagInclude 16
--Offset 1 1 --minMappingQuality 30 -b $cur_bam -o $cur_bam.bpnet.minus.bw
This script calculates the 5’ count coverage for all aligned reads with alignment quality of 30 or higher, with a bin size of 1. For DNASE and ATAC-seq datasets, we don’t not oberve a strand shift effect, so non-stranded counts are used for training the model. For histone ChIP-seq models, the forward and reverse strands are treated as two tasks for model training, to account for strand shift effects.
The model is trained on genomewide input sequences that overlap by at least one base with an

Figure 4.5: Interpretation score stability across folds for the reference and alternate allele effects on accessibility at variant rs636317 within the Alzheimers/Parkinsons dataset. A) GKMexplain scores for the effect allele (alternate allele) T (left) and the non-effect (reference allele) C (right). B) In silico mutagenesis scores across all four alleles at each base in the 1kb region flanking rs636317 for the genomewide binned classification and regression CNN models.

Figure 4.6: ChromBPNET model predicts base-level count profile as well as 1kb resolution summed counts for ATAC-seq and DNASE data. Example IDR peak from K562 DNASE dataset ENCSR000EOT.
optimal overlap peak in the training set. To augment the input set, all 1473 base pair windows that overlap a given peak by at minimum 1 base are considered for training. The test set consists of summit-centered IDR peak regions.
Training was performed via 10-fold cross-validatoin, excluding regions blacklisted in the hg38 blacklist[21].
BPNET models were trained for the five canonical encode cells using ENCODE ATAC-seq and DNASE datasets:
• ENCSR000EOT – K562 DNASE
• ENCSR000EMT – GM12878 DNASE
• ENCSR000EMU – H1HESC DNASE
• ENCSR149XIL – HEPG2 DNASE
• ENCSR477RTP – IMR90 DNASE
Note: at the time of this writing, the ATAC-seq datasets for GM12878, H1HESC, HEPG2, IMR90 did not yet have ENCODE accession numbers. No ENCODE ATAC-seq data was available for K562, so that dataset is omitted for model training.
Inputs for model training were generated from the pipeline-processed datasets and corresponding 5’ count tracks (see above) using the seqdataloader[29] dbingest utility. This utility stores stranded and unstranded 5’ count data, overlap peaks, IDR peaks, blacklist peaks, fold change tracks, and pvalue tracks generated by the ENCODE pipeline in a compressed tiledb[350] database that can be queried rapidly in a multi-threaded fashion. Data generation scripts are in the ChromBPNET github repository: https://github.com/kundajelab/chrombpnet/tree/master/tiledb.
The kerasAC[30] software was used to traing ChromBPNET models for each of the abovementioned The full architecture used is illustrated in Figure 4.7
4.1.3 ChromBPNET hyperparameter tuning
The model takes a 1346 base-pair length one-hot-encoded sequence as input. Input lengths of up to 6kb were tested, but no increase in model predictive performance for ATAC-seq and DNASE-seq datasets was observed for input sequence longer than 1346.
Hyperparameter search revealed that 500 filters in each dilated convolutional layer was optimal (from a search space of 100, 300, 500, 700, 1000). Six dilation layers were included in the architecture, each with a dilation factor of two.
Among the lessons learned in model training was that ChromBPNET has some outlier sensitivity in the input space, and any bins with ln(counts) outside the range 4 to 11.5 were excluded from training (corresponding to a minimum read coverage of 100 reads in the bin and a maximum coverage of 100,000 reads).
The models were also found to be sensitive to the choice of random seed used in weight initialization on the counts prediction, but not on the profile prediction. For this reason, each ChromBPNET model was trained with 3 random seeds (1234, 2345, 3456) and the model with highest count performance among the three seeds was chosen for use in interpretation and downstream analysis.
4.1.4 ChromBPNET baseline performance
Test set performance was evaluated using five metrics:
• Spearman correlation between log(counts) in the input labels and model predictions.
• Pearson correlation between log(counts) in the input labels and model predictions
• the mean squared error (MSE) between log(counts) of the input labels and the model predictions
• the Jensen-Shannon distance between the input counts profiles (converted to probability space) and the model’s predicted profiles (converted to probability space). This distance was calculated on each test region and averaged across the test set.

Figure 4.7: ChromBPNET architecture for ATAC-seq and DNASE datasets, with bias correction.
• the multinomial negative log likelihood of the predicted profile (in probabability space) given the labeled profile (in probability space). This metric was also calculated for each test region profile and averaged across the test set.
These performance metrics for the ENCODE canonical cell lines in ATAC-seq and DNASE datasets are illustrated in Figure4.8. Because the profile labels were generated per-base, they were susceptible to technical noise in the dataset, and hence the JSD and MNNLL metrics were calculated on unsmoothed profiles, as well as profiles smoothed with a Gaussian kernel of length 7 and standard deviation of 3. Performance metrics for smoothed profiles and labels are indicated in yellow, for smoothed labels only are in red, and for no smoothing are in blue. Pearson and Spearman correlation values on the test set ranged from 0.5 to 0.8, with relative stability across folds. JSD values for smoothed labels and predictions were centered on 0.2, while the mean MNLL values centered on 1100 for ATAC and 1500 for DNASE models.
4.1.5 Enzymatic Bias Effect Correction
In the tagmentation step of the ATAC-seq assay, the hyperactive mutatnt Tn5 transposase inserts sequencing adapters into open regions of the genome. This enzyme has a bias for a particular sequence content (figure4.10) and is more likely to cleave and tag those sequences with adapters[53],[85]. Similarly, the DNase enzyme used to cut sequences in DNASE-seq also exhibits a sequence bias[481].
These biases can effect ChromBPNET model predictions, and we seek to correct for them by providing them as explicit covariates in the ChromBPNET model. We therefore adopt a two-step training process, where first, we freeze the weights in all layers of hte BPNET model with the exception of single convolution layer across the sources of bias 4.9. Next, we concatenate those model predictions with the rest of the full ChromBPNET architecture 4.7. As shown in figure 4.9, two approaches to training BPNET from bias models were used: both heads trained simultaneously in one approach, and each head was trained separately while the other was frozen in the second approach. Training both heads simultaneously led to improved prediction performance.
Several different models of enzymatic bias were used in phase one of this training protocol. Previous work by Vierstra[481], Ohler[85], and Bentsen[53] derived k-mer based models of enzymatic bias (figure 4.10. In addition to the kmer bias models, we trained convolutional neural nets to predict enzymatic bias from sequence. The labels for these CNN’s were derived from deproteinized genomic DNA (Tn5 transposition) from GEO SRR072187[8]. DNAse bias lables were derived from K562 and MCF7 deproteinized DNAse datasets in GSE61105[506]. In addition to single-layer convolutional neural networks, variations of the ChromBPNET architecture with 5 filters and 20 filters were used to predict the bias labels from sequence. Altogether, the following architectures were evaluated. The PWM’s learned across these architectures are presented in figure4.10.
• ChromBPNET architecture with five filters (as opposed to 500). Models were trained with

Figure 4.8: ChromBPNET performance metrics on ENCODE canonical cell line ATAC and DNASE datasets. Test set performance metrics for each fold in 5-fold cross validation are included. A) Pearson correlation on count predictions from DNASE data. B) Spearman correlation on count predictions from DNASE data. C) Mean squared error (MSE) on count predictions from DNASE data. D-F) Pearson, Spearman, and MSE performance metrics across test set count predictions from ATAC data. G) Mean Jensen-Shannon distance between profile labels and predictions for DNASE test set. H) Multinomial negative log likelihood between profile labels and predictions. I-J) Mean Jensen-Shannon distance and multinomial negative log likelihood between labels and predictions in ENCODE cannonical ATAC datasets.

Figure 4.9: Predicting profile and count signal from enzymatic bias input. A) Predicting DNASE and ATAC-seq profiles with weight-frozen ChromBPNET architecture followed by convolution layer initialized with the Vierstra DNASE 6-mer[481]. B) convolution layer initialized with Tobias 24-mer. C) Predicting DNASE and ATAC-seq profiles from BPNET bias model. D-F) Training counts head only, while profile head is frozen. G-I) Training profile head only, while count head is fixed.
kernel size 6 and 24 to match the kernel sizes used by prior k-mer based models.
• ChromBPNET architecture with 20 filters (as opposed to 500). Models were trained with kernel size 6 and 24. The learned PWM’s are shown in figure 4.10E,F. A variation of this model was trained with a weight of 0 assigned to the count loss.
• convolutional neural network with a single layer, 5 filters, and kernel size 6 4.10C,D. Variations of this model with 2 conv layers and 3 conv layers were trained.
• convolutional neural network with a single layer, 5 filters, and kernel size 24.
• convolutional neural network with a single layer, 1 filter, and kernel size 6 4.10A,B.
• convolutional neural network with a single layer, 1 filter, and kernel size 24 4.10A,B.
• convolutional neural network with a single layer, 1 filter, kernel weights initialized from the Vierstra DNASE PWM, as presented in 4.10A.
• convolutional neural network with a single layer, 1 filter, kernel weights initialized from the
TOBIAS DNASE and ATAC PWMs.
• convolutional neural network with a frozen single filter, initialized as the Vierstra DNASE
PWM.
• convolutional neurla network with a frozen single filter, initialized as the TOBIAS ATAC or
DNASe PWM.
The count performance metrics for these bias models are shown in figure 4.11, including Spearman and Pearson correlations with label counts across the genome, within GM12878 IDR peaks, and within K562 IDR peaks. Similarly, performance metrics on the profile (mean Jensen Shannon Distance and mean MNLL) are shown in figure 4.12. For learning DNASE bias, the highest count and profile performance metrics were achieved for the BPNET 20 filter bias model and for the 1-filter model initialized from the TOBIAS DNASE 24-mer PWM. For learning ATAC bias, the highest performance was achieved by the 1-filter model initialized from the TOBIAS ATAC 24-mer pwm. These models were used for bias correction in the full ChromBPNET models.
Once the highest performing bias models had been identified, the first step of training (learning bias from sequence with the ChromBPNET architecture frozen) was performed. Performance metrics are illustrated in table 4.1.
The comparison of ChromBPNET model performance within the HEPG2 cell line before and after enzymatic bias correction is illustrated in figure4.13. In addition to the baseline models without a bias term and the enzymatic bias-corrected models, we perform two additional comparisons. The first, referred to as ”WithNegatives.Bias.Cor.Tobias” augments the training set to include regions

Figure 4.10: Position weight matrices learned for ATAC-seq and DNASE bias.A) DNASE enzymatic bias PWM from the Tobias[53] algorithm (top), the Ohler K-mer based DNASE double-hit model[85], the Vierstra kmer-based model[481], the Ohler single-hit model[85]. Bottom two rows show the trained filter from a single-filter CNN model with a kernel size of 6 and 24, respectively. B) ATAC enzymatic bias PWM from the Tobias algorithm (top), the Ohler K-mer based ATAC model. Bottom two rows show the trained filter from a single-filter CNN model with a kernel size of 6 and 24, respectively. C)Trained filters from a 5-filter CNN model for DNASE bias. D) Trained filters from a 5-filter CNN model for ATAC-seq bias. E) PSSM from a ChromBPNET model trained to predict DNASE enzymatic bias. F) PSSM from a ChromBPNET model trained to predict ATAC-seq enzymatic bias.

Figure 4.11: Comparison of count performance metrics for models trained to predict enzymatic bias.

Figure 4.12: Comparison of profile performance metrics for models trained to predict enzymatic bias.
Pearson Spearman Mean JSD Std JSD
HEPG2 DNASE BPNET 0.35 0.37 0.24 0.03
HEPG2 DNASE TOBIAS 0.32 0.34 0.25 0.03

HEPG2 ATAC BPNET 0.57 0.59 0.26 0.05
HEPG2 ATAC TOBIAS 0.55 0.58 0.27 0.04
Table 4.1: Performance metrics for ChromBPNET signal predicted on frozen sequence component, using TOBIAS-initialized adn 20-filter BPNET bias models.
that fall into peaks in one or more of the ENCODE canonical cell lines, but do not fall into peaks in HEPG2. The second approach, referred to as ”BiasUnplugged”, refers to calculating test set performance without incorporating the bias term into the model. Performance results suggest no significant difference in global performance metrics across the four modeling approaches, suggesting that ”regressing out” the bias contribution does not degrade overall model performance on either the count or profile head. Comparison of count head predictions from the baseline ChromBPNET, the bias-corrected ChromBPNET, and the bias-corrected ChromBPNET with the bias term unplugged (figure 4.14, suggests that count predictions agreed across folds of training, and were similar across models with bias correction and baseline. Predictions from the bias-corrected, bias-unplugged model were generally lower than for the bias-correctd model with the bias contribution, suggesting that the frozen bias model derived in the first step of the 2-step training process successfully regresses out the enzymatic bias contribution to the signal prediction.

Figure 4.13: Comparison of performance metrics for bias-corrected ChromBPNET model with uncorrected model and negative-augmented model in HEPG2 ATAC-seq data.

Figure 4.14: Count predictions from baseline, bias-corrected, and bias-corrected with bias unplugged ChromBPNET models.
4.1.6 ChromBPNET sequence importance scores with DeepSHAP
DeepSHAP scores were calculated for all summit-centered IDR regions in the ENCODE canonical cell lines. The scores were compared across five models: bias-corrected ChromBPNET profile head, bias-corrected ChromBPNET count head, binned genomewide regression CNN, binned genomewide classification CNN, and SVM GKMExplain. Signal-to-noise ratio, stability across folds, and footprint accuracy were calculated across DeepSHAP scores for the models. The SNR comparison is illustrated in figure 4.15. SNR was determined by converting DeepSHAP observed scores to probability space along the 1kb output interval. The entropy of the probability distribution was then calculated using the entropy function from the Python scipy.stats package. Entropy values for each DeepSHAP (or corresponding GKMExplain profile) were compared across the five models. The result suggests that the GKMExplain scores had the lowest entropy (bottom row, figure 4.15, followed by the ChromBPNET DeepSHAP profile scores (top row, figure 4.15.
Stability of the DeepSHAP scores across folds was evaluated via the cosine similar, Spearman, and Pearson correlation metrics. These metrics were computed pairwise across each of five folds used to train each model for all IDR peaks 4.16. The SVM models showed slightly more stable interpretation values across the five folds compared to the four other models considered.
Stability metrics were also computed per-base using per-base cosine similarity scores 4.17.For the majority of pairwise fold comparisons, the profile DeepSHAP scores for the ChromBPNET models were found to have the highest cosine similarity (distribution with highest mean). The next most stable set of DeepSHAP scores were from the count head of the ChromBPNET model, followed in stability by the GKMExplain scores of the SVM model. The lowest cosine similarity values were observed for the binned genomewide classification and regression CNN’s.
4.1.7 Footprint comparisons from ChromBPNET against gold standards
The basepair-level resolution of ChromBPNET model predictions enables determination of transcription factor footprint locations. Footprints were identified for the set of IDR peaks in the ENCODE canonical cell lines, and two examples are illustrated in figure 4.18 and figure 4.19. Footprinting was also performed on the IDR peak set using the TOBIAS[53] algorithms, including the preliminary TOBIAS enzymatic bias correction step. Footprints at the FDR =0.001 thresholds were next obtained from Vierstra et al[481]. The Vierstra footprint locations are indicated in blue shading in 4.18 adn 4.19. The TOBIAS footprint track is shown in the bottom panel of both figures. DeepSHAP and DeepLIFT scores across the five models we compared are illustrated in tracks 2 - 6 from the top. We observe that the ChromBPNET prediction clearly delineates the footprint region in both examples, and strong DeepSHAP/GKMExplain scores are observed in the footprint region. In 4.18, the scores form a close match to the SP4 motif, whlie in 4.19, they form a close match to an NKX4 motif. In both examples, we observe that the GKMExplain socres have some signal outside of the

Figure 4.15: Signal to noise ratio comparisons across DeepSHAP scores for different accessibility models. Pairwise comparison of Jensen Shannon Entropy values for 1kb DeepSHAP profiles from ChromBPNET profiles, ChromBPNET counts, binned regression counts, binned classification counts, and GKMExplain.
footprints designated by Vierstra and TOBIAS algorithms, highlighting the propensity of GKMkernel based linear models to report false positive hits in this context. In contrast, the DeepSHAP for ChromBPNET profile has the highest concentration of signal in the footprint region.
To compare the DeepSHAP/GKMExplain score alignment with TOBIAS/Vierstra footprints, we identified the highest confidence gold standard footprint for each IDR peak region. This footprint consisted of the region where the Vierstra FDR=0.001 footprint call overlapped the highest footprint probability region in the TOBIAS footprint signal track. If no such overlap was observed, the region was skipped. Two metrics were calculated for the highest confidence footprint 4.20. The DeepSHAP score distributions in footprints (blue regions in 4.20) were compared to score distributions outside of footprints (pink regions in 4.20), and we observe a clear shift to the right (higher values) for scores within footprints. A Wilcoxon rank sum test reports p≤0.001 that this shift is statistically significant for all five models. We also compute the auPRC value for DeepSHAP scores in each 1kb region, with the positive class denoted by bases within the TOBIAS/Vierstra most confident footprints, and negative class denoted by bases outside the most confident footprint. The distribution of auPRC values across the full set of IDR peaks is shown in the right column of figure 4.20 – the ChromBPNET profile DeepSHAP scores have a significantly (p≤≤0.05) right-shifted distribution from a Wilcoxon rank sum test compared to each of the four other models.
4.1.8 Contributions
Avanti Shrikumar developed the DeepSHAP and DeepLIFT algorithms used for model interpretation. ChromBPNET builds on the BPNET model developed by Ziga Avsec. Jin Lee constructed the ATAC-seq and DNASE-seq ENCODE DCC pipeline used for data processing. Anna Shcherbina performed the other above-mentioned analyses.

Figure 4.16: Interpretation score stability across ChromBPNET, binned genomewide CNN, and SVM models, measured by calculating cosine similarity, Spearman correlation, and Pearson correlation across five folds.

Figure 4.17: Distribution of per-base cosine similarity across base importance DeepSHAP and DeepLIFT scores. Distributions are calculated pairwise across five folds of training for ChromBPNET profile DeepSHAP scores, ChromBPNET count DeepSHAP scores, binned genomewide regression CNN DeepSHAP, binned genomewide classification DeepSHAP, SVM GKMExplain scores.

Figure 4.18: ChromBPNET footprint and interpretation scores compared to other models for K562 IDR peak centered on chr1, 17348301.

Figure 4.19: ChromBPNET footprint and interpretation scores compared to other models for K562 IDR peak centered on chr1, 17348301.

Figure 4.20: Interpretation score signal within the Vierstra/Tobias highest confidence footprint across models.

Chapter 5
Molecular phenotype to cellular phenotype links
5.1 Matrix stiffness induces a tumorigenic phenotype in mammary epithelium through changes in chromatin accessibility
5.1.1 Abstract
In breast cancer, the increased stiffness of the extracellular matrix is a key driver of malignancy. Yet little is known about the epigenomic changes that underlie the tumorigenic impact of extracellular matrix mechanics. Here, we show in a three-dimensional culture model of breast cancer that stiff extracellular matrix induces a tumorigenic phenotype through changes in chromatin state. We found that increased stiffness yielded cells with more wrinkled nuclei and with increased lamina-associated chromatin, that cells cultured in stiff matrices displayed more accessible chromatin sites, which exhibited footprints of Sp1 binding, and that this transcription factor acts along with the histone deacetylases 3 and 8 to regulate the induction of stiffness-mediated tumorigenicity. Just as cell culture on soft environments or in them rather than on tissue-culture plastic better recapitulates the acinar morphology observed in mammary epithelium in vivo, mammary epithelial cells cultured on soft microenvironments or in them also more closely replicate the in vivo chromatin state. Our results emphasize the importance of culture conditions for epigenomic studies, and reveal that chromatin state is a critical mediator of mechanotransduction.
147
5.1.2 Introduction
Tumour extracellular matrix (ECM) is substantially remodelled from normal tissue microenvironments, leading to changes in the composition and density of the ECM network. These modifications of the microenvironment result in changes in mechanical properties, such as increased matrix stiffness[260]. Differences in ECM stiffness have been broadly studied and are known to cause changes in gene expression by modulation of integrin binding and downstream signalling, cytoskeletal tension, conformational changes and activation of mechanosignalling complexes, and transcription factor activation and localization[260, 370, 107, 107, 353, 122, 140, 33]. Stiff matrices even promote malignant phenotypes in non-malignant mammary epithelial cells[353, 95]. However, despite the broadly appreciated role of the epigenome in gene regulation and its recognized misregulation in cancer, little is known about how changes in chromatin state regulate the impact of ECM mechanics[474].
Transcription factors bind regulatory DNA elements to control gene expression and cellular phenotypes[294], and this binding is dictated by the chromatin state. Regulatory elements bound by transcription factors typically exhibit signatures of accessible chromatin[361]. Chromatin accessibility can be altered by several enzymatic modifications, such as acetylation and methylation of specific histone tail residues, that are also associated with activation or silencing of genes[234]. The nuclear lamina is the nexus of the cytoskeleton and chromatin and is known to be mechanoresponsive[453], thus serving as a likely link between mechanical cues and chromatin remodelling. Indeed, histone deacetylases (HDACs) have been shown to be responsive to mechanical properties and culture dimensionality through interactions with the nuclear lamina[206, 270]. Intriguingly, biological processes associated with changes in chromatin state, such as stem cell differentiation and breast cancer progression, are also known to be mechanoresponsive[150, 304]. However, whether chromatin state mediates mechanotransduction in these processes is unclear. Further, direct evidence for mechanically induced chromatin remodelling in general is limited, and it is unknown by what mechanisms these changes might occur.
5.1.3 Methods
See Appendix for methods detailing hydrogel generation, cell culture, and imaging.
ATAC-seq library preparation
MCF10A cells were extracted from the alginate matrices via chelation and digestion as described above or trypsinized from 2D TCPS, and then processed according to published protocols[75]. DNA concentration and library quality were assessed with a Qubit fluorometer and a Bioanalyzer before sequencing. The libraries were sequenced on Illumina HiSeq 2500 or 4000 instruments in the Stanford Genome Sequencing Service Center using paired-end 101bp reads. At least three biological replicates were sequenced for each experimental condition. The conditions include soft, stiff, soft with SAHA treatment, stiff with SAHA treatment, soft 2D and 2D TCPS.
ATAC-seq analysis pipeline
The ATAC-seq data were subjected to quality control and processed using a publicly available pipeline that serves as the official specification of the ENCODE consortium https://github.com/ kundajelab/atac dnase pipelinesversion1.0. Briefly, sequencing adapters were trimmed from the reads. The reads were mapped in paired-end mode to the hg19 reference genome using Bowtie2[242], and multimapping, duplicate and mitochondrial reads were discarded. For each experimental condition, peaks were called for each replicate, for a pooled sample (by pooling reads from replicates) and for a pair of virtual pseudo-replicates (by randomly splitting the reads from the pooled sample into two pseudo-replicates). Peaks were called using Macs2 with a relaxed P-value threshold of 0.0151. The peaks were ranked by P value and only the top 300,000 peaks were retained. Irreproducible discovery analysis (IDR) was executed to determine high-confidence peak sets for all ATAC-seq samples. The IDR optimal set for all ATAC-seq samples was used for analysis. A consensus peak set for soft versus stiff analysis was generated by running bedtools merge on the IDR optimal peak set from the soft, stiff, stiff+SAHA and soft+SAHA samples.Pipeline quality control metrics for each experimental replicate are in F.1.
A count matrix for the soft/stiff/SAHA samples was generated by computing the filtered read coverage for each replicate on the merged IDR peak set. This was done by running the bedtools coverage command on the filtered tagAlign files generated by the ENCODE analysis pipeline with duplicates, blacklisted regions and mitochondrial reads removed (*nodup.tn5.no chrM.25M.R1.tagAlign.gz).
Differential accessibility analysis for soft, stiff and SAHA-treated samples
The soft/stiff/SAHA count matrix served as the input for differential accessibility analysis via the DESeq2 analysis52. For DESeq2 analysis, a design matrix of Read Count Stiffness+Stiffness:SAHA was used. Surrogate variable analysis was first performed with this design on the count matrix to determine batch effects. This was carried out with the svaseq function from the R sva library and one significant surrogate variable was identified and added to the model design. Size factor correction was performed using a set of housekeeping genes from https://www.tau.ac.il/ elieis/HKG/ whose promoters intersect with the set of IDR optimal peaks. After the custom size factor correction,
DESeq2 analysis was performed and differential regions were identified using an FDR threshold of
0.05.
Normalized counts from the DESeq2 matrix were corrected for batch effects using the limma removeBatchEffect function. An rlogTransform of the normalized, filtered counts was then computed. The rlog data were used as the input for PCA and heatmap visualization. The row z-scores across soft, stiff and SAHA-treated samples were used for heatmap visualization.
It was determined that the DESeq2 model of soft and stiff samples was underpowered to detect
the significantly differential peaks between the two conditions. Correcting for surrogate variables further reduced the power of the model. However, the rlog-transformed, normalized reads from the DESeq2 object indicated a clear soft versus stiff difference on the PCA. Consequently, the R loadings function was used to compute the contribution of each IDR peak to PC1 in a PCA of soft/stiff samples. The distribution of peak loadings for PC1 was computed, and those peaks with a loading value greater than two standard deviations above the mean were considered differential. This set of peaks underwent further analysis with MEME-ChIP[288] and HOMER[188].
Row z-scores from the stiff/soft/SAHA heatmap were analysed to determine the subset of differential peaks whose accessibility reverted to soft-like levels after treatment with SAHA. Motif analysis on this subset of peaks was performed using MEME-ChIP and HOMER. For MEME-ChIP analysis, a FASTA file of the differentially accessible regions was used to search against a background of all other shared IDR optimal set regions. Similarly, the HOMER command findMotifsGenome.pl was run on a BED file of differentially accessible regions with the same background described above.
Motifs discovered de novo were compared to known motifs to find the best matches.
Aggregate ATAC-seq footprint profiles at transcription factor motifs
The HOMER command scanMotifsGenome.pl was used to scan for Sp1 motif occurrences in differentially accessible regions (foreground) or consensus regions (background). The resulting motif coordinates were padded to 200 bases and intersected with the bed file of differential peaks between soft and stiff samples. The tagAlign files for soft and stiff samples from the ENCODE pipeline were split by strand, and the bedtools coverage command was used to compute the average number of 5’-end cuts and 3’-end cuts at each position in a ± 100bp window centred on the motif.
Pharmacological inhibition
For all small-molecule inhibition studies, the drug was added on the day of encapsulation and replaced with each subsequent medium change. The inhibitors used were GSK 126 (100nM, Fisher Scientific), UNC 0638 (250nM, Sigma), JIB-04 (1µM, Sigma), GSK-J4 (10µM, Abcam), GSK-LSD1 (100nM, Sigma), sirtinol (50µM, Sigma), apicidin (1µM, Sigma), mithramycin A (50nM, Sigma), LY294002 (20µM, Sigma) and SAHA (1µM, Sigma). All drugs were dissolved in dimethylsulfoxide and diluted in basal medium before adding to the culture medium. Dimethylsulfoxide alone was added to the culture medium as a vehicle control.
RNA interference
MISSION lentiviral transduction particles were purchased from Sigma and used according to the manufacturer’s instructions. Clone IDs for the lentiviral particles were:
• TRCN0000020445 (Sp1)
• TRCN0000004814 (HDAC1)
• TRCN0000004819 (HDAC2)
• TRCN0000194993 (HDAC3)
• TRCN0000004851 (HDAC8)
pLKO.1-puro Control Transduction Particles were used as empty vector controls. MCF10A cells were plated on 24-well plates and cultured until 80% confluency was reached. Hexadimethrine bromide (8µgml-1, Sigma) and 2x107 transducing units per millilitre were added to the cells and incubated overnight. The transduction medium was removed and replaced with growth medium for 24h. Then puromycin-containing medium (1µgml-1) was used to select for transduced cells. Puromycin-containing medium was used for expansion, 2D validation studies and 3D encapsulations.
Protein–protein interaction screening
String-DB was used to search for proteins interacting with Sp155. The search was limited to experimental evidence alone, gathered from protein–protein interaction databases, with a minimum interaction score of 0.4.
TRRUST pathway analysis
The TRRUST version 2 web platform was used to assess Sp1 target genes and pathways associated with them. A subset of Sp1 target genes that overlapped with genes associated with malignant neoplasm of breast were selected for gene expression analysis. The top 15 enriched diseases or pathways were included in F.4 and F.5.
Comparison to mammary epithelium
Publicly available ATAC-seq data from human mammary epithelium was obtained from the ENCODE database for two patients (ENCSR846ZBX and ENCSR65UYP). FASTQ files were processed through the same pipeline described above. A consensus peak set for the culture-specific analysis was generated by running bedtools merge on the optimal IDR peak sets for the mammary epithelium samples, the 2D TCPS samples, the 2D soft samples and the 3D soft samples.
The DESeq2 analysis pipeline, as described above, was applied to the merged IDR optimal peak set for mammary epithelium, 2D TCPS, 2D soft and 3D soft samples. Surrogate variable analysis was performed to identify sources of variation not captured by the DESeq2 model: Read Count Sample. Size factor correction was performed using the set of housekeeping genes, as described above. Differential peaks were determined with an FDR threshold of 0.01 and a log2 fold change of
1.
IDR peaks that overlapped between mammary epithelium and the three cultures (and were not found to be differential by the DESeq2 analysis above) underwent comparative analysis. Comparisons were performed using PCA. The histology image was obtained from the Human Protein
Atlas56.
Gene and genome ontology
The HOMER54 command annotatePeaks.pl with the gene ontology flag was run on regions differentially accessible between 3D soft and 3D stiff matrices, and on the regions shared by mammary epithelium and either 2D TCPS, 2D soft or 3D soft. The genome ontology program was run on the naive overlap peak set for 3D soft and 3D stiff matrices to evaluate differences in accessibility of genomic features.
Statistical analysis
Statistical comparisons were performed with GraphPad Prism 7.0 software using the tests described in the figure captions. P values less than 0.05 were considered statistically significant. Significant differences in gene expression were determined by setting the FDR to 5% and using the two-stage step-up method of Benjamini, Krieger and Yekutieli. Default significance thresholds were used for motif analyses, gene and genome ontologies, and protein–protein interactions.
5.1.4 Results
To address whether ECM mechanics can drive phenotypic shifts through changes to chromatin state, we utilized a well-studied, mechanosensitive breast cancer three-dimensional (3D) culture model[260, 353, 95] (5.1A). Interpenetrating networks (IPNs) of reconstituted basement membrane (rBM) matrix and alginate, a material system in which matrix stiffness can be tuned independently of ligand density, matrix pore size and matrix architecture, were used as matrices for 3D culture[95]. MCF10A breast epithelial cells were encapsulated in IPNs with elastic moduli similar to those of normal mammary tissue ( 100Pa, soft) or malignant tumour tissue ( 2,000Pa, stiff) and cultured for 14d (F.1). As in prior reports, the MCF10A cells formed organotypic acinar structures in the soft matrices, but exhibited a tumorigenic phenotype in stiff matrices, marked by an increase in invasive sites presenting higher levels of vimentin and phosphorylated focal adhesion kinase (pFAK), an associated decrease in roundness, matrix remodelling, filled acinar lumens, loss of normal β4 integrin-containing hemidesmosomal adhesions to an intact basement membrane, and aberrant Ecadherin and N-cadherin localization (5.1b–e and 5.2,F.3)[353, 95]. A prior report determined that enhanced stiffness is transduced through a β4 integrin–phosphoinositide 3-kinase (PI(3)K) signalling pathway to drive the malignant phenotype[95]. A tumorigenic phenotype was also observed when hTERT-HME1 cells (HME1 cells), another cell line used to model normal mammary epithelium, were encapsulated in stiff matrices, while clusters in soft matrices were rounded and less invasive (F.4).
We first examined the changes in nuclear morphology and chromatin architecture associated with the transition to a tumorigenic phenotype in the stiff matrices. An increase in nuclear wrinkling was observed for cells cultured in stiff matrices, and quantitative analysis revealed that these nuclei had a significantly higher proportion of extreme curvature than nuclei in cells cultured in soft matrices (5.1f,g). As the nuclear lamina can bind chromatin in lamina-associated domains in a mechanoresponsive manner, chromatin organization was next analysed using transmission electron microscopy (TEM)[456, 162]. Numerous heterochromatin bundles were observed at the nuclear periphery for cells in stiff matrices, in contrast to a thin chromatin boundary for cells in soft matrices (5.1h). Heterochromatin thickness increased on average in stiff matrices, with substantially more heterochromatin bundles, indicated by occurrences in the right tail of the distribution (5.1i). Having found that stiffness can alter chromatin organization, several histone modifications associated with activation or repression were investigated. Interestingly, stiff matrices produced significantly higher levels of AcH3, but decreased levels of AcH4, and neither H3K4me3 (active) or H3K9me3 (repressive) marks nor HP1γ (repressive) levels were significantly different (5.1j). Together, these results indicate that chromatin state was broadly misregulated with increased stiffness, although not consistently toward a more open or closed architecture, motivating the use of a more specific assay to characterize chromatin organization.
To gain a genome-wide, site-specific perspective of alterations in chromatin organization in response to increased stiffness, we performed the assay for transposase-accessible chromatin with sequencing (ATAC-seq)[77]. ATAC-seq utilizes a hyperactive Tn5 transposase that preferentially cleaves and inserts sequencing adapters in accessible chromatin. Sequencing and mapping the fragments enables genome-wide profiling of putative regulatory elements bound by transcription factors exhibiting signatures of chromatin accessibility. ATAC-seq can be carried out with low cell numbers (500–50,000), making it amenable to 3D cultures. Accessibility profiles from independent biological replicates for both soft and stiff matrices were highly correlated, enrichment at transcription start sites was prevalent for both conditions, and signal within other genomic features was similarF.1-F.3. Differential analysis of ATAC-seq peaks revealed 1,658 significantly more accessible peaks for cells cultured in stiff matrices, with no regions found to be significantly more accessible in soft matrices (5.2a,b).

Figure 5.1: a, The hypothesis underlying this study is that ECM properties can stabilize normal phenotypes or make phenotypic transitions more permissive through chromatin alterations. b, Immunofluorescence staining for F-actin, vimentin, pFAK (Tyr 397) and β4 integrin in MCF10A acini after 14d in soft (top) or stiff (bottom) matrices (scale bars, 50µm; representative images selected from 15, 3, 5 and 5 images per group, respectively). c, Representative outlines of cell clusters from soft (top) and stiff (bottom) matrices. Scale bars, 50µm. d, Quantification of roundness of cell clusters from at least three independent replicates (median ± 95% confidence interval (CI)). Significance was determined by the Mann–Whitney test. e, Quantification of invasive clusters (n ≥ 5, mean±s.d.). Significance was determined by an unpaired t-test. f, A colour map of nuclear curvature for cells in soft (top) and stiff (bottom) matrices. Scale bars, 2µm; the colour bar ranges from -20 to 20µm -1. g, Distribution of curvature values, showing more regions of extreme curvature for cells in stiff matrices. Distributions were significantly different by the Kolmogorov–Smirnov test (P≤0.0001, n≥11). h, TEM micrographs of nuclei from cells in soft or stiff matrices at x2,000 (left) and x10,000 (right) magnification (representative images from at least 16 images per group). i, Distribution of measured chromatin thickness at each pixel around the nuclear boundary for at least six nuclei in each group. Distributions were significantly different by the Kolmogorov–Smirnov test (P≤0.0001). j, Western blot quantification of histone modifications normalized to total H3 levels (mean±s.d.). Three independent replicates were used and significance was tested using unpaired t-tests.

Figure 5.2 (previous page): a, A heatmap of 1,658 regions with differential accessibility. Each row represents a differential region; each column is one of three biological replicates of soft or stiff conditions. b, Representative genome browser tracks of significantly differentially accessible regions (highlighted). c, Quantification of roundness for small-molecule inhibitors of histone modifiers. Three independent replicates were used for each condition; significance was determined by the Kruskal–Wallis test followed by Dunn’s multiple testing correction (median±95% CI). d, Confocal immunofluorescence images of clusters in control matrices and treated with SAHA (top, from at least 22 images per group) and representative outlines of clusters (bottom). Scale bars, 25µm. e, Quantification of invasive clusters (n=3, mean±s.d.). Significance determined by one-way analysis of variance (ANOVA) followed by Dunnett’s multiple testing correction. f, A heatmap of 1,658 differentially accessible regions between soft and stiff control matrices with signal for SAHA-treated cells in stiff matrices. The green box represents regions that are more accessible in stiff control and less accessible after SAHA treatment. g, Representative genome browser tracks of regions that are more accessible in stiff control matrices than soft control matrices and are less accessible following SAHA treatment (differential region highlighted).
We next sought to determine chromatin modifiers that may be associated with the changes in chromatin accessibility. Pharmacological inhibition of four major classes of histone-modifying enzymes including histone methyltransferases, histone demethylases, class I HDACs and class III HDACs was performed. Only inhibition of class I HDACs caused cells cultured in stiff matrices to form rounded clusters similar to cells cultured in soft matrices (5.2c,d). Clusters of cells cultured with either suberoylanilide hydroxamic acid (SAHA) or apicidin, two structurally distinct class I HDAC inhibitors, were significantly more rounded than controls in stiff matrices but similar to clusters cultured in soft matrices. A similar result was found in HME1 cells treated with SAHA (F.4). HDAC inhibition by SAHA treatment also significantly reduced the fraction of invasive clusters for MCF10A cells cultured in stiff matrices (5.2e). Others have shown class I HDACs to be differentially activated by matrix stiffness and actomyosin contractility[206, 270]. To clarify whether SAHA treatment prevented the stiffness-induced tumorigenic phenotype by chromatin changes, ATACseq was performed on cells from this group as well. Differential peak calling followed by clustering identifies a group of peaks that are more accessible in stiff control matrices than soft control matrices, but that have decreased accessibility following SAHA treatment, mirroring the observed phenotypic changes (5.2f,g). Peaks from the stiff–SAHA group with z-values closer to soft matrices than stiff matrices were identified as a reverted subset, consisting of 660 regions from the total 1,658 regions. This result is ostensibly contradictory to the conventional role for HDACs; deacetylation of histones to decrease chromatin accessibility. However, several reports have found that HDAC inhibition can induce both acetylation and deacetylation to alter chromatin accessibility in both directions [338, 378, 152, 379, 380, 97]. These reports are in agreement with the broad effects we observe following SAHA treatment in either soft or stiff matrices (F.5), illustrating the complex roles that HDACs play in epigenetic regulation.
To identify candidate transcription factors that may bind in the differentially accessible chromatin sites, we performed de novo motif analysis on the regions that were more accessible in stiff matrices. The best match among known transcription factor motifs for the highest ranked de novo motif by MEME-ChIP analysis was Sp1 (5.3a,c). The HOMER motif-discovery tool identified Sp1 among the most enriched candidate motifs as well (F.6). Additionally, motif analysis on the subset of peaks in 5.2f that lose accessibility following SAHA treatment (reversion) also implicated Sp1 (5.3b,c and F.7). Notably, the Sp1 motif is not enriched in the non-reverted set of peaks. ATAC-seq-based footprinting analysis at the Sp1 motifs revealed a distinct signature for cells in stiff matrices, with a higher Tn5 cut frequency (more accessibility) flanking the binding site and a deeper footprint at the motif centre, where bound Sp1 protects the site from Tn5 cleavage (5.3d). This footprint shape is consistent with those previously reported from ATAC-seq data[374].

Figure 5.3 (previous page): a, MEME-ChIP de novo motif analysis for differentially accessible regions between soft and stiff matrices. b, MEME-ChIP de novo motif analysis for regions of decreased accessibility in SAHA-treated cells in stiff matrices. c, The Sp1 motif logo (top) best matches the top ranked MEME-ChIP de novo motifs from soft versus stiff comparison (middle) and from regions of decreased accessibility in SAHA-treated cells in stiff matrices. d, Left: the transcription factor footprint of a 200bp region centred on Sp1 motifs in differentially accessible regions. Right: a cartoon illustrating an accessible chromatin region displaying the Sp1 motif in stiff ECM that is inaccessible to Sp1 in soft ECM. e, Quantification of Sp1 phosphorylation levels at Thr453 for cells in stiff matrices and treated with PI(3)K inhibitor (LY294002) or class I HDAC (SAHA). All values are normalized by the value for soft matrices of the same treatment condition (mean±s.d.). Significance was determined by one-way ANOVA followed by Dunnett’s multiple comparison correction. f, A heatmap of gene expression for Sp1 target genes associated with malignant neoplasm of breast (n=3, two-way ANOVA with multiple comparison correction, false discovery rate (FDR)=0.05, colour scale represents log2 fold change, dot pattern indicates no significance, black box represents no data). g, A heatmap of gene expression for Sp1 target genes associated with malignant neoplasm of breast for stiff matrices with the indicated inhibitors compared to vehicle control (n=3, two-way ANOVA with multiple comparison correction, FDR=0.05, colour scale represents log2 fold change, dot pattern indicates no significance). h, Confocal immunofluorescence of Sp1-knockdown cells in stiff matrices (left, from 38 total images) and representative outlines of Sp1 shRNA (shSp1) clusters. Scale bars, 25µm. i, Quantification of roundness for shSp1 clusters in stiff matrices versus empty vector (EV) controls in soft and stiff matrices (n=3, median±95% CI). j, Quantification of invasive clusters (n=3, mean ±s.d.). k, A schematic timeline of the small-molecule inhibitor of Sp1 washout experiment. l, Quantification of roundness of cell clusters in soft or stiff matrices treated with mithramycin A or vehicle control for the indicated period of time (n=3, median±95% CI). m, Quantification of invasive clusters (n=3, mean±s.d.). n, Confocal immunofluorescence of cells in soft or stiff matrices treated for the duration indicated and imaged after 14 total days (scale bars, 50µm; representative images from 15 images per group). Significance was determined by the Kruskal–Wallis test followed by Dunn’s multiple testing correction for roundness and by one-way ANOVA followed by Dunnett’s multiple testing correction for invasion.
We then analysed Sp1 activation in the stiff matrices and its relevance to breast cancer. Sp1 is a potent transcription factor, known to regulate proliferation, apoptosis, differentiation and malignant transformation28. Malignant neoplasm of breast is the most significant specific disease found for Sp1 target genes in a search of the Transcriptional Regulatory Relationships Unraveled by Sentencebased Text mining (TRRUST) v2 database (F.4)[181]. Interestingly, this is the stage of breast cancer that our high-stiffness model system most closely recapitulates in terms of cell type, ligand type and stiffness. Sp1 interacts with a number of chromatin-modifying proteins and can function as both an activator and repressor of transcription depending on which proteins it recruits to regulatory complexes. Importantly, Sp1 phosphorylation at Thr453, a marker of Sp1 activation30, was assessed and was found to be significantly higher in stiff matrices relative to soft matrices (5.3e and F.7). Expression levels of Sp1 target genes associated with malignant neoplasm of breast were then examined, including ten genes within the subset of peaks reverted by SAHA treatment. Stiff matrices induced significantly increased expression in 13 of the 18 genes, consistent with the conclusion of enhanced Sp1 activity (5.3f). As expected, Sp1 inhibition with mithramycin A, which diminishes the tumorigenic phenotype, significantly reduced the expression of 14 of the target genes compared to cells in stiff matrices treated with a vehicle control (5.3g). Intriguingly, treatment with SAHA generally resulted in reduced expression for genes in regions with decreased chromatin accessibility (reverted subset), but SAHA increased expression for genes not found in the reverted subset (5.3g). Thus, the gene expression profile largely mirrors the chromatin accessibility profile.
In the 3D culture model used here, it is known that increased stiffness promotes the tumorigenic phenotype through PI(3)K-mediated signalling, and PI(3)K has been shown by others to phosphorylate Sp1 to alter HDAC association, chromatin binding and gene activation[95, 447, 516]. PI(3)K signalling was also the pathway most significantly associated with Sp1 target genes by TRRUST analysis (F.5). Sp1 phosphorylation levels were significantly reduced by inhibition of PI(3)K (with LY294002) or class I HDACs (with SAHA) (5.3e and F.7). The expression profile of Sp1 target genes following PI(3)K inhibition was similar to that of SAHA-treated groups (5.3g). For genes within the SAHA-reverted subset, expression was generally reduced, while for regions not found in the reverted subset, expression was increased. That inhibition of either HDACs or PI(3)K results in similar effects on both Sp1 gene expression and Sp1 phosphorylation suggests that they may be regulating Sp1 through a common pathway, although the exact mechanism is still uncertain. Taken together, these data demonstrate increased activation of Sp1 in stiff matrices and indicate that Sp1 activation may be prominently involved in the establishment of the tumorigenic phenotype in response to stiff matrices through altered chromatin accessibility.
To directly assess the role of Sp1 in phenotypic determination, we performed genetic and pharmacological inhibition of Sp1. Knockdown of Sp1 using short hairpin RNA (shRNA) caused cells to form rounded clusters despite being cultured in stiff matrices (5.3h,i). Compared to empty vector control cells cultured in stiff matrices, Sp1 knockdown cells were significantly more rounded and similar to empty vector control cells in soft matrices. Knockdown of Sp1 also resulted in a significant reduction in the fraction of invasive clusters for cells cultured in stiff matrices (5.3j). Sp1 knockdown cells were still able to proliferate in soft and stiff matrices, as cell clusters contained dozens of cells, but the tumorigenic characteristics typically associated with stiff matrices were not observed. In addition, pharmacological inhibition of Sp1 with mithramycin A in HME1 cells abrogated the invasive phenotype in stiff matrices (F.4). Further, inhibition of Sp1 was performed for the breast cancer cell lines MCF7 and MDA-MB-231 in soft and stiff 3D matrices. Sp1 inhibition altered the morphology of these cancer cell lines in a stiffness-dependent manner (F.8). Both MCF7 and MDA-MB-231 cells adopted invasive, proliferative phenotypes in stiff matrices, but Sp1 inhibition resulted in significantly more rounded and smaller clusters, indicative of diminished invasion and proliferation, respectively. Interestingly, no difference in cluster size was observed for either cell line cultured in soft matrices with or without mithramycin A, suggesting that, for these breast cancer cell lines, Sp1-driven proliferation and invasion is stiffness dependent. These studies confirm the role of Sp1 in mediating the impact of stiffness on the tumorigenic phenotype.
Next, we examined the timeline for Sp1 regulation of tumorigenicity. Our previous experiments revealed a role for Sp1 in cultures lasting 14d. However, it was not clear whether stiff matrices initiated Sp1-induced effects in the early stages of induction of the tumorigenic phenotype, or whether the effects of Sp1 occurred only after the phenotype was well established. To determine when Sp1 was active within our system, mithramycin A was applied for the first 3 or 7d of culture, and then washed out with growth medium, while culture was continued for another 11 or 7d, respectively (5.3k). Inhibition of Sp1 for just 3d resulted in a significant increase in cluster roundness and a significant decrease in invasiveness (5.3l–n). These results indicate that Sp1 must be involved in the earliest stages of tumorigenic conversion.
As Sp1 is known to recruit chromatin modifiers[129, 275], we next investigated what chromatin modifiers might cooperate with Sp1 to mediate the phenotypic transition. Assessment of known protein–protein interactions with Sp1 revealed a strong association with class I HDACs (5.4a). Motivated by the combination of this result with the prior results showing that class I HDAC inhibition was effective at preventing the tumorigenic phenotype from arising in stiff matrices, reduced Sp1 phosphorylation and altered Sp1 target gene expression profiles, shRNA knockdowns were performed on the four known members of class I HDACs (HDAC 1, 2, 3 and 8). Interestingly, knockdowns of HDAC3 and HDAC8 were highly effective at preventing the tumorigenic phenotype from arising in stiff matrices, while knockdowns of HDAC1 and HDAC2 had no effect (5.4b–d). To determine whether HDACs 3 and 8 are involved early in the stiffness-induced tumorigenic conversion, or are further downstream, small-molecule inhibitor washout experiments were performed (5.4e). Inhibition of class I HDACs for only 3 or 7d resulted in significantly more rounded, less invasive clusters (5.4f–h), supporting their involvement in early mechanosignalling pathways. Together, these data implicate Sp1 and HDACs in driving the tumorigenic phenotype in response to stiffness by altering chromatin state, potentially through interactions or overlapping signalling pathways (5.4i), although we have not established a direct interaction here.
Finally, we examined the implications of our results for the use of 3D cell culture versus traditional culture on tissue culture plastic for recapitulation of the in vivo chromatin landscape. It has been well established that mammary epithelial cells cultured on or in soft microenvironments yield phenotypes that mimic healthy mammary tissue35, while stiff matrices induce phenotypes similar to invasive tumours1,8. However, chromatin state profiling experiments, to date, have largely been carried out in samples cultured on 2D tissue culture polystyrene (TCPS) or in vivo specimens. To evaluate whether soft microenvironments reproduce in vivo chromatin state more faithfully than culture on TCPS, ATAC-seq data from MCF10A cells cultured on 2D TCPS, on soft 2D polyacrylamide substrates (150Pa) coated with rBM matrix, or in 3D soft matrices, as well as non-malignant human mammary epithelial tissue (ENCODE) were compared. Morphologically, acini generated from soft matrices in vitro resemble the hollow glandular structure found in mammary epithelium in vivo (Human

Figure 5.4: a, A String-DB protein–protein interaction network of the top ten interactors with Sp1. b, Confocal immunofluorescence of shHDAC1, shHDAC2, shHDAC3 and shHDAC8 cells in stiff matrices (top, from at least 12 total images per group) and representative outlines of clusters (bottom). Scale bars, 100µm. c, Quantification of roundness for HDAC knockdowns in stiff matrices compared to empty vector controls in soft or stiff matrices and SAHA-treated cells in stiff matrices (n=3, median±95% CI). d, Quantification of invasive clusters (n=3, mean±s.d.). e, A schematic timeline of the small-molecule inhibitor of Sp1 washout experiment. f, Confocal immunofluorescence of cells in soft or stiff matrices treated for the duration indicated and imaged after 14 total days (scale bars, 25µm; representative images from 15 images per group). g, Quantification of roundness of cell clusters in soft or stiff matrices treated with SAHA or vehicle control for the indicated period of time (n=3, median ± 95% CI). h, Quantification of invasive clusters (n=3, mean ± s.d.). i, A schematic illustrating sequential events from ECM mechanical properties to phenotype via mechanically induced chromatin remodelling. Significance was determined by the Kruskal–Wallis test followed by Dunn’s multiple testing correction for roundness and by one-way ANOVA followed by Dunnett’s multiple testing correction for invasion.
Protein Atlas[473], and both are distinct from the monolayer that forms on TCPS (5.5a). Peaks with significantly different accessibility were identified between 2D TCPS and mammary epithelium samples (1,314 peaks, 5.5b). Accessible regions from soft matrices, either 2D or 3D, cluster with mammary epithelium. Interestingly, while the majority of differential regions are less accessible in mammary epithelium or soft microenvironments, a substantial subgroup ( 30%) have increased accessibility compared to 2D TCPS culture. Principal component analysis (PCA) also shows a clear separation of 2D TCPS samples from the soft matrices, which are much closer along the PC1 axis that accounts for 72% of the variance (Fig. 5c). Thus, just as culture of mammary epithelial cells on soft matrices better replicates the morphology and phenotype of in vivo mammary epithelium, the chromatin landscape is also remodelled in response to soft matrices in a manner that much more accurately represents in vivo tissue.

Figure 5.5 (previous page): a, Representative morphologies of MCF10A cells cultured on TCPS (left), on soft 2D matrices, in soft 3D matrices, and from human mammary tissue. Scale bars, 25µmb, A heatmap of significantly differentially accessible regions between the different culture conditions and mammary epithelium, demonstrating the similarity in accessibility with cultures from soft matrices. Each row represents a differential region; each column is one biological replicate of the indicated condition. c, PCA reveals that accessible regions in soft matrices cluster closer to mammary epithelium than 2D TCPS along the first PC that accounts for 72% of variance. Three biological replicates were used for 2D TCPS, 2D soft and 3D soft, and two were used for in vivo mammary epithelium. Mammary epithelium image in a adapted from ref[36], Human Protein Atlas.
5.1.5 Discussion
This work demonstrates that alterations in ECM stiffness can drive changes to the nucleus and chromatin state, which in turn regulate phenotypic changes. Enhanced matrix stiffness alters both lamina-associated chromatin and accessible chromatin regions to promote the tumorigenic phenotype in mammary epithelium through Sp1–HDAC3/8-mediated pathways. Our findings highlight a major role of chromatin in mediating mechanotransduction and identify changes in chromatin organization in response to matrix stiffness genome-wide at a high level of detail. Several recent studies have shown that mechanical cues can alter nuclear mechanics, nuclear lamina components and global chromatin condensation[453, 191]. Extracellular mechanical signals can also be transmitted to the nucleus through the cytoskeleton, allowing for forces on the cell surface to directly remodel chromatin to alter gene expression18. Prior studies on 2D substrates have demonstrated that class I HDACs are differentially active in response to actomyosin contractility[270, 207]. Additionally, HDAC3 has been shown to be bound and activated by emerin on the nuclear lamina[123] and emerin regulates heterochromatin in response to mechanical strain[247]. These studies complement our findings that enhanced stiffness in 3D matrices promotes chromatin remodelling through the class I HDACs HDAC3 and HDAC8. Enhanced stiffness induces increased accessibility in chromatin regions throughout the genome, and these accessible sites frequently present Sp1-binding motifs. Increased activation of Sp1, increased expression of key Sp1 target genes, and abrogation of tumorigenicity by knockdown or inhibition of Sp1 point to increased Sp1 activity as a key event in regulation of the stiffness-induced tumorigenic phenotype in this 3D culture model of mammary epithelium. The known increase in stiffness during breast cancer progression combined with the known association of Sp1 target gene expression with malignant neoplasm of the breast suggests the relevance of this stiffness–Sp1–HDAC3/8 pathway to the earliest stages of breast cancer progression. More broadly, our work reveals a pathway by which mechanical signals can be transduced to the nucleus to alter chromatin through a complex of regulatory proteins to drive different phenotypic outcomes.
In addition, our work highlights the role of culture conditions in epigenomic studies, finding that soft matrices match not only the phenotypic state, but also the chromatin state of mammary epithelium more closely than conventional 2D culture on rigid TCPS. These experiments build on foundational studies showing the effects on mammary epithelial cells from culture in rBM protein matrices compared to conventional 2D TCPS. A previous study found that MEC culture on 2D TCPS causes loss of tissue-specific function through chromatin remodelling[503]. Additionally, culture in the presence of rBM causes a reduction in histone acetylation, an increase in chromatin condensation and a global decrease in gene expression[248]; all indicative of less accessible chromatin. Here we employed ATAC-seq to reveal site-specific changes in chromatin accessibility and found broad agreement with prior work demonstrating chromatin changes based on bulk chromatin assays. However, we were able to distinguish a subgroup of approximately 30% of the significantly differentially accessible regions between 2D TCPS and mammary epithelium that have increased accessibility in in vivo samples. These chromatin changes are mirrored by soft microenvironments. The ability to differentiate chromatin accessibility in a site-specific, genome-wide manner highlights the power of next-generation sequencing techniques when applied to studies of mechanotransduction, as conventional bulk assays would mask the specific effects of mechanical cues by averaging signal across the epigenome. This work underscores the need for chromatin profiling experiments in biomimetic culture systems when they become available, particularly in models that are known to be mechanically responsive, such as cancer progression and stem cell differentiation[153, 58].
5.1.6 Author Contributions
This work is authored by Ryan Stowers, Anna Shcherbina, Johnny Israeli, Joshua Gruber, Julie
Chang, Sungmin Nam, Atefeh Rabiee, Mary Teruel, Michael Snyder, Anshul Kundaje, and Ovijit Chaudhuri. Anna Shcherbina’s contribution was the ATAC-seq data analysis and bioinformatics profiling of transcription factor binding.
5.2 Cell cycle dynamics of human pluripotent stem cells primed for differentiation
5.2.1 Abstract
Understanding the molecular properties of the cell cycle of human pluripotent stem cells (hPSCs) is critical for effectively promoting differentiation. Here, we use the Fluorescence Ubiquitin Cell Cycle Indicator (FUCCI) system adapted into hPSCs and perform RNA-sequencing on cell cycle sorted hPSCs primed and unprimed for differentiation. Gene expression patterns of signaling factors and developmental regulators change in a cell cycle-specific manner in cells primed for differentiation without altering genes associated with pluripotency. Furthermore, we identify an important role for PI3K signaling in regulating the early transitory states of hPSCs towards differentiation.
5.2.2 Introduction
Despite recent advances in generating specialized cell types from human pluripotent stem cells (hPSCs), many studies have noted that pluripotent stem cell lines often have an inherent inability to differentiate even when stimulated with a proper set of signals[63, 103, 113, 340, 112]. The cell cycle, particularly the G1 phase, may play an important role in enhancing the differentiation potential of PSCs[103, 355, 427, 493, 411]. However, simply lengthening the G1 phase in embryonic stem cells is not sufficient to facilitate differentiation[269], suggesting that an improved understanding of the molecular properties of the embryonic cell cycle is needed.
In a prior study, we demonstrated that transiently treating hPSCs with dimethylsulfoxide (DMSO) for 24h prior to directed differentiation significantly increases the propensity for differentiation across all germ layers[103, 266]. This technique is now used by multiple laboratories to improve differentiation across species (including mouse, rabbit, primate, and human) into more than a dozen lineages, ranging from neurons and cortical spheroids to smooth muscle cells to hepatocytes[104, 478, 17, 401]. While the DMSO treatment activates the retinoblastoma protein (Rb) and increases the percentage of hPSCs in the G1 phase of the cell cycle[103], it remains unknown whether the DMSO treatment simply enriches cells in G1 or whether there are intrinsic changes to the cell cycle following the DMSO treatment that may potentiate differentiation.
Here, we use Fluorescence Ubiquitin Cell Cycle Indicator (FUCCI) technology to systematically track and understand cell division in hPSCs primed and unprimed for differentiation[400]. The FUCCI system fuses red- and green-emitting fluorescent proteins to the cell cycle ubiquitination oscillators, Cdt1 and Geminin, whereby cdt1 tagged with RFP is present only when cells are in G1 and geminin tagged with GFP is only present when cells reside in the S/G2/M phases. By performing RNA-sequencing on hPSCs sorted from the early G1, late G1, and SG2M phases of the cell cycle, we show that gene expression patterns of signaling factors and developmental regulators change in a cell cycle-specific manner in cells primed for differentiation following a 24h DMSO treatment. Changes in signaling pathways controlling cell proliferation, differentiation, and apoptosis, particularly the phosphoinositide 3-kinase (PI3K) pathway, were regulated by the DMSO treatment. Concordantly, transiently inhibiting PI3K signaling enhances hPSC differentiation across all germ layers.
To our knowledge, this is the first study to systematically perform RNA-seq on cell cycle sorted populations of hPSCs to investigate changes that occur within a cell line as cells transition towards a state for differentiation. This comprehensive analysis begins to shed light on important signaling pathways, particularly PI3K, in regulating the developmental potential of hPSCs during early transitory states.
5.2.3 Materials and Methods
Dataset generation is described in the Appendix.
Statistical analysis
For all the variables in the experimental validation studies, means and SEM were calculated. A one-way ANOVA followed by a Tukey’s post hoc test was used to determine statistical significance. For comparisons between two groups, the unpaired two-tailed Student’s t-test was used to determine statistical significance. P value leq0.05 was considered statistically significant.
RNA-sequencing processing pipelines
Paired end Illumina RNA-seq reads were trimmed using the Trimommatic software[64](v. 0.36) The trimmed reads were aligned to the hg19 reference genome using STAR (v2.5.3a)[12] . All parameters were set to their default values, with the exception of the following:
--outFilterScoreMinOverLread 0
--outFilterMatchNminOverLread 0
--outFilterMatchNmin 0
--outFilterMismatchNmax 2
These parameters were set less stringently relative to their default values to enable less stringent determination of multimappers, ultimately leading to an improved overall alignment rate. Gene and transcript expression was then quantified with RSEM(v1.1.17)[262], also using the default parameters. Gene expression values were normalized as asinh(transcripts per million).
Differential gene expression was calculated on normalized read counts.
TPM (transcript per million) values from RSEM were used for differential expression analysis. The sva package[254] in R was used to perform surrogate variable analysis with the null model: expression cell cycle phase + DMSO treatment status. The significant non-correlated surrogate variables were identified and subtracted from the normalized data. The limma R package[490] was used to identify differentially expressed genes with the model as mentioned above. Three sets of pairwise comparisons were performed with limma: early G1, DMSO vs early G1, Control; late G1, DMSO vs late G1, Control; SG2M, DMSO vs SG2M, Control. Differential genes were those with FDR values ≤ 0.01 and abs(log2(fold change)) ≥1.
Differential genes underwent clustering with the Dirichlet process Gaussian process mixture model DPGP[307] software. 1000 iterations of clustering were performed with default software parameters. Inputs to the clustering algorithm for genes consisted of the fold change in gene expression in DMSO samples over control samples for all differential genes, normalized as asinh(TPM surrogate variable contribution). Differential gene sets were visualized with the UpSetR package[261]. Pathway analysis (MSIGDB, KEGG) was performed with DAVID bioinformatics software (v.
6.8)[196]. Genes within each DPGP cluster were provided to DAVID as a gene list, with the hg19 reference gene list as background. Pathways with FDR values ≤0.01 were determined to be significant. The differential genes present in the top 50 most significant pathways (42 associated with signaling; 8 associated with mitosis) identified by the Cytoscape Reactome FI plugin were aggregated across pathways (Figure 2B, Supplementary Figure 5B). The number of pathways that each gene was associated with was tallied. It was determined that the PI3K gene was present in 13 of the top 50 differential Reactome pathways, and this observation was confirmed by querying the PI3K gene in the WikiPathways database[432] and cross-checking all pathways that were associated with the query.
5.2.4 Results
Gene expression dynamics associated with cell cycle progression in hPSCs
To begin, we use the FUCCI system adapted into the hPSC H9 cell line [427] to systematically track and isolate hPSCs from different phases of the cell cycle (Figure 5.6A). H9 FUCCI hPSCs were cultured under maintenance conditions in mTESR (control) or with 2% DMSO for 24h to prime hPSCs for differentiation (Figure 1B). Following treatment with DMSO, there is a shift in cells from SG2M to G1 (Figure 5.6C and 5.6D). We next used fluorescence activated cell sorting (FACs) to isolate cells from early G1, late G1, and SG2M phases from control and 24h DMSO-treated H9 FUCCI hPSCs (Figure 1B and 1D) and performed RNA-sequencing. Using principal component analysis (PCA), we found that the strongest source of variation was in treatment vs control (PC1), followed by phase of the cell cycle (PC2) (Figure 5.6E). We next used the nonparametric Dirichlet process Gaussian process mixture model (DPGP)[307] to cluster fold changes in aligned reads (normalized to transcripts per million (TPM)) to assess changes in gene expression patterns associated with cell cycle progression.
A total of 2972 differentially expressed genes (FDR ≤ 0.05) underwent clustering and 10 clusters emerged (Figure 5.6F) with genes upregulated or downregulated in response to phase of the cell cycle following the DMSO treatment. The largest clusters consisted of genes with decreased expression in late G1 but high in early G1 and SG2M (cluster 7 with 454 genes) or increased expression in late G1 and reduced in early G1 and SG2M (cluster 5 with 420 genes) following the DMSO treatment. Genes with trajectories characteristic of the 10 clusters include those playing important roles in early development and regulating growth signaling pathways (e.g. PGAM1, LEFTY2, RHOB, WNT3, PIK3R3), ubiquitination and DNA repair (CUL4A), DNA replication licensing (e.g. MCM3), maintaining cell shape and cytoskeletal interactions (e.g. VIM, RHOB), and regulating transcription, splicing, and translation of genes through critical RNA helicases and polymerases (e.g. DDX46, POLR2H) (Figure 5.6G). Annotation of the genes sets representative of each cluster using the Molecular Signatures Database (MSigDB) shows the most significant pathways enriched in the 10 clusters (Figure 5.6H). Across all clusters, the DMSO treatment targeted pathways known to be tightly coordinated with the cell cycle, playing critical roles in cytoskeletal organization and membrane structure, transcriptional regulation, cell growth control, and development (e.g. Rho GTPases signaling, mitochondrial biogenesis, rRNA processing, neddylation, protein folding, extracellular matrix organization, cilium assembly, Pre-mRNA processing, spliceosome) (Figure 5.6H). Genes associated with 5 of the 10 clusters (R5, R6, R8, R9, and R10) were enriched in the Processing of Capped Intron-Containing Pre-mRNA (Figure 1H), indicating an important role for the DMSO treatment in regulating the efficiency and fidelity of gene expression[319].
Many pathways associated with mitochondrial function were also enriched (clusters R5 and R8), consistent with recent work demonstrating that mitochondrial dynamics play critical roles in the developmental potential of hPSCs[518]. Overall, this data illustrates that the DMSO treatment changes the expression of these genes in a phase-specific manner in hPSCs and thereby restricts their activity in a temporal manner that is otherwise not present in the cell cycle of untreated control hPSCs.

Figure 5.6 (previous page): (A) Schematic representation of the FUCCI technology labeling individual late G1 phase nuclei in red and S/G2/M phase nuclei in green, while early G1 phase nuclei are double negative. (B) Schematic of H9-FUCCI hPSCs treated with or without 2% DMSO for 24h followed by cell cycle sorting and high throughput RNA-sequencing. (C) Immunofluorescent images of Control and DMSO-treated H9 FUCCI hPSCs in the late G1 phase (red) and SG2M phases (green) of the cell cycle. (D) Fluorescence activated cell sorting of cells in the early G1 (double negative), late G1 (red), and SG2M (green) phases of the cell cycle in control and DMSO-treated FUCCI hPSCs. (E) Principal component analysis of batch-corrected RNA-seq expression data. PC 1 (97.02% variance explained) vs PC 3 (0.38% variance explained) are plotted on TPM (transcript per million) data for hg19. (F) DPGP (Dirichlet process Gaussian mixture model) clustering of differentially expressed genes. DPGP clustering was applied to the fold change of (TPM DMSO / TPM control) for differential genes, and yielded 10 clusters, labeled R1 - R10. The Z-scores for genes in each cluster are plotted in heatmap form as well as line plots of trajectories across early G1, late G1, SG2M. Red lines indicate fold change trajectories for individual peaks assigned to the cluster. The light blue cloud indicates values within 2 standard deviations of the cluster mean. (G) Representative differentially expressed genes for each DPGP cluster R1- R10. TPM values with standard error bars are indicated for Control (red) and DMSO (blue) at the early G1, late G1, and SG2M phases. (H) Most significant MSIGDB pathways enriched in the DGPGP clusters R1 - R10. Height of the bar indicates -log10(FDR) values for the corresponding clusters.
Regulatory role for PI3K-AKT signaling in hPSC differentiation
In aggregate, UpSet analysis shows that the most number of differentially expressed genes occur in the late G1 phase and are specific to distinct phases of the cell cycle – of the 1078 genes downregulated in late G1, 783 were not significantly altered at the other cell cycle phases; of the 895 upregulated genes, 554 were unique to late G1 (Figure 2A). MSIGDB pathway analysis (FDR ≤ 0.01) shows that DMSO affects a number of pathways associated with cell signaling (Figure 2B). Across all of the signaling pathways targeted by DMSO, PI3K was the most commonly represented gene followed by PIK3CA, a catalytic subunit of PI3K (Figure 2B). Kyoto Encyclopaedia of Genes and Genomes (KEGG) analysis shows that 48 genes associated with the PI3K-AKT pathway are significantly regulated by the DMSO treatment at one or more phases of the cell cycle (Figure 2C, Supplementary Figure 1). Many genes upstream in the pathway (e.g. PI3K receptors, PI3K, Ras) are generally down-regulated in the early and late G1 phases of the cell cycle. Other signaling pathways regulated by DMSO also converge upon PI3K and PI3KR signaling (examples illustrated in Supplementary Figures 2, 3, and 4), a pathway well known to regulate cell cycle, proliferation, differentiation, apoptosis, and growth and metabolism[20–24].

Figure 5.7 (previous page): (A) Number of differentially expressed genes (FDR ≤ 0.05, LFC≥= 1) in Control vs DMSO-treated FUCCI hPSCs in the early G1, late G1, and SG2M phases. UpsetR diagram of differentially expressed genes shows that number of differential genes that increase (up) or decrease (down) in expression in response to DMSO at the early G1, late G1, and/or SG2M phases. (B) Enriched REACTOME pathways for differential genes at the early G1, late G1, and SG2M phases of the cell cycle. The heatmap shading corresponds to the -10log10(FDR) for each pathway across the different phases of the cell cycle. Genes with differential expression in response to DMSO treatment that are present in five or more differential signaling pathways are indicated with black boxes in the grid to the left of the heatmap. (C) Differentially expressed genes within the PI3K-AKT signaling pathway. Heatmap values are row z-scores of asinh(TPM) DMSO / asinh(TPM) controls.
Concordantly, pathways and genes associated with mitosis and the cell cycle (e.g. cell cycle checkpoints, p-value=3.98e-8) were also significantly regulated by the DMSO treatment through MSIGDB pathway and gene ontology (GO) enrichment analyses (E.5). Expression patterns for genes commonly implicated in cell division or regulating early differentiation of hPSCs[355, 427] are shown for DMSO-treated hPSCs compared to untreated control hPSCs as cells progress through the cell cycle (E.5). Human embryonic and pluripotent stem cells are known to have minimal regulatory control across phases of the cell cycle and be refractory toward growth inhibitory signals. As a result, oscillation of gene expression across phases of the cell cycle is modest in hPSCs[435, 81, 339]. However, activation of checkpoint controls has been shown to be associated with improved cell cycle regulation and differentiation potential. Consistent with this, we observed a correlation between DMSO treatment and increased cell-cycle phase oscillation across all genes. Mean standard deviation across all genes between early G1 and late G1 was 2.05 TPM in control hPSCs and 3.72 TPM for DMSO-treated hPSCs. While the transition between late G1 and SG2M was relatively consistent across the two groups, mean standard deviation across all genes between SG2M and early G1 was 1.34 TPM in control hPSCs and 2.68 TPM for DMSO-treated hPSCs. Interestingly, pluripotency genes (GO Term GO:0019827 Pluripotency Genes; FDR=8.50e-1 by Fischer’s Exact Test) were not altered, suggesting that the DMSO effect on improved differentiation is not mediated by altering the expression of the pluripotency network (E.6).
Given the convergence towards PI3K, we next investigated whether inhibiting PI3K would mimic the DMSO treatment and increase the mutlilineage differentiation potential of hPSCs. To suppress PI3K signaling, we treated H9 hPSCs with small molecule PI3 kinase inhibitors (LY294002 and Wortmannin) for 24h and subsequently induced differentiation into the ectodermal, mesodermal, and endodermal germ layers using previously published protocols (5.8A). Following directed differentiation, protein expression of germ layer specific genes[471], Sox1 (ectoderm), Brachyury (mesoderm), and Sox17 (endoderm) were assessed by immunostaining. Treatment with the PI3K inhibitors increased subsequent differentiation capacity across all germ layers in a dose-dependent manner (5.8B and 5.8C). Similar improvements in differentiation were observed in another hPSC line, HUES6, known to have a very poor propensity for differentiation[340] (E.7). Together, these results show that understanding gene trajectories in the cell cycle of hPSCs can highlight important signaling mechanisms regulating hPSC differentiation.

Figure 5.8 (previous page): ]
(A) Schematic of H9 hPSCs treated with 2% DMSO or inhibitors of PI3K (LY294002 or Wortmannin) for 24 hours and subsequently directly differentiated into the ectodermal, mesodermal, and endodermal germ layers. Immunostaining for germ layer specific markers following treatment with (B) LY294002 or (C) Wortmannin compared with untreated Control and 2% DMSO-treated hPSCs. (D) Quantitative RT-PCR for lineage-specific genes following directed differentiation of LY294002 (20µM) or Wortmanin (10µM) treated hPSCs compared with untreated Control and 2% DMSOtreated hPSCs. Error bars, s.d. of 2–4 biological replicates. Scale bars, 100 µm. * p ≤ 0.05, ** p ≤ 0.01 under one-way ANOVA; Tukey’s test for multiple comparisons.
5.2.5 Discussion
Strikingly, although DMSO is an agent with pleiotropic effects[346, 510], here, we show that a short 24h treatment of hPSCs targets 2,972 genes in an orchestrated manner, particularly those controlling cell division and early developmental pathways. Genes are periodically expressed because there is special need for the gene products at particular points in the cell cycle[106]. Genes associated with cytoskeletal, cilium assembly, and cell adhesion factors were especially subject to regulation by the DMSO treatment in the SG2M phases, characteristic of a time when cells may need to duplicate centrioles in the S phase, change shape during mitosis, or exit the mitotic cycle to differentiate. Many of the targeted pathways play critical roles during embryogenesis, including Wnt, BMP, NODAL, FGF, Hippo, EGF, VEGF, and PDGF as well as the downstream signaling pathways such as MAPK, Trk receptor, and PI3K[44]. Integration of these signaling pathways coordinates a number of developmental processes, including proliferation, fate determination, differentiation, apoptosis, migration, adhesion, and cell shape, to ultimately affect organogenesis. Most of the pathways that were regulated by DMSO converged on PI3 kinase signaling. Concordantly, suppression of PI3K signaling increased differentiation propensity across all germ layers in hPSCs, highlighting the utility of the genome-wide profiling approach used here to dissect out important signaling mechanisms regulating the developmental potential of pluripotent stem cells. This work is consistent with prior studies showing that PI3K-dependent signals promote embryonic stem cell proliferation and supports the notion that each phase of the cell cycle is important in performing distinct roles to orchestrate stem cell fate[81, 169]. Many of the signaling pathways and effects on metabolic function and cell adhesion identified here were also reported to play important regulatory roles during early transitions in pig embryonic development in recent work[387], suggesting shared mechanisms across species.
It would be interesting to investigate if improvements in terminal differentiation or enhancements in CRISPR-mediated genome editing of hPSCs following a 24h DMSO treatment[68] may be due to changes in the molecular properties elicited on the pluripotent cell cycle. In conclusion, our data yield novel insights on the transcriptional and signaling dynamics during early transitory states in human pluripotent stem cells that could be a useful point of focus in studying embryonic development. Targeting these early modes of regulation may put hPSCs on a better trajectory for differentiation and ultimately improve their utility for regenerative medicine.
5.2.6 Author Contributions
This work is authored by Anna Shcherbina, Jingling Li, Cyndhavi Narayanan, William Greenleaf, Anshul Kundaje, and Sundari Chetty. A.S.: collection and/or assembly of data, all RNA-sequencing and bioinformatics data analyses and interpretation, experimental design, manuscript writing; J.L.: collection and/or assembly of data, experimental design, data analysis and interpretation of experimental validation studies; C.N.: collection and/or assembly of data; W.G.: experimental design; A.K.: data analysis and interpretation, experimental design, final approval of manuscript; S.C.: conception and experimental design, data analysis and interpretation, financial support, provision of study material, manuscript writing, final approval of manuscript.
5.3 Learning cis-regulatory principles of ADAR-based RNA editing from CRISPR-mediated mutagenesis
5.3.1 Abstract
Adenosine-to-inosine (A-to-I) RNA editing catalyzed by ADAR enzymes occurs in double-stranded RNAs. Despite a compelling need towards predictive understanding of natural and engineered editing events, how the RNA sequence and structure determine the editing efficiency and specificity (i.e., cis-regulation) is poorly understood. We applied a CRISPR/Cas9-mediated saturation mutagenesis approach to generate libraries of mutations near three natural editing substrates at their endogenous genomic loci. We used machine learning to integrate diverse RNA sequence and structure features to model editing levels measured by deep sequencing. We confirmed known features and identified new features important for RNA editing. When trained and tested within the same substrate, XGBoost models explained 68 to 86 percent of isoform-specific variation in editing levels. However, the models did not generalize across substrates, suggesting complex and context-dependent regulation patterns. Our integrative approach can be applied to larger scale experiments towards deciphering the RNA editing code.
5.3.2 Introduction
RNA editing greatly diversifies the transcriptome and proteome in higher eukaryotes[405, 356, 329, 384, 384]. In animals, the predominant type of RNA editing is the hydrolytic deamination of adenosine (A) to form inosine (I), catalyzed by adenosine deaminase acting on RNA (ADAR)[311, 487]. Abnormal A-to-I RNA editing is strongly linked to autoimmune diseases, neurological disorders and

cancers[200, 183]. Humans have two catalytically active ADAR proteins, ADAR1 and ADAR2, responsible for the editing of millions of RNA editing sites[385, 366]. Adenosines in perfect or nearly perfect dsRNA duplexes, formed mainly by inverted repeats, are promiscuously edited[47]; in contrast, adenosines in imperfect dsRNA structures, can be edited by ADARs with high specificity and efficiency[466]. How RNA editing is regulated to determine its efficiency and specificity is poorly understood. Both the primary sequence and secondary structure (i.e., cis-acting regulatory elements) have been proposed to regulate ADAR editing[487, 139, 365, 256, 486, 515, 402]. A preferred sequence motif has been defined including the 5’ and 3’ nearest neighboring positions (-1 and +1 nt) to the editing site[139, 365, 256]. Editing can be enhanced or suppressed by deviations from perfect base-pairing (i.e., mismatches, bulges and loops), suggesting complex structural contributions to editing specificity14-16. The quantitative trait loci (QTL) mapping approach has been used to identify genetic variants associated with variability in RNA editing in Drosophila and humans, suggesting that many editing QTLs (edQTL) act through changes in the local and distal secondary structure for edited dsRNAs, consistent with the importance of RNA structure[383, 351]. Nevertheless, few general properties have emerged across different substrates, suggesting complex structural contributions to editing specificity[257].
Previous studies have generally been limited to small numbers of natural or engineered variants, thus lacking the systematic sequence and structure variation required for development of predictive models of editing, and a high-throughput, systematic mutagenesis approach is therefore called for. Here we combined CRISPR/Cas9 genome engineering, next-generation sequencing, and machine learning to decipher cis regulatory RNA sequence and structural elements that affect ADAR-mediated RNA editing. As proof-of-concept, we chose three representative RNA editing substrates and introduced hundreds of mutations at the endogenous loci in human cells using the CRISPR-mediated approach. We used supervised machine learning to build predictive models of substrate-specific RNA editing levels based on a variety of cis sequence and structural features. We identified highly edited structures different from wild-type structure (referred to as alternative structures), and general and idiosyncratic features that determine editing efficiency of individual substrates, highlighting the complexity of the cis-regulatory editing code. Our integrative approach, named predicting RNA editing using sequence and structure (PREUSS), lays the foundation for developing predictive models of RNA editing.
5.3.3 Results
CRISPR/Cas9-mediated mutagenesis to interrogate endogenous RNA editing
To interrogate the effects of cis-regulatory elements of RNA editing, we applied the CRISPR/Cas9 technology (5.9)a to introduce mutations at the endogenous loci of three natural ADAR1 substrates (NEIL1, TTYH2, AJUBA) (5.9b, see 5.3.4). The mutations were introduced both in the strand containing the editing site (’editing strand’) and in the complementary sequence involved in forming the secondary structure, which we refer to as editing complementary sequence (ECS). Briefly, we designed CRISPR guide RNAs (gRNAs) targeting the regions of interest, as well as oligonucleotide donors carrying mutations to directly knock-in mutations through the CRISPR/Cas9-mediated homology-directed repair (HDR) pathway[147]. To measure the RNA editing levels of the resulting variants, we performed targeted amplicon deep sequencing. Because the variant and the associated editing site are in the same transcript, there is no need to perform laborious clonal selection for homozygotes of the variants. Because each designed variant has a unique sequence, we successfully performed large scale multiplex mutagenesis and measured the editing levels without the aid of barcodes (see 5.3.4 for details).
In a pilot experiment to introduce one or more mutations near the editing site, we used a single degenerate donor oligonucleotide with mutations at each position to randomize the region from -3 to +3 positions of the editing strand for NEIL1 (5.9c) and a 10 nt region on the ECS of TTYH2 (G.1a). These random mutations provide a rapid means to evaluate the CRISPR/Cas9 knockin efficiency and the effects of mutations. We observed that 3 or more mutations almost always lead to an abolishment of editing events (5.9d-e, G.1b). Therefore, to generate variants that lead to a wide range of editing levels, we next performed targeted mutagenesis using a pool of 200–300 donor oligonucleotides with designed mutations (see 5.3.4), focusing on single- and double-mutations with larger mutagenesis regions around the editing site in the editing strand and the ECS (G.1b, 5.10a, G.3a-c). For NEIL1 and TTYH2, we designed all possible single mutations both in the editing strand and the ECS (with the exception of –1 and +1 positions for NEIL1 where A-to-G mutations, which would be indistinguishable from A-to-I editing). For AJUBA, we only designed mutations in the editing strand. We selectively designed a subset of double-transversion mutations (11% all possible double mutations) that theoretically disrupt the original base-pairing at the mutation position. For NEIL1, we also introduced compensatory mutation variants which theoretically maintain basepairing. Additionally, we designed indel variants to study the effects of selected secondary structure features of NEIL1, such as bulges, internal loops, and stem length.
Overall, we achieved high knock-in efficiency as 10-20% of the sequenced RNAs at the target locus carried mutations, similar to previous reports for a similar approach[147]. We were able to reliably detect ≥90% of our designed variants after using stringent quality control filters. The knock-in results and editing measurements were highly reproducible (5.9f, G.1c-f). Interestingly, we discovered that similar knock-in efficiency and editing results were achieved when using ssDNA oligonucleotides or dsDNA (e.g., PCR product) as the donor for CRISPR-mediated HDR (G.1g1j). Using dsDNA PCR products greatly simplified the procedures and reduced the cost of the experiments. The coverage of RT-PCR product for each variant was generally well correlated with the corresponding coverage of the product amplified from gDNA (R2 = 0.87 for NEIL1 and R2 = 0.25 for TTYH2) (G.2a,d), suggesting that the RNA abundance is generally not affected by the introduced variants. There is no correlation between the RNA or gDNA coverage with the editing level for all three substrates, which is consistent with previous reports[160, 415] that argues against potential influence of the substrate expression level on the editing level (G.2b-c,e-g).

Figure 5.9 (previous page): a) Overview of the experimental methods and computational pipeline. CRISPR/Cas9 mediated homology-directed repair is applied to mutagenesis of endogenous RNA in HEK293T cells. A supervised machine learning method (a gradient boosted tree, XGBoost) was applied to develop quantitative models that predict how cis elements such as RNA sequence and secondary structure determine RNA editing level. b) Sequence and secondary structure of the three RNAs, NEIL1, TTYH2 and AJUBA, for targeted mutagenesis. The residues subjected to mutations are highlighted in red and the specific editing site is in blue. For AJUBA, partial sequences from the genomic sequences are taken to focus on the region of interest. Therefore, the G59 and U60 shown in (b) is 524 nt apart in the genomic region. c) Degenerate donor oligos are designed for the -3 to +3 nt region around the specific editing site in the NEIL1 substrate. The mutagenized region is highlighted in red and the editing site in blue. d) The distribution of editing level by the number of mutations from the results of the degenerate NEIL1 library from (c). e) Examples of how the number of mutations affect the RNA secondary structure of NEIL1. f) Reproducible editing measurement of the two replicates of the targeted mutagenesis library of NEIL1.
Intertwined effects of primary sequence and secondary structure on editing levels
We compared the effects of single and double mutations in terms of the type (transition/transversion and location of the mutation across all three RNA substrates (5.10,G.3). We used the computationally predicted minimum free energy (MFE) secondary structure (with of each RNA variant to dissect the associations between mutations and structure. The mutational effects are summarized for each target below.
For NEIL1, most single mutations led to minor decreases in editing (-1 ≤ z-score ≤ 0), with the largest effects observed at positions +1 and +2 relative to the editing site (5.10c, 5.11b). Exceptions from this pattern were the large effects (z-score ≤ -1) of G mutations downstream from the editing site in the ECS strand. This observation may suggest formation of alternative structures in these G-mutants (e.g., G.4d). Some RNA variants have the same predicted RNA structure but different editing level, most simply suggesting a primary sequence effect (G.4e). To decouple sequence and structural effects at each position, we considered six categories of single mutations. The simple ’transition’ (i.e., purine to purine, pyrimidine to pyrimidine) and ’transversion’ (pyrimidine to purine and vice versa) categories indicate that the RNA structure was unchanged compared to the WT. ’Transition + break’ and ’transversion + break’ categories indicate that the mutation disrupted the base-pair at the mutation site, and the ’transition+shift’ and ’transversion+shift’ categories include all other scenarios, such as the formation of a new base pair or disruption of more than one base pair (5.11a). We observed moderate effects (z-score ≤ -1) for the transversion mutations that also disrupted base-pairing (transversion+break) or caused other structural changes (transversion+shift) at positions in close vicinity to the editing site (-5, -1, +1 to +3, +9), and at the 3’ side of the editing site (*+1 to *+8). Double-transversion mutations in the editing strand of NEIL1 had overall pronounced effects on editing (5.10d), and the strongest effects (z-score ≤- 2) were observed when at least one of the mutations was in close proximity to the editing site (positions –6 and +2). The effect was generally smaller (z-score ≥ -2) if one of the mutations was in a non-base-paired region (+4, +5, +11) (5.10d).
In contrast to NEIL1, where the vast majority of mutations decreased editing, a large proportion of TTYH2 single mutations increased the editing efficiency (z-score ≥ 0) (G.3,G.4). This difference may be explained by the lower WT editing level for TTYH2 (30%) than for NEIL1 (66%). Similar to NEIL1, single mutations closer to the editing site (-2 to +1, *-3 to *+4) tended to have larger negative effects on editing levels (z-score ≤ -1) (G.4a). Interestingly, several single mutations located both upstream (-6 to -3) and downstream (+6 to +8, *+6) of the editing site increased editing levels (z-score ≥ 0). For TTYH2 double mutations, the effects were most negative when at least one mutation was located around the editing site (–2 to +6 in the editing strand and *-3 to *+5 in the ECS) (G.3. a-b).
We only examined the mutations on the editing strand of AJUBA. In contrast to NEIL1 and TTYH2, many single mutations of AJUBA were sufficient to disrupt editing. Also, all double mutations abolished editing regardless of positions (G.3c). Most of the transition+break, transition+shift and the transversion+shift single mutations had large effects (z-score ≤ -2) (G.3b). Many AJUBA single mutations have much larger effects (z-score ≤ -3) than single mutations in NEIL1 and TTYH2. This difference may be explained by the long (524 nt) distance between AJUBA editing strand and ECS, such that mutations may lower the probability of forming this long-range structure relative to alternative proximal structures; alternatively, and less likely, primary sequence might have a larger influence on editing of the AJUBA RNA.
Taken together, these results are consistent with previous observations of intertwined sequence and structure effects on editing. These effects also vary among the three different RNA substrates, suggesting substrate-specific cis-regulation rules.
RNA structural features affect editing levels
Next, we systematically explored the effects of changes to the RNA secondary structure on editing levels. We found that compensatory double mutations in NEIL1 that did not affect secondary structure resulted in only minor reduction of editing levels (5.12a-c). To investigate how a specific structural change affects editing efficiency, we designed several indels that change the predicted secondary structure of NEIL1 (5.12d). Shortening the 5’ stem or breaking base-pairs within this stem abolished editing, suggesting the importance of this region for editing; increasing the stem length by 2-bp did not increase editing efficiency (5.12d, G.4f). The 3’ base-pairing is also critical because breaking it led to nearly complete disruption of editing (G.4f). When we replaced the downstream 3’ internal loop with either a canonical base-pair or wobble base-pair, the editing efficiency decreased by 50% (z-score = -1), suggesting the importance of this structural feature (5.12d, G.4f). Enlarging the loop with additional nucleotides resulted in mild (-1 ≤ z-score ≤0) reductions in editing levels (5.12d). As expected, editing site structures containing an A:C mismatch (1:1 internal loop) exhibited higher

Figure 5.10: a) Number of the types of mutations made in each targeted-mutagenesis library. b) Distributions of editing levels for each targeted-mutagenesis library, colored by editing level quantile in each RNA library. Heatmap of editing levels from c) single- and d) double-mutations in the editing strand of NEIL1. e) Heatmap of editing levels from single-mutation in the editing complementary sequence (ECS) of NEIL1. Editing level of WT NEIL1 is 0.64. The z-score is calculated as described in Methods and the WT editing level z-score is 0. c-e share the same heatmap scale shown in e).

Figure 5.11: a) The single mutations were grouped into six types: sequence change (transition and transversion), the sequence change resulted in breaking of the base pair at the mustioan site (break), or resulted in breaking more than one base pair or forming of new base pair(s) (shift). b) Positionspecific effects of NEIL1 single mutations. The cartoon is illustrating the secondary structure of WT NEIL1. The z-score is calculated for the NEIL1 RNA library as described in Methods and the WT editing level z-score is 0.
editing levels on average than when the editing site resided in a larger loop (P ≤ 0.0001 by Wilcoxon test, G.4g ). However, several editing site structures harboring non-A:C mismatches also showed strong editing levels for NEIL1 and TTYH2 (G.3d,e), indicating that additional factors affect editing efficiency.

Figure 5.12: a) Compensatory mutations generally maintain a high editing level. b-c) Comparing the editing level b) and similarity score c) (normalized score calculating the similarity of the MFE structure of each variant to the WT) by different mutation types. d) Alterations in the 5’ stem and 3’ non-stem structure elements affect editing level.
We reasoned that structural and thermodynamic features affecting RNA stability could also affect editing efficiency[515]. We observed significantly greater predicted structural stability in highly (highest 25 percentile of editing level in each RNA library) compared to lowly (lowest 25 percentile) edited NEIL1 (P ≤ 0.0001) and AJUBA (P ≤ 0.001) variants, based on both the minimum free energy (MFE) structure (5.13a) and the predicted structural ensemble (5.13b). We also observed significantly higher MFE frequency for NEIL1 (P ≤ 0.01) and TTYH2 (P ≤ 0.0001) (5.13c) and lower ensemble diversity for NEIL1 (P ≤ 0.0001) (5.13d) in highest edited quartile. The same observation held when stability was approximated by the number of base-pairs formed (5.13e).
We hypothesized that RNA variants that are structurally more similar to the WT would result in editing levels similar to WT. We quantified structural similarity using two measures: a similarity score that indicates the degree to which the MFE structure of a variant is similar to the MFE structure of WT (scale is 0 to 1, from least similar to identical structure) and the probability of active conformation, which indicates the probability of forming wild-type-like secondary structure in the predicted structural ensemble for each RNA variant. We found significant differences (P ≤ 0.0001) of similarity score between highly and lowly editing variants for NEIL1 and AJUBA (5.13g).
A higher probability of active conformation was observed in the highly edited variants compared to lowly edited variants for all three substrates (P ≤ 0.0001, 5.13f).
However, when we considered all variants across the entire editing spectrum, instead of highly vs. lowly edited variants, no significant correlations were observed between individual features and RNA editing levels (G.5,G.6). These results show that individual sequence, structure and stability features of variants can explain the differences between highest and lowest edited substrates but only have limited predictive association with quantitative editing levels. Therefore, we decided to carry out an integrative analysis of RNA sequence and structure features to model quantitative editing levels.
RNA clustering reveals alternative structures that support efficient editing
Given that no single property of the RNA substrates correlated strongly and consistently with editing efficiency, we used machine learning to dissect the collective effects of different features on editing. First, we performed a hierarchical clustering analysis based on variant sequence and structure. We clustered the NEIL1 and TTYH2 libraries because the editing levels are widely distributed compared to the AJBUA results. We used the locARNA pipeline[496, 495] which takes into account both the sequence and the MFE structure (5.14a). Because the sequence variation is relatively small, the similarity and difference among the MFE structures was weighted highly for the resulting hierarchical clustering. The resulting clusters of RNA variants generally share a similar core structure and show similar editing levels (5.14b, G.7,G.8,G.9). Interestingly, we found clusters of RNA with predicted structures distinct from WT that are edited with near-WT efficiencies, both for NEIL1 (e.g., clusters (4) and (8) in 5.14), and for TTYH2 (e.g. clusters (3), (2), (7) and (8) in G.8,G.8). As an example, positioning of the NEIL1 editing site in an asymmetric 2:3 internal loop (cluster 8 in 5.14b) instead of the 1:1 A:C loop seen in WT (cluster 1 in 5.14b) appears to maintain, and even enhance, the editing efficiency. In contrast, a 1:2 internal loop in cluster (7) (5.14b) with similar downstream structures as cluster (8) is mostly poorly edited. Cross-cluster comparisons further illuminate the

Figure 5.13: Comparing the difference of a) Minimum Free Energy (MFE), b) Ensemble Free Energy, c) MFE frequency, d) Ensemble Diversity, e) All Stem Length, f) Probability of Active Conformation and g) Similarity Score, in the highly edited (75 to 100 percentile in editing level in the library) with the lowly edited (0 to 25 percentile) variants in each RNA library. ns, non-significant; **, P ≤ 0.01; ***, P ≤ 0.001; ****, P ≤ 0.0001; by Wilcoxon test.
contributions of certain structural features to editing. For example, comparing NEIL1 cluster (5) with (1) (5.14) suggests a negative effect of a bulge in the 5’ stem. While NEIL1 prefers a good 5’ stem structure, the TTYH2 can tolerate symmetric internal loops in the 5’ stem (G.8,G.9).
Machine learning models accurately predict substrate-specific RNA editing levels from sequence and structure features
To quantitatively capture the complex relationship between editing levels and multi-dimensional RNA sequence and structure features, we turned to machine learning models. A set of 125 features were derived to annotate the RNA variants (see 5.3.4 and G.2,G.3,G.4 for feature annotations for all variants of NEIL1, TTYH2, and AJUBA respectively). The sequence features summarized various properties of the primary RNA sequence of each variant at and around the vicinity of the editing site where the mutations were made. We used the bpRNA tool to assign all residues in each variant to diverse structural elements such as hairpin loops, bulges, internal loops, stems, multi-loops and closing pairs (5.15a). We chose to featurize the bpRNA structural annotations at the editing site and adjacent regions (up to 3 bpRNA structural elements upstream and downstream from the editing site structure element, detailed in G.1) (5.15b), as these regions within the RNA substrate fully encompass the interaction site with the ADAR deaminase domain (5.15c)[300, 465]. The 125 features were further grouped into nine major categories for purposes of feature interpretation (G.1). Gradient boosted trees (GBTs) were trained via the XGBoost algorithm[100]. We trained and tuned GBTs on distinct subsets of RNA variants to map their feature annotations to corresponding realvalued editing levels or binarized labels obtained by thresholding editing levels into two classes (edited vs. not-edited).
First, we evaluated the prediction performance of our model for each substrate. We trained and tuned models on a subset of variants and then tested model performance on a held-out test set of variants of the same substrate. For NEIL1, the models accounted for 85.6% of the variance (R2) in ADAR editing levels for variants in the held-out test set, with a Spearman correlation (Rs) of 0.92 between observed and predicted editing levels. Binary editing status was also predicted accurately (Area under precision-recall curve, auPR = 0.97). Similarly, high test set predictive performance was obtained for TTYH2 variants (R2 = 0.68, Rs = 0.91, auPR = 0.81) and AJUBA variants (R2 = 0.79, Rs = 0.90, auPR = 0.93). Augmenting the training set for each substrate with variants from the other substrates did not result in any significant improvements in model performance (G.2, G.10). These results indicate that it is possible to predict RNA editing levels of new mutations in a substrate with high accuracy from sequence and structure features using integrative machine learning models trained on a subset of mutations from the same substrate (5.15d).

Figure 5.14: a) NEIL1 variants are clustered by RNAclust from the sequence-structure alignment generated by RNAclust. b) Consensus secondary structure of selected clusters from a) grouped by editing levels. The gray box (”not base paired”) in the consensus structure for each cluster indicate that there is variant’s MFE structure within the cluster that has a different structure at this position (see examples in G.7)

Figure 5.15 (previous page): a) Structural features annotated by bpRNA and included in featurization of RNA variants. b) High-level feature groups for input to XGBoost analysis. u1= structural element immediately upstream (5’) of editing site; u2= structural element upstream of u1; site=structural element within which the editing site is found; d1= structural element downstream (3’) of site; d2= structural element downstream of d1; d3= structural element downstream of d2. c) Illustration of a putative model for binding of the NEIL1 RNA (cyan) to the ADAR1. The ADAR1 deaminase domain (silver) and modeled from ADAR2 by Phyre2. The dsRNA binding domains (pink) are modeled in one possible conformation as described in the Methods. The editing site 1:1 internal loop on NEIL1 is shown in red and the editing A shown as space filled. The upstream (purple and light purple) and downstream (yellow, orange and light orange) immediately adjacent to the editing site are colored according to shown in (b). d) XGBoost editing level predictions for variants of NEIL1 (orange), TTYH2 (purple), and AJUBA (green) within the test split (15% random split of positions). R2 is a measure of the % variance explained. Spearman R indicates correlation between observed and predicting editing values. e) SHAP annotation of feature contributions for the NEIL1 test split variant with the highest observed editing level. Features with positive SHAP scores (drive the prediction over the dataset base value) are indicated in pink; features with negative SHAP values (drive the prediction below the dataset base value) are indicated in blue. Base value refers to the mean predicted editing level across the test split. Output value refers to the XGBoost prediction on this example. The four features with the highest absolute value SHAP scores are shown. f) SHAP annotation of feature contributions for the NEIL1 test split variant with the lowest observed editing level. g) SHAP values for the 20 most important features driving XGoost editing level predictions on the test split for NEIL1, TTYH2, and AJUBA. Each dot indicates a variant in the test split. Features (y-axis) are ranked from top (most significant) to bottom (least significant) by predictive importance.
Next, we tested whether models trained on variants of one or more substrates could predict editing effects of mutations in a different substrate. We observed a significant drop in model performance for cross-substrate prediction of RNA editing (G.11). For example, a model trained on NEIL1 variants yielded lower performance on AJUBA variants (R2 ≤ 0.05, Rs = 0.68, auPR = 0.69) and TTYH2 variants (R2 ≤ 0.05, Rs = 0.46, auPR = 0.59) as compared to a model trained and tested on NEIL1 variants (R2 = 0.88, Rs = 0.93, auPR = 0.97). Similarly, a model trained on all NEIL1 and TTYH2 variants also yielded lower predictive performance when tested on AJUBA variants (R2 ≤ 0.05, Rs = 0.66, auPR = 0.29), albeit higher than the model trained on either substrate independently. The same held true for all models trained on two of the substrates and evaluated on the third - lower performance compared to within-substrate training and evaluation.
The inability of our current models to accurately generalize predictions to new substrates is not entirely surprising considering the diversity of the substrates and the small number (three) of distinct substrates available for model training. It is likely that the challenge of cross-substrate training may be solved by training on a larger number of variants from a database of diverse substrates, and future efforts will focus on this task. However, given the success of our substrate-specific models in predicting editing effects for unseen mutations within each substrate, we decided to interpret these models to investigate the features that may be predictive of RNA editing levels.
Model interpretation provides insights into common and substrate-specific features associated with RNA editing efficiency
For each of the three substrate-specific models, we used the TreeExplainer SHAP (SHapley Additive exPlanations) algorithm to quantify the contributions (or importance) of all features to the RNA editing predictions of each variant in the test sets[283]. The SHAP importance score of a feature with a specific value for a variant of a substrate estimates how much the feature contributes to pushing the model’s output from a baseline editing level to the predicted RNA editing level for the variant. The baseline editing level is defined as the average editing level across all variants in the test set. Examples of how SHAP scores illuminate feature importance are illustrated in 5.15e-f. 5.15e illustrates the SHAP scores for the 5 most important features for the NEIL1 test variant (RNA ID 92, 31UtoG) with a high observed editing level (0.78) agrees with model prediction (0.78). For the NEIL1 test set, the baseline (mean) predicted editing value is 0.25. We display the contribution of all feature values for this variant in pushing the prediction from the baseline of 0.25 to the predicted output value of 0.78. The feature ”sim nor score” (same as in 5.13g, normalized similarity score comparing MFE structure of variant to WT), is estimated to have the highest importance and increases the prediction of editing level by 0.09 (SHAP value) from the baseline. The contribution of the editing site with an A:C mismatch has a SHAP value of 0.05, and so on. Although a larger number of mutations generally decreases editing level (5.15g), the ”num mutations=3” has a positive SHAP value (red) for this variant, highlighting the ability of the model to pick up different feature combinations. Conversely, 5.15f illustrates how feature values unfavorable to ADAR editing result in a predicted editing level of 0 for another variant (ID 142, 41GtoC,45CtoG) of the NEIL1 substrate relative to the baseline. This variant has two mutations in the substrate (num mutations=2). This feature has a SHAP score of -0.05 (blue), indicating that a higher number of mutations in the RNA is unfavorable to editing. There is no A:C mismatch at the editing site, and this feature value has a SHAP score of -0.06. These and other highlighted feature values serve to drive the prediction down from the baseline of 0.25 to 0 (5.15f).
To illustrate the directionality of predictive association of the features with RNA editing levels, we plotted the SHAP scores of the top 20 features for all test-set variants of the three substrates (5.15g). We also summarized the relative importance of features for each of the three substrates by computing the percent contribution from each feature to the mean of absolute SHAP values across all examples in the test sets of each substrate, and highlighted the six new features unique to this study in red (”probability active conf”, ”sim nor score”, ”ensemble diversity”, ”mfe frequency”, ”site 5prm cp internal:C:G”, and ”d2 5prm cp internal:G:C”, 5.16a). The closing pairs for loops and bulges are previously unexplored but highly ranked features for NEIL1 and TTYH2 (5.15g). Closing pair can be a readout for alternative active structure, such as in some highly edited NEIL1 variants the d2 internal loop’s 5’ closing pair is a G:C sequence (”d2 5prm cp internal:G:C”) (G.7) compared to the U:A in WT (5.15b). These results also corroborate with trends observed in the abovementioned clustering analysis (5.14, G.7,G.8,G.8).
Eight features were illuminated as most important for driving model predictions across substrates. Number of mutations (”num mutations”) was the strongest contributor for AJUBA (47.69%) and NEIL1 (19.18%) and in the top 6 most important features for TTYH2 (4.51%) (5.16a). Increasing number of mutations had a negative influence on editing levels (5.15g). This effect supports the proposal that RNA structure plays a big role in editing activity because in our library design the more mutations (single vs double-transversion) the larger changes occur in the structure (5.11c). An A:C mismatch at the editing site (”site 1 1:A:C”) had a high relative contribution for NEIL1 (18.03%) and TTYH2 (15.95%) but contributed less to AJUBA editing levels (0.31%), consistent with previous proposals that A:C mismatch facilitates the flip-out of the adenosine for ADAR editing[300, 465] (5.15g, G.4g). The probability of the active conformation (”probability active conf”) accounted for a mean of 8.8% relative contribution across substrates. The structure-similarity score of variants compared to WT feature (”sim nor score”) was positively correlated with editing levels. The lower ”minimum free energy” (MFE) is positively associated with editing for NEIL1 and TTYH2 but not for AJUBA. The higher ”ensemble diversity” is positively associated with editing levels for TTYH2 and AJBUA but bi-directional for NEIL1. Higher MFE frequency is positively associated with editing levels for NEIL1 and TTYH2 but not for AJUBA (5.15g). These features corroborate with previous results that the overall structural stability of RNA substrate is positively correlated with the editing activity18 and suggest that RNA conformational diversity plays important and specific roles in different substrates. The seventh ranking feature was the position of the mutation along the RNA molecule (”mut pos”). ”Mut pos” values are numbered beginning at the 5’ end of the RNA molecule, so higher values indicate positions further from 5’ and closer to 3’. This result indicates that the nucleotides adjacent to the editing site in the structure is the hot spot dictating activity. Though the ”mut pos” feature had a strong impact on editing level, the directionality varied across substrates, reflecting the interplay of the mutation position with other structural features.
In addition to top individual features, a sparse set of features collectively contribute to the accurate predictions made by the models. For the NEIL1 substrate, 90% of the explained variance could be attributed to the 26 top features, compared with the 32 top features for TTYH2 and 23 top features for AJUBA (G.3). To illustrate the contributions of different types of features and to draw biological insights, we looked at feature groups and subgroups. We categorized the group of all structure and sequence features excluding mutation-related features to 4 subgroups (5.14b, full list of feature groups and subgroups in G.1). Overall, the thermodynamics and the editing site structure have the largest contributions, consistent with prior proposals that the overall thermodynamics (RNA stability and conformational diversity), and the structure of the editing site dictate the editing efficiency[139, 515]. Notably, upstream and downstream structure features are also important, such as the downstream features in TTYH2. The -1 and +1 nt sequence motif (the 5’ and 3’ nearest neighbor, termed ”site prev nt” and ”site next nt” in 5.15f and G.1) also contributes to the prediction albeit to a lesser extent.
This systematic interpretation of our models not only reveals several biologically relevant features that are globally predictive across the three substrates, but also some that are highly predictive for specific substrates. These results showcase the promise of predictive cis-regulatory models of RNA editing but also highlight the need for much larger datasets spanning diverse substrates to learn more generalizable models of RNA editing.

Figure 5.16 (previous page): a) Percent contributions of individual features to model prediction ranked by averaging normalized SHAP values. The new features unique to this work is highlighted in red. Higher ranking with smaller standard errors indicates that these features are commonly among the highest contributors to model prediction in all three RNAs. b) Contributions of different feature groups to the prediction of editing levels for each RNA library. The group of individual features included in each feature group are listed in Supplementary Table 1.
5.3.4 Methods
Cell culture and transfection
HEK293T cells were cultured in Dulbecco’s modified Eagle medium (Life Technologies) supplemented with 10% FBS (Gibco, Thermo Fisher) and penicillin streptomycin (Life Technologies). Cells were maintained at 70%-90% confluency. One day before transfection, around 700,000 cells were split to 6 wells. The next day, 500 ng of Cas9-sgRNA construct in the px330 backbone (https://www.addgene.org/42230/) was co-transfected with 500 ng of the DNA donor using lipofectamine 2000 (Invitrogen). Cells were maintained at 50%-90% confluency for 5 days.
Design of the CRISPR/KI donor oligos
We selected three natural ADAR1 substrates (NEIL1, TTYH2, AJUBA) (5.9b) for the mutagenesis studies based on the observations that (1) the editing sites for all three substrates are highly edited (30-60%) in HEK293T cells, in which ADAR1 is expressed but ADAR2 is lowly expressed; (2) the editing sites are not edited when ADAR1 activity is abolished (data not shown); and (3) they represent three different types of dsRNA substrates. The NEIL1 editing site is in the coding region. The editing event leads to an amino acid change from Lysine (K) to Arginine (R), which has been shown to increase the enzymatic activity of the NEIL1 glycosylase[100]. The TTYH2 editing site is intronic and the AJUBA editing site is located in its 3’ UTR. The functional impact of these two editing sites is currently unknown.
Two types of CRISPR knockin (KI) donors were designed in this study: the degenerate donor and the fixed donor. For the degenerate donor oligos, a single stranded DNA oligo was synthesized in which degenerate sequences were introduced at the interrogated regions. In the NEIL1 donor, -3 to -1 and +1 to +3 were interrogated and equal molar of 4 nucleotides were introduced at these positions during DNA synthesis. To avoid cutting by the Cas9, a point mutation was also introduced at the PAM sequence. In the TTYH2 donor, a 10 nt region in the editing complementary sequence was studied, and equal molar of C or T was introduced. The PAM sequence was also mutated along with a compensatory mutation to maintain the secondary structure.
For donors used for targeted mutagenesis, individual DNA sequences were designed to carry desired mutation(s). Briefly, a 15-20nt region around the target editing site and the corresponding region on the opposite strand were subject to mutagenesis. All possible nucleotide at any single position was tested, with exceptions where A-to-G mutation was avoided in the +1 and -1 nt of NEIL1 because it potentially becomes indistinguishable with A-to-I editing in RNA-seq results. Combination mutations at two positions were also designed, in which each of the positions is mutated to the nucleotide in the opposite strand to disrupt the original structure. In addition, individual donors with altered length for interrogation of specific features of the RNA substrate were included. For NEIL1, we were able to use donor oligos to introduce compensatory mutation variants because the ECS and editing site are close in sequence space.
Generation of the CRISPR/KI donor pool
For NEIL1 donors, 80mer oligos were purchased from IDT and pooled at equal molar ratio. Oligo pairs NEIL1 leftarm/NEIL1 rightarm, or asymmetrical labeled primer pairs NEIL1 leftarm biotin/NEIL1 rightarm and NEIL1 leftarm/NEIL1 rightarm biotin were used separately to add additional sequences to obtain 200mers in PCR reactions using Phusion polymerase. Around 400ul PCR products were purified using MinElute PCR purification kit (Qiagen) to obtain the dsDNA donor pool which was verified by agarose gel electrophoresis. For single-stranded donors, 100ul MyOne Streptavidin Dynabeads (Thermo Fisher) were added to the purified products that were amplified with asymmetrical biotin label and then the mixtures were denatured at 95 C for 10min and chilled on ice immediately. The unbound single stranded oligos were collected from the supernatant and then purified with column MinElute PCR Purification Kit (Qiagen) to obtain the ssDNA donors. For TTYH2 and AJUBA donors, 100 to 120-mer pooled oligos were purchased from Agilent and amplified using individual primers. Primers donor F and donor R of each target gene were used to specifically amplify the oligo library from the oligo chip. A second PCR using Donor F 70 and Donor R 70 was performed to elongate the homologous arms of each donor. The PCR products were purified using MinElute PCR purification kit (Qiagen) and used as dsDNA donors later. NEIL1 degenerate donor and TTYH2 degenerate donors were synthesized as Ultramer by IDT and used directly in the transfection.
Guide RNA design and cloning
Guide RNA was predicted by the web-based software CRISPR.mit.edu. The higher ranked guide RNA with a PAM sequence close to the interrogation region was selected. For the TTYH2 and AJUBA loci, different sets of guide RNAs were designed for the KI regions in two opposite strands. To construct guide RNA plasmids, two reverse complementary single stranded oligos with overhangs were synthesized by IDT and annealed on a thermocycler (Bio-Rad) before ligation to BbsI-linearized PX330 backbone. The ligation mix was transformed into Stbl3 chemical competent cells (Invitrogen) and single clones were sequence verified by Sanger sequencing.
CRISPR mutagenesis and library construction
We used 600 ng single stranded oligo donor library or 1200 ng double stranded oligo donor library along with 500 ng guide RNA construct to co-transfect into 1 million HEK293T cells using lipofectamine 2000 (Invitrogen). For degenerate donor mediated knock in, 1ul of 10µM degenerate donor was used. 1uM L755507 (Sigma Aldrich) was added to the media one day after transfection to enhance the HDR efficiency as reported previously[509]. Two biological replicates were included for each assay. The transfected cells were grown for 5 days before they were seeded onto 10 cm dishes for an additional two days. 10% of the cells were harvested for genomic DNA using Quick-DNA kits (Zymo Research). The remaining cells were used for nuclear extraction using the Nuclear/Cytosolic Fractionation Kit (Cell Biolabs) following the manual. Nuclear RNA was purified from the nuclear extract using the Trizol method. Genomic DNA was removed from the RNA samples using the TURBO DNase (Thermo Fisher Scientific). RT was performed using SuperScript III kit (Thermo Fisher Scientific) and the gene specific primers. All RT products were used in total of 300 ul (50 uL x 6) PCR reaction with Phusion polymerase (Thermo Fisher Scientific) and gene specific primers with Fluidigm mmPCR adaptor sequences61. Genomic DNA library was amplified using a similar approach, except for the different primer set. All first round PCR products were size-selected on 1.5% Agarose gel and purified using Gel purification Kit (Qiagen). Diluted PCR product (1:50) was used in the second round of PCR to add the Illumina sequencing adapter and individual barcode sequences using Fludigm universal F/fludigm barcode R[514]. The library was size selected and purified as in the previous step.
Next-generation sequencing and data analysis
All libraries were sequenced on a NextSeq 550. NEIL1 libraries were sequenced for 75 cycles paired end and TTYH2 and AJUBA libraries 150 cycles paired end. To map the variants of the target gene, a reference genome was first built using the GMAP package, where designed mutations were included as SNPs. Briefly, GSNAP was used to detect variants with mismatches inside the interrogated region but not INDELs. The mapped reads were separated into individual variants based on the unique mutations carried in the region except for the editing site and RNA editing was called and measured for each variant, as described in the previous report. The indel variants were mapped individually. The z-score of each variant is calculated for each RNA library as: ; where = ; EL=editing level; xi = ELi − ELWT; x= mean editing level for a given library.
Chemical mapping of RNA structure in vitro
We were able to construct RNA libraries by in vitro transcription by T7 polymerase (Megascript kit, ThermoFisher) for the NEIL1 and a portion of TTYH2 (TTYH2-ECS) variants to probe the RNA structures in vitro to compare with computationally predicted structures (below). The NEIL1 library (DNA oligo manufactured by IDT) was constructed with 3’ common primer binding sequence (PBS) and 3’ hairpin barcodes similar to previous report[101]. For the TTYH2-ECS library (oligo manufactured by Agilent), we designed new 5’ PBS and 3’ barcodes. The DMS and ethanol-control experiments were performed according to reported protocols (Cite Rhiju’s PNAS) except for the reverse transcription step was carried out using the TGRIT-III enzyme (Ingex) which improved efficiency of the reverse transcription reaction (50 mM Tris-HCl pH8, 75 mM KCl, 3 mM MgCl2, 5 mM DTT, 1 mM dNTPs, 100 U TGIRT-III enzyme, 10 U SuperaseIN)42. The reverse transcription reaction mix (12 uL) were incubated at room temperature for 5 min prior to incubation at 57C for 3 hours followed by quenching of reaction by adding 5 uL of 0.4 M NaOH at 90C for 3min and then cooled on ice for 3 min by adding 5 uL acid quench mixture (1.43 M NaCl, 0.57 M HCl, 1.29 M NaAcetate pH 5.2). The first strand cDNA was then purified by RNAclean XP beads and amplified by one round of PCR to construct the library to add index and barcodes. The resulting library was sequenced with pools of diverse sequence pools to increase read quality by NextSeq500 (NEIL1 library was sequenced by paired-end on 2x76 cycles and TTYH2 on 2x150 cycles). Reads were first filtered by AfterQC63 (”-q 30 -f0 -t0”, Quality threshold - ”30”, no trim on both ends) then demultiplexed by cutadapt 1.17 to read and trim the barcodes in three steps64 (first remove common sequence, -e 0.07, resulting error rate - 7%; second detect the barcodes, -e 0.15, error rate, 15%; third by detecting common sequence+barcode in wildcard mode in the rest of unrecognized reads, -O 36, Minimum overlap - 36). The resulting reads were processed by ShapeMapper 265 (default configuration except for read depth 2000 for QC) to detect DMS reactivity followed by structure inferring by Biers in MATLAB[224] (default settings except for max bootstrap=100). DMS reactivity data and experimentally inferred MFE structure were deposited in the RMDB database[507] (see Data Availability below).
Computational RNA secondary structure prediction
The sequence used for WT NEIL1, TTYH2 and AJUBA are shown in 5.9b. For AJUBA, we chose sequences for the region near the editing site by omitting 524 nt sequences in lieu of the full length (≥800 bp). We chose this ECS sequence similar to the previously reported method and preserved the hair-pin loop region (5.9b) existed in the structure when folding the entire AJUBA gene. This AJUBA ECS sequence also matches the predicted duplex region using RNAhybrid[388]. The secondary structures with the minimum free energy (MFE) of the RNA variants for all three RNAs are calculated from the Vienna RNAfold[281] 2.4.14 using default parameters except for allowing lone pairs (parameter: -p -d2 ). We used the SimTree[136] method version 1.2.3 to compare the MFE structure between computationally predicted to the experimentally inferred structures (described above) for all NEIL1 variants and some of the TTYH2 variants. The MFE structures for each variant are similar between experimental and computational results (calculated by pairwise normalized similarity score (ref) (NEIL1= 0.96 ± 0.08, TTYH2 ECS library= 0.97 ± 0.04, where 1 means identical structures, G.4c). Therefore, for all of the structural features analysis we used the computationally predicted RNA structures.
Clustering analysis of RNA sequence and structure
We performed clustering analysis for each RNA library using the LocARNA pipeline (version 2.0.0RC8)[496, 495] and associated tools. First, we generated a multiple alignment by the mlocarna module[496] using both the RNA sequence and the MFE structure as the input. The resulting alignment was then input into RNAclust[389] (RNAclust.pl, version 1.3, modified to suit current computing environment) to generate a hierarchical cluster tree file. The resulting clustering and consensus RNA structure for each cluster were manually examined in SoupViewer[389]. The hierarchical clustering were illustrated (5.14, G.8) using dendrogram generated by iTOL web server[259, 258].
Calculating the probability of forming wild type secondary structure The probability of forming wild type-like RNA secondary structure was calculated with Vienna RNAfold[281] version 2.1.9. The probability of forming the wild type secondary structure was calculated as: , where kT = 0.6 kcal at temperature T = 37 degrees C. Z is the unconstrained partition function (calculated with RNAfold -p). E wt is the energy of the state with the wild type-like secondary structure, calculated using the following constraints in RNAfold, where ’≥’ indicates that the given base must be paired with a residue that comes before it (5’) in the sequence and ’.’ indicates no constraint for the given base. Note that penalties are not applied for any additional base pairs that form in the unpaired regions of the reference secondary structures below.
NEIL1: .........................................>>>>>.>>....>>>...>>>>...>>>...>>>... TTYH2:
.............................................................>>>>>>>>>>>>>>>>> >>>>>.>>.>>>.>>>>>>>>>>>>>>>>>>>>.... AJUBA:
...........................................................>>>>.>>>>>>>>>.>>.> >>>>>.>>>>>>>>.>>>>>>>>>>>>>>>.
An additional penalty was added if base pairs could not be formed in the core region of the RNA. For NEIL1, the probability was divided by the number of non-canonical base pairs that would be formed in the core of the wild type secondary structure, according to the following secondary structure, to roughly account for the additional energetic penalty that these base pairs should incur.
Such reference information used for each RNA is shown below.
NEIL1 .......................(((((.(((.(((((......))))).)))..)))))..................... TTYH2:
.....................((((((.((.((((((((((((............................))))))))))) ).)).))))))...................... AJUBA:
(((((((((((((((.((((((((.((((((.((.(((((((((.((((..........)))).))))))))).)).)))))) .)))))))).))))))))))))))).
These constraints and reference information ensure that the ”active conformation” we are calculating includes a group of conformations that closely resembles the MFE structure of the WT structure but are not limited to the WT MFE structure.
Modeling the 3D structure of ADAR1 bound to NEIL1
A 3D model of ADAR1 bound to NEIL1 was built through homology modeling and the Rosetta RNPdenovo method74. First, a homology model of human ADAR1 deaminase was built using Phyre275. The conformation of the core RNA residues (nucleotides corresponding to NEIL1 residues 30-39 and 44-53) was taken from the previously solved structure of human ADAR2 bound to double-stranded RNA (PDB ID: 5HP3). The RNA was positioned relative to the protein by aligning the previously solved ADAR2 structure (in complex with dsRNA) to the homology model of ADAR1, then copying the RNA coordinates from the ADAR2-dsRNA structure. Protein residues in the ADAR1 homology model that clashed with the RNA were removed (the final residues included in the model were: 823973, 996-1003, and 1010-1223). This model was used as input to RNP-denovo with the -s option. Helical regions of the NEIL1 RNA were modeled as ideal A-form helices, also included with the -s option. Conformations of protein residues were not optimized (-minimize protein sc false and -rnp high res cycles 0). Default settings were used for all other options. The complete RNP-denovo command line used is provided below:
rna_denovo -fasta fasta.txt -secstruct_file secstruct.txt -s
ADAR1_homology_model_and_core_RNA_from_5hp3.pdb RNA_helix_1.pdb RNA_helix_2.pdb
RNA_helix_3.pdb RNA_helix_4.pdb RNA_helix_5.pdb -new_fold_tree_initializer true
-minimize_rna true -minimize_protein_sc false -out:file:silent build_full_wt_neil1.out
-rna_protein_docking true -rnp_min_first false -rnp_pack_first false -cycles 10000
-rnp_high_res_cycles 0 -minimize_rounds 2 -nstruct 2000 -ignore_zero_occupancy false
-convert_protein_CEN false -FA_low_res_rnp_scoring true -ramp_rnp_vdw true \ -dock_each_chunk_per_chain false -use_legacy_job_distributor true -no_filters
where RNA helix 1.pdb, RNA helix 2.pdb, etc. are ideal A-form helices for base-paired regions of Neil1. Possible placements of the double stranded RNA binding domains were visualized by aligning the previously solved structure of the ADAR2 dsRNA binding motif bound to dsRNA (PDB ID:
2L2K) to our model of NEIL1 bound to ADAR1.
Machine learning models of RNA editing levels
All feature extraction and model training code is available to access on github: https://github.com/ kundajelab/PREUSS
Feature extraction
RNA structures for NEIL1, AJUBA, and TTYH2 were annotated with the bpRNA algorithm[183]. The bpRNA annotations were in turn utilized to extract structural and positional features for each variant. A feature matrix with structure-specific features from the bpRNA (annotations, sequencespecific features, features that take into account the isoform mutation type and position, and thermodynamic-specific features was engineered (how the featured were derived are described in G.1) for each substrate and included a total of 122 features.
Model training
The XGBoost[100] Python library (v. 0.81 ) was used to train gradient boosted regression trees to predict Adar editing levels from feature matrices described above. Training was performed both within-substrate and across substrates. The following approaches were utilized:
• Train on NEIL1, predict on NEIL1.
• Train on TTYH2, predict on TTYH2.
• Train on AJUBA, predict on AJUBA.
• Train on NEIL1 and TTYH2, predict on AJUBA.
• Train on NEIL1 and AJUBA, predict on TTYH2.
• Train on TTYH2 and AJUBA, predict on NEIL1.
• Train on TTYH2, AJUBA, NEIL1. Predict on TTYH2, AJUBA, NEIL1.
• Train on NEIL1, predict on AJUBA, TTYH2
• Train on TTYH2, prediction on NEIL1, AJUBA
• Train on AJUBA, predict on NEIL1, TTYH2
The dataset was randomly separated into 3 splits: training on 70% of variants, model validation on 15%, and testing on the remaining 15%. To avoid train/test contamination, base pair positions along the RNA molecules were assigned to one of the 3 splits (training, tuning, or test). All features associated with a given base pair position were assigned to the corresponding split. Any feature that was null or non-varying across all variants in a given training split was removed from analysis. Any variant that had more than one mutation was included in the feature matrix twice – the features in each entry for the variant were calculated specifically for one of the mutations. The rationale for this is that different combinations of features may lead to variations in editing level, and a number of features were derived in reference to mutation type and position (G.1).
XGBoost was trained for a maximum of 1000 iterations, with early stopping after 10 subsequent rounds with no reduction in root mean square error (RMSE) on the validation split. Default parameters were used.
In addition to models trained on computed feature values, a separate set of models were also trained across substrates using ratios of numerical feature values relative to the wild type feature values. Ratios of feature values relative to the WT feature values within the same substrate were computed for the following feature set: ’editing value’, ’free energy’, ’sim nor score’, ’probability active conf’, ‘all stem length’,’site length’, ’site length internal es’,’site length internal ecs’, ’u count’, ’u all stem length’, ’u hairpin length’, ’u1 distance’, ’u1 length’, ’u2 distance’,’u2 length’,
’u3 distance’,’u3 length’, ’d count’,’d all stem length’, ’d1 distance’,’d1 length’, ’d2 distance’,’d2 length’, ’d2 length internal ecs’, ’d3 distance’, ’d3 length’. Ratios were also calculated for substrate-specific features such as the length of hairpin vs stem vs bulge for the upstream and downstream structural features.
The R2 value was calculated on the test set to determine the percent of total variance explained by the feature matrix. Other metrics to measure model performance included:
• Spearman correlation from the scipy.stats Python library
• Pearson correlation the scipy.stats Python library
• Mean absolute error (MAE) from sklearn.metrics Python library
• Mean absolute percent error (MAPE)
• Root mean square error (RMSE) from sklearn.metrics Python library
• Area under the precision recall curve (auPRC) from sklearn.metrics Python library
• Area under the receiver operating characteristic (auROC) from sklearn.metrics Python library
Feature importance analysis
Feature importance analysis was performed to identify the subset of features most informative in predicting Adar editing levels. The XGBoost ”plot importance” function was used to calculate the F score for each feature. The TreeSHAP algorithm32 was applied to interpret feature importance from the XGBoost model. SHAP summary values were computed for each feature as a measure of feature importance using the ”shap values” function within the ”TreeExplainer” class. Pairwise interaction values from TreeShap were also calculated to identify highly correlated feature values.
SHAP values were applied to calculate the combined relative importance of feature subsets.
Feature subsets (G.1) were defined as follows; some features were parts of multiple subsets:
• Structure features: stem length, free energy, probability of active conformation.
• Number of mutations in the variant.
• Mutation-specific sequence features: mutation position, mutation site reference allele, mutation site alternate allele, distance of mutation site from edited base.
• Mutation-specific structure features: bpRNA structure designation for the mutation site, bpRNA structure designation for the adjacent upstream site, bpRNA structure designation for the adjacent downstream site, boolean indication of whether or not the mutation is part of the same structure as the editing site.
• ”Other” mutation-specific features: Type of mutation (indel, SNP), presence/absence of mutation (WT/ mutated) in the variant
• Editing site sequence features
• Editing site structure features
• Characterization of the 3 structural features upstream of the editing site
• Characterization of the 3 structural features downstream of the editing site
For each feature subset, the mean absolute SHAP values across variants were calculated. These were in turn summed across all features in the subset and compared to the total sum of mean absolute SHAP values across all features.
Overall feature rankings were computed by calculating the mean absolute value of SHAP values for each feature across the test set samples. These mean(|SHAP|) values were summed across all features and the percent contribution to the total was obtained for each feature. These percent contributions for each feature were averaged across substrates to determine features that were ranked as high importance consistently across all substrates.
Data availability
The RNA-seq data are deposited in the following repository: Repository/DataBank Accession: GEO; AccessionID: GSE138860. Databank URL:http://www.ncbi.nlm.nih.gov/geo/. The DMS chemical mapping data for in vitro RNA structure inference is deposited in the RNA Mapping Database (RMDB IDs: NEIL1 DMS 0001 to 0021, TTYH2 DMS 0001). Bioinformatics codes for RNA editing call are available upon request. Codes for the PREUSS computational pipeline is available on GitHub URL: https://github.com/kundajelab/PREUSS.
5.3.5 Discussion
The ultimate goal in understanding the cis-regulation of RNA editing is to develop a model that accurately predicts the ADAR editing efficiency in vivo, namely an ”editing code”. Unlike proteinDNA or protein-ssRNA interactions, where the primary cis-sequence largely dictates the interaction, ADAR substrates are required to bear double-stranded secondary structure. The difficulty of associating RNA sequence and secondary structure features to editing activity is a major challenge in studying the cis-regulation of RNA editing. To tackle this challenge, we integrated highthroughput measurements of ADAR editing with computational analysis of more than 100 RNA sequence and structure features simultaneously. The CRISPR/Cas9 engineering allowed us to study the cis-regulation of RNA editing by introducing desired mutations at the endogenous locus with the minimal perturbation of the RNA editing process. Our key results can be summarized in four main points. First, we found alternative structures that can be equally or better edited than the wild-type structure (5.14, G.7,G.8,G.9). Second, our models confirmed all known features identified by previous biochemical and transcriptomic studies (A:C mismatch at the editing site, the 5’ and 3’ nearest neighbors, the length and stability of the substrate including 5’ stem length and 3’ loop structure)[139, 365, 256, 486, 257] and revealed previously unexplored features, including ensemble diversity and closing pairs (5.15,5.16). Third, our substrate-specific machine learning models integrated diverse sequence and structural features to and quantitatively predict editing levels of new variants for a given target (5.15d). Fourth, both general features and substrate specific features synergistically contribute to editing levels and the degree of contribution of each feature varies across different RNAs, suggesting complex and context-dependent cis-regulation of the editing landscape (5.15e-g). A lot of progress has been made in recent years in deciphering the RNA splicing code[394, 502, 40, 205, 105]. However, the ADAR-mediated RNA editing code likely harbors more complex regulation via RNA secondary structure. Our efforts are analogous to the efforts to predict efficiency of CRISPR guides[128, 272].
Our approach opens several new lines of inquiry for further improvement. Measuring the structure of RNA variants in cells[397, 126, 524] at native endogenous loci (vs. relying on predicted structure) would greatly enhance the RNA structure analysis. Additionally, advanced experimental methods such as irCLASH (ref) can be applied together with our gene specific amplification to validate the ECS sequence. Further, while we focused on cis elements adjacent to the editing site, longer-range interactions such as the editing inducer elements[119] may be important for editing and can be investigated using our approach. Although our data focused on the editing level, which is largely determined by the deaminase domain[486, 498] the dsRBD (illustrated in 5.13c) also contributes to substrate recognition[276, 444, 299]. In the future, high-throughput in vitro RNA binding experiments can be performed to combine with existing SELEX data for dsRBDs[180] to identify features specific to dsRBDs of human ADARs using our pipeline. To tease apart trans regulation effects by RNA binding proteins (RBP)[154], in vitro competitive binding experiments and the editing measurements in cells can be conducted in the knockout or overexpression background for the RBP of interest. Because we observed no correlation of RNA abundance on editing level (G.2) and our experiment measures pre-mRNAs, our editing analysis is likely not affected by potential effects of sequence variation on RNA processing. Nevertheless, understanding the interplay between RNA editing and various RNA biology such as RNA processing pathways and RNA binding proteins presents an important question for future investigations.
Several systems were recently developed to recruit ADAR enzymes to specific sites for sitedirected RNA editing[499, 440, 318, 492, 482, 314, 375], providing novel tools to study biological function and a safer and reversible alternative to gene therapy[499, 484, 483, 98]. Currently, these RNA engineering methods mainly use antisense guide RNAs (gRNA) that form perfect duplex with the target region except for an A:C mismatch at the editing site[499, 440, 318, 492, 482, 314, 375]. Our results strongly support additional imperfectly base-paired designs to mimic the highly selective and efficient editing observed in the natural ADAR substrates with complex structure features. Such features include relatively short 5’ (upstream) stem required for ADAR117 compared to ADAR2 and specific non-stem 3’ (downstream) structure, where the internal loops at the 3’ likely contribute to the ADAR selectivity[257]. Notably, each of the three RNA substrates we tested has substratespecific features that dictate the editing efficiency (5.15g). This showcases that a screen of possible designs of gRNA would be a valuable and cost-effective strategy to learn the best features that lead to the most specific and efficient editing for each different target site in transcriptome engineering. In this regard, our experimental methods and computational pipeline are readily applicable.
There are several limitations to the modeling approaches utilized in this study. While our current results give rise to models with substantial predictive power for individual substrates, their generalizability remains low (G.11). This tendency to overfit will be mitigated by expanding the training set to include more RNA substrates, allowing the model to learn the shared properties of RNA substrates. Furthermore, the SHAP interpretation of feature importance in the XGboost model highlighted the significance of features related to mutation number, structure, sequence, and position. This result suggests that an effective featurization of the data relies upon knowledge of a wild type substrate structure and sequence, which may not be available for all substrates.
Building on our work using the PREUSS pipeline, ADAR editing can be further investigated in larger scale and in different cell types, tissues and disease states to explore the full spectrum of cis regulation. Ultimately, establishing the ”RNA editing code” will help us better understand the underlying rules of RNA editing, and facilitate efficient and precise transcriptome engineering for studying RNA biology and treating human disease.
5.3.6 Author Contributions
This manuscript is authored by Xin Liu, Tao Sun, Anna Shcherbina, Qin Li, Inga Jarmoskaite, Kalli Kappel, Gokul Ramaswami, Rhiju Das, Anshul Kundaje, and Jin Billy Li.
X.L., T.S., G.R., and J.B.L conceived the work. X.L., A.S., T.S., I.J., A.K., J.B.L., and R.D. co-wrote the manuscript. T.S. designed and carried out the CRISPR/Cas9 and RNA editing measurements. T.S. and Q.L. carried out the editing level analysis. X.L. carried out the RNA clustering analysis, and performed the RNA chemical mapping experiments and data analysis. X.L. and K.K. performed RNA and protein structure analysis. A.S., X.L., and A.K. developed the PREUSS computational pipeline.
5.4 Transient relief from AP-1 epigenetic roadblock augments reprogramming to pluripotency
5.4.1 Abstract
Mechanistic insight into nuclear reprogramming from one cell state to another is of fundamental and clinical importance. Here we capture the dynamic architecture of chromatin accessibility and gene expression during nuclear reprogramming after formation of bi-species heterokaryons. At the onset of reprogramming, we detect a transient, genome-wide increase in accessibility at sites containing the AP-1 transcription factor motif. Inhibition of AP-1 results in an increase in OCT4 expression in heterokaryons. Moreover, in human iPSC reprogramming, dominant negative AP-1 can replace exogenous OCT4 in the reprogramming cocktail. Our findings reveal that AP-1 family member JUN, which is induced at the onset of reprogramming and traditionally thought to be an activator, creates a JUN-MBD3 repressor complex that inhibits nuclear reprogramming to pluripotency through direct targeting of an OCT4 distal regulatory element. These findings reveal a role for Jun as a repressive epigenetic gatekeeper of reprogramming to pluripotency.
5.4.2 Introduction
The discovery of a route to reprogram somatic cells to pluripotent stem cells [457] enabled an explosion of research in areas like disease modeling [467], stem cell therapy [455], and the study of mechanisms of reprogramming to pluripotency [32]. A challenge for understanding the early steps in reprogramming is that a low percentage of starting cells is converted to a fully pluripotent stem cell state upon four-factor transgene expression. Some of the barriers leading to inefficient reprogramming have been identified through genomic profiling of reprogramming cells [250] and gain- and loss- of function studies [265],[273]. These include mechanisms that lock in the chromatin state of the starting cell type, such as DNA methylation and histone deacetylation [197],[315],[84], and the active maintenance of somatic enhancers [109],[226].
However, a major pitfall of genomic profiling during early iPSC reprogramming is that the majority of cells are not undergoing productive reprogramming. In order to reveal early, transient shifts in epigenomic activity in cells destined to activate the pluripotency network and minimize signal from the non-reprogramming population, we used the heterokaryon system of nuclear reprogramming. Previously, we have shown that following fusion of human fibroblasts with mouse embryonic stem cells (ESCs), multinucleate heterokaryons activate human pluripotency genes OCT4 and NANOG at high efficiencies[57]. Globally, the transcriptome and chromatin landscape shift from the fibroblast state towards the ESC[454]. These remarkable observations suggest that the barriers to reprogramming are overcome.
In this study, we profiled chromatin accessibility dynamics over time and resolved 15 patterns of commonly occurring trajectories. Through motif enrichments and chromatin state maps we deduce that the regulatory logic is governed, in part, by combinations of transcription factor motifs. In particular, an early, transient increase in accessibility was strongly associated with the AP-1 motif, and was often found at enhancers of the starting cell type. Loss of function experiments reveal that AP-1 is acting as a barrier to reprogramming in both heterokaryons and iPSCs, and that a dominant negative AP-1 mutant can surmount this barrier and efficiently replace exogenous Oct4 in a Sox2-Klf4 transcription factor cocktail in human fibroblasts. Mechanistically, we demonstrate AP-1 inhibits OCT4 activation by binding to an AP-1 motif at a distal regulatory element, and provide evidence that inhibition of OCT4 occurs through an interaction between Mbd3 and unphosphorylated Jun. Our results provide a fresh view of the roles of Mbd3 and AP-1 as inhibitors of reprogramming to pluripotency and highlight a seminal function for repressor complexes in stabilizing cell phenotypes.
5.4.3 Results
Chromatin accessibility dynamics identify AP-1
In order to identify early regulators of reprogramming, we used the efficient heterokaryon system, where human fibroblasts (hFs) are fused to mouse embryonic stem cells (ESCs)[57].. We assayed chromatin accessibility by ATAC-seq and gene expression by RNA-seq in reprogramming heterokaryons at four time points. We chose 0, 3, 16, and 48 hours post-fusion to represent the starting fibroblast, early, middle, and late stages of heterokaryon reprogramming (5.17A). The 0hr represents the mean of the control samples we used, which were fibroblasts alone, and fibroblasts after co-culture with ESCs (co-culture control) to account for the response to paracrine signals in our cell fusion model, and fibroblasts fused to fibroblasts 3 hours post-fusion (homokaryon control) to control for the early effects of cell fusion. As a proxy for the end-state, we used ATAC-seq and RNA-seq for human embryonic stem cells (hESC control).
We find that heterokaryon reprogramming is highly dynamic at the level of chromatin accessibility. Around human LIN28A there are examples of fibroblast-specific peaks that lose accessibility and ESC-specific peaks that gain accessibility as well as transiently upregulated and downregulated peaks that may be acting in coordination to robustly activate LIN28A gene expression by 48 hours (5.17B). To examine the drivers of these types of dynamics at a genome-wide level we clustered the accessibility patterns of differential peaks mapping to the human genome using the Dirichlet process Gaussian process mixture model (DPGP)[305], and found 15 significant clusters of ATAC-seq peaks (5.17C and H.1A). We found that 12 clusters fell into four patterns representing gains and losses in accessibility as well as transient states that were associated with the early 3 hour time point. To interpret the accessibility dynamics in the context of reprogramming from the starting fibroblast towards the embryonic stem cell state, we mapped the proportion of accessible sites for each cluster that belong to either an enhancer, promoter, or inactive chromatin state of fibroblasts or embryonic stem cells (Figure 1D and Sup Fig 1B) (chromatin states defined by [142]. We find productive reprogramming exemplified in clusters 1-3 where accessibility decreases consistent with a more inactive chromatin state signature in the end state (ESC), and clusters 4 and 5 where accessibility increases in line with a higher proportion of enhancer-like states in ESC chromatin. In the transient gain and transient loss states, the dynamics could not be predicted by examining the end states alone.
To identify possible regulators of these dynamics, we looked for transcription factor (TF) motifs from the Encode project[221] that were enriched in each cluster relative to all differentially accessible sites. We find that many of the motifs showed strong co-enrichment (H.2), allowing us to simplify our analysis by summarizing distinct motif family representatives with significant enrichment (5.17E and H.1C). Interestingly, we found that all four clusters featuring an early, transient gain in accessibility had a very significant rate of occurrence for the AP-1 motif (Bonferroni corrected p ≤ 1x10-4). Examining AP-1 sites genome-wide, we find a transient increase in accessibility at peaks containing the AP-1 site even as many had detectable accessibility prior to fusion with mouse ESCs (5.17F,G), suggesting that these sites may be primed to respond.

Figure 5.17 (previous page): ]
a, Experimental setup: GFP+ mouse ESCs are fused to human fibroblasts and sorted at 3, 16, and 48 hours post-fusion using GFP and anti-human CD44. b, Human LIN28A locus is depicted with ATAC-seq and RNA-seq tracks. ATAC-seq for human ESCs is depicted for reference. Colored boxes highlight different peak dynamics observed. c, Differentially accessible ATAC-seq peak clusters are combined into four patterns with 0 hour comprised of the fibroblast, co-culture, and the homokaryon control samples. d, Chromatin states in the starting (fibroblast) and ending (ES) cell types are shown as a stacked bar graphs depicting enhancer, promoter, and inactive states for each cluster of genomic regions identified by ATAC-seq (above). e, Heatmap of transcription factor motif enrichment for 12 ATAC-seq clusters is shown. Data represents fold-change relative to all differential peaks. f, Chromatin accessibility is shown at 20,000 most accessible peaks identified by ATAC-seq that contain the canonical AP-1 motif (TGASTCA). g, Average chromatin accessibility centered at AP-1 sites in ATAC-seq peaks.
AP-1 inhibits reprogramming of OCT4
We were interested examining the role of factors that may be driving an early, transient response during productive, heterokaryon reprogramming that might be missed in a time course of iPSC formation. To that end, we sought to further investigate the role of the canonical AP-1 family, enriched in clusters 9-12 as noted above. Since Jun and Fos have close family members that also bind to the canonical AP-1 motif, we examined the expression of Jun and Fos family of genes during reprogramming (5.18A). We found that all 7 genes examined peaked in the first two hours of the 24 hour time course. FOS, FOSB, JUN, and JUNB showed the earliest response and peaked at 30min post-fusion, whereas FOSL1, FOSL2, and JUND peaked at 2 hours. To test the role of AP-1 family experimentally, we used the model outlined in 5.18B, where we treat the fibroblasts at the onset of overnight co-culture with mouse ESCs and assay heterokaryon reprogramming 48 hour post-fusion.
Since all of the Jun-Fos family had early and transient expression kinetics, we decided to assess the impact of AP-1 activity on reprogramming using a dominant negative inhibitor (dnAP-1) that globally blocks AP-1 function[337]. To have more precise control, we used a dox-on inducible system to express dnAP-1 as a fusion protein with RFP, separated by a cleavable linker so function is minimally impacted. By flow cytometry, we find that RFP signal is apparent in the fibroblast population at 12 hours post doxycycline addition, and increases further at 24 hours (H.3A). We analyzed chromatin accessibility at AP-1 sites by ATAC-seq after induction of the dominant negative mutant in heterokaryons. We observed a marked decline in accessibility at AP-1 sites, but not at peaks without an AP-1 site (H.2B and C). Functionally, we find that dnAP-1 expression downregulates known AP-1 target genes SERP1 and FOSL1 (H.2D and E).
To test the function of AP-1 in nuclear reprogramming, we sorted heterokaryons after induction of dnAP-1 measured the levels of pluripotency genes. We find a significant increase in human OCT4 activation, which was highest when dnAP-1 was induced throughout the time course (5.18C).
Additionally, we see significantly elevated levels of pluripotency genes LIN28A and NANOG (H.2F and G ). Our gene expression time course revealed that human OCT4 activation is already detectable at 16 hours post-fusion (Sup Fig 4A). We assayed the effect of dnAP-1 induction on early OCT4 activation and observed a 2.2-fold increase in OCT4 expression 16 hours post-fusion (p≤0.01) (H.2H).
We were interested if the regulatory elements that were identified by ATAC-seq near OCT4 may be playing a role in OCT4 activation. We targeted regulatory elements with catalytically inactive Cas9 (dCas9) fused to the KRAB repressor domain. It has been reported that targeting KRAB to distal regulatory elements can prompt formation of heterochromatin-associated repressive marks, a spread of repression to the transcription start site ultimately resulting in a decline in transcriptional output[217]. We designed guide RNAs targeted to six regions highlighted in H.4A. We found that with the exception of the AP-1 site near the 3’ UTR, all of the regulatory regions tested showed a ≥50% reduction in OCT4 gene expression after 48 hours of nuclear reprogramming, suggesting these elements are interacting with the OCT4 locus (H.4B).
We wondered whether AP-1 family members could target OCT4 directly. Our strategy was to inhibit transcription factor binding at specific genomic elements by targeting a catalytically inactive Cas9 (dCas9), as precise targeting of dCas9 to a TF binding site has been shown to block the function of a trans-activating TF [373]. We annotated two accessible genomic elements containing AP-1 sites (TGA G/C TCA) near OCT4. The site labeled 5’AP1 was located within 1 kb of the distal enhancer, and 3’AP1 was just downstream of OCT4 (5.18D). Additionally, we examined an SP-1 motif at the distal enhancer, because prior studies have demonstrated AP-1 activity through an SP1 site via a physical interaction between JUN and SP1 factors [253],[96]. As a positive control, we targeted the transcription start site (TSS). We find that while the sgRNA complementary to the TSS significantly represses OCT4 gene expression at 48 hours, blocking the 5’AP1 site significantly increases OCT4 expression (5.18E). This finding suggests that AP-1 may act to inhibit OCT4 transcription as a repressor, rather than acting indirectly by activating factors repressive to OCT4 activation.
JUN inhibits OCT4 via upstream regulatory element
To dissect the role of individual AP-1 family members, we employed RNAi to target either JUN, JUNB, or JUND in fibroblasts prior to plating in co-culture with ESC. Jun family proteins can form homodimers, whereas Fos family proteins depend on the Jun family for dimerization[178],[137]. We treated fibroblasts with siRNA prior to fusion, and examined the expression of OCT4 in sorted heterokaryons after 48 hours. Knockdown of JUNB or JUND did not significantly change OCT4 expression (5.18F). By contrast, siRNA against JUN resulted in a 1.5-fold increase in OCT4 expression (p≤0.01) (Figure 2F). The knockdown efficiencies for all three Jun family members approximated 50% (H.5A).
The increase in OCT4 after siJun treatment and the dCas9 targeting experiment suggested that JUN may play a role in repression of OCT4 at the 5’AP1 site. To test JUN occupancy at this element, we performed a ChIP-qPCR with or without dnAP-1 induction in human fibroblasts transduced with reprogramming factors. We find significant signal at the JUN promoter, a positive control region (5.18G). Importantly, JUN is bound at the 5’AP1 site upstream of OCT4 (5.18H), but not at the 3’AP site (5.18I) consistent with the dCas9 targeting data (5.18E).

Figure 5.18 (previous page): a, Gene expression data by RNA-seq during heterokaryon reprogramming of human Jun and Fos family members. b, Experimental setup for heterokaryon perturbation experiments where cells are treated at the start of the overnight co-culture period, and then sorted 48 hours post-fusion. c, Human OCT4 gene expression following induction of dominant-negative AP-1 either pre and/or post-fusion as shown. Gene expression is normalized to housekeeping genes GADPH and RPLP0 and shown relative to cells without induction of dominant-negative AP-1. Data shown as mean ± s.e.m. (n=4 biological replicates). P-values calculated as a two-tailed Student’s t-test. d, Diagram showing the sites for dCas9 sgRNA targeting vectors near the human OCT4 locus. DE and PE are distal and proximal enhancers respectively. e, Human OCT4 gene expression following lentiviral introduction of catalytically inactive Cas9 (dCas9) and the listed sgRNA. Gene expression normalized to housekeeping genes GADPH and RPLP0 and plotted relative to sample receiving a non-targeting sgRNA. Data shown as mean ± s.e.m. (n=3 biological replicates). P-values calculated as a two-tailed Student’s t-test. f, Human OCT4 gene expression following transfection of siRNA targeting one of the JUN family members or a scrambled control siRNA. Gene expression normalized to housekeeping genes GADPH and RPLP0. Data shown as mean ± s.e.m. (n=3 biological replicates). P-values calculated as a two-tailed Student’s t-test. g-i, ChIP-qPCR of JUN in human fibroblasts 8 days post-transduction with vectors expressing Sox2 and Klf4. Data was normalized to input and plotted relative to the fibroblasts which had dominant-negative AP-1 induced for 24 hours prior to harvest (negative-control for JUN binding). Sites in h and i are shown in d. Data shown as mean ± s.e.m. (n=3 biological replicates). P-values calculated as a two-tailed Student’s t-test.
JUN-MBD3 interact during nuclear reprogramming
We reasoned that JUN-mediated repression might result from an interaction between JUN and MBD3, a member of the NuRD-repressor complex, based on a study demonstrating their interaction [11]. We sought to confirm this interaction during heterokaryon reprogramming using proximity ligation (PLA). We used mouse ESCs containing a FLAG-tag knock-in at the Mbd3 locus. Mbd3null mouse ESCs of the same line served as a negative control. We find the interaction present in heterokaryons sorted at 2, 16, and 48 hours post-fusion (5.19A). We quantified the percentage of and find that the presence of the interaction is highly significant at all three time points, but that the number of positive cells is higher at 16 and 48 hours post-fusion (37 and 33% respectively) compared to 2 hours (4.6%) (5.19B).
To confirm the JUN-MBD3 interaction genetically, we overexpressed JUN in the presence of siMbd3 or a scrambled siRNA. We find a significant reduction in OCT4 activation after overexpression of JUN, which is rescued after loss of Mbd3 by RNAi (5.19C). We confirmed MBD3 knockdown of expression after siMbd3 transfection (H.5B).
Role of JUN phosphorylation
The prior study examining the Jun-Mbd3 interaction found it can be blocked via Jun phosphorylation at the transactivation domain [11]. We measured phospho-JUN (pJUN) levels relative to total JUN by flow cytometry during heterokaryon reprogramming. The ratio of pJUN to JUN, relative to co-culture values, increases transiently at 2 hours post-fusion (p≤0.05) (5.20A and B). The decreased frequency of the JUN-MBD3 interaction at 2 hours compared to 16 hours post-fusion

a,b, Proximity ligation assay between JUN and MBD3 in heterokaryons. Mouse ESCs were used containing a FLAG-tag at the endogenous Mbd3 locus or Mbd3 null ESCs as a negative control. Representative image at each time point is shown in a. Interactions as red dots were scored by hand and quantified in b. Three standard deviations away from the mean of the control sample for each time point was used to quantify percent of interactions. At least n≥30 nuclei were used for time point, and two-tailed student’s T-test was used to calculate statistical significance. c, Human OCT4 gene expression following co-transfection of siRNA and an expression vector as shown. Gene expression normalized to housekeeping genes GADPH and RPLP0 and plotted relative to the sample receiving a control vector expressing mCherry and a scrambled siRNA. Error bars are showing mean ± s.e.m. (n=4 biological replicates). P-values calculated as a two-tailed Student’s t-test.
(5.19B) is consistent with elevated pJUN-to-JUN ratio at 2 hours, and a subsequent decline (5.20B). To test if the inhibitory function of JUN in reprogramming was phosphorylation-dependent, we measured OCT4 in heterokaryons after expression of constitutive version of the Jun-kinase JNK1 (cJNK1), catalytic mutant of cJNK (cJNK1mut), or a control vector. We find that cJNK1 increases intra-cellular levels of pJUN (H.5C), as well as OCT4 expression (1.9 fold ± 0.3, p≤0.05) (5.20C), suggesting the repressive factor may be the unphosphorylated form of JUN.
To confirm if the phosphorylation sites are critical for the JUN-MBD3 interaction, we expressed HA-tagged, Jun mutants with substitutions of the serine and threonine residues in the transactivation domain to either alanine (phopho-null) or aspartic acid (phosphomimic) (5.20D). After transfection of these mutants into human fibroblasts, we performed an immunoprecipitation (IP) for HA, and confirmed high expression and successful IP for JUN by western blot (5.20E). In the same lysates, an interaction with MBD3 is apparent by Co-IP with the alanine mutant, but not with the aspartic acid mutant (5.20F), consistent with previous findings that the negatively-charged phospho-residues on Jun block its interaction with Mbd3 [11].
Role of AP-1 in iPSC reprogramming
To determine if our findings in heterokaryons could be used to enhance reprogramming of fibroblasts to induced pluripotent stem cells (iPSCs). To this end, we employed the inducible dominantnegative AP-1 (dnAP-1) construct described earlier. We transduced mouse embryonic fibroblasts (MEFs) with the four Yamanaka reprogramming factors in a polycystronic cassette [457]. MEFs received either no doxycycline (dox) as a control, or were exposed to dox added on days 1, 3, and 5 post-transduction (5.21A). There was an approximate 5-fold increase in NANOG+ iPSC colony formation after dox addition (5.21B). We noticed fewer cells in the doxycycline condition early on in reprogramming, consistent with reports that cJun is a mitogen, and that Jun -/- MEFs proliferate slowly and undergo premature senescence[290]. Representative images of iPSC colonies positive for OCT4 and NANOG are shown in 5.21C.
Since human OCT4 activation is greatly increased upon inhibition of AP-1 activity during heterokaryon reprogramming of human fibroblasts, we reasoned that AP-1 might be able to replace exogenous OCT4 in human iPSC reprogramming. In a previous study on the role of cJun in reprogramming, a truncated form of cJun was found to be capable of replacing Oct4 during iPSC induction in mouse embryonic fibroblasts, albeit at low efficiency[273]. We designed an OCT4 replacement experiment in which each reprogramming factor was expressed by a single retrovirus. MYC was omitted because it is not necessary for reprogramming [491], and can lead to partially reprogrammed colonies [99]. Fibroblasts were given two days to recover after transduction before inducing the dnAP-1 with dox (5.21D). We added dox every day for days 2-5, mimicking the early, transient nature of AP-1 activity in heterokaryons. Not only did inhibition of endogenous AP-1 was sufficient to replace exogenous OCT4, but the average number of iPSC colonies was higher (5.21E).

a, Phospho-flow with anti-phospho JUN (pJUN) in heterokaryons at 2 hours post-fusion vs coculture control. b, Quantified phospho-flow data for anti-pJUN vs total JUN. The mean fluorescence intensity (MFI) of unfused fibroblasts was subtracted from heterokaryons at each sample to control for sample-to-sample variation, and then the pJUN signal was divided by total JUN normalized MFI. Error bars are showing mean ± s.e.m. (n=3 biological replicates). P-values calculated as a two-tailed Student’s t-test. c, Human OCT4 gene expression following transfection of an expression vector containing mCherry (Ctrl), catalytically inactive JNK fused to Mkk7 (cJNK1mut), or JNK fused to Mkk7 (cJNK1). Gene expression normalized to housekeeping genes GADPH and RPLP0 and plotted relative to the sample transfected with control vector. Error bars are showing mean ± s.e.m. (n=4 biological replicates). P-values calculated as a two-tailed Student’s t-test. d, Diagram showing human JUN constructs used in e and f, and the relative location of mutated serine/threonine residues. e,f, Western blot showing lysates from samples after transfection with an expression vector carrying one of two HA-tagged Jun mutants. Samples are either 4% input or anti-HA IP blotted with anti-JUN in e and anti-MBD3 in f.
A representative image of an iPSC colony reprogrammed with SOX2, KLF4, and dnAP-1 is shown in 5.21F.
5.4.4 Discussion
We characterized dynamics of genomic regulatory elements during productive heterokaryon reprogramming. Our data reveal a transient increase in accessibility at sites with the AP-1 motif at hundreds of fibroblast-associated distal enhancers. AP-1, short for activating protein 1, is known as a potent transcriptional activator [26],[417]. Our findings suggest that that AP-1 acts as a repressor at an AP-1 binding site near the OCT4 distal enhancer via the interaction between unphosphorylated Jun and Mbd3. The early transient nature of both accessibility at AP-1 sites and levels of phospoJun suggests that AP-1 activity may toggle from a transactivator to an Mbd3-associated repressor during nuclear reprogramming. Consistent with this idea, maintaining high levels of phospho-Jun increased OCT4 activation during reprogramming 5.20C.
A number of studies have investigated the role of transcription factor AP-1 family members as barriers to somatic cell reprogramming to pluripotency. Overexpression of Jun and Fosl1, members of the AP-1 family, which are constitutively expressed in fibroblasts and required to perpetuate that starting cell’s phenotype, results in a drastic loss of iPSC colony formation[273],[109],[264]. In addition to preventing the silencing of the somatic state, Jun is thought to activate genes involved in the epithelial-to-mesenchymal (EMT) transition [273], thus directly opposing the mesenchymalto-epithelial (MET) transition required during reprogramming to pluripotency [268]. Both during ESC maintenance and iPSC reprogramming, Jun induction results in a loss of pluripotency gene expression, particularly Oct4, and the specific mechanism of silencing had not yet been identified.
It has been described that the silencing of the somatic program is a critical component of reprogramming [109],[264]. The mechanism is unclear, because a large fraction of somatic regulatory elements are not targets of reprogramming transcription factors. Moreover, reprogramming factors are typically assigned a trans-activating role during the reprogramming process, making it difficult to reconcile how they might also repress the necessary somatic genes which often lack motifs for pluripotency factors like Oct4. An alternate hypothesis for loss of the somatic program may involve Jun-Mbd3 mediated repression, since AP-1 motifs are strongly enriched in fibroblast enhancers as well as enhancers in other somatic cell types [480]. A side-effect of this mechanism is concomitant repression of the pluripotency program, such as Oct4. We find that a total loss-of-function by our dominant-negative (dnAP-1) approach can result in both cellular amnesia at somatic regulatory elements H.2 and robust activation of the pluripotency network during heterokaryon reprogramming 5.18C and H.2F,G), as well as augmented iPSC reprogramming when combining dnAP-1 with the Yamanaka factors 5.21.
The role of Mbd3 in iPSC reprogramming is the subject of debate. One study found that Mbd3 is required for efficient iPSC generation from epiblast stem cells and neural precursors. In a different

a, Experimental setup for temporal induction of dominant negative AP-1 (dnAP-1) during iPSC reprogramming from mouse embryonic fibroblasts (MEFs). b, Quantification of NANOG+ iPSC colonies relative to the sample without induction of dnAP-1. Error bars are showing mean ± s.e.m. (n=3 biological replicates). P-values calculated as a two-tailed Student’s t-test. c, Representative mouse iPSC colony in phase contrast and by immunofluorescence. Scale bar is 100µm. d, Experimental design for replacement of exogenous OCT4 via temporal induction of dominant negative AP-1 (dnAP-1) using doxycycline (dox) during iPSC reprogramming from human fibroblasts. e, Quantification of NANOG+ iPSC colonies. Error bars are showing mean ± s.e.m. (n=3 biological replicates). f, Representative human iPSC colony in phase contrast and by immunofluorescence. Scale bar is 100µm.
study, Mbd3 inhibits fibroblast reprogramming to iPS and loss of Mbd3 promotes activation of an Oct4-GFP reporter at high-frequency, in accordance with a deterministic model of reprogramming [285],[381]. Taken together, these studies suggest Mbd3 may act as a context-specific epigenetic gas or brake. A part of this context may be the levels of JUN protein in the cell. It may be that Mbd3 acts as a repressor in cell types with high levels of Jun, such as mouse embryonic fibroblasts, but promotes reprogramming in cell types with low or undetectable levels of Jun such as epiblast stem cells[193]. In particular, one study found MBD3 occupancy at the Oct4 locus in fibroblasts [285], consistent with a model in which MBD3 is recruited by a factor already present in fibroblasts to repress Oct4 directly. Conceivably, recruitment could occur by interaction with overexpressed transcription factors, as in the gas and breaks model [381] or by interaction with unphosphorylated Jun as our data suggest. The mechanisms are not mutually exclusive.
The question of what is the activator of endogenous pluripotency genes may be less relevant than the repressive mechanisms that prevent activation. This is supported by a number of studies that are able to replace factors by relieving a repressive epigenetic mechanism. Examples include replacing KLF4 from the reprogramming cocktail in human fibroblasts with a histone deacetylase (HDAC) inhibitor [197], and by using a combination of a G9a methyltransferase and DNA methyltransferase inhibitors to efficiently replace Sox2 in iPSC reprogramming from MEFs [421]. In a separate study, knockdown of Mbd3 by shRNA was able to replace Sox2 in cocktail [285]. Notably, in our study we are able to replace exogenous OCT4 with dnAP-1, which would block recruitment of MBD3, part of the nucleosome remodeling and deacetylase (NuRD) complex. Our data combined with previous studies replacing SOX2 and KLF4 raise the possibility that a major role of the exogenous factors during iPSC formation is the stable acquisition of histone acetylation at genes crucial to iPSC reprogramming. This can be accomplished either through recruitment of histone acetyltransferases (HATs) or by blocking recruitment of HDACs via AP-1 or other HDAC recruiting factors. Together these findings highlight the importance of repressors in cell fate determination.
5.4.5 Methods
See appendix for methods detailing heterokaryon generation and sequence data generation.
ATAC-seq and RNA-seq mapping and quantification
The ATAC-seq datasets were processed with the ENCODE ATAC-seq pipeline (v0.3.0). The cutadapt algorithm was used to trim adapters, and the Bowtie2 aligner was used to align the reads to a custom hg19-mm9 index. To address the concerns raised by multi-mapping of reads to the human and mouse genomes, the haplotype chromosomes and unplaced contigs were excluded from the Bowtie2 index, and the multi-mapping parameter was set to 4. Duplicates were then removed from the aligned reads, and the MACS23,4 peak caller was used to call peaks from the aligned ATAC-seq samples. 702,475 peaks were present in the merged set of peaks across all conditions
The DESeq2 algorithm was used to identify differentially expressed peaks across all pairs of conditions, with an FDR threshold of 0.01. Inputs to the algorithm were read counts, computed from the pseudo-replicate tagalign files for overlapping regions with the merged peak set across all 702,475 peaks. The bedtools coverage command was used to compute the raw read counts for each peak within each replicate. Surrogate variable analysis was performed to identify and remove batch effects using the R SVA package.
STAR was used to align RNA-seq reads to a combined hg19-mm9 reference genome. Gene expression values were quantified with RSEM and any mm9 genes were removed from further analysis. As with the ATAC-seq data, RSEM read counts were analyzed with DESeq2 to identify differentially expressed genes across pairs of conditions, with FDR threshold = 0.01. Surrogate variable analysis was run to remove batch effects.
5.4.6 DPGP clustering of peak and gene trajectories
The resulting set of differential ATAC-seq peaks were clustered with the Dirichlet process Gaussian process mixture model (DPGP) algorithm[305]. Given that it was computationally intractable to cluster all 702,475 peaks, the peaks were grouped by common patterns of differential expression. Specifically, for each of 18 pairwise comparisons across samples (Homokaryon Hk, Co-culture CC, MRC5 M5, 3hr post-fusion, 16hr post-fusion, 48hr post-fusion, H1) each peak was assigned a value of -1 (downregulated in the second condition compared to the first), 1 (upregulated in the second condition), or 0 (no significant difference). Peaks were assigned to patterns based on this 18-character string of -1,1,0 values. For each of the resulting 5876 patterns, the median value of asinh(counts per million) was computed for the controls (Hk, M5, CC), 3hr, 16hr, and 48hr. These values were clustered with 1000 iterations of DPGP, using default algorithm settings, to yield 15 clusters. The initial set of peaks corresponding to each cluster was back-calculated from the clustered patterns. For each cluster, the mean trajectory and standard deviation were computed. The trajectory values for each peak were normalized to Z-scores (5.17B).
A similar approach was applied to cluster 14955 genes that exhibited differential expression in at least one of the pairwise comparisons across samples. The asinh(TPM) expression values for the genes were clustered directly, without first mapping genes to differential patterns, as the number of samples was smaller and hence computationally tractable than in the case of ATAC-seq peak clustering. 1000 iterations of DPGP yielded 15 gene clusters. Z-scores relative to the mean cluster trajectory are plotted in 5.17C.
To identify associations between peak and gene clusters, the pairwise Pearson correlation was computed between the median trajectories of clusters A1 - A15 and R1-R15. Correlation values lower than 0.8 are not shown; higher correlation values are indicated in 5.17B,C with proportionally thick gray lines.
5.4.7 Motif Enrichment Analysis
Cluster-specific motif enrichment analysis was computed for 62 motifs that had been profiled by ENCODE in IMR90 (ENCODE ID E017) and H1 (ENCODE ID E003) cell types[221]. Known motif positions across the hg19 reference genome were downloaded from http://compbio.mit.edu/ encode-motifs/. A bedtools intersection was performed to identify motif overlaps with ATAC-seq peaks in each of the 15 clusters A1 - A15. Taking the full set of ATAC-seq peaks as background, the overlap of background peaks with known ENCODE motif positions was also computed. The motif fold enrichment for a given cluster was computed as
background enrichment =
( number of intersections between differential ATAC-seq peaks and known motif sites from ENCODE) /
(total number of differential ATAC-seq peaks in the dataset)
cluster enrichment =
(number of intersections between cluster peaks and known motif sites from ENCODE) / (total number of peaks in the cluster)
fold enrichment = cluster enrichment / background enrichment
Motifs with significant enrichments in 1 or more cluster (FDR ≤ 0.01) were assigned to the cluster in which they exhibited the greatest enrichment. For each motif, the Counts per Million (CPM) values in the peaks where the motif appears were averaged together for each timepoint. The row Z-scores of these averaged CPM values are plotted in Figure 1a, where each motif is assigned to the cluster in which it exhibited the highest enrichment. We lack sufficient information from the motif scan alone to determine which transcription factor within a given motif family is enriched specifically.
5.4.8 Transcription factor expression analysis
Using the motif families with significant enrichment in one or more cluster, as described above, the corresponding transcription factors with differential expression for at least one pair of samples were identified. For example, Figure 1a reveals enrichment for the Sox family in cluster A2. It was discovered that the Sox7 gene was assigned to cluster R2, and hence suggests that the Sox7 member of this motif family contributes to the observed enrichment.
5.4.9 Chromatin state distributions
The 12-mark/127-reference epigenome/25-state Imputation Based Chromatin State Model was used to identify the chromatin state distribution in the E017 (IMR90) and E003 (H1) cells. For each cluster A1 - A15, the fraction of peaks in each of the 25 chromatin states was determined. Enhancer states EnhA1, EnhA2, EnhAF, EnhW1, EnhW2, EnhAc were combined into a single ”Enhancer” state. Promoter states PromU, PromD1, PromD2 were combined into a single ”Promoter” state.
Inactive states Het, PromP, PromBiv, ReprPC, Quies were combined into a single ”Inactive” state.
5.4.10 Author Contributions
This work is co-authored by Glenn J Markov, Thach Mai, Anna Shcherbina, Yu Xin Wang, Anshul Kundaje, and Helen Blau. Glenn Markov and Thach Mai performed the experimental analysis under supervision from Helen Blau. Anna Shcherbina performed the ATACseq data processing and machine learning modeling under supervision from Anshul Kundaje.
5.5 Dissecting Murine Muscle Stem Cell Aging Through Regeneration Using Integrative Genomic Analysis
5.5.1 Abstract
During aging, there is a progressive loss of volume and function in skeletal muscle that impacts mobility and quality of life. The repair of skeletal muscle is regulated by tissue resident stem cells called satellite cells (MuSCs), but in aging, MuSCs decrease in numbers and regenerative capacity. The transcriptional networks and epigenetic changes that confer diminished regenerative function in MuSCs as a result of natural aging are partially understood. Herein, an integrative genomics approach was utilized to profile MuSCs from young and aged animals before and after injury. Integration of these datasets revealed aging impacts multiple regulatory changes through significant differences in gene expression, metabolic flux, chromatin accessibility and patterns of transcription factor (TF) binding activities. Collectively, these datasets facilitate a deeper understanding of the regulation tissue resident stem cells utilize during aging and healing.
5.5.2 Introduction
Physical frailty, with its associated immobility and disability, is a major factor limiting independence and quality of life for the elderly and can be partially derived from skeletal muscle atrophy and weakness [295]. Declines in the health and repair of skeletal muscle can be attributed to a population of resident stem cells called satellite cells [488] or muscle stem cells (MuSCs). In response to damage, MuSCs undergo dramatic molecular transitions[367] to regenerate the tissue as well as replenish the reservoir of stem cells for future regenerative needs. However, during aging, decreasing numbers and function of MuSCs[62] result in reductions in the rate and magnitude of recovery following muscle injury leading to persistent tissue damage [508] and potentially contributing to age-associated muscle wasting[16].
The molecular mechanisms that govern stem cell aging encompass changes in metabolism[390], aberrant chromatin packaging[102],[429], accumulation of DNA damage[280] and loss of proteostasis, all of which collectively converge to drive an imbalance between maintenance of the quiescent (nonproliferating) state, differentiation and self-renewal. We lack mechanistic understanding of how each of these levels of genomic regulation are modified in age for MuSCs or how they influence each other, and this limitation is in part driven by the lack of genome-wide datasets for aged MuSCs, notably during regeneration. Understanding how different types of molecular changes[274] from natural aging impacts defects and delays in healing through MuSCs is critical for prevention of a senescent[438] or fibrogenic phenotype [67] and therapeutic target discovery[89],[56] to maintain healthy muscle into old age [407].
Herein, we utilize an integrative genomics approach and compare the gene expression programs and chromatin landscape of murine MuSCs from distinct age groups during multiple phases of the regenerative response. Our results describe how natural aging impacts changes in the regulome that manifest in the loss of constitutive heterochromatin and aberrant patterns of transcription factor (TF) binding. We link these alterations to modulation of metabolic flux, and corresponding changes in gene expression. Collectively, these data enable definition of positive regulatory programs that drive healthy regeneration and how chromatin landscapes evolve to negatively affect regeneration as a result of aging.
5.5.3 Results
Aging Modulates Muscle Regeneration Through Differences in Gene Expression
To understand how aging impacts the function of MuSCs, hindlimb muscles (tibialis anterior-TA and gastrocnemius-Gas) of wild-type (WT) mice (young: 2-3 months and aged: 22-24 months) were injected with barium chloride (BaCl2, 5.22a). This injury model yields destruction of muscle fibers but leaves MuSCs intact facilitating tissue regeneration [184]. Consistent with previous observations [438], histological and immunofluorescence (IF) staining of injured tissue isolated 7 days post injury (dpi) from aged and young mice displayed reductions and delays in regeneration for aged muscles (5.22b, I.1a). To understand the sources of dysregulated repair, MuSCs were purified at multiple time points (0,1,3,5,7 and 21 dpi) using fluorescent activated cell sorting[88] (FACS) with both negative (Sca-1-, CD45-, Mac-1-, Ter-119-) and positive surface markers (CXCR4+, β1-integrin+) (5.22c) from TAs and Gas muscles. Previous studies[291] of young MuSCs have shown that these surface markers enrich a highly purified (≥90%) population of Pax7+ MuSCs. In line with these results, 93-97% of our sorted cells stained positive for Pax7 expression (I.1b). We also extracted MuSCs from injured tissue 3 dpi using FACS and immunostained the cells for Pax7 and found ≥93% were Pax7+ (I.1b). To further validate the sorted cells were MuSCs, MuSCs were extracted from uninjured hindlimb muscles of transgenic mice harboring a Cre/LoxP-based system for MuSC lineage tracing Pax7CreERT2/+; Rosa26TdTomato (P7TdT)[218]. The transcriptome of FACS-sorted WT MuSCs with P7TdT MuSCs (I.1c) was compared by RNA isolation followed by sequencing[10] (RNA-Seq) and the two transcriptomes yielded highly reproducible correlations (Spearman = 0.97). These results, coupled with the high purity observed from IF imaging, indicate FACS-sorted WT MuSCs and P7TdT MuSCs exhibit nearly complete overlap.
Gene expression profiles from young and aged sorted MuSCs were generated and strong agreement (Spearman ≥ 0.80 ) was observed for each biological replicate isolated from the different time points. Differential expression (DE) analysis revealed that 4,985 genes underwent a change during the time course or as a result of aging. Time series clustering of differential expression trajectories using nonparametric-based clustering [306] revealed differences between young and aged MuSCs and also displayed variations in enriched GO terms and KEGG pathways (5.22d-f). At 1 dpi, enriched pathways for young MuSCs included cellular adhesion, and ECM deposition (Ret, Itgb1, Itga6, Col5a3, Laminin β1, Col15a1), while aged MuSCs overexpressed genes associated with histone deacetylase binding (Rac1, Cbx5, Nudt21). At 3 dpi, both young and aged samples were positive for many cell cycle genes (Ccna2, Ccnb1, Ccnb2, Ccne1) and pathways, but young samples were observed to increase expression of TFs such as Myf5, Myog and Runx1[476] and genes associated with muscle contraction (Myh3, Tnnt2 and Myomaker-Tmem8c, I.1e). In contrast, at days 3 and 5, aged MuSCs exhibited upregulation of inflammatory markers (Tnfrsf10b/11a/13bf12a, Irf7) and nervous system development genes (Ntrk3, Cadm1, Nrn1). At 5 dpi, young MuSCs upregulated pathways associated with chromatin binding (Top2a, Ezh2, Cbx5, MyoD1, Setdb1) and maintained increases in expression of genes associated with myogenic differentiation (Myh3, Tnnt2, Tmem8c), whereas aged MuSCs maintained increases in expression of inflammatory response genes (Cxcl12, Il-10, Fas, Ccl3) and complement activation (C1ra, C1s1, C3). At 7 dpi, young MuSCs upregulated genes associated with MAPK and calcium signaling as well as inhibition of Wnt signaling (Camk2g, Sphk1, Mef2c, Sfrp1, Sfrp5). In contrast, aged MuSCs continued to be enriched for immune response pathways. At 21 dpi, young MuSCs had more differentially expressed genes compared to aged MuSCs (765 to161) and included genes such as Notch2, Zbtb20 and Laminin β1. Many of the enriched genes in aged MuSCs at 21 dpi were also shared with aged MuSCs from 0 dpi and young MuSCs from 1 dpi. Collectively, the observed changes in expression align with histological observations as well as other studies[274] showing aging induces impairments in the ability to maintain quiescence and regenerate tissue efficiently, which was in contrast to the responses for young MuSCs that displayed stronger and faster regeneration.
Murine Muscle Stem Cells Display Similar Activation Programs as Human Muscle Stem Cells
To glean if young or aged murine MuSCs displayed similar changes in expression as a result of activation with human MuSCs, we contrasted the transcriptomes of freshly isolated (FI) / uninjured and activated human MuSCs [94] with murine uninjured (0 dpi) and activated (3 dpi) MuSCs. By 3 dpi and 7 days in culture, both murine and human MuSCs have fully activated and undergone at least one cellular division [393]. 135 common genes were observed to exhibit differential expression between FI murine young and aged and FI human MuSCs (5.22g). In all 3 FI populations, we observed enrichment of genes such as Calcr, Apoe, Cebpb, Ndrg2, Chrdl2 and Chodl, which have been previously associated with quiescence [156] and decreased in expression after activation. Common enriched pathways included regulation of proliferation (FDR=1.6e-06) and extracellular matrix (ECM) components (FDR=3.3e-06). Several ECM proteins unique to FI murine MuSCs included vitronectin (Vtn) and decorin (Dcn) and were more strongly expressed in young than aged MuSCs. For all 3 activated MuSC populations, we observed increases in expression of a series of myogenic differentiation and muscle contractile genes (MyoG, Tmem8c, Myh3, Cav3, Tnnt2, Acta1, Des). Pathway annotation of activated murine MuSCs revealed enrichments in cell cycle (cell cycle phase, FDR=2.55e-35; M phase, FDR=3.38e-35) and cytoskeleton assembly (cytoskeletal, FDR=6.48e15; contractile fiber, FDR=2.66e-07). Pathways enriched in human activated MuSCs (vs FI human) and young activated murine MuSCs included negative regulation of apoptosis (FDR=2.12e-1), which were absent in activated aged murine MuSCs. Summing these results suggests similarities in several genes associated with quiescence and activation for human and mouse MuSCs, but distinct genes and pathways for ECM-related proteins.

Figure 5.22 (previous page): Muscle stem cells act aberrantly as a result of aging and poorly regenerate muscle after injury. A) Schematic of experiment whereby young (2-3 months) and aged (22-24 months) mice are injured by BaCl2 injection in hindlimb muscles (gastrocnemius and tibialis anterior) and muscle stem cells (MuSCs) are isolated with FACS, pooled across muscles, and profiled during muscle regeneration using genome-wide chromatin accessibility and gene expression measurements. B) Histological assessment (top) and quantification of mean regenerating fiber cross-sectional area (CSA) size (identified through centrally located nuclei) from TAs harvested from young and aged mice 7 days post injury with Hematoxylin and Eosin (H—&E) staining. C) Representative FACS plots showing negative (Sca-1, Mac-1, CD45, Ter-119) and positive (CXCR4, b1-Integrin) surface markers where numbers within gates indicate percentage of cells within gate. D) Heatmap of differential expression for 4,985 genes plotted as z-score for young and aged MuSCs isolated from different days post injury. Clusters are identified by color on the left side of the heatmap. E) Dirichlet Process Gaussian Process (DPGP) mixture model-based clustering of gene expression time series data defined as the z-score of young over aged (gray represents 2x standard deviation and black line is cluster mean), where cluster peaks are corresponding to day of MuSC isolation and color-coded to match clusters in d). F) Enriched GO/KEGG pathways for subset of clusters from e) that are color-coded. G) Venn diagrams of differentially expressed genes from freshly isolated / uninjured or activated MuSCs from young and aged mice and human MuSCs24. Activated MuSCs were isolated from 3 dpi from young and aged murine muscle and 7 days in culture for human MuSCs, when both of which have undergone at least one cellular division. H) Enriched pathways from unique and shared genes from g). FI: pathways from common genes among all freshly isolated MuSCs, FI–M: pathways from common genes among mouse freshly isolated MuSCs, FI–A: pathways from genes among aged mouse freshly isolated MuSCs, FI–Y: pathways from genes among young mouse freshly isolated MuSCs, Ac: pathways from common genes among all activated MuSCs, Ac–M: pathways from common genes among mouse activated MuSCs, Ac–A: pathways from common genes among aged mouse activated MuSCs, Ac–Y: pathways from common genes among young mouse activated MuSCs.
Aged Muscle Stem Cells Undergo Switches in Metabolism That Associate with Histone Methylation
A large number of the differentially enriched pathways observed from murine MuSC gene expression changes were related to metabolism, which is consistent with the known regulatory role of metabolism on MuSC gene expression[55]. To probe deeper into these effects, we used genome-scale metabolic modeling to assess the relationship between 3,744 metabolic reactions, 2,766 metabolites, 1,496 metabolic genes and 2,004 metabolic enzymes132. The expression data was overlaid onto the metabolic model by maximizing the flux through the metabolic reactions that are associated with up-regulated genes while minimizing flux through those reactions that are down-regulated genes. The model predicted that young MuSCs upregulated metabolic flux through Vitamin A (retinol) signaling, glutamate metabolism and oxidative phosphorylation (OxPhos). Previous measurements[347] of the bioenergetics (oxygen consumption rate and extracellular acidification rate) of young and aged MuSCs showed aging induces a shift in metabolic substrate utilization away from oxidative metabolism, which is consistent with the results from the metabolic model. The model also revealed increased flux through folate metabolism and the one-carbon cycle (1CC) for aged MuSCs.
Based on the role of 1CC to generate s-adenosyl-methionine (SAM) substrates, which act as methyl group donors for methylation reactions, we next examined whether changes in 1CC coincided with changes to histone methylation [313]. To address this question, single uninjured MuSCs from young and aged mice were immunostained, imaged and enumerated for global levels of total histone levels (H3) and repressive chromatin modifications (H3K27me3 and H3K9me3) (5.23b-c). Aged MuSCs displayed increased levels of H3K27me3, which was in contrast to young MuSCs that displayed increases in H3 and H3K9me3. These results are consistent with young MuSC increases in expression of Suv39h methyltransferase enzymes, which are primarily responsible for the formation and maintenance of constitutive heterochromatin[358], as well as other chromatin enzymes that contain the SET domain (Ezh1, Utx/Kdm6a, Ash1l, Suv39h2, Setd3) and interact with Suv39h enzymes. These enzymes were decreased in aged MuSCs, which also overexpressed different factors such as Mll5/Kmt2e, Hp-1γ/ Cbx3, Prdm16, Eed, and Setd6 (5.23d). Taken together, these results show that aged MuSCs possess differences in metabolic flux that are associated with alterations of histone modifications and changes in expression of concomitant sets of enzymes. These findings are consistent with previous studies of cellular agin [274],[412], whereby histone loss and constitutive heterochromatin are modified.
Engagement of Retinoic Acid Receptors Restrains Muscle Stem Cell Activation but Is Lost with Age
Vitamin A-retinoic acid has been shown to restrain human skeletal muscle progenitors from differentiation [399],[176] and promote quiescence in hematopoietic stem cells[83]. A loss of expression of retinoic acid receptors and retinoid x receptors was observed for aged MuSCs (5.24a), mirroring the loss of other quiescence-associated genes (I.1d). To probe if vitamin A / retinoic acid as predicted by our metabolic model impacted young or aged murine MuSC maintenance of quiescence, we isolated MuSCs from uninjured hindlimb muscles of young and aged mice using FACS and cultured the cells in activating conditions for 3 days with and without all-trans retinoic acid (ATRA). We observed ATRA-treated MuSCs reduced activation (MyoD+) and proliferation (Ki67+) in both young and aged MuSCs, and ATRA-treated young MuSCs increased Pax7, which was in contrast to aged MuSCs that maintained the same level of Pax7 (5.24b-c). These results confirm aged MuSCs lose VA-signaling and suggest the loss of retinoic acid in age contributes to inability to restrain from activation.
Muscle Stem Cell Regulatory Networks Change with Age
To obtain a deeper insight into the regulation of the aberrant expression networks observed in aging and changes in metabolism, an assay for transposase-accessible chromatin followed by sequencing

Figure 5.23: Alterations in metabolism associate with global changes in histone methylation of young and aged muscle stem cells. A) Histogram (top) and jitter plot (bottom) showing significant changes in reaction flux plotted as the z-score for each metabolic reaction between young and aged MuSCs using a paired t-test (p ≤ 0.05). B) Representative immuno-fluorescence (IF) staining of total histone levels (H3) and repressive chromatin modifications (H3K27me3, H3K9me3) for young and aged muscle stem cells. Scale bar represents 100um. C) Quantitation of stains from b) show higher histone levels (H3) and constitutive heterochromatin modifications (H3K9me3) for young MuSCs, as to where aged MuSCs displayed increases in facultative heterochromatin modifications (H3K27me3), where **** indicates p ≤ 0.001 as calculated by two-sided, two-sample student’s t-test. n = 1,765-3,825 cells from each of two young and two aged mice. D) Heatmap of gene expression for methyltransferases and chromatin enzymes in young and aged uninjured MuSCs plotted as z-score.

Figure 5.24: Retinoic acid receptors contribute to maintenance of muscle stem cell quiescence but are lost in age. A) Heatmap of gene expression for genes encoding retinoic acid receptors and retinoid x receptors as well as several downstream retinoic acid target genes in young and aged uninjured MuSCs, plotted as z-scores of TPM values. B) Representative immuno-fluorescence (IF) staining of Pax7 and MyoD for young and aged muscle stem cells following 3 days of treatment of retinoic acid (+RA) or DMSO alone (-RA). Scale bars represents 50 um in images of aged MuSCs and 25 µm in images of young MuSCs. C) Quantitation of images in b) using a two-sample student’s t-test shows that treatment with ATRA increased Pax7 in young MuSCs (** p ≤ 0.01), decreased MyoD in both young MuSCs and aged MuSCs (**** p ≤ 0.0001 and * p ≤ 0.05, respectively), decreased Ki67 in both young MuSCs and aged MuSCs (* p ≤ 0.05 and ** p ≤ 0.01, respectively). Cells were harvested from muscles of two young and two aged mice. In total, at least 50 cells were stained and analyzed per condition. A two-sided, two sample t-test was used to calculate statistical significance.
[78] (ATAC-Seq) was utilized (5.22a) on MuSCs at multiple time points (5.25a). A total of 238,563 accessible sites were found, and blacklisted sites were removed[20]. Overall, the datasets were highly reproducible across technical and biological replicates (5.25a). Principal component analysis (PCA) of enriched sites showed young samples migrate along a trajectory defined primarily by the first principal component, while aged MuSCs utilized a trajectory defined by the second principal component (5.25b). During the early stages of the regenerative process (1 and 3 dpi), both young and aged samples were observed to cluster with each other and at later time points (5 and 7 dpi) the samples began to reveal a divergence in their regenerative trajectory back towards the uninjured state (0 dpi). Annotation of the ATAC-Seq peaks revealed the largest fraction of sites aligned with transcriptional start sites (TSSs), but the majority of peaks were distal to TSSs (I.3c). During the regenerative process (1,3,5 dpi), young MuSCs were observed to increase the number of distal sites when compared to aged, but these dynamics reversed at 7 dpi and resembled 0 dpi. To probe if sites that were opening up as a result of aging were previously marked by facultative heterochromatin modifications (H3K27me3) in young uninjured MuSCs, enriched sites were intersected with ChIPSeq datasets (Liu et al., 2013) for H3K27me3 in MuSCs (5.25c). Of the 23,275 sites with increased accessibility in the aged samples, 274 overlapped H3K27me3-bound regions in aged satellite cells, while 399 overlapped H3K27me3-bound regions in young MuSCs (5.25c). Integrating these results shows aging induces increases in accessibility of sites that were previously demarcated by facultative heterochromatin.
Assessment of potential drivers of changes in chromatin dynamics was performed by grouping chromatin regions with similar accessibility profiles through time, and differential peaks were analyzed with gene set enrichment analysis using the GREAT toolbox[310] (5.25d). This analysis yielded pathway enrichments such as fatty-acyl-CoA biosynthesis for young MuSCs before injury, and interleukin-6 (IL-6) mediated signaling and muscle cell development enriched for aged MuSCs. After injury, aged MuSCs remained enriched for IL-6 signaling and upregulated Wnt receptor signaling, the Ras pathway and regulation of retinoblastoma protein. In contrast, after injury young MuSCs exhibited temporal activation of the Notch-mediated Hes/Hey network, FGF signaling, as well as muscle cell differentiation.
Aging Induces Divergent Transcription Factor Binding Dynamics
To further determine candidate regulators that specify differences in the chromatin accessibility landscape, enriched regions were assessed for transcription factor binding motifs and corresponding changes in expression at each time point. This analysis revealed young MuSCs contained increases in the number of enriched motifs and higher expression for myogenic factors (Mef2a, Mef2c, MyoG) (5.26a, I.4a,c). Many of the recovered motifs were central regulators of the myogenic lineage (Myf5, Myog) or were predicted to interact with MyoD such as Zbtb18/Rp58, Tcf3 and Xbp1 motifs (I.44). These collective observations are well illustrated by chromatin profiles of the MyoG and Mef2a loci (5.26b). The MyoG promoter displayed differential accessibility in aged MuSCs compared to young throughout the time course, and these chromatin changes mirrored the expression profile which peaked at 3 and 5 dpi for young MuSCs. The MyoG promoter contains MyoD binding sites as well as two overlapping binding sites for Pbx/Meis that cooperate[403] with MyoD. Similarly, distal sites for Mef2a exhibited accessibility changes comparing young and aged MuSCs, which also contained MyoD binding sites, and correlated with stronger expression levels in young MuSCs during regeneration.
Given the pivotal role of MyoD in specifying the chromatin landscape of MuSCs, we performed footprinting analysis of MyoD in both young and aged ATAC-Seq datasets and found clear footprints across the genome (5.26c). On average, we observe stronger and more MyoD footprints for aged MuSCs at 0 and 7 dpi and at 1,3 and 5 dpi, MyoD footprints were stronger and increased in number in young MuSCs. These results were consistent with increases in the number of accessible sites at 0 and 7 dpi for aged MuSCs. Given that MyoD functionally cooperates with other transcription factors and histone acetyltransferases (protein-protein interactions) to activate transcription at target genes, we integrated ChIP-Seq datasets for MyoD [86] with MyoD footprints and identified ChIP-Seq peaks containing a MyoD footprint motif (directly bound MyoD), ChIP-Seq peaks lacking the MyoD motif or footprint (indeterminate sites) and ChIP-Seq peaks overlaying a motif but lacking a footprint

Figure 5.25: Chromatin accessibility is modified during muscle stem cell regeneration and exhibits divergent regenerative trajectories in aging. A) Heat map of Spearman correlation coefficients for individual replicates isolated from age and time points showing strong reproducibility. Correlation was computed on asinh(counts per million) reads after removal of the contributions from surrogate variables. B) Multi-dimensional scaling (MDS) of ATAC-Seq enrichments color-coded by day of isolation; circles represent aged samples and triangles represent young samples. C) Intersection of differentially accessible (young vs aged) ATAC-seq peaks from day 0 with H3K27me3 sites previously derived11 for young and aged MuSCs. d) Statistically enriched (Benjamini-Hochberg corrected pvalue ≤ 0.01) pathways from different days derived from ATAC-Seq enrichments using GREAT analysis.
(indirectly bound MyoD) (5.26d-e). The fraction of ChIP-Seq peaks predicted to encompass direct binding versus being indirectly bound changed throughout the time course with 3 dpi representing the largest number of directly bound predicted sites. Increases in direct MyoD binding were observed for young MuSCs for all time points compared to aged MuSCs, with the greatest differential at 5 dpi, which is the time that exhibited the greatest number of upregulated genes for young MuSCs (I.3b). In contrast, aged MuSCs contained more indirect or indeterminate MyoD binding sites for all time points suggesting more but weaker MyoD interactions with chromatin as a result of age. We next determined the frequency of indirectly bound MyoD sites and queried if the sites coincided with motifs of a second factor. This analysis recovered enrichments for E-protein motifs (Tcf3/E47, Tcf12, E2f1), AP-1, related basic helix loop helix (BHLH) proteins such as MyoG and Myf5, Prdm16, Ddit3, and Tfap4, a transcriptional repressor[214]. Many of the associated factors exhibited stronger expression in young when compared to aged MuSCs during regeneration (5.26f), indicating young MuSCs may exhibit a stronger regenerative response due to combinations of transcription factor co-occupancies. We also observed that Prdm16 and Ddit3 increased expression at 21 dpi in aged MuSCs, which was similar to 0 dpi and in contrast to young MuSCs.

Figure 5.26 (previous page): Aging engenders variations in transcription factor binding dynamics during regeneration. A) Comparison of fold change in transcription factor expression in young/aged (x-axis) with the HOMER fold enrichment of the corresponding transcription factor binding motif in young/aged over background (y-axis). Motifs enriched at day 0 in either young (y-axis ≥0) or aged (y-axis ≤0). Points on the chart are sized proportionally to median TPM for the corresponding day and color-coded by the motif family to which they belong. B) Normalized tracks of ATAC-Seq datasets (fold change of ATAC-seq signal relative to the mm10 background distribution) around the MyoG locus (left) and Mef2a (right), where differences in enrichments are highlighted in gray and all tracks are scaled to the same level (fold change 0-20). C) Top - Comparison of aggregate MyoD footprints in young versus aged for each day of isolation, where the y-axis are logarithm of reads per motif site and the x-axis is distance away from the center of the footprint (+/- 0.1kb from motif center). D) Schematic of approach to distinguish direct and indirect binding of MyoD transcription factor whereby MyoD ChIP-Seq peaks are categorized for ATAC-Seq footprints (FP) and position weight matrix (PWM). E) Annotated distribution of MyoD ChIP-Seq peaks (y-axis) in young (Y) and aged (A) MuSCs for each time point of isolation before and after injury. F) Gene expression heatmap plotted as z-score for each time point of isolation before and after injury for MyoD co-binding partners identified through indirectly bound MyoD sites and PWM of a second factor.
Misregulation of DDIT3 from Aging Inhibits Myogenic Differentiation
MyoD footprinting analysis revealed that different TFs coincided with MyoD binding sites, and these TFs and chromatin enzymes displayed differential expression at different time points. One of the factors that was over-expressed in aged MuSCs was Ddit3/Chop[343] (DNA damage inducible transcript 3). Ddit3 is a TF that regulates the growth hormone receptor (Ghr)[511] and protein synthesis through insulin growth factor 1 (Igf1) [182], and has been shown to act as a transcriptional repressor of MyoD [18] through interaction with HDAC1. Ddit3 is also sensitive to autophagy [163], the protein degradation process through which components of the cytoplasm are digested by lysosomes. The role of Ddit3 in MuSC aging has not been profiled, and to evaluate the impact of reductions of Ddit3, Ddit3 was silenced in myogenic progenitors (C2C12 cells) via delivery of two distinct Dicer-substrate small interfering RNAs [223] (DsiRNAs) packaged in lipid-nanoparticles and knockdown efficacy was verified using qPCR (I.5). After knockdown, myoblasts were differentiated and fused into myotubes, and increases in fusion and MyoG expression were observed for Ddit3 knockdowns compared to controls (5.27a-c). To determine if Ddit3 knockdown could rescue the differentiation delay and deficiency of aged MuSCs ex vivo, MuSCs were isolated from young and aged mice and derived into myoblasts. Ddit3 was silenced and myoblast differentiation and fusion in myotubes was compared to controls. We observed Ddit3 knockdown enhanced myotube fusion index (5.27b), and increases in MyoG expression. These results confirm coordinative TF binding modified in age can restore transcriptional output that regulates cellular differentiation.
5.5.4 Discussion
A healthy MuSC compartment is critical for maintaining muscle homeostasis through aging and transcriptional and epigenetic regulation [143] protects these cells from deleterious factors present

Figure 5.27: A) Representative immuno-fluorescence (IF) staining of aged MuSCs differentiated into myoblasts / myotubes for 3 days after knockdown (KD) of Ddit3, where myosin heavy chain 3 (green) and DAPI (blue) are shown. Scale bar = 100um) B) Enumeration of fusion index for differentiated myoblasts shows knockdown of Ddit3 induced more fusion. n =3 independent experiments, where * corresponds to p≤0.05, calculated by two-sample student’s t-test assuming equal population variance. C) qPCR of Myogenin (MyoG) following siRNA knockdown of Ddit3and 3 days post differentiation shows upregulation of MyoG. n=3 replicates, where * corresponds to p≤0.05, calculated by student’s t-test assuming equal population variance.
in their systemic milieu that promote alternative cell fates [67] and adoption of a senescent [438] phenotype. Herein, we present an integrative genomics resource showing how the transcriptional and epigenetic landscape of MuSCs is altered during regeneration and aging. We observe that aged MuSCs display significant changes in their gene expression profile and chromatin landscape prior to injury, and attribute these differences to usage of distinctive metabolic pathways.
Profiling the expression of aged and young MuSCs has shown aging induces a defect in the regenerative abilities of MuSCs [274],[56],[65] and an impaired capacity to self-renew [56],[62]. Consistent with these results, we observed decreases in expression of genes associated with quiescence and negative regulation of cellular proliferation before injury in aged MuSCs. During regeneration, young MuSCs displayed stronger expression of myogenic TFs and genes associated with muscle contraction compared to aged MuSCs. These results suggested aging induces restraint of the strength and timing of the myogenic program through impairments in the ability of myogenic TFs to enact changes in expression. To probe further into this effect and the potential role of changes in chromatin, we first observed variations in expression of chromatin enzymes, with young uninjured MuSCs displaying increases in expression of oxygen-sensitive factors [90] that remove H3K27me3 (Utx) [145], and interact with MyoD1 and MyoG (Setd3) [141]. We also observed young MuSCs had increased expression of Kmt5b / Suv4-20h1, which has been shown to regulate heterochromatin [102]. Aged MuSCs were observed to increase expression of Prdm16, which contributes to cell fate[24] decisions towards brown adipocytes [410], and downregulated Lsd1/Kdm1a as well as E2F4, both of which have been shown to prevent brown adipocyte differentiation in MuSCs [468]. These results suggest aging induces a loss of chromatin enzymes that deposit constitutive heterochromatin [358] and attenuate the ability of myogenic TFs to enact tissue repair.
The regulation of MuSC metabolism [398] has been shown to be disrupted as a result of aging [513],[347]. Leveraging our expression datasets into an unbiased metabolic model, we observed retinoic acid-induced signaling was enriched in young MuSCs but lost in aging. This pathway, involving metabolism of vitamin A, has been shown to restrain human skeletal muscle progenitors from differentiation [176] and dormancy of hematopoietic stem cells [83]. Culturing young and aged MuSCs in retinoic acid prevented activation and reduced proliferation suggesting the loss of vitamin A and associated engagement with retinoic acid receptors with age contributes to loss of ability to maintain quiescence and resist activation. The metabolic model also predicted enrichments in OxPhos for young MuSCs and 1CC for aged MuSCs. The coupled loss of OxPhos and gain of SAMs from the 1CC observed in aging has previously been shown to reduce NAD-dependent histone deacetylases (HDACs) such as sirtuins [479] and increase chromatin accessibility [133]. These studies are in line with our observations for aged MuSCs, where Sirt1 was downregulated and decreased in histone levels and H3K9me3, a marker for constitutive heterochromatin, were detected. These results were also mirrored by decreases in components of the nuclear envelope (Lamin B1, LaminB2, Lamin B receptor, Sun1, Nesprin1, Nesprin2), suggesting an untethering of previously condensed heterochromatin. The redistribution of constitutive heterochromatin (H3K9me3) into facultative heterochromatin (H3K27me3) was consistent with increases in expression of Eed and Ezh2, which form a complex with Polycomb repressive complex 2 (Prc2) to deposit methyl groups onto lysine 27 of histone 3 and display specificity[296] for H3K27me3. Notably, expression of Ezh1, the paralog to Ezh2 that has weaker methyltransferase activity and also complexes with Prc2, was reduced with age. Whether this switch in comparative abundance impacts levels of mono- and di-methylated H3K27 in favor of tri-methylation could be a fruitful direction for further exploration. Overall, the coupled loss of heterochromatin [360] and nuclear envelope proteins with age [485] implies the hierarchical organization of the MuSC nucleus may be disrupted [177], which has been observed in aged hematopoietic stem cells [91],[173] and also been shown to disrupt gene repositioning schemes and expression during myogenesis [392].
The regulatory landscape of young and aged MuSCs showed aging induces changes in the number of accessible chromatin states before and after regeneration. A detrimental result of these differences in chromatin state were alterations in the expression and manner of binding of TFs (such as MyoD). Footprinting the MyoD motif during the regenerative process (1,3,5 dpi), we observed increases in accessibility for young MyoD motifs compared to aged. To further discern the nature of differences in MyoD binding and potential co-binding partners that would impact the epigenetic landscape, we integrated MyoD ChIP-Seq datasets and categorized the ChIP peaks and ATAC-Seq peaks containing MyoD footprints. This analysis revealed MyoD exhibited enhancements in MyoD direct binding for young MuSCs and stronger interactions with additional binding partners such as E-proteins. These results were in contrast to aged MuSCs that displayed increases in indirect or indeterminate MyoD binding. Given MyoD’s limited ability to enact transcriptional activation and MyoD binding correlates with only a small number of DNA elements that change expression, these results highlight how heterodimerization of MyoD with other factors to activate muscle-specific genes may occur from a distance and these distance dependent changes appear attenuated with age. Another factor that may contribute to weaker interaction of MyoD with DNA and transcriptional activation in aging is a lack of histone acetylation [372]. Hypoacetylation at myogenic differentiation genes would be compatible with our results given acetyl-CoA and glucose metabolism regulate the histone acetylation landscape, and both were modified in aged MuSCs. This model is further supported by the fact that the recruitment and complexation of MyoD with histone acetyltransferases (HATs) such as p300/CBP or p/CAF at enhancers [364] is essential for proper execution of myogenic differentiation, and factors such as AP-1 (Fos/Jun) and Runx1 that cooperatively bind with MyoD were differentially expressed in old age. Integrating these results suggests changes in the cooperative nature of transcription factors [110] with aging occurs at enhancers and these changes drive reductions in the regenerative potential of aged MuSCs.
A deleterious effect of prolonged MuSC stress from aging is a loss of proteostasis, where impairments in unfolding of proteins cause stress in the endoplasmic reticulum (ER). The accumulation of unfolded or improperly folded proteins induces autophagic clearance, reactive oxygen generation (ROS) and persistent activation of ER stress-mediated cell death signaling, which is regulated by the PKR-like ER kinase (PERK)–eukaryotic initiation factor 2α (eIF2α)–activating transcriptional factor 4 (Atf4) pathway. Autophagy and the PERK-eIF2α-Atf4 pathway are essential for MuSC survival and functional [501] but less is known about the other signaling arms of the unfolded protein response (UPR) on MuSC function and in aging. Ddit3, which is an autophagic sensitive downstream target of Atf6 (one of the three UPR branches) was found to be upregulated in aged MuSCs at 0 and 21 dpi. Ddit3 is a transcriptional repressor of MyoD[403], and we observed that indirectly bound MyoD sites were enriched for Ddit3, suggesting reduction of this factor would exert a positive effect on subsequent myogenic differentiation lost in aging. Consistent with this view, transient Ddit3 knockdown in aged MuSCs enhanced myogenic differentiation via increases in MyoG expression. Given Ddit3 also interacts with HDAC1, which associates with MyoD [293] in a manner that results in transcriptional repression via deacetylation, these experiments further suggest imbalances in the combinatorial network of TFs from aging results in inefficient and/or delayed activation of the myogenic program in MuSCs.
Overall, our studies are a valuable resource for understanding how epigenetic regulation affects aging of a tissue resident stem cell and may be applicable for other types of trauma such as volumetric muscle loss [10], and can also be used to contrast anti-aging models. Future work will investigate the complex and combinatorial control of transcription factor binding in aged MuSCs along with how constitutive heterochromatin domains diminish with age.
5.5.5 Methods
Data and code availability
The datasets generated during this study are available at the Gene Expression Omnibus (GEO) repository under GSE121589. Additionally, the bioinformatics code is available on GitHub (https:
//github.com/annashcherbina/nobel lab projects/tree/master/age V2).
Experimental Model and Subject Details
Young (3-4 months) and aged (20-24 months) C57BL/6 wild-type female mice were obtained from Charles River Breeding Laboratories or from a breeding colony at the University of Michigan (UM). Male 5 month old Pax7CreER/+;Rosa26nTnG/+ (nTnG) mice were obtained from a breeding colony at the University of Michigan. All mice were fed normal chow ad libitum and housed on a 12:12 hour light-dark cycle under UM veterinary staff supervision. All procedures were approved by the Institutional Animal Care and Use Committee and were in accordance with the U.S. National Institute of Health (NIH).
Myogenic Progenitor Cells (C2C12s)
: C2C12s are an immortalized mouse myoblast cell line commercially available at ATCC (CRL-
1772). They were maintained in culture at 37 degrees C and 5% CO2 using F10 supplemented with 20% FBS, 1% Pen Strep, and 0.02ug/mL bFGF. Media was replenished every 2-3 days, and cells were passaged using 0.25% Trypsin EDTA once they reached 70-80% confluence.
Animal and Injury Model
C57BL/6 wild-type female mice were obtained from Charles River Breeding Laboratories or from a breeding colony at the University of Michigan (UM). All mice were fed normal chow ad libitum and housed on a 12:12 hour light-dark cycle under UM veterinary staff supervision. All procedures were approved by the Institutional Animal Care and Use Committee and were in accordance with the U.S. National Institute of Health (NIH). Young female mice (3-4 months) and aged female mice (20-24 months) were randomly assigned to one of five groups: uninjured, day 1, day 3, day 5, and day 7 injured (n=4 per group). To induce skeletal muscle injury, mice were first anesthetized with 2% isoflurane and bilaterally administered a 1.2% barium chloride (BaCl2) solution injected intramuscularly into several points of the tibialis anterior and both heads of the gastrocnemius muscles for a total of 80 µL per hindlimb. We used an uninjured set of mice as our controls because of the previous observation (Rodgers et al., mTORC1 controls the adaptive transition of quiescent stem cells from G0 to GAlert. Nature (2014) 510, 393-396) that the contralateral control after muscle injuries adopts an activated state (called GAlert) in muscle stem cells that is unique compared to uninjured muscle stem cells.
For tissue collection, mice were anesthetized with 2% isoflurane, then euthanized with cervical dislocation, bilateral pneumothorax, and removal of their heart. Tissues were quickly excised in either biohazard containment or a surgical room. Hind limb muscles (tibialis anterior and gastrocnemius) of control and experimental mice were dissected using sterile surgical tools and placed in petri dishes containing ice-cold PBS. To achieve adequate MuSC yield for downstream analysis, TA and Gas from both legs were pooled, then minced using surgical scissors and transferred into 50 mL conical tubes containing 20 mL of digest solution (2.5 U/mL Dispase II and 0.2% [ 5,500 U/mL] Collagenase Type II in DMEM) per mouse. Samples were incubated on a rocker placed in a 37oC incubator for 70 min with manual pipetting the solution up and down to break up tissue every 30 minutes using an FBS-coated 10 mL serological pipette. Once the digestion was completed, 20 mL of F10 media containing 20% heat inactivated FBS was added into each sample to inactivate enzyme activity. The solution was then filtered through a 70 µm cell strainer into a new 50 mL conical tube and centrifuged at 350xg for 5 min at 4oC. The supernatant was discarded and the pellets were resuspended in a total of 6 mL of staining media (2% heat inactivated FBS in Hank’s Buffered Salt Solution - HBSS). The single cell suspension was divided into separate FACS tubes and centrifuged at 350xg for 5 min at 4 degrees C. The supernatants from each FACS tube were discarded and the cell pellets were resuspended in 200 µL of staining media and antibody cocktail containing Sca-1:APC (1:400 dilution), CD45:APC (1:400), CD11b:APC (1:400), Ter119:APC (1:400) and CD29/beta1integrin:PE (1:200) and CD184/CXCR-4:Biotin (1:100), then incubated on ice for 30 minutes in the dark. Following incubation in primary antibodies, samples were diluted with 2mL staining solution per sample, centrifuged at 350xg for 5 min at 4 degrees C, and supernatant discarded. Samples were then re-suspended in a staining buffer containing PECy7:Streptavidin (1:100), and incubated on ice for 20 min in the dark. After incubation, samples were again diluted in 2mL staining buffer, centrifuged at 350xg for 5 min at 4 degrees C, supernatant discarded, then re-suspended in 200uL staining solution. Prior to sorting, cells were filtered through 35um cell strainers, and 1ug of propidium iodide (PI) stain was added in each experimental sample. Cell sorting was done using a BD FACSAria III Cell Sorter (BD Biosciences, San Jose, CA) and APC-PI-PE+PECy7+ MuSCs were sorted into either ice-cold staining solution for immediate processing or into Trizol and snap frozen for later use. Purity of the enriched MuSC population was validated using immunofluorescent labelling of Pax7 (I.1B). Coverslides were coated with 22.4ug/mL CellTak in PBS for 20 minutes at room temperature, then cells were seeded and allowed to adhere for 45 minutes at room temperature before being fixed with 4% paraformaldehyde in PBS for 20 minutes at room temperature. Cells were stained and imaged at 10X magnification as described below, using mouse anti-Pax7 (1:10), anti-mouse AF555 (1:300), and Hoechst (1ug/mL). Pooling of MuSCs from both TA and Gas muscles was required to achieve adequate amounts of RNA or DNA for subsequent sequencing-based assays, which prevented our ability to discern the muscle of origin.
Chromatin Accessibility Assay and Sequencing Library Preparation
At least 10,000 MuSCs were centrifuged at 500xg for 5 min at 4 degrees C in a fixed angle centrifuge. After removing the supernatant, the cell pellets were resuspended in 50uL ATAC-Resuspension Buffer (RSB) (500uL 1M Tris-HCl, 100uL 5M NaCl, 150uL 1M MgCl2 in 49.25mL sterile water) with 0.1% NP40, 0.1% tween-20, and 0.01% Digitonin. Samples were pipetted up and down three times to mix, incubated 3 minutes on ice, then diluted in 1mL of ice-cold ATAC-RSB containing 0.1% Tween-20 and centrifuged at 500xg for 10 min at 4 degrees C to pellet the nuclei. The supernatants were carefully removed and the pellets were resuspended in 25 µL of transposase reaction mix (12.5 µL 2x TD buffer, 1.25 µL TN5 transposase (Illumina) and 11.25 µL nuclease-free water). Transposase reactions were carried out by incubating samples at 37 degrees C for 30 min under mild agitation (300 rpm on a Thermo-mixer C, Eppendorf). Once the incubation was completed, sample tubes were placed on ice and the transposed DNA fragments from each sample were purified using a Qiagen MinElute PCR Purification Kit following manufacture’s protocol. Purified DNA fragments were then amplified for 13 cycles using barcoded PCR primers and NEB Next High Fidelity 2x PCR Master Mix (New England Bio Labs) on a thermal cycler. Double concentrated Ampure beads were used to purify transposed DNA amplicons. The molarity of each DNA library was determined (Agilent 2100 Bioanalyzer), pooled into a single tube and sequenced on a NextSeq (Illumina) using 76-bp paired-end reads.
Bulk mRNA Isolation and Sequencing Library Preparation
MuSCs sorted directly into Trizol were thawed at room temperature, and RNA was extracted using a Qiagen miRNeasy Micro Kit as per manufacturer’s instructions. RNA concentration and integrity were measured with a Nanodrop spectrophotometer (Nanodrop 2000c) and Bioanalyzer (Agilent 2100). 1-10 ng of high-quality RNA (RIN≥8) was used to produce cDNA libraries using the SmartSeq v4 protocol (Clontech) as per the manufacturer’s instructions. cDNAs were prepared into sequencing libraries using 150 pg of full-length cDNA amplicons (Nextera XT DNA Library Preparation Kit, Illumina) with dual index barcodes as per manufacturer’s instructions. Barcoded cDNA libraries were pooled into a single tube and sequenced on a NextSeq (Illumina) using 76-bp single-ended reads.
Histology and Immunohistostaining
A separate cohort of mice was used for histological evaluation of skeletal muscle injury and regeneration. Young (3-4 months) and old (20-24 months) female mice (n=4 per group) received bilateral intramuscular injection of the tibialis anterior muscles with a 1.2% BaCL2 solution (40 µL per muscle) while under isoflurane anesthesia. At day 7 post BaCl2 injury, mice were euthanized via bilateral pneumothorax and cervical dislocation, and tibialis anterior muscles were rapidly dissected. Tibialis anterior muscles were embedded in optimum cutting temperature (OCT) compound, rapidly frozen in isopentane, cooled on liquid nitrogen and stored at -80 degrees C until further analysis. Muscle tissue cross sections (10 µm) were cut in a cryostat at -20 degrees C, adhered to Superfrost Plus microscope slides (Fisher Scientific) and air dried at room temperature. Slides were fixed in 100% acetone for 10 minutes at -20degreesC, air dried at room temperature and stained with hematoxylin and eosin (H and E) using routine procedures. Sections for immunofluorescence staining were rehydrated in PBS for 5 minutes and blocked overnight at 4 degrees C in M.O.M. blocking reagent. Sections were then incubated overnight at 4degreesC with a cocktail of primary antibodies against embryonic myosin (1:20) and laminin (1:200). The following morning, slides were washed three times for 5 minutes in PBS and incubated for 1 hour at room temperature with Goat Anti-Mouse IgG1 Alexa Fluor 488 (1:500) and Goat Anti-Rabbit Alexa Fluor 555 (1:500) secondary antibodies. Nuclei were counterstained with 4’,6-Diamidino-2-Phenylindole Dihydrochloride (DAPI) (1 µg/mL). Slides were washed three times for 5 minutes each in PBS and then mounted with coverslips using Dako Fluorescence mounting media. Fluorescent and bright field images were acquired using a Nikon A1 confocal and Olympus BX51 wide field microscope. Regenerating myofibers (central nuclei and/or eMHC+ cytoplasm) were quantified in a single 20x field of view from the core of the injured region using Image J software. Differences between young and aged mice for regenerating myofibers and regenerating myofiber cross-sectional areas were determined by Student’s t-tests where statistical significance was set at P ≤ 0.05.
Chromatin Immunofluorescence
Black, 96-well flat clear-bottom plates were coated with 22.4ug/mL CellTak in PBS for 20 minutes at room temperature, followed by three quick rinses with distilled water. FACS-enriched MuSCs were then seeded at a density of 10,000 cells/well and allowed to adhere for 45 minutes at room temperature. Next, cells were washed twice with PBS, fixed with ice-cold 100% methanol for 20 minutes at -20 degrees C, and rinsed three times with PBS. Next, cells were blocked in Odyssey blocking buffer (OBB; LI-COR Biosciences) for 1 hour at room temperature and incubated overnight at 4degreesC with the following primary antibodies: Histone H3; H3K27me3; and H3K9me3. The following day, cells were washed three times with PBS supplemented with 0.1% Tween-20 (PBS-T) and incubated at room temperature with the secondary rabbit antibody Alexa Fluor 647 for 1 hour. Cells were then washed once with PBS-T and once in PBS. Hoechst 33342 was used to stain nuclei, after which cells were washed twice with PBS and imaged with a 20x objective using the ImageExpress Micro Confocal system. A total of 49 sites were imaged per well. Background subtraction was performed using ImageJ, whereas image segmentation was performed with CellProfiler. Singlecell and population-average image analysis and quantification was performed using MATLAB 2017b software. Immunofluorescence images were assembled using ImageJ and Adobe Illustrator.
Treatment with all-trans Retinoic Acid
Aged MuSCs were isolated from the hindlimb muscles of two C57BL/6 wild-type mice (26 months, 1 male and 1 female) via FACS as described above. Young MuSCs were isolated from two 5-month-old male Pax7CreER/+;Rosa26nTnG/+ (nTnG) mice following 5 daily intraperitoneal injections of 20 mg/mL tamoxifen in corn oil (75 mg/kg body weight) and 5 days of recovery from injections. The same dissection, digestion, and filtration techniques were used as above. Cells were incubated with 200 µL of 1 µg/mL DAPI in staining solution at room temperature for 10 minute (protected from light) to stain for viability. Cell suspensions were passed through a 35 µm cell strainer prior to sorting on a Sony MA900 cell sorter, and DAPI-/tdTomato-/GFP+ MuSCs were collected into cold staining solution for immediate processing.
Cell Culture
The wells of a 96-well plate were coated with sterile 0.5% gelatin (in ddH2O) solution for an hour at room temperature and then aspirated and allowed to dry. 100mM aliquots of all-trans retinoic acid (ATRA) were prepared by dissolution in cell culture-grade dimethyl sulfoxide (DMSO). ATRA was dissolved in myoblast media (F10 with 20% FBS, 1% Pen Strep, and 0.02ug/mL bFGF) for a final ATRA concentration of 100 nM and 0.1% DMSO (v/v). Control media was prepared by diluting DMSO in myoblast media (0.1% v/v). Aged and young MuSCs were re-suspended in myoblast containing either ATRA (+RA) or DMSO (-RA). Cells were seeded in gelatin-wells at a density of approximately 6,250 cells/cm2. Media containing either ATRA or DMSO was replenished every 24 hours for 3 days.
Immunofluorescence (IF) staining
Cells in wells were fixed with 100% methanol for 5 minutes at room temperature. After washing with PBS, cells were permeabilized and blocked with 0.3% Triton X-100 and 1% BSA in PBS, incubated overnight at 4 degrees C with a combination of a 1:100 dilution of AF488-conjugated anti-Pax7 antibody and a 1:50 dilution of AF647-conjugated anti-MyoD antibody in 0.2% BSA in PBS. Cells were incubated overnight at 4 degrees C with a 1:50 dilution of PE-conjugated anti-Ki67 antibody in 0.2% BSA in PBS. Each staining combination was performed in duplicate for each treatment condition and age. Following overnight incubation with antibodies, cells were washed 3 times with PBS. Nuclei were counterstained with DAPI (1 µg/mL) for 10 minutes at room temperature. Cells were washed in PBS a final 3 times and left covered in 100 µL of PBS during imaging. 20x magnification images were acquired on a Zeiss Axio Vert.A1 inverted microscope with a Colibri 7 LED light source and an AxioCam MRm camera. Images were subsequently analyzed in Fiji. ROIs were generated by thresholding on the DAPI image to identify nuclei. The average fluorescent intensities of each stain within these ROIs were recorded. The average Pax7, MyoD, and Ki67 signals were compared between +RA and -RA cells for each age using a two-sample t-test with the significance level set to α = 0.05. Statistical tests were performed in R and plots were generated using the ggplot2 package in R.
In vitro DDIT3 Knockdown.
C2C12s were seeded at a density of 50,000 cells per well in a 12-well plate in myoblast media (F10 with 20% FBS, 1% Pen Strep, and 0.02ug/mL bFGF). After 24 hours, the media was replaced with myoblast media without antibiotics containing 15nM DDIT3 13.1 DsiRNA and 15nM DDIT3 13.9 DsiRNA encapsulated in RNAiMAX lipid droplets. After 72 hours, transfection media was replaced with differentiation media (DMEM with 5% horse serum and 1% Pen Strep). Cells were incubated for 72 hours in differentiation media prior to being fixed in 4% paraformaldehyde for immunostaining or lysed in trizol for qPCR.
Mice were euthanized and their hindlimb muscles extracted and digested into a single cell suspension. Satellite cells were enriched using the Miltenyi mouse satellite cell isolation kit following the manufacturer’s protocol. Tissue culture dishes were coated with 10% Matrigel in DMEM as previously published (Motohashi, Asakura and Asakura, 2014). Enriched satellite cells were seeded onto the Matrigel-coated plates in myoblast media (F10 with 20% FBS, 1% Pen Strep and 0.02ug/mL bFGF). Media was replaced every 48 hours until cells expanded to 50% confluency, at which point they were passaged as previously published and seeded onto a Matrigel-coated 12-well plate. 24 hours after seeding, 0.15nM DDIT3 13.1 DsiRNA and 0.15nM DDIT3 13.9 DsiRNA were encapsulated in RNAiMAX lipid droplets according to manufacturer’s protocols and delivered to cells. Knockdown efficiency after 72 hours in transfection media was validated using quantitative real-time PCR (I.4). Post DDIT3 knockdown, the transfection media was replaced with differentiation media (DMEM with 5% horse serum and 1% Pen Strep). Cells were incubated in differentiation media for 72 hours prior to lysing or fixing in 4% paraformaldehyde in PBS.
Immediately prior to fixation, cells were incubated for 45 minutes in warmed DMEM containing 250nM Mitotracker Deep Red, then washed 3 times with warmed PBS and fixed for 15 minutes in warmed 4% paraformaldehyde in PBS at 37 degrees C. After three quick washes with PBS, cells were permeabilized with 0.1% TritonX-100 and blocked with 1% BSA, 0.1% Tween-20 and 22.52mg/mL glycine in PBS. After blocking, cells were incubated with primary antibodies (1:10 dilution of antiMYH3) overnight at 4 degrees C followed by secondary antibodies (1:500 dilution of AF555 antirabbit) overnight at 4 degrees C. Nuclei were stained with Hoechst 33342. Immunolabeled cells were imaged on a Zeiss epifluorescent microscope using a 10X objective. The fusion index was automatically calculated using MATLAB as the ratio of nuclei within myofibers containing more than 2 nuclei divided by the total number of nuclei per image. The knockdown and differentiation were performed in at least biological triplicates and technical duplicates. Statistical comparison between groups was performed using a two-sided, two sample student’s t-test assuming equal variances (n between 25 and 41 for primary cells harvested from aged mice and between 13 and 15 for primary cells harvested from young mice). P-values below 0.05 were considered significant.
Real-Time PCR
RNA was extracted from lysed cells using the Qiagen miRNeasy micro kit according to manufacturer’s protocol. RNA quality and concentration were determined using a Nanodrop. Within one week, cDNA was synthesized using SuperScript III cDNA Synthesis Kit according to manufacturer’s protocol, and quality was determined using a Nanodrop. 80-100ug cDNA template was plated in triplicate along with SYBR Green PCR MasterMix and 500nM PCR primer, then cycled 40 times starting at 95 degrees C for 10 seconds followed by 60 degrees C for 30 seconds on a CFX96 RealTime thermocycler. Gene expression was quantified using the δδCt method. Statistical significance was determined using a two-sided, two-sample t-test for MyoG expression assuming equal variances, and p-values below 0.05 were considered significant. Each condition was performed in biological and technical triplicates, and technical triplicates were averaged prior to statistical analysis (n = 3 for each condition).
RNA-Seq Data Processing and Analysis.
Double-stranded RNA-seq data was aligned to the mm10 reference genome with the STAR algorithm
[127],[263]
STAR --genomeLoad NoSharedMemory --outFilterMultimapNmax 20 --alignSJoverhangMin 8
--alignSJDBoverhangMin 1 --outFilterMismatchNmax 999
--outFilterMismatchNoverReadLmax 0.04
--alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000
--outSAMunmapped Within --outFilterType BySJout --outSAMattributes NH HI AS NM MD
--outSAMtype BAM SortedByCoordinate --quantMode TranscriptomeSAM --sjdbScore 1
--limitBAMsortRAM 60000000000 --twopassMode Basic --twopass1readsN -1
Differentially expressed genes between Young and Old samples at each timepoint (days=0,1,3,5,7) were identified by limma [391] analysis in R. The Expected Counts from RSEM were transformed to counts per million using the voom [246] R package with a design formula: Count Day + Age, with Day=0,1,3,5,7, and Age=Young,Old. Surrogate variable analysis was performed with the SVA package [255] using a null model of Expression 1, and a design matrix of Expression Day + Age. Contributions from the surrogate variables were quantified and removed from the expresssion data matrix. An lmfit analysis was performed to check for statistically significant (p≤0.05) associations between surrogate variables and the protected variables of Day and Age. The removeBatchEffect function in limma was used to remove contributions of surrogate variables from the data, while protecting the variables of Age and Day. Pairwise Pearson and Spearman correlation values were computed between all sva-corrected replicates. Any replicate that had r ≤ 0.9 with other replicates for a given sample was excluded from further analysis.
The resulting data was analyzed with limma, using a design matrix of Expression Sample. All pairwise contrasts were examined to identify differentially expressed genes between Old vs Young at time t , Old at time t vs Old at time (t-1), Young at time t vs Young at time (t-1). Thresholds of p.adjusted ≤ 0.05 and log-fold-change ≥=1 were used to call differential genes.
The differential genes underwent time-series clustering with the DPGP (Dirichlet process Gaussian process mixture models) algorithm[306]. To generate inputs for the algorithm, replicates were averaged, and the fold change in TPM of Old vs Young was computed for each set of averaged replicates. Pathway and GO Term Enrichment Analysis The clusters generated by the DPGP algorithm were analyzed with DAVID [195].
Transcription Factor Expression Analysis
A similar approach was used to identify differentially expressed transcription factors. A list of known GRCm38 transcription factors was downloaded from Animal Transcription Factor Database [513].
Log2-fold change in FPKM was calculated for MDX injured vs control, as well as wild type injured vs control. abs(log2-fold) change cutoff of 1 was used to determine significance.
Generation of a merged peak set across ATAC-seq samples
The ATAC-Seq samples were analyzed with the ENCODE ATAC-seq processing pipeline (https: //github.com/ENCODE-DCC/atac-seq-pipeline, version 0.3.4) [292]. The cutadapt algorithm [298] was used to trim adapters, and the Bowtie2 [243] aligner was used to align the reads to the mm10 index. Duplicates were then removed from the aligned reads, and the MACS2 [517] peak caller was used to call peaks from the aligned ATAC-seq samples. The naive overlap peak set from all replicates for a given sample was used. Such peak sets from all samples were concatenated and merged using the bedtools merge command to produce a master peak set across all samples. The read counts for each sample at each peak were obtained by running bedtools coverage on the shifted de-deplicated tagAlign files generated by the pipeline.
Differential Chromatin Accessibility
The resulting counts matrix (238,563 peaks x all sample replicates) was analyzed with limma voom to produce differential peak sets between aged and young at each timepoint. The analysis protocol matched that for RNA-seq data (see Differential Gene Expression) section above, with equivalent thresholds for differential peaks.
Time-series Clustering of Differentially Accessible Chromatin Regions.
All differential peaks underwent DPGP clustering, as described above for RNA-seq data. The log2 fold change in counts per million in young samples versus ages samples were calculated for each differential peak. These fold change values for days 0, 1, 3, 5, 7 were used as inputs for DPGP clustering. 10 significant clusters were identified.
Differential peaks between aged and young for each timepoint underwent analysis with the GREAT algorithm [310], using default parameters. A parallel approach to identify enriched GO terms and pathways was also utilized. In this approach, results from the GREAT analysis were used to map peaks with associated genes (incorporating both distal and proximal associations returned by GREAT). The gene sets were then analyzed with DAVID to identify enriched GO terms and Kegg pathways (Benjamini-Hochburn corrected P-value ≤ 0.01).
In addition to GREAT analysis of differential peaks at each timepoint, the peaks for each of the 10 clusters from DPGP clustering (described above) underwent analysis with GREAT and GO. The enrichment analysis on peak clusters was performed to allow for analysis of more complex peak trajectories compared to enrichment at a single timepoint.
The p-value bigWig signal tracks generated by the Kundaje ATAC-seq processing pipeline were visualized with the WashU browser [521]: (http://epigenomegateway.wustl.edu/browser/?genome=

CHAPTER 5. MOLECULAR PHENOTYPE TO CELLULAR PHENOTYPE LINKS
mm10&datahub=http://mitra.stanford.edu/kundaje/annashch/age/datahub.pval.json&tknamewidth= 150).
Differential Chromatin State Distributions
The core mark 15-state chromatin state model [144] built on skeletal muscle tissue (ENCODE cell id E107) was used to identify the chromatin state distributions for differential peaks (Aged vs Young) at each timepoint. For each timepoint, the fraction of peaks in each of the 15 chromatin states was determined.
The HOMER algorithm [189] was used to identify enriched motifs in Aged vs Young (and vice versa) at each timepoint. For each timepoint, the foreground to HOMER consisted of differentially accessible peaks in aged vs young (and Young vs Aged in a parallel run), while the background consisted of all differential peaks merged across timepoints. The purpose of this analysis was to identify motifs that were enriched only at a particular timepoint.
In a separate run, the foreground was kept the same, but the background was the default reference mm10. The purpose of this analysis was to identify motifs that were enriched in the Aged or Young samples relative to the genome background.
Differential TF motifs from HOMER (see above) were cross-referenced with bulk RNA-seq TPM values at each timepoint in young and aged samples. For days 0 ,1, 3, 5, 7, the log2 fold change of young TPM vs aged TPM was compared to the -10log10(corrected P-value) from HOMER to generate a scatter plot with R ggplot2 [494]. The TF’s were color-coded on the scatterplot based on TF family membership. The mapping of TF to corresponding TF family was accomplished in accordance with the description below (Heatmap of TF family enrichment in Young vs Aged).
Footprint analysis was performed for the Myod transcription factor motifs within the ATAC-seq datasets. The HOMER scanMotifGenomes.pl command was executed on the mm10 genome with the HOMER pwm for HOMER Myod motifs:

AGCAGCTGCTGC MyoD(bHLH)/Myotube-MyoD-ChIP-Seq(GSE21614)/Homer 7.249827
-2.493974e+04 0 49999.2,13125.0,15210.6,14377.0,0.00e+00
0.432 0.137 0.291 0.140
0.400 0.102 0.455 0.042
0.001 0.997 0.001 0.001
0.966 0.001 0.001 0.032
0.001 0.014 0.984 0.001
0.001 0.996 0.002 0.001
0.093 0.001 0.001 0.905
0.001 0.001 0.997 0.001
0.003 0.518 0.079 0.400
0.123 0.268 0.093 0.516
CHAPTER 5. MOLECULAR PHENOTYPE TO CELLULAR PHENOTYPE LINKS
0.187 0.234 0.377 0.202
0.165 0.373 0.197 0.265
The found motifs were intersected with ATAC-seq optimal overlap peaks to identify peaks with TF footprints. Motif positions overlapping the peaks were extended to 200 bp. The bam mm10 alignments from the ENCODE ATAC-seq pipeline were corrected for ATAC enzymatic bias using the TOBIAS toolkit [54] (ATACorrect). The resulting corrected bigwigs were used with the TOBIAS command PlotAggregate to generate log-transformed footprints for the Myod regions.
Comparison with external ChIP-seq datasets.
Peak call files from TF-ChIP-seq and histone ChIP-seq datasets were downloaded from studies examining satellite cells and myoblasts 5.1.
Bedtools intersect was used to determine the number of overlapping peaks between the optimal overlap ATAC-seq peak sets and each of the samples in 5.1. A Wilcoxon rank enrichment test was performed to determine whether the change in fraction of overlapping peaks with each dataset in the table above was significantly different between young and aged samples at each of the 5 timepoints (d0-d7).
Derivation of reaction fluxes using gene expression data and genome-scale metabolic modeling
Gene expression data was used as input to derive reaction flux information from the human genomescale metabolic model (RECON1)[132],[422] using a modeling approach detailed in Shen et al. [418]. This approach maximizes the flux through the metabolic reactions that are up-regulated in a condition while minimizing flux through those reactions that are associated with down-regulated genes. All of the data processing steps described henceforth were carried out using MATLAB R2018b (https://www.mathworks.com/products/matlab.html). First, expression data was normalized across each gene. Next, normalized expression data between the old and young groups was compared to attain a list of significantly expressed genes using a threshold of p-value ≤ 0.05. Based on this list, up- and down-regulated genes were determined for each sample based on a z-score threshold of 1.5 and -1.5, respectively. The lists of up- and down-regulated genes for each sample were then overlaid onto the RECON1 model based on gene-protein-reaction annotations in the model. Finally, reaction flux data was generated using a linear optimization version of the iMAT algorithm with the following inputs: the RECON1 model, the list of up- and down-regulated genes, and the recommended values for the optional parameters (rho = 1E-3, kappa = 1E-3, epsilon = 1, mode =
0).
CHAPTER 5. MOLECULAR PHENOTYPE TO CELLULAR PHENOTYPE LINKS
Satellite Cells Myoblasts
H3K4me3 Liu et al. (Liu et al., 2013) d0 Aged ( GSM1148118) d0 Young (GSM1148110) ENCODE
ENCFF360QRN
H3K27me3 Liu et al. (Liu et al., 2013) d0 Aged (GSM1148119) d0 Young (GSM1148111) ENCODE
ENCFF569LDY
MyoD1 ENCODE
ENCFF423NWT
MyoG ENCODE
ENCFF980DKG
CTCF ENCODE
ENCFF297NKN
Table 5.1: Histone ChIP-seq datasets in satellite and myoblast cells overlapped with ATAC-seq samples.
Visualization (R)
The visualization methods for histogram and jitter plot generation (5.23) were carried out through R software (https://www.r-project.org/). For both plots, the z-score was calculated for each reaction by comparing flux values between aged and young groups using a paired t-test. Negative and positive extreme z-scores values were floored to -5 and 5, respectively, and invalid values were defaulted to zero with a p-value = 1. Significant changes in reaction flux were defined by a p-value ≤ 0.05, and these reactions were presented along with the plots (excluding transport reactions).
Statistical details, including sample size (n), what n represents, and statistical test used can be found in the figure legends. In most cases, sample size was great enough to assume normality based on the central limit theorem, and parametric statistical tests were used. Two-sided tests were employed for more conservative calculations of significance. The threshold for statistical significance was set at p ≤ 0.05. Unless otherwise stated, data in bar graphs are expressed as mean ± standard deviation. A combination of MATLAB R2019b, R (v3.6), and Prism GraphPad were used for statistical analysis.
5.5.6 Author Contributions
This manuscript is authored by Anna Shcherbina, Jacqueline Larouche, Benjamin A Yang, Lemuel Brown, James Markworth, Carolina Chung, Mehwish Khaliq, Kanishka de Silva, Jeongmoon Choi, Mohammad Sichani, Sriram Chandrasekaran, Young Jang, Susan Brooks, Carlos Aguilar.
J.L., P.F., B.A.Y., L.A.B., J.F.M., M.K., K.d.S., J.J.C., M.F.-S., Y.C.J., and C.A.A. performed experiments. A.S., J.L., C.H.C., S.C., and C.A.A. analyzed data. S.V.B. and C.A.A. designed the experiments. A.S., J.L., and C.A.A. wrote the manuscript with additions from other authors.

Appendix A MyHeart Counts
A.1 Supplementary Methods
A.1.1 Data Access
Data are stored on the phone and uploaded directly to a secure server (Sage Bionetworks, Seattle, WA) where they are de-identified. No data are sent to Apple Inc. Security measures exceed those specified by the Health Insurance Portability and Accountability Act (HIPAA). Incremental updates are downloaded to Stanford University servers using the Synapse R API. Personal and cohort average 6-minute walk scores were returned to participants within the application.
The application sends a combination of structured json and tabular HealthKit files to an intermediate bridge server controlled by Sage Bionetworks. Synchronization happens over the internet at scheduled intervals through the day or when the local cache size reaches a minimum threshold. The purpose of the bridge server is two-fold. First, Apple account emails are mapped to internal anonymized identifiers designated as HealthCodes. Second, the structured json files are formatted as tabular data using a priori defined and versioned schema for display through Synapse. During the mapping process, metadata including timestamps and application versions are extracted and associated with the results. Each record in the Synapse table corresponds to a discrete synchronization event. For HealthKit related data such as Activity State, Geographical Displacement and Heart Rate there will be links to data blobs that contain high resolution time-series data collected over a specific time interval. At the time of writing, there are 22 tables being updated of which 7 include links to external HealthKit blob data. All data is programmatically queried and downloaded through the R API v1.11.1[413].
251
A.1.2 Motion Tracking Calculation
We used the motion data from participants’ iPhones to calculate the number of seconds each individual spent in each of these states: walking, running, cycling, automotive, and stationary. Individuals with short durations (less than 2 consecutive days of data) or few entries of motion tracking data (less than 2000 data points) were excluded. To maximize the inclusion of participants who did not contribute a full 7 days of motion tracking, feature extraction was performed on data from two weekday and two weekend day samples. We calculated the total proportion of time an individual spent active (running, walking, cycling) and the proportion of time an individual spent inactive (stationary and automotive). The physical activity data was sampled and filtered to select two consecutive weekdays as well as two consecutive weekend days of activity for each subject. “Unknown” states were resolved by forward-carrying the immediately preceding known activity state. Data was analyzed at a two-minute granularity. For each consecutive two-minute window of data, the mode of the reported activity states was computed. The state for the window was assigned in accordance with the majority vote. Any gap in data greater than 15 minutes was assigned to the stationary state.
K-means clustering was then performed using K=10 (as determined by minimizing the Bayesian information content) A.8. 10 features were selected for clustering: fraction of time spent in the stationary, automotive, walking, running, and cycling states during the weekend as well as the fractions of time spent in each of these five states during the weekday. Four meta-clusters were generated by hierarchical clustering on the centroid coordinates of the 10 K-means clusters. The closest pair of centroids in N-dimensional space were successively merged to reduce the 10 clusters to four. The four meta-clusters consisted of inactive individuals, individuals who spent a significant portion of the day driving, active walkers/cyclists, and individuals who were inactive during the weekdays but active on weekends. In subsequent iterations of the app, the goal is to use the metaclusters to track subject behavior over time and provide feedback indicating when a subject’s activity levels lead them to shift from a cluster with significant correlations for poor health outcomes to a cluster with more favorable health metrics.
A.1.3 Unsupervised Machine Learning Analysis
Participants with low amounts of motion tracking data were removed from the motion tracking and 6-minute walk analysis.
Physical activity clusters were correlated with health outcomes collected from survey questionnaires via several statistical analysis techniques. A Chi-squared test was performed to check for associations between activity cluster membership and the presence/absence of multiple health conditions (heart disease, vascular disease, diabetes, joint problems, chest pain, hypertension). Tukey’s HSD test in conjunction with ANOVA was applied to compare cluster means for quality of life metrics such as happiness, depression, worry, and overall life satisfaction, as well as continuous (rather than categorical) health outcomes such as blood pressure and HDL/LDL levels.
A.1.4 Heart Age and 10 Year Risk Assessment
A participant’s ten-year risk and lifetime risk of stroke and myocardial infarction were calculated utilizing formulas published by the American Heart Association[213, 278].
Ten-year risk was calculated for participants in the 40-80 age range, while lifetime risk was calculated for subjects in the 20-60 age range. The calculations incorporate age, race, sex, HDL levels, total cholesterol levels, treated/untreated systolic blood pressure, smoking status, and diabetes status, as well as population estimates of baseline survival.
These metrics were acquired through a cardiovascular health questionnaire. To provide the participants a risk estimate that was more meaningful, we calculated their “heart age” by identifying the age of an individual with the same 10-year risk as the subject, but with optimal predictor values. Predicted heart age and risk calculations were compared to subjects’ self-reported perceptions of risk, as obtained through the risk perception questionnaire. Linear regression was performed to identify the relationship between a subject’s calculated heart age and true biological age.
A.1.5 Validation studies
Validation studies were carried out for the 6 minute walk measurements. We measured an individual’s walking distance using a measuring wheel (Komelon MK45M Meterman) and the MyHeart Counts App and compared the two values. Individuals walked outside, in a straight line with no stopping. Individuals were instructed to walk at a either normal, very brisk or very slow pace.
A.2 App Screenshots
Screen shots from the application demonstrating the consent process and the return of data dashboard are shown in figures A.1,A.2.
A.3 Survey instruments
Survey instruments used in the study are shown in A.3-A.5. The Physical Activity Readiness Questionnaire (PAR-Q+ c.2012, A.3A) originates from and is used with the permission of The Canadian Society for Exercise Physiology. The PAR-Q+ is a self guided screening instrument completed by the participant before they become more physically active. Participants are required to read each question in its entirety and respond to each question carefully. If the answer to all questions is NO the participant is informed that they cleared for physical activity. If one or more answers is a YES users are prompted to consult with a physician prior to starting and/or increasing activity.
The Activity and Sleep survey (A.3B,A.4A) is a fusion of existing validated activity and sleep surveys assessing on-the-job activity[460, 459], leisure-time activity[460, 459, 222] and sleep[4]. More specifically questions 1 and 2 (”On-the-job Activity” and ”Leisure-Time Activity”) are from the Stanford Brief Activity Survey (SBAS) and the updated Stanford Leisure-Time Categorical Item. This is a self administered clinically validated survey that is intended to provide a swift appraisal (less than 5 minutes) of the quantity as well as intensity of physical activity that the user does over the course of a day. The SBAS is composed of two questions, each question has five possible responses. The participant is asked to chose a response that most closely depicts their work-place activity as well as their leisure time activity. Each of the five response statements are phrased as comprehensive statements to include the type of activity, its duration, frequency and its intensity. The activity survey also includes two additional questions from the AHA’s MyLifeCheck on minutes per week of moderate and vigorous activity, adapted from the short-form IPAQ questionnaire. Questions related to sleep are derived from the 2011-12 National Health and Nutrition Examination Survey (NHANES). The user is asked to estimate the actual sleep hours per weekday and if they have any sleep disorder. The latter was modified by Dr. Mignot (Director, Stanford Sleep Medicine Center) to include a list of seven specific sleep disorders.
The Well-Being survey questions figure A.4B approach Well-Being[334] and Risk Perception[227]. Questions stem from the Organisation for Economic Co-operation and Development guidelines on measuring subjective well being. The questions are designed to have a minimal demand on participants time while measuring the topics for which there is the strongest validity and relevance to well being. The first question ask the user to evaluate their overall life satisfaction with 0 being completely dissatisfied and 10 meaning completely satisfied, question two encompasses the concept of whether the things a user is doing in their life are worthwhile with 0 signifying “not at all worthwhile” and 10 signifying “completely satisfied”. The last 3 question are designed to capture the affective state of the individual in regards to the previous day period, with 0 meaning the feeling of happiness was not experienced at all and a rating of 10 meaning the feeling of happiness was experienced all day.
The Risk Perception Survey was adapted from Knowles et. al. (2012) and consists of four questions whose aim is to assess the participants perceived risk of ten year and overall lifetime mortality due to heart attack, stroke or death due to cardiovascular disease. Participants are asked to rate their own risk of having a heart attack, stroke or death relative to others their age and gender over the subsequent decade as well as over their entire lifetime. Response options are: much lower than average, about average, higher than average, and much higher than average. The responses are assigned a numerical value (-3,-2,-1,0,1,2,3 respectively). Perceived risk is stratified based on the mean of values indicated above thus a mean less than zero indicates an optimistic bias and means that the individual perceives themselves as lower risk than their peers.
The Diet Survey figure A.5A was based on the American Heart Association’s MyLifeCheck questions, with permission, adapted from Policy Statement on The Role of Worksite Health Screening (201411). The questions regarding diet were used in order to determine if a healthy dietary pattern is followed. The document identifies a healthy dietary pattern as one that is consistent with a Dietary Approaches to Stop Hypertension (DASH)–type eating pattern, to consume ≥ 4.5 cups/d of fruits and vegetables, ≥ 2 servings/wk of fish, and ≥ 3 servings/d of whole grains and no more than 36 oz/wk of sugar-sweetened beverages and 1500 mg/d of sodium. The Cardiovascular Health Survey figureA.5B is based on the AHA’s MyLifeCheck questions[203], supplemented with baseline questions from the International Study on Comparative Health Effectiveness With Medical and Invasive Approaches (ISCHEMIA Trial , NCT01471522) designed by the ISCHEMIA Trial investigators.
A.4 Supplementary Results
A.4.1 Physical Activity
Though the study protocol indicated that subjects should generate 7 days of activity data, participant compliance with this guideline varied widely. 22,790 subjects uploaded 2 consecutive days of data, 13,990 uploaded 3 consecutive days, and 8,877 uploaded 4 consecutive days. Most subjects did not complete a single full week of data collection, leading us to select 2 consecutive weekday days and 2 consecutive weekend days as the minimum participation level to include a subject in the physical activity analysis. Most subjects recorded activity data for 15 - 20 hours per day, with gaps in data typically appearing during the nighttime hours. Estimates of minutes spent in moderate or vigorous activity were stable through different age groups (IQR= 43 -172 minutes of physical activity per day). Confidence estimates are provided by the low power sensor algorithm.
The assignment of individuals to activity clusters correlates with reported on-the-job activity levels. Individuals in cluster 1 spend the lowest fraction of time sitting/standing and the highest fraction of time walking and performing tasks that required moderate exertion. Conversely, individuals in clusters 5 and 6 spent the highest fraction of time sitting/standing. Individuals in cluster 3, the “weekend warriors”, spent a similarly high fraction in the sitting/standing state.
A more finely-grained clustering analysis grouped individuals by fraction of time spent in each of five activity states during two consecutive weekdays and weekend days. A K-means clustering on these features yielded 10 clusters grouped into three meta-clusters. A pairwise ANOVA of cluster membership versus answers to the well-being survey indicated significant correlations between activity level and reported happiness, worry, depression, and overall life satisfaction (figureA.6C). A Tukey HSD analysis of difference in means between the most active cluster (“walkers”) and the inactive cluster revealed that the active subjects reported feeling 7.3% less worried (p≤0.001), 5.4% less depressed (p≤0.001), 5.5% happier (p≤0.001), 5.2% more satisfied with life (p≤0.001), and 4.1% more worthwhile (p≤0.001) than their inactive counterparts. Similarly, cluster membership was found to correlate with disease (figureA.2B); the active participants were found to have on average 5% reduced risk for chest pain, dizziness, heart condition, heart disease, and joint problems.
A.4.2 Validation Study
Although extensive testing of activity and distance measurements from the internal motion coprocessor chip functionalities was carried out by Apple, we carried out an independent validation study comparing self administered 6 minute walk distance to clinically administered tests at Stanford Hospital. Figure A.7 shows a result of the Bland-Altman analysis of app-reported vs. measured distance. The mean absolute percentage error of the six minute walk was 8.7% (52.3 (SD: 37.4) yards).
A.4.3 Models of life satisfaction and self-reported disease
We tested the association of life satisfaction and self-reported disease status in our population with dietary, lifestyle, and other factors. Significant univariate predictors of life satisfaction in a linear model adjusting for covariates of age and sex were: family history of heart disease (β = −0.31 95% CI(-0.52, -0.08), p=0.0057), proportion of time recorded as active (β = 3.03 95% CI(0.97, 5.09), p=0.0041), self-reported minutes of moderate (β = 0.0018, 95% CI(0.0011, 0.0025), p≤0.0001) and vigorous activity (β = 0.0032, 95% CI(0.0022, 0.0042), p≤0.001), diet including daily servings of fruit (β = 0.18, 95% CI(0.10, 0.25), p≤0.001) and vegetables (β = 0.13, 95% CI(0.06, 0.19) p≤0.001), weekly servings of fish (β = 0.12, 95% CI(0.05, 0.19), p=0.0011), and weekly sugary drink intake (β = −0.03, 95% CI(-0.01, -0.05), p=6e-3). Since many of these univariate predictors are correlated, we derived a multivariate linear model using stepwise selection on all significant univariate predictors, once again including age and sex as covariates. We found that fruit intake, sugary drink intake, recorded activity, and self-reported vigorous activity remained as significant predictors.
Significant univariate predictors of disease in a logistic regression of disease status that included age and sex as covariates were: family history of cardiovascular disease (OR: 1.88 95% CI (1.32, 2.67), p≤0.001), consumption of whole grains (OR: 1.07 95% CI (1.00, 1.13), p=0.044), life satisfaction (OR: 0.89, 95% CI (0.82, 0.98), p=0.015), and having an active job (OR: 0.54, 95% CI (0.30, 0.97), p=0.041). We used stepwise selection on the significant predictors to derive a multivariate logistic regression model, with age and sex as covariates, that showed family history, whole grain consumption, and job activity as significant predictors.
Demographic Number of participants
Biological sex
Male 30338
Female 6556
Other 10
No response 3115
Age group
18-30 12178
31-40 9026
41-50 6328
51-60 7068
61-70 1684
71-80 444
80 75
No response 3210
Race/ethnicity
Alaska Native 3
American Indian 33
Asian 642
Black 226
Hispanic 533
Pacific Islander 22
White 5148
Other 185
I prefer not to indicate 82
No response 33143
Table A.1: Subject demographic information.
Presence of Chest Pain Diabetes Heart Disease Joint Pain
P-value 0.001 0.001 0.001 3.42E-02
N 17062 17062 17062 17062
Xˆ2 34.160 23.068 22.682 34.161
Cramer V 0.0149 0.0122 0.0121 0.0149
φ for weekend warrior cluster 0.0420 0.0380 0.0465 0.0429
Table A.2: X2 statistical associations between K-means activity clusters and self-reported health conditions.
Significant predictors of life satisfaction. Significant predictors of disease
Fruit intake Sugary drink intake Recorded activity Minutes of self-reported vigorous activity Family history Whole grain consumption Job activity
P-value 0.03 0.04 0.01 0.001 0.001 0.03 0.04
95% CI 0.01, 0.27 -0.063 , -0.002 0.60, 4.90 0.0014,0.0047 1.35,2.78 1.01,1.14 0.29,0.99
β 0.14 -0.032 2.75 0.003
OR 1.94 1.07 0.54
Table A.3: Significant predictors of life satisfaction and disease status, as determined by a multivariate linear model (life satisfaction) and multivariate logistic regression model (for disease status) using stepwise selection on all significant univariate predictors, including age and sex as covariates. Fruit intake, sugary drink intake, recorded activity, and minutes of self-reported vigorous activity remained as significant predictors of life satisfaction.
Region Activity levels(% of day spent active) Life satisfaction(score on a scale of 1 - 10)
West 15.6%95% CI (15.2%, 16.0%) 7.1895% CI (7.09, 7.27)
Midwest 15.1%95% CI (14.8%, 15.4%) 7.1595% CI (7.08, 7.22)
South 14.9%95% CI (14.5%, 15.3%) 7.1395% CI (7.04, 7.21)
Northeast 14.8%95% CI (14.3%, 15.2%) 6.9695% CI (6.86, 7.06)
Table A.4: Levels of activity and life satisfaction across U.S. geographic regions.

Figure A.1: Screenshots from MyHeart Counts App with Consent Form.

Figure A.2: Data returned to the user by the MyHeart Counts application. Returned metrics include heart age (left), 6 minute walk statistics (center), and insights (right).

Figure A.3: A: Physical Activity Readiness Questionnaire (PAR-Q). B: Activity and Sleep Survey:
on-the-job activity[460, 459] leisure-time activity[460, 459, 222]

Figure A.4: A: Activity and Sleep Survey: Moderate or Vigorous Physical Activity[376], sleep[4].B: Well-Being[334] and Risk Perception[227].

Figure A.5: A: Diet Survey[34]. B: Cardiovascular Health Survey[203].

Figure A.6: A: Clusters of physical data based on the number of times subjects changed state from active to inactive and vice versa over the course of 2 weekdays and 2 weekend days. B: Association of physical activity clusters with probability of health conditions. Y-axis indicates the Chi-squared difference in observed minus expected standardized residuals for each cluster.

Figure A.7: Bland Altman analysis of app-reported six minute walk distance vs. measured six minute walk distance.

Figure A.8: K-means clustering of subjects’ activity patterns based on 10 features: proportion of time spent in the “stationary”, “automotive” (driving), “walking”, “cycling”, and “running” states during the weekdays as well as during the weekends.

Figure A.9: Assessment of subjects’ cardiovascular risk. A. Subject’s calculated 10-year cardiovascular risk is compared to how they rank themselves compared to others of the same sex and age. A rank of 1 indicates that a subject considers him/herself at lower risk for cardiovascular disease as compared to others; a rank of 5 indicates that a subject considers him/herself at a much higher risk. B. Linear regression of subjects’ predicted heart age onto true age.

Appendix B
Digital cross-over randomized trial of physical activity interventions (a substudy of the MyHeart Counts Cardiovascular Health Study)
B.1 Supplementary Tables
Baseline 10K Steps Personal Advice Hourly Stand Read
Daily steps (HealthKit) iPhone, users who also reported Apple Watch data n=171 n=135 n=149 n=142 n=145
mean=3088+ 180 /-mean=3196+ 179 /-mean=3288+ 177 /-mean=3285+ 180 /-mean=3109+/179
p-val=0.55 p-val=0.44 p-val=0.29 p-val=0.91
beta=108+/181 beta=140+/180 beta=197+/186 beta=20+/187
Daily steps
(HealthKit) Apple Watch, users who also reported phone data n=171 n=135 n=149 n=142 n=145
mean=3651+ 217 /-mean=4263+ 217 /-mean=4362+ 215 /-mean=4290+ 218 /-mean=4560+/218
p-val=4.20e3 p-val=8.00e4 p-val=3.70e3 p-val=4.27e5
beta=612+/213 beta=710+/213 beta=639+/220 beta=909+/222
Table B.1: Daily step count from HealthKit from Apple Watch and iPhone for users who reported data on both device types.
265
Table B.3: Coaching prompts for participant activity clusters.

Cluster Prompt
Busy Bee cluster Introduction: Congratulations! Your typical physical activity patterns place you in the top group of MyHeart Counts users! You take the appropriate steps to protect your heart health, and we applaud you.Stay active! App users who stay active throughout the week report, on average, higher overall satisfaction with life than those in less active groups.Logging all those steps is working! App users in your activity group report, on average, 10 points lower blood glucose levels than those in less active groups. App users in your group report a significantly lower prevalence of heart disease than the general population. App users in your activity group report significantly lower incidence of blood vessel disease than the general population. You are definitely making your heart count! Users in your activity group report lower blood pressure than all other groups. You’re doing things right! Participants as active as you are report feeling less worried than those who are inactive.App users in your activity group report feeling happier with life than those who are sedentary.
Table B.3 – continued from previous page

Sedentary cluster Introduction: According to your activity profile, you fall in the segment of our app users who need to improve and increase their activity levels. You signed up and made your heart count - now make a conscious effort to increase your activity each day. Did you know that MyHeart Counts participants who are more physically active throughout the week reported being more satisfied with life than participants in your activity group? MHC users who get active on the weekend report higher life satisfaction than those who are not active on the weekend. Adding just 1 or 2 days of activity makes a difference in how you feel. So, how about a walk this weekend? More active people using this app report feeling more worthwhile than those who don’t get much activity Our most physically active users report feeling less depressed and less worried than less active users. Try moving more regularly and watch your mood brighten! App users who mainly get active on the weekends report on average 12 points lower LDL cholesterol than non-active individuals. Every step helps make a positive impact! How about that Sunday morning walk? Users who don’t get much physical activity are more likely to report joint problems than their more active counterparts. Our bodies are made to move – your joints will thank you! App users getting less physical activity are significantly more likely to report heart disease than their more active counterparts.
Try adding some steps to your day today.
Table B.3 – continued from previous page

Driver Cluster Introduction: So, we noticed you spend a lot of time in the car, compared to other app users. We know you have to get from point A to point B, and that requires sedentary time in the car. There’s no better way to offset the sedentary effects than to incorporate physical activity into your day. So you drive a lot. Did you know MyHeart Counts users who drive a lot but are also physically active report feeling more satisfied with life than those who report minimal activity? While users in your activity group report feeling less happy than their more active counterparts, those who drive a lot yet are also physically active report feeling happier than less active individuals. Participants who drive a lot but still engage in regular physical activity report feeling less depressed than those who get minimal to no activity. People in your group, who drive a lot but are also very active, report feeling less worried than those who lead more sedentary lives. Participants in your group report, on average, 10 points higher blood glucose than their more active counterparts. That’s all the more reason to walk to your next destination or fit in walking in other ways throughout the day. Did you know that participants in your activity group report an increased prevalence of heart disease, compared to the general population? Adding more physical activity to your day is a key way to help reduce your risk. People in your activity group report a higher incidence of vascular disease than the general population. That’s one more good reason to walk more today!
Table B.3 – continued from previous page

Worker Bees Clus-
ter Introduction: Congratulations! Your activity level puts you in one of our more physically active MyHeart Counts participant groups. Clearly, you know what it takes to make your heart count, so please keep up the good work! You get your activity in during the week – great job! You also tend to be more sedentary during the weekend – what’s up with that? Individuals who are active throughout the whole week report higher overall life satisfaction than those who are only active during the workweek. Try logging some steps this weekend, and see if it affects how you feel. Being more active throughout the entire week will help lower your risk of heart disease even more. MyHeart Counts users who get active on the weekends report feeling less worried than those who spent their weekends not being active. Users who are active throughout the week, including weekends, report feeling less depressed than those who don’t do much activity on weekends. Keep up the good work! Individuals who are active during the workweek report feeling happier than those who remain sedentary. Try being active on weekends as well as weekdays. MyHeart Counts users who are active daily report feeling more worthwhile and satisfied with life than those who do not log much activity. Make every day count! MyHeart Counts users who get their activity daily report feeling more satisfied with life than users who are not active each day.
Table B.3 – continued from previous page

Weekend
Cluster Warrior Introduction: Looks like you like to get outside and get active on weekends! Your activity levels go up towards the end of the week, and you show more physical activity during that time than our other MyHeart Counts users. Great work on getting out and enjoying your free time! We know you are busy during the week, but every little bit counts so consider increasing your activity on Monday to Friday as well. Did you know that your activity patterns place you in the MyHeart Counts Weekend Warriors group? Overall, this group gets much of their activity on weekends and reports a lower incidence of heart conditions than the general population. Does this surprise you? App users in your activity group report, on average, 3 points lower blood pressure than other activity groups. Good work, but there’s definitely room to improve. How about adding some activity during the work week? App users in your group report that they are less worried than more sedentary individuals and those who are only active during the workweek. MyHeart Counts users in your group report feeling less depressed than participants who engage in very little activity. Weekend Warriors report feeling more satisfied with life than inactive participants. Weekend Warriors report feeling more worthwhile than sedentary participants.Weekend Warriors report feeling happier than participants who are inactive.

B.2	Supplementary Figures

Figure B.1: A) User-days of data collected during weeks 1 - 4 of the study for each of the four interventions. B) Number of users who completed each intervention during weeks 1 - 4 of the study.

Figure B.2: A) Effect size data from Apple Watch. B) Mean step count from Apple Watch C) Effect size data from iPhone of users who supplied Apple Watch and iPhone data. D) Mean step count from iPhone of users who supplied Apple Watch and iPhone data. E) Effect size from Apple Watch of users who supplied both Apple Watch and iPhone data. F) Mean step count from iPhone of users who supplied both Apple Watch and iPhone data.
Baseline 10K
Steps Personal Advice Hourly Stand Read
Daily hours slept from HealthKit
N 354 235 240 248 227
Mean+/SE 7.99+/-
0.18 8.04+/-
0.20 8.16+/-
0.20 7.88+/-
0.19 7.98+/-
0.20
Standard Deviation 5.16 4.99 5.15 4.92 5.19
P-value 0.82 0.48 0.64 0.97
Effect
Size+/-
SE 0.051+/-
0.23 0.16+/-
0.24 -0.10+/-
0.22 -0.008+/-
0.23
Daily sleep quality (hrs asleep/hrs in bed) from HealthKit
N 325 218 220 235 216
Mean+/SE 0.74+/-
0.012 0.75+/-
0.012 0.76+/-
0.012 0.75+/-
0.012 0.76+/-
0.012
Standard Deviation 0.28 0.27 0.27 0.27 0.25
P-value 0.45 0.28 0.55 0.23
Effect
Size+/-
SE 0.01+/-
0.01 0.02+/-
0.01 0.01+/-
0.01 0.02+/-
0.14
Daily self-reported happiness on a scale of 1- 10
N 947 117 179 174 134
Mean+/SE 7.34+/-
0.05 7.50+/-
0.12 7.52+/-
0.11 7.37+/-
0.11 7.40+/-
0.12
Standard Deviation 1.93 1.97 1.66 1.65 1.83
P-value 0.18 0.15 0.87 0.69
Effect
Size+/-
SE 0.17+/-
0.13 0.17+/-
0.12 0.02+/-
0.12 0.05+/-
0.13
Daily minutes walked:Core motion accelerometry
N 1721 1329 1434 1384 1359
Mean+/SE 45.78+/-
0.30 46.03+/-
1.11 45.17+/-
1.10 45.70+/-
1.10 46.06+/-
1.10
Standard Deviation 73 61 59 58 61
P-value 0.82 0.57 0.93 0.79
Effect
Size+/-
SE 0.24+/-
1.09 -0.61+/-
1.08 -0.08+/-
1.08 0.27+/-
1.08
Table B.2: Secondary outcomes include minutes walked per day, as measured via iPhone core motion accelerometry; sleep duration in hours, from HealthKit; sleep quality (hours asleep / hours in bed), from HealthKit; self-reported happiness on a scale of 1 (lowest) to 10 (highest), from daily survey.
Baseline 10K Steps Personal Advice Hourly Stand Read
Daily Steps from HealthKit: iPhone
N 832 549 549 560 556
Mean+/-SE 3090+/-86 3331+/-90 3404+/-90 3313+/-90 3435+/-90
Standard Deviation 3750 3668 3806 3560 3777
P-value 7.30e-3 4.00e-4 1.29e-2 1.00e-4
Effect Size +/- SE 240+/-89 313+/-89 223+/-90 344+/-89
Daily Steps from HealthKit:Apple Watch
N 181 130 129 122 115
Mean+/-SE 4114+/-213 4720+/-221 4730+/-222 4875+/-224 4779+/-228
Standard Deviation 4644 4305 4170 4198 4155
P-value 5.10e-3 5.00e-3 6.00e-4 3.40e-3
Effect Size+/-SE 606+/-216 616+/-219 761+/-221 664+/-226
Table B.4: Intervention effects on primary outcome of mean daily step count for individuals who completed ≥4 days of an intervention.
Baseline 10K Steps Personal Advice Hourly Stand Read
Daily Steps from HealthKit: iPhone
N 512 248 236 240 261
Mean+/-SE 3470+/-113 3805+/-129 3758+/-132 3739+/-131 3896+/-129
Standard Deviation 3892 3990 3953 3831 4056
P-value 9.10e-3 2.69e-2 3.77e-2 8.00e-4
Effect Size+/-SE 335+/-128 288+/-130 269+/-129 426+/-127
Daily Steps from HealthKit:Apple Watch
N 123 66 65 56 56
Mean+/-SE 4919+/-265 5578+/-301 5331+/-303 5819+/-312 5504+/-311
Standard Deviation 4760 4565 4127 4293 4299
P-value 2.43e-2 1.69e-1 3.10e-3 5.29e-2
Effect Size+/-SE 658+/-292 412+/-299 900+/-304 585+/-302
Table B.5: Intervention effects on primary outcome of mean daily step count for individuals who completed 7 days of an intervention.
Baseline 10K Steps Personal Advice Hourly Stand Read
Forward carry baseline values
Mean+/-SE 3197+/-44 3329+/-45 3357+/-45 3376+/-45 3362+/-45
P-value 2.00e-13 1.30e-18 6.40e-23 1.27e-18
Effect Size +/-SE 131+/-17 160+/-18 179+/-18 164+/-18
Forward carry values from latest intervention
Mean+/-SE 3193+/-43 3400+/-46 3393+/-46 3400+/-46 3404+/-46
P-value 2.47e-15 4.32e-20 3.28e-21 2.39e-21
Effect Size +/-SE 174+/-22 199+/-21 207+/-22 210+/-22
Carry forward mean step count across all days in baseline + interventions
Mean+/-SE 3153+/-44 3402+/-49 3450+/-49 3496+/-49 3465+/-49
P-value 1.23e-23 9.90e-33 1.66e-43 2.98e-36
Effect Size +/-SE 249+/-25 297+/-25 343+/-25 311+/-25
Table B.6: All analyses performed on data from HealthKit, measured through the iPhone. Missing intervention values for a given participant were filled in with a) the participant’s baseline study daily step count. b) step count from the participant’s most recent intervention. c) mean step count across participants’ baseline and all completed interventions.

Figure B.3: A) 95% CI for intervention effect sizes. B) 95% CI for mean daily step count.

Figure B.4: A) Badges earned upon completion of each phase of the study. B) User access to badges.

Figure B.5: K-means clustering of subjects’ activity patterns based on 10 features: proportion of time spent in the ”stationary”, ”automotive” (driving), ”walking”, ”cycling”, and ”running” states during the weekdays as well as during the weekends.
B.3 Supplementary Equations
Daily Step Count Days in Study + Intervention, random= 1—User

Appendix C
Genetic determinants and causal implications of physical activity in large populations
C.1 Supplementary Tables
277
APPENDIX C. GENETIC DETERMINANTS AND CAUSAL IMPLICATIONS...
NDescriptionGWAS phenotype abbreviation
93112Average acceleration 00:00 00:590 1
93112Average acceleration 01:00 01:591 2
93112Average acceleration 10:00 10:5910 11
93112Average acceleration 11:00 11:5911 12
93112Average acceleration 12:00 12:5912 13
93112Average acceleration 13:00 13:5913 14
93112Average acceleration 14:00 14:5914 15
93112Average acceleration 15:00 15:5915 16
93112Average acceleration 16:00 16:5916 17
93112Average acceleration 17:00 17:5917 18
93112Average acceleration 18:00 18:5918 19
93112Average acceleration 19:00 19:5919 20
93112Average acceleration 02:00 02:592 3
93112Average acceleration 20:00 20:5920 21
93112Fraction acceleration 2000 milli gravities2000mg
93112Average acceleration 21:00 21:5921 22
93112Average acceleration 22:00 22:5922 23
93112Average acceleration 23:00 23:5924 24
93112Average acceleration 03:00 03:593 4
93112Average acceleration 04:00 04:594 5
93112Average acceleration 05:00 05:595 6
93112Average acceleration 06:00 06:596 7
93112Average acceleration 07:00 07:597 8
93112Average acceleration 08:00 08:598 9
93112Average acceleration 09:00 09:599 10 93112Fraction acceleration 9 milli gravities9mg
93112Overall acceleration averageOverallAccelerationAverage
93112Standard deviation acceleration averageStandardDeviationOfAcceleration
86025Number transition states 10mgTransition10
86025Number transition states 25mgTransition25
362939Duration moderate activityDurationModerateActivity
420988Duration walksDurationOfWalks
44817Duration strenuous sportsDurationStrenuousSports
256609Duration vigorous activityDurationVigorousActivity
311993Duration walking for pleasureDurationWalkingForPleasure
85874Discrete wavelet transform signal magnitudeDWT SMV
vector
44817Frequency strenuous sports last 4 weeksFrequencyStrenuousSportsLast4Weeks
311993Frequency of walking for pleasure in last 4FrequencyWalkingForPleasure weeks
73261Job involves heavy manual or physical workJobInvolvesHeavyManualOrPhysicalWork
73261Job involves mainly walking or standingJobInvolvesMainlyWalkingOrStanding
73261Job involves shift workJobInvolvesShiftWork
438063Number days walked more than 10 minutesNumberDaysWalked10Minutes
438063Number days moderate physical activityNumberOfDaysModeratePhysicalActivity
438063Number days vigorous physical activityNumberOfDaysVigorousPhysicalActivity
437662Time spent outdoors summerTimeSpentOutdoorsSummer
437662Time spent outdoors winterTimeSpentOutdoorsWinter
436473Usual walking paceUsualWalkingPace
Table C.1: Physical activity phenotypes derived from UK Biobank data used to perform GWAS. No pair of phenotypes in this list were correlated with R pearson gt 0.4
APPENDIX C. GENETIC DETERMINANTS AND CAUSAL IMPLICATIONS...
Feature NumSNPs pvalue <5e-8 NumSNPsMaf ≥0.01 INFO>=0.8
2 3 1410 93 1097
TimeSpentOutdoorsSummer 1115 64 975
DurationWalkingForPleasure 263 64 263
OverallAccelerationAverage 1929 52 1500
21 22 1995 47 1531
StandardDeviationOfAcceleration 2196 45 1743
5 6 1959 44 1489
3 4 1244 39 949
23 24 2132 37 1618
20 21 2142 33 1643
0 1 1681 32 1265
1 2 1199 32 874
NumberDaysWalked10Minutes 177 32 176
UsualWalkingPace 185 31 179
TimeSpentOutdoorsWinter 326 30 294
2000mg 893 25 652
17 18 1963 24 1508
15 16 1967 23 1484
7 8 1758 21 1376
18 19 1650 18 1283
14 15 1583 18 1208
4 5 1208 18 909
16 17 2036 17 1557
6 7 1822 17 1420
9 10 1697 14 1326
19 20 1755 12 1367
NumberOfDaysModeratePhysicalActivit y17 12 17
NumberOfDaysVigorousPhysicalActivity 12 11 12
13 14 1640 9 1280
8 9 1062 9 834
DurationOfWalks 24 9 24
11 12 1091 7 837
10 11 594 5 466
12 13 1229 4 970
FrequencyWalkingForPleasure 5 4 5
JobInvolvesShiftWork 91 3 89
DurationModerateActivity 7 3 7
JobInvolvesHeavyManualOrPhysicalWo rk6 2 6
DurationVigorousActivity 7 1 7
9mg 5 1 4
FrequencyStrenuousSportsLast4Weeks 3 1 3
Transition10 1 1 1
DWT SMV 78 0 75
Transition25 4 0 4
JobInvolvesMainlyWalkingOrStanding 3 0 3
DurationStrenuousSports 2 0 2
Table C.2: UK Biobank physical activity number of significant hits by phenotype.

Appendix D
Single-cell epigenomic analyses implicate candidate causal variants at inherited risk loci for Alzheimer’s and Parkinson’s diseases
D.1 Supplementary Figures
280

Figure D.1 (previous page): a, Bar plots of peak reproducibility across all bulk ATAC-seq biological replicates from the designated brain region. In each panel, the dotted line represents the cutoff for the fraction of samples that must have called a peak for the peak to be included in our merged reproducible bulk ATAC-seq peak set (cutoff = 0.3 or 30%). For each region, the percent of the total peaks passing this cutoff is indicated in the upper right. b, Principal component analysis of all samples showing components 1 and 2. Each dot represents a single piece of tissue with technical replicates merged where applicable. Color represents the brain region from which the sample was isolated. The proportion of variance explained is included along each axis and the axes have been scaled to these values. c, Dot plot showing the proportion of variance explained by each principal component in the analysis presented in D.1b. d, Principal component analysis of genotypes from the 1000 genomes project showing different races depicted by color. Biological samples from the current study (in red) were genotyped and projected into the principal component space defined by the 1000 genomes data to confirm the self-reported race of all individuals. e, Dot plot showing the significance of correlation between covariates and each of the top 5 principal components. Significance of each covariate is determined by the p-value of its contribution to a linear model. Dot size represents the absolute value of the correlation while color represents the principal component number. Covariates included were: region, biological sex, post-mortem interval (PMI), APOE genotype, paired-end (PE) reads, cumulative alignment rate, percent of reads mapping to chrM (%chrM), the total peaks called by MACS2 per sample, and the fraction of reads in peaks (FRiP) within those peaks. f, Sample by sample Pearson correlation heatmap of all 140 samples profiled in this study. Brain region, donor biological sex, and APOE genotype are indicated by color to the left. g, Principal component analysis of all samples showing components 1 and 3. Each dot represents a single piece of tissue with technical replicates merged where applicable. Color represents the brain region from which the sample was isolated. The proportion of variance explained is included along each axis. In the top panel, the axes have not been scaled to the proportion of variance explained to enable visualization of the distribution along PC3, which is correlated with data quality). In the bottom panel, the axes have been scaled to the proportion of variance explained to contextualize the relevance of PC3 to the overall data. h, MA plots showing the change in normalized bulk ATAC-seq accessibility comparing the parietal lobe (PARL) to all other brain regions. Each dot represents an individual peak from the merged bulk ATAC-seq peak set. Only peaks that showed non-zero accessibility in at least one sample were tested for significance.

Figure D.2: a, Dot plots showing the TSS enrichment score and the total number of fragments per cell for each of the 10 samples profiled by scATAC-seq. Each dot represents an individual cell. Dot color represents density on the plot. Dotted lines represent the quality control cutoffs implemented.

Figure D.3: a-b, Bar plot showing CIBERSORT predictions across all bulk ATAC-seq data generated in this study. Samples are sorted and colored (bottom of plot) by the region from which they were profiled. Bars are colored by (a) the predicted cell class or (b) the predicted cluster. Donor IDs are annotated below the plot. c-d, Dot plot showing the performance of the CIBERSORT classifier by comparing the ”ground truth” from scATAC-seq data and the CIBERSORT prediction on the bulk ATAC-seq data from the same tissue sample. Each dot represents (c) a cell class (i.e. the merge of multiple clusters) or (d) a cluster from one of the 10 scATAC-seq samples profiled. Dots are colored by (c) cell class according to the legend in D.3a or (d) cluster according to the legend in D.3b. e-f, Box and whiskers plot showing the Pearson correlation of each bulk ATAC-seq sample with a synthetic analog derived from admixing (e) the proportional cell class-specific signal predicted by the cell class-specific CIBERSORT classifier or (f) the proportional cluster-specific signal predicted by the cluster-specific CIBERSORT classifier (see Methods). The lower and upper ends of the box represent the 25th and 75th percentiles and the internal line represents the median. The whiskers represent 1.5 multiplied by the inter-quartile range. All samples are shown as individual dots overlaid on the box and whiskers (N = 59 CAUD, 42 HIPP, 23 MDFG, 35 PARL, 20 PTMN, 61 SMTG, 28
SUNI).

Figure D.4 (previous page): a, Pearson correlation heatmaps showing the correlation of pseudo-bulk replicates from cell types across different brain regions. All heatmaps use the same color scale ranging from R values of 0.6 to 1.0. b, Volcano plot of peaks that show differential signal between astrocytes from the substantia nigra and astrocytes from the isocortex. Peaks below a log2(fold change) threshold of 2 were not considered. Peaks near genes that are predicted to be key lineage-defining genes are accented with larger colored dots. Significance determined by the Wald test (DESeq2). c, The same UMAP dimensionality reduction shown in 3.25e but each cell is colored by its gene activity score for the annotated lineage-defining gene identified in D.4b (FOXG1, ZIC5, FOXB1, IRX1). Gene activity scores were imputed using MAGIC. Grey represents the minimum gene activity score while purple represents the maximum gene activity score for the given gene. The minimum and maximum scores are shown in the bottom left of each panel. The gene of interest is shown in the upper left of each panel. d, Sequencing tracks of pseudo-bulk scATAC-seq data from multiple genomic regions showing differential chromatin accessibility between astrocytes or OPCs in the isocortex and substantia nigra. From left to right: Isocortex-specific - FOXG1 (chr14:28750000-28787000), and ZIC2/ZIC5 (chr13:99937000-99999000); Substantia Nigra-specific - FOXB1 (chr15:5999600060012000), IRX1 (chr5:3589600-3607800), IRX2 (chr5:2737000-2760000), IRX3 (chr16:5427700054292000), IRX5 (chr16:54927000-54940000), and PAX3 (chr2:222189500-222333500). Peaks called in scATAC-seq data are shown below each plot. Each pseudo-bulk track was derived from merging all single cells corresponding to the annotated cell types in the specified regions. Tracks have been normalized to the total number of reads in TSS regions, enabling direct comparison of tracks within each vertical panel. e-g, Volcano plot of peaks that show differential signal between (e) OPCs, (f) oligodendrocytes, or (g) microglia from the substantia nigra and isocortex. Peaks below a log2(fold change) threshold of 2 were not considered. Peaks near genes that are predicted to be key lineage-defining genes are accented with larger colored dots. Significance determined by the Wald test (DESeq2). h,j, Sequencing tracks of pseudo-bulk scATAC-seq data from multiple genomic regions showing differential chromatin accessibility between (h) oligodendrocytes from the substantia nigra and isocortex or (j) inhibitory neurons from the striatum and isocortex. From left to right: (h) Isocortex-specific - SHC2 (chr19:409800-463200), and INSM1 (chr20:20361000-20374000); Substantia nigra-specific RBFOX1 (chr16:5899200-7791000); (j) Isocortex-specific - KCNJ6 (chr21:37583000-37955000), and NCALD (chr8:101673000-102141000); Striatum-specific - DRD2 (chr11:113369000-113602000), and FOXP1 (chr3:70922000-71622000). Peaks called in scATAC-seq data are shown below each plot. Each pseudo-bulk track was derived from merging all single cells corresponding to the annotated cell types in the specified regions. Tracks have been normalized to the total number of reads in TSS regions, enabling direct comparison of tracks within each vertical panel. i, Same as D.4e but for inhibitory neurons in the isocortex and striatum. Significance determined by the Wald test (DESeq2).

Figure D.5 (previous page): a-c, Heatmap of neuronal cell class-specific (a) peaks, (b) corresponding TF motifs, or (c) gene activity scores identified by ArchR. Each row represents a different (a) peak region (N = 171,346), (b) TF motif or (c) gene. For a and c, color represents the row-wise Z-score of the log2(Fold Change) between the given neuronal cell class and a background set of features with color values thresholded at 2 and -2. For b, color represents the normalized enrichment (log10(p-value) of the hypergeometric test) of the TF motif in the relevant peaks with color values thresholded at the maximum enrichment of that motif as indicated in parentheses next to the TF name. d, Volcano plot of differential gene activity scores between striatopallidal and striatonigral medium spiny neurons (MSNs). Select genes of interest are labeled and highlighted by larger dots. Significance determined by the Wald test (DESeq2). e, Sequencing tracks of pseudo-bulk scATACseq data from striatopallidal MSNs and striatonigral MSNs at the FOXP1 locus (chr3:6965800071860000) and the FOXP2 locus (chr7:113677000-115072000). Each pseudo-bulk track was derived from merging all single cells corresponding to the annotated cell types in the specified regions. Tracks have been normalized to the total number of reads in TSS regions, enabling direct comparison of tracks within each vertical panel.

Figure D.6: a, Histogram of the number of genes linked per GWAS locus. Each bar represents a bin of size 1. A link represents a putative association of a SNP within an ATAC-seq peak to a gene based on HiChIP or co-accessibility data. b, Venn diagram of the number of genes linked through assessment of the nearest gene to the lead SNP (green) or the number of genes linked though HiChIP and scATAC-seq analyses of LD-expanded polymorphisms (blue) for all GWAS loci from AD (left) and PD (right). c-d, GO-term enrichments of genes linked to (c) AD and (d) PD GWAS SNPs through FitHiChIP loop calls or scATAC-seq based co-accessibility correlations. Significance determined by the hypergeometric test. e, Characterization of GWAS loci in AD and PD according to the predicted effects of the polymorphisms. For example, loci whose phenotypic association is likely mediated by changes in coding regions are marked as ”Likely coding”. Loci where no known SNPs overlap an exonic region are annotated as ”Likely noncoding”. Loci whose effect could be mediated by either coding or noncoding mechanisms are marked as ”Either coding or noncoding” whereas loci with no polymorphisms overlapping a peak region from our bulk or scATAC-seq data or an exonic region are marked as ”Unknown”.
D.2 Supplementary Note 1 - Quality control analysis of bulk ATAC-seq data
In total, we generated 268 bulk ATAC-seq libraries from 140 macrodissected brain samples, with technical replicates for 128 of the 140 samples. As with t-SNE (3.25c), principal component analysis (PCA) shows clear brain region-specific clustering with 39% of the variance explaining the difference between striatal and non-striatal brain regions (D.1b-c). For example, region-specific chromatin accessibility was observed at the dopamine receptor D2 (DRD2) gene in the striatum, corresponding to medium spiny neurons1, the Iroquois homeobox 3 (IRX3) gene in the substantia nigra, corresponding to diencephalic-origin astrocytes2, or the potassium voltage-gated channel modifier subfamily S member 1 (KCNS1) gene in the isocortex, corresponding to various neuronal populations3. To validate the reported race of each donor, we performed genotype PCA using the 1000 genomes data as a reference (D.1d). These samples showed no clustering based on covariates such as post-mortem interval, APOE genotype, or biological sex (D.1e-f), and we identified very few peaks with significant differential accessibility across sexes. We note that PC3, representing 7.29% of variance, shows a significant correlation with metrics of data quality including the fraction of reads in peaks (D.1e,g). Lastly, we note that comparison of bulk ATAC-seq data across regions from the same anatomical tissue type (i.e. isocortex) showed no significant differences (D.1h).
D.3 Supplementary Note 2 - Single-cell ATAC-seq provides reference cell populations for deconvolution of cell typespecific signals in bulk data
Using the cell type-specific signals present in our scATAC-seq data, we performed cell type deconvolution of our bulk ATAC-seq data using CIBERSORT4. From our 8 cell classes and our 24 clusters, we created classifiers to deconvolve the ATAC-seq signal from all 140 samples profiled by bulk ATAC-seq in this study (D.3a-b). These classifiers recapitulate expected patterns of cell type abundance such as a relative absence of excitatory neurons in the striatum (Supplementary Fig. 3a) and mapping of signal from Cluster 14 (nigral astrocytes) specifically to samples from the substantia nigra (D.3b). Moreover, these classifiers predict a wide range of cellular composition across the macrodissected human brain samples used here, even within a single region. Such large differences in cell type composition can hamper efforts to find differential features, further supporting the use of single-cell approaches to understand complex tissues and disease states where small disease-specific variation may be overshadowed by larger differences in cell type composition across samples.
By comparing the CIBERSORT prediction to the observed ”ground truth” in the scATAC-seq data for the 10 samples profiled here, we assessed the performance of the cell class-specific and cluster-specific classifiers (D.3c-d). As would be expected, the cell type-specific classifier showed better performance than the cluster-specific classifier, largely due to over- or under-prediction of closely related clusters, such as the oligodendrocytic Clusters 19-23, by the cluster-specific classifier (D.3d). To benchmark the ability of these classifiers to explain the majority of signal observed in bulk ATAC-seq, we created synthetic analogs5 by proportionally mixing signal from each cell group (class or cluster) at each peak. For each bulk ATAC-seq sample, we asked how similar this sample is to the synthetic analog. In almost all cases, the Pearson correlation value between each sample and its synthetic analog surpassed 0.9, indicating that the vast majority of bulk ATAC-seq signal can be explained by either the class-specific or cluster-specific classifiers (D.3e-f).
D.4 Supplementary Note 3 - Single-cell ATAC-seq identifies brain region-specific differences in glial cells
Our dissection of the cell type-specific chromatin landscapes in adult brain identified clusters that are both region- and cell type-specific, such as Cluster 14 which is comprised almost exclusively of astrocytes from the substantia nigra. This observation indicates that certain brain cell types may show region-specific variation. This phenomenon has been very well described in neurons, with, for example, inhibitory neurons from the striatum (largely medium spiny neurons) differing substantially from inhibitory neurons outside of the striatum[240]. Murine oligodendrocytes[72] and astrocytes[317] also show regional differences in morphology, function, and gene expression. However, the brain-regional variation of glial cells in humans remains less well understood. To address this, we grouped cells into one of the 8 broad cell classes defined above and created region-specific pseudobulk profiles from the cumulative data (see Methods). Using these region-cell type combinations, we calculated Pearson correlations for all regions across a single cell type (D.4a). As expected, neuronal cell types showed the most regional variation, with a minimum Pearson correlation R value of 0.6.
Glial cells, however, also showed appreciable regional variation, with astrocytes showing the most variation followed by OPCs (D.4a). Within astrocytes, the greatest difference was found between the substantia nigra and the isocortex, indicating that the function or composition of astrocytes may differ across these brain regions. Differential peak analysis identified significant differences in chromatin accessibility near transcriptional regulators that may help explain the observed regional astrocytic differences (D.4b). In particular, nigral astrocytes showed significantly increased accessibility at the forkhead box B1 (FOXB1), IRX1, IRX2, IRX3, and IRX5 genes. Conversely, isocortical astrocytes showed significantly increased accessibility at the FOXG1, zic family member 2 (ZIC2), and ZIC5 genes. These changes in chromatin accessibility were associated with differential motif accessibility and would be expected to correlate with similar changes in gene expression for the annotated genes. Moreover, the gene activity scores of these genes are definitional for the regioncell subtypes with, for example, FOXB1 being active only in nigral astrocytes and ZIC2 and ZIC5 being active in all other astrocytes (D.4c-d). Of particular interest, the observed FOX switch from FOXG1 in isocortical (and hippocampal/striatal) astrocytes to FOXB1 in nigral astrocytes and the significant changes in chromatin accessibility at the IRX genes represent a potential transcriptional lineage control mechanism that could help to better understand region-specific functional differences in these astrocytes. Notably, diencephalic brain regions such as the substantia nigra have previously been shown to express FOXB19, IRX110, and IRX32 during early brain development, thus explaining part of this broad TF-based lineage control. These transcriptional regulators could be exploited to drive differentiation programs to, for example, create regionally biased glial cells in vitro.
In addition to controlling regional astrocytic identity, chromatin accessibility at IRX genes was also found to differentiate nigral OPCs from isocortical OPCs (D.4d-e). Similarly, FOXG1 also showed significantly more accessibility in isocortical OPCs, echoing the observations from astrocytes. Lastly, chromatin accessibility at the PAX3 gene locus was significantly higher in nigral OPCs compared to isocortical OPCs (D.4d-e). Taken together, these results identify shared and disparate transcriptional regulatory programs that likely control regional differences amongst astrocytes and OPCs in the substantia nigra and isocortex.
Compared to astrocytes, oligodendrocytes and microglia showed less regional variation in chromatin accessibility (D.4f-g). While a small number of genes showed highly significant regional differences in oligodendrocytes (D.4h), very few genes showed appreciable regional differences among microglia. As noted previously, the regional differences observed in glial cells are a small fraction of the size and magnitude of regional differences observed in neurons (D.4i-j), further emphasizing the importance of single-cell approaches to study complex tissues.
D.5 Supplementary Note 4 - Single-cell ATAC-seq identifies neuronal cell class-specific biology
We identify peak regions and corresponding transcription factor motifs that are unique to each neuronal cell class, highlighting potential gene regulatory mechanisms underlying the class-specific differences (D.5a-b). Additionally, using gene activity scores, we identify genes that show neuronal class-specific activity (D.5c), including genes that differentiate striatopallidal and striatonigral medium spiny neurons such as the transcription factors FOXP1 and FOXP2 (D.5d-e), which have previously been shown to exhibit variable expression in the striatum[151].
D.6 Supplementary Note 5 - Tiered approach to identification of functional GWAS polymorphisms
Of the 9,707 putative disease relevant SNPs, 9,429 were included in downstream analysis based on genome-wide significance or presence in high LD with a genome-wide significant SNP. Of these, 1175 SNPs overlapped peak regions identified in the cluster-specific peaks of our single-cell ATAC-seq data (Tier 3). Intersecting these SNPs with gene linkage predictions based on HiChIP, co-accessibility, or colocalization, we identified 1081 SNPs that met the requirements of Tier 2. Additionally, 278 SNPs (of the original 9,707) were included based on colocalization or presence in high LD with a colocalized SNP, despite the SNP of interest not meeting genome-wide GWAS p-value thresholds. Of these colocalization-based SNPs, 56 overlapped peak regions from our cluster-specific scATAC-seq data and were therefore also included in Tier 2. Collectively, these merged Tier 2 SNPs implicate 516 and 433 genes putatively affected by the activity of GWAS polymorphisms in PD and AD, respectively (D.6a-b). These gene sets are enriched for biological processes known to be implicated in AD and PD including lipoprotein particle clearance12 (AD) and synaptic vesicle recycling13 (PD) (D.6c-d). Lastly, we identified high-confidence Tier 1 SNPs as the subset of Tier 2 SNPs that were predicted to affect transcription factor binding based on our machine learning framework (100 SNPs) or our allelic imbalance analysis using RASQUAL (74 SNPs).
D.7 Supplementary Note 6 - A multi-omic epigenetic dissection of the MAPT gene locus
The MAPT gene encodes tau isoforms, primarily neuronal microtubule binding proteins that, under pathologic conditions, can adopt an abnormal structure and extensive post translational modifications, a process called neurofibrillary degeneration, which is a hallmark of AD and other neurodegenerative diseases, but not PD16. Enigmatically, MAPT is a replicated risk locus for PD despite the absence of neurofibrillary degeneration[426, 328]. The MAPT locus, found on chromosome 17, represents one of the largest LD blocks in the human genome (1.8 Mb) and is present in two distinct haplotypes, H1 and H2, the latter formed by an approximately 900 kb inversion of H1 that occurred about 3 million years ago and is present mostly in Europeans[443]. Cumulatively, previous work supports MAPT haplotype-specific impacts on transcript amount, transcript stability, and alternative splicing in several neurodegenerative disorders[50, 239, 15]. We highlight multiple epigenetic avenues through which the MAPT gene is differentially regulated in the H1 and H2 haplotypes, thus explaining at least a portion of the molecular underpinnings of the observed MAPT GWAS association in PD.
D.8 Supplementary Methods
D.8.1 Ancestry determination via PCA analysis on genomic data
Genotyping was performed on the bulk ATAC-seq datasets with the bcftools (1.7)[118]. The ‘bcftools mpileup‘ command was executed on individual bulk ATAC-seq filtered bam files to generate read pileups. The output of this command was fed into ‘bcftools call‘ to perform variant calling on the mpileup files. The resulting vcf files were merged with ‘bcftools merge‘, converted to plink 1.9[371] –bfile format, and filtered to include variants with population minor allele frequency (MAF) greater than or equal to 0.05. Chromosome 1 data from phase 3 of the 1000 Genomes Project25 was downloaded from
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38 positions/ ALL.chr1 GRCh38.geno The variants were filtered to those with MAF ≥ 0.05. Common variants were identified from the
1000 Genomes SNP set and the donor SNP set above, and the datasets were merged into a single PLINK binary format (bed) file. This yielded 447,096 SNPs for 2916 individuals in the combined 1000 Genomes / donor dataset. The PLINK –pca command was executed with the subject id’s of the 1000 Genomes Project individuals passed to the ‘–family –pca-cluster-names‘ flags to ensure that PCA would be performed on the 1000 Genomes cohort, and the unknown donors from this study would be projected on the resulting PC’s. Individuals form 1000 Genomes and the donors from this study were jointly plotted along PC1 and PC2, and the ancestry for all donors was set as the ancestry of the closest 1000 Genomes individual in PC space.
D.8.2 SNP selection for colocalization testing
A single test for colocalization of GWAS and eQTL association signals involves a locus, a GWAS, an eQTL tissue, and a gene expressed in that tissue. For each GWAS, we selected the set of all loci for which the lead GWAS variant had p-value ≤ 1e-5. Using eQTLs from GTEx brain tissues in the GTEx v8 dataset, we then found all tissue-gene combinations for which the lead SNP at one of the GWAS loci had an eQTL SNP (association p-value ≤ 1e-5) for that gene in that GTEx tissue. This resulted in a list of unique combinations of GWAS trait / genomic locus / eQTL tissue / eQTL gene, each to be tested individually for colocalization of GWAS and eQTL signals. The GWAS threshold of 1e-5 is less stringent than the threshold for genome-wide significance, but we favored sensitivity over specificity when selecting which SNPs to test, since colocalization with a strong eQTL signal may still suggest that a sub-threshold GWAS locus has an expression-mediated effect on disease.
D.8.3 Colocalization analysis
For each colocalization test combination as defined above, we selected all 1000 Genomes Phase 3 variants within a window of 500kb around the lead GWAS variant. We narrowed this list down to SNPs measured not only in the 1000 Genomes VCF, but also in the GWAS and eQTL summary statistics for the selected trait, tissue, and gene. We used a streamlined version of the FINEMAP tool[52] to compute posterior causal probabilities for each SNP at the locus in both the GWAS and eQTL studies, and then combined these probabilities as described in eCAVIAR27 to compute a colocalization posterior probability (CLPP) score for this test locus. We considered a SNP weakly colocalized if its CLPP score exceeded 0.01; although this seems like a low probability, we have observed previously that loci exceeding this cutoff show considerable likelihood of haring causal eQTL and GWAS variants28, and our goal in this analysis was to be as sensitive as possible in selecting putatively functional loci for subsequent orthogonal analysis steps.
D.8.4 CIBERSORT deconvolution
CIBERSORT4 was used to deconvolve bulk ATAC-seq data using signature matrices generated from scATAC-seq data. Default parameters were used. For the cell type-specific classifier, pseudo-bulk replicates were generated for each of the 8 main cell types. For the cluster-specific classifier, pseudobulk replicates were generated for each of the 24 clusters.
gkm-SVM machine learning classifier training and testing For each of the 24 scATAC-seq clusters, we used a 10 fold cross-validation scheme to train weighted gapped k-mer Support Vector Machine (gkm-SVM) models to classify 1000 bp sequences into two classes - accessible (corresponding to sequences underlying peaks) and inaccessible (GC matched inaccessible genomic regions). The test sets for each of the 10 folds are as follows. Fold 0 consisted of chr 1. Fold 1 consisted of chr 2 and chr 19. Fold 2 consisted of chr 3 and chr 20. Fold 3 consisted of chr 6, chr 13, and chr 22. Fold 4 consisted of chr 5, chr 16, and chr Y. Fold 5 consisted of chr 4, chr 15, and chr 21. Fold 6 consisted of chr 7, chr 14, and chr 18. Fold 7 consisted of chr 11, chr 17, and chr X. Fold 8 consisted of chr 9 and chr 12. Fold 9 consisted of chr 8 and chr 10.
For each of the 24 scATAC-seq clusters, we merged the IDR peaks with identical genomic coordinates (peaks with multiple summits) while preserving the summit position and the MACS2 p-value of the peak with the lowest p-value among the ones with the identical coordinates. Next, we ranked the peaks by the MACS2 p-value, expanded each peak by 500 bp on either side of the summit, to a total of 1000 bp, and eliminated those peaks with any N bases in the 1000 bp. For each of 10 cross-validation folds, we kept up to 60,000 of the top peaks belonging to the training set and all of the peaks belonging to the much smaller test set, all of which comprised the positively labeled (accessible) examples for training.
In order to generate the negative (inaccessible) examples for each of the cross-validation folds in each single-cell cluster, first, we used seqdataloader https://github.com/kundajelab/seqdataloader to generate all 1000 bp sequences obtained by tiling the hg38 genome 200 bp at a time, with a stride of 50 bp, keeping those 200 bp segments that have no IDR peak summits in that cluster, and then expanding those 200 bp segments by 400 bp on each side for a total of 1000 bp. Next, we calculated the GC content of the selected positive examples and all other bins in the genome. We partitioned the positive examples into 20 equally numerous GC bins according to the GC-content percentile of the positive sequence with respect to the positive set. We assigned sequences from all other bins in the genome to GC bins according to their GC-content. Starting with an empty negative set, we then sampled a positive example, sampled a negative sequence from the same GC bin as the sampled positive example, added the negative sequence to the negative set, and repeated this process until the number of negative examples equaled the number of positive examples for both the training set and the test set.
For each of the 10 folds in each of the 24 clusters, we used the 1000-bp DNA sequences corresponding to the positive and GC-matched negative training examples as inputs to the gkmtrain function from the LS-GKM package29 with the default options, producing a total of 240 models; the default options for LS-GKM included the gapped k-mer + center weighted (wgkm) kernel (t = 4), a word length of 11 (l = 11), 7 informative columns (k = 7), 3 maximum mismatches to consider (d = 3), an initial value of the exponential decay function of 50 (M = 50), a half-life parameter of 50 (H = 50), a regularization parameter of 1.0 (c = 1.0), and a precision parameter of 0.001 (e = 0.001). We used the resulting support vectors for each trained model to score the DNA sequences corresponding to the positive and GC-matched negative test set examples for each fold in each cluster by running gkmpredict, and used the scikit-learn python library[357] to calculate both auROC and auPRC accuracy metrics.
D.8.5 gkm-SVM allelic scores of candidate SNPs
We intersected the coordinates of all LD-expanded candidate AD and PD GWAS and colocalization SNPs with those of the peaks for each single-cell ATAC-seq cluster to obtain the SNPs in each cluster that are in peaks. For each SNP in a peak in each of the clusters, we retrieved the 1000 bp DNA sequence around the SNP, with the SNP at its center, and created a sequence corresponding to the effect allele by replacing the 500th position of the sequence with the effect allele. Similarly, we created another sequence corresponding to the non-effect allele by replacing the 500th position of the sequence with the non-effect allele. Furthermore, we repeated the same procedure to also produce 50 bp sequences for each SNP with the effect allele and the non-effect allele by retrieving the 50 bp DNA sequence around each SNP and replacing the 25th position with the effect and the non-effect allele, respectively.
For each SNP in a peak in each of the clusters, we computed GkmExplain31 importance scores for each position in each of the 1000 bp effect and non-effect allele sequences using each of the 10 gkm-SVM32 models for the respective cluster. GkmExplain is a method to infer the importance or predictive contribution of every base in an input sequence to its corresponding output prediction from a gkm-SVM model. Next, for each SNP in a given cluster, we computed the average score for each position across all 10 models (from the 10 folds) for that cluster for both the effect allele sequence and the non-effect allele sequence, producing a set of consensus importance scores for both the effect allele and the non-effect allele. Then, we subtracted the sum of these consensus importance scores corresponding to the central 50 bp of the non-effect allele sequence from that of the effect allele sequence to compute the GkmExplain score for each SNP in each cluster.
To compute in silico mutagenesis (ISM) scores for each SNP in a peak in each of the clusters, we used each of the 10 fold gkm-SVM models from the respective cluster to compute model output prediction scores for the 50 bp effect and non-effect allele sequences by running gkmpredict. Then, we subtracted the score of the non-effect allele sequence from the score of the effect allele sequence to obtain the ISM score and computed the average ISM score for each SNP across all 10 folds in each cluster.
To compute deltaSVM scores, we generated all possible non-redundant k-mers of size 11 and scored each of them using each of the 240 models. Next, for each SNP in a peak in each of the clusters, we used each of the 10 sets of k-mer scores from the 10-fold gkm-SVM models from the respective cluster to run deltaSVM33 on the 50 bp effect and non-effect allele sequences. We computed the average of the resulting deltaSVM scores for each SNP across all 10 folds in each cluster.
D.8.6 Statistical significance and high confidence sets of gkm-SVM based allelic scores for candidate SNPs
In order to obtain a statistical significance for each of the three gkm-SVM model based allelic SNP scores (GkmExplain, ISM and deltaSVM), for each SNP scored in each cluster, we generated 10 random 1000 bp sequences with the same di-nucleotide frequencies as those of the 1000 bp around the SNP using the fasta-shuffle-letters program from MEME Suite34 to serve as a null background set. For each null sequence, we created a null effect allele sequence and a null non-effect allele sequence by replacing the base at the center of the null sequence with the effect and non-effect allele, respectively.
For each SNP in a peak in each of the clusters, we computed GkmExplain importance scores for each of the central 200 bp in each of the 10 null effect and non-effect allele sequences using each of the 10 gkm-SVM models for the respective clusters. Next, for each pair of null sequences, we subtracted the sum of the importance scores corresponding to the central 50 bp of the null non-effect allele sequence from that of the null effect allele sequence to compute the null GkmExplain score.
To compute null in silico mutagenesis (ISM) scores for each SNP in a peak in each of the clusters, we used each of the 10 fold gkm-SVM models from the respective clusters to compute model output prediction scores for the central 50 bp of the null effect and non-effect allele sequences by running gkmpredict. Then, we subtracted the score of the null non-effect allele sequence from the score of the null effect allele sequence to obtain the null ISM score.
To compute null deltaSVM scores, for each SNP in a peak in each of the clusters, we used each of the 10 sets of k-mer scores from the 10 fold gkm-SVM models from the respective cluster to run deltaSVM on the central 50 bp of the null effect and non-effect allele sequences.
We found that the t-distribution was a good fit (based on KS test) to the empirical null distribution for all three scores. Hence, we used the fitted t-distributions (using SciPy python library http://www.scipy.org/) to each of the three sets of null scores as the null distributions.
To select SNPs with statistically significant gkm-SVM allelic scores, for each cluster, we selected those SNPs that fall outside the 95% confidence interval for all three null t-distributions fitted to the GkmExplain, ISM, and deltaSVM scores.
Next, we developed a method to identify putative transcription factor binding sites around each gkm-SVM scored statistically significant candidate SNP, by identifying the subsequences around the SNP whose base-resolution importance scores are significantly above that of the di-nucleotide matched shuffled background. We use the GkmExplain importance scores of all bases in the central 200 bp of all the null effect and non-effect allele sequences as a null distribution to identify bases around the SNP with high signal-to-noise ratio. For each SNP, we defined the active allele as the allele for which the 50 bp sequence centered on the SNP has the higher sum of non-negative importance scores relative to the other allele. Next, starting from the center of the active allele’s sequence, which is the location of the SNP, we continue advancing one pointer upstream and another downstream, each up to the position beyond which lie two consecutive bases that both have consensus importance scores that are not higher than 97.5% of the null importance scores. The subsequence between the terminal positions of the two pointers corresponds to one that underlies a series of bases with high GkmExplain importance scores that are significantly above the null scores of the di-nucleotide matched shuffled background sequences and potentially contains transcription factor binding sites and motifs that are relevant for the given cluster. We refer to these high-importance subsequences as seqlets. If a SNP does not have a seqlet that reaches a minimum length of 7 bp, then we alternatingly extend each end of the seqlet by 1 bp until this minimum length is reached.
Next, we defined two additional scores (prominence score and magnitude score) to further identify high confidence candidates from the gkm-SVM scored statistically significant candidate SNPs that are supported by seqlets that could potentially match identifiable transcription factor binding sites. We compute the sum of the non-negative consensus importance scores from the positions of the effect allele that overlap the active allele’s seqlet, which we refer to as the effect allele seqlet score, and divide that score by the sum of the non-negative consensus importance scores from the entire central 200-bp region of the effect allele sequence; we refer to this ratio as the effect allele seqlet signal-tonoise ratio. Similarly, we compute the non-effect allele seqlet score as the sum of the non-negative consensus importance scores in the non-effect allele sequence from the same positions overlapping the active seqlet. We obtain a corresponding non-effect allele seqlet signal-to-noise ratio by dividing the non-effect allele seqlet score by the sum of the non-negative consensus importance scores from the entire central 200-bp region of the non-effect allele sequence. Then, for each SNP, we compute the prominence score by subtracting the non-effect allele seqlet signal-to-noise ratio from the effect allele seqlet signal-to-noise ratio. In addition, we also compute a magnitude score by subtracting the non-effect allele seqlet score from the effect allele seqlet score.
To compute the statistical significance of the prominence and magnitude scores for candidate SNPs, for each cluster, we compute null prominence scores and null magnitude scores for each pair of null effect and non-effect allele sequences using the same procedure described above and use the empirical null distributions to obtain p-values for the prominence and magnitude scores for each candidate SNP scored for that cluster. For each type of score, in order to control for any arbitrary bias in the sign of the score, we include the negative value of each score to the list of scores to enforce symmetry before fitting the distribution.
Finally, to prioritize SNPs that disrupt potential transcription factor binding sites, in each cluster, among the SNPs with statistically significant gkm-SVM allelic scores, we designate as high confidence SNPs those that have a prominence score with a p-value less than 0.05. These are the SNPs that have an allele that completely destroys a prominent and high-scoring seqlet and, as a result, potentially disrupts an important transcription factor binding site. Next, among the confident SNPs that do not pass the high confidence threshold, we designated as medium confidence SNPs those that have either a magnitude score with a p-value less than 0.05 or a prominence score with a p-value less than 0.10. The magnitude score threshold is intended to capture those SNPs that have a significant deleterious effect on the seqlet score, even if those SNPs do not necessarily destroy the entire seqlet and even for cases where the seqlet around the SNP is not among the most prominent seqlets in the local 200 bp sequence window. In addition, the relaxed prominence score threshold is intended to capture those SNPs that do not pass the stringent filter for the high confidence set, but nevertheless, demonstrate at least a partial deleterious effect on a moderately scoring seqlet around the SNP. Together, these two filters serve to increase the recall in the prioritization of the SNPs, allowing us to identify all promising SNPs that are worthy of in-depth evaluation, which can assess their potential regulatory effect through a case-by-case analysis. The remaining SNPs in the confident set, which fail to meet the threshold for medium confidence, are designated as low confidence SNPs, as they include SNPs that significantly reduce the GkmExplain score, the ISM score, and the deltaSVM score, but do not have a clear impact on a seqlet around the SNP, making it unlikely for them to have a disruptive effect on a key transcription factor binding site.

Appendix E
Cell cycle dynamics of human pluripotent stem cells primed for differentiation
E.1 Supplementary Methods: Dataset generation
hPSC maintenance conditions
All hPSC cultures were maintained at 37◦C, 5% CO2 and expanded on Matrigel (BD Biosciences) coated plates in mTeSR (STEMCELL Technologies) with 10 µM ROCK inhibitor Y-27632 (Abcam). The previously characterized human embryonic stem cell lines (HUES6, H9 (Wicell), H9 FUCCI) were used in this study[340, 68]. The H9 FUCCI hPSC line was constructed by Singh et al., 2013 by introducing fluorescent reporters into expression vectors under the control of the constitutive CAG promoter and linked to neomycin or puromycin selectable markers through an internal ribosome entry site[427, 428]. G418 sulfate and puromycin were used to select and maintain the H9-FUCCI hPSCs for all studies.
Regulatory and institutional review
All human pluripotent stem cell experiments were conducted in accord with experimental protocols approved by the Stanford Stem Cell Research Oversight (SCRO) committee.
DMSO treatment and differentiation protocols
For all RNA-seq experiments, H9 FUCCI hPSCs were plated onto plates coated with growth factor–reduced Matrigel (BD Biosciences) in mTeSR with 10 µM ROCK inhibitor Y-27632 (Abcam)
300
at 1 million cells per well of a 6-well plate. After 24h, cells were cultured in mTeSR with or with 2% DMSO for another 24h. After a 24h DMSO treatment, cells were collected and prepared for fluorescence-activated cell sorting by flow cytometry for subsequent RNA-seq analyses.
For differentiation experiments, after a 24h treatment with DMSO or the PI3K inhibitors LY294002 (Selleck Chemicals) and Wortmannin (Selleck Chemicals) in mTESR, the medium was replaced with one of the following media at the start of each differentiation, with media replacement every day in all protocols. Ectoderm differentiation was induced using Knockout DMEM (Life Technologies), LDN-193189 (100 nM; Stemgent) and SB431542 (10 µM; Tocris) containing 15% Knockout serum replacement (Life Technologies) for 3 days. The medium was removed and replaced with fresh medium every 24 hours. Mesoderm differentiation was induced in advanced RPMI medium (Invitrogen) supplemented with Wnt3a (20 ng/ml; RandD Systems) and Activin A (100 ng/ml; RandD Systems) for 24h. Endoderm differentiation was induced in RPMI medium (Invitrogen), supplemented with Wnt3a (20 ng/ml; RandD Systems) and Activin A (100 ng/ml; RandD Systems) for 24 hours and subsequently in RPMI medium containing Activin A (100 ng/ml) for 2 days.
RNA isolation and quantitative real-time PCR
Cell pellets were collected and the total RNA were isolated using RNeasy Mini Kit (QIAGEN) according to manufacturer’s directions. Concentration and purity of RNA were determined by the NanoDrop spectrophotometer. Reverse transcription was conducted using SuperScript IV VILO Master Mix with ezDNase (Thermo Fisher) to synthesize cDNA. Quantitative RT-PCR was performed using the SYBR green system. One reaction included SYBR green mix (Applied Biosystems), the forward and reverse target gene primers, and 100 – 120 ng cDNA. The ABI 7500 Real-Time PCR machine was used to run the qRT-PCR experiment. The level of RNA transcripts was analyzed using the δδCT method. Gene expression was subsequently normalized based on the housekeeping GAPDH expression. The primers used in this study are listed in Supplementary Table 1.
Immunocytochemistry
Cells were rinsed in PBS and fixed in 4% paraformaldehyde (PFA; Sigma) for 30 min. Following the rinses, cells were blocked for 1 h at room temperature in 5% donkey serum (Jackson ImmunoResearch), 0.3% Triton X-100 in PBS. All primary antibody incubations were done overnight at 4 degrees C in blocking solution at a 1:500 dilution unless otherwise noted. Primary antibodies used in this study were: Sox1 (RandD Systems), Brachyury (RandD Systems), Sox17 (RandD Systems), Oct3/4 (Santa Cruz), and Nanog (Stemgent). Cells were rinsed the next day, followed by secondary antibody incubation for 1 h at room temperature at a 1:500 dilution. Secondary antibodies (Invitrogen) conjugated to Alexa Fluor 488 or 594 were used to visualize primary antibodies. DAPI (4,6-diamidino-2-phenylindol, Life Technologies) was used as a nuclear dye to stain all cells. All images were acquired using a Leica Fluorescent Microscope.
Assessment of cell viability and death
To assess the degree of cell viability and death, H9 hPSCs were harvested following treatment with and without 2% DMSO in 1 ml PBS. 10 µl of Trypan Blue (Invitrogen) was mixed with 10 µl of cells. Subsequently, 10 µl of the mixed solution was loaded onto the CountessTM cell counting chamber slide (Invitrogen) and inserted into the Countess II FL Automated Cell Counter (Invitrogen). The percentage of dead or nonviable cells as well as the percentage of viable live cells were quantified by the automated system using the trypan blue exclusion assay.
Fluorescence-activated cell sorting (FACS)
Following a 24h treatment with or without 2% DMSO, the H9-FUCCI hPSCs were collected and resuspended in FACS buffer (0.2% BSA in PBS). Cells in late G1 phase express mKO2-Cdt1 (color red), while the S/G2/M cells express mAG-Gem (color green); the double-negative (colorless) population is indicative of early G1 cells. DAPI was included to stain and gate out the dead cell population. The H9 hPSC line was also included as a control and analyzed by flow cytometry. Cells were run on a flow cytometer adjusted for UV excitation to measure DAPI fluorescence at blue wavelengths and RFP and GFP wavelengths to isolate and sort populations from the early G1 (double-negative), late G1 and SG2M fractions of the cell cycle. 50,000 cells were collected for each population in replicate over two independent experiments.
RNA-sequencing library preparation
For bulk-population RNA-seq, RNA was extracted from cell cycle sorted control and DMSO-treated hPSCs. Integrity of extracted RNA was assayed by on-chip electrophoresis (Agilent Bioanalyzer) and only samples with a high RNA integrity (RIN) value were used for RNA-seq. Purified total RNA was reverse-transcribed into cDNA using the Ovation RNA-seq System V2 (NuGEN) and cDNA was sheared using the Covaris S2 system (duty cycle 10%, intensity 5, cycle/burst 100, total time 5 min). Sheared cDNA was cleaned up using Agencourt AMPure XP beads (Beckman Coulter) and ligated to adaptors (Illumina) Sequencing libraries were constructed using the NEBNext Ultra DNA Library Prep Kit (New England Biolabs) using barcoded adaptors to enable multiplexing of libraries on the same sequencing lane. For each RNA-seq library, the effectiveness of adaptor ligation and effective library concentration was determined by Bioanalyzer before loading them in multiplexed fashion onto an Illumina HiSeq4000 (Stanford Functional Genomics Facility) to obtain 150 bp paired-end reads.
E.2 Supplementary Figures

Figure E.1: KEGG annotation for the PI3K-AKT signaling pathway. Genes with differential expression in DMSO-treated hPSCs vs control hPSCs are filled in gray. Genes downregulated or upregulated in response to DMSO treatment are denoted by the color-coded triangles when differential at the early G1, late G1, and/or SG2M phases.

Figure E.2: (A) KEGG annotation for the TNF signaling pathway. Genes with differential expression in DMSO-treated hPSCs vs control hPSCs are filled in gray. Genes downregulated or upregulated in response to DMSO treatment are denoted by the color-coded triangles when differential at the early G1, late G1, and/or SG2M phases. (B) Summary of differentially expressed genes within the TNF signaling pathway. Heatmap values are row z-scores of asinh(TPM) DMSO / asinh(TPM) controls.

Figure E.3: (A) KEGG annotation for the cGMP-PKG signaling pathway. Genes with differential expression in DMSO-treated hPSCs vs control hPSCs are filled in gray. Genes downregulated or upregulated in response to DMSO treatment are denoted by the color-coded triangles when differential at the early G1, late G1, and/or SG2M phases. (B) Summary of differentially expressed genes within the WNT signaling pathway. Heatmap values are row z-scores of asinh(TPM) DMSO / asinh(TPM) controls.

Figure E.4: (A) KEGG annotation for the VEGF signaling pathway. Genes with differential expression in DMSO-treated hPSCs vs control hPSCs are filled in gray. Genes downregulated or upregulated in response to DMSO treatment are denoted by the color-coded triangles when differential at the early G1, late G1, and/or SG2M phases. (B) Summary of differentially expressed genes within the VEGF signaling pathway. Heatmap values are row z-scores of asinh(TPM) DMSO / asinh(TPM) controls.

Figure E.5: (A) TPM values for cell-cycle associated genes are illustrated for DMSO-treated hPSCs (blue) and control hPSCs (red) at the early G1, late G1, and SG2M phases of the cell cycle. * denotes FDR ≤ 0.05. (B) Enriched REACTOME pathways for differential genes associated with Mitosis at the early G1, late G1, and SG2M phases of the cell cycle. The heatmap shading corresponds to the -10log10(FDR) for each pathway across the different phases of the cell cycle. (C) Fold change row z-scores of asinh(tpm) DMSO/control for differentially expressed genes that are associated with enriched sub-terms of the cell cycle biological process GO Term (GO:0007049). (D) -10log10(FDR) for enriched GO terms associated with the cell cycle.

Figure E.6: (A) TPM values for cell-cycle associated genes are illustrated for DMSO-treated hPSCs (blue) and control hPSCs (red) at the early G1, late G1, and SG2M phases of the cell cycle. * denotes FDR ≤ 0.05. (B) Enriched REACTOME pathways for differential genes associated with Mitosis at the early G1, late G1, and SG2M phases of the cell cycle. The heatmap shading corresponds to the -10log10(FDR) for each pathway across the different phases of the cell cycle. (C) Fold change row z-scores of asinh(tpm) DMSO/control for differentially expressed genes that are associated with enriched sub-terms of the cell cycle biological process GO Term (GO:0007049). (D) -10log10(FDR) for enriched GO terms associated with the cell cycle.

Figure E.7 (previous page): (A) Schematic of HUES6 hPSCs treated with 2% DMSO or inhibitors of PI3K (LY294002 or Wortmannin) for 24 hours and subsequently directly differentiated into the ectodermal, mesodermal, and endodermal germ layers. Immunostaining for germ layer specific markers following treatment with (B) LY294002 or (C) Wortmannin compared with untreated Control and 2% DMSO-treated hPSCs. (D) Quantitative RT-PCR for lineage-specific genes following directed differentiation of LY294002 (20µM) or Wortmanin (10µM) treated hPSCs compared with untreated Control and 2% DMSO-treated hPSCs. Error bars, s.d. of 2–4 biological replicates. Scale bars, 100 µm. * p ≤ 0.05, ** p ≤ 0.01 under one-way ANOVA; Tukey’s test for multiple comparisons.

Appendix F
Matrix stiffness induces a tumorigenic phenotype in mammary epithelium through changes in chromatin accessibility
F.1 Supplementary Methods
F.1.1 Hydrogel formation
Hydrogel matrices were composed of IPNs of alginate (5mg/ml-1 final, Pronova LF20/40) and rBM matrix (Matrigel, 4.4mg/ml-1 final, Corning)[95, 497]. Calcium sulfate was used to crosslink alginate matrices (1mM Ca2+ and 21mM Ca2+ final concentrations). To form IPNs, alginate and rBM solutions were first mixed on ice, and then added to a cooled Luer lock syringe. A calcium sulfate slurry was diluted in DMEM/F12 basal medium and added to a second Luer lock syringe. The two syringes were connected with a coupler and the solutions were mixed by passing them back and forth six times. The IPNs were then deposited into a well plate pre-coated with rBM and incubated at 37 degrees C for 30 min to gel.
Polyacrylamide gels were prepared as the 2D soft substrates for ATAC-seq comparative analyses. Briefly, coverslips were cleaned with ethanol, immersed in 0.5% (3-aminopropyl)trimethoxysilane (in dH2O) at room temperature for 30min, and washed with dH2O. Then, coverslips were treated with 0.5% glutaraldehyde in dH2O at room temperature for 30 min. A solution was prepared containing 3% acrylamide, 2%N,N’-methylenebis-acrylamide, 1:100 volume of 10% ammonium persulfate and 1:1,000 volume of 4N,N,N’,N’-tetramethylethylenediamine to produce 150Pa polyacrylamide
311
gels. After gentle mixing, the solution was deposited on a Sigma-cote-treated glass plate, covered with the activated coverslips, and allowed to polymerize between the glass plate and coverslips. Once polymerization was completed, gels were gently detached from the plate. The surface of the gels was activated by adding a solution containing 1 mg/ml-1 sulphosuccinimidyl 6-(4’-azido-2’nitrophenylamino)hexanoate dissolved in 50 mM HEPES pH 8.5. The gels were then exposed to ultraviolet light (wavelength 365 nm, 4 mW/cm-2), washed with the HEPES solution and incubated in 100 µg/ml-1 of rBM in HEPES solution overnight at 4 degrees C. The gels were washed with PBS before use.
F.1.2 Matrix deformation calculation
To calculate matrix deformation, MCF10A cells were embedded in alginate–rBM IPNs containing 21 mM calcium and 7.5% FluoSpheres Carboxylate-Modified Microspheres, 0.2 µm, dark red fluorescent (660/680) (Thermo Fisher Scientific). At 7d post encapsulation, the hydrogels were imaged every 30min for 24h to track microbead displacements. The acquired images were corrected for drift using an image registration plug-in from ImageJ46. Then, matrix deformation was calculated by tracking the microbeads using a standard particle image velocimetry algorithm (PIVlab; open source code for MATLAB) with three passes (128x128, 64x64 and 32x32 pixel-size interrogation window with 50% overlap)[464]. Microbead displacement for each time frame was calculated relative to the bead position from the initial time point, so the matrix deformation maps display cumulative displacements. Finally, bright-field images were overlaid over the matrix deformation maps.
F.1.3 Encapsulation and cell culture
MCF10A mammary epithelial cells were obtained from ATCC and cultured according to established protocols[121]. DMEM/F12 basal medium (ThermoFisher Scientific) was supplemented with 5% horse serum (ThermoFisher Scientific), 1% penicillin/streptomycin (ThermoFisher), 20ng/ml-1 epidermal growth factor (Peprotech), 0.5 mg/ml-1 hydrocortisone (Sigma), 100ng/ml-1 cholera toxin (Sigma) and 10µg/ml-1 insulin (Sigma). HME1 cells were obtained from ATCC and cultured in the medium described above without cholera toxin. MCF7 and MDA-MB-231 cell lines were obtained from ATCC and cultured in DMEM basal medium (ThermoFisher Scientific) supplemented with 10% fetal bovine serum (GE Healthcare) and 1% penicillin/streptomycin. Cells were encapsulated at 50,000 cells/ml-1 final concentration in hydrogel matrices and cultured for 14d.
F.1.4 Immunofluorescence, confocal imaging and analysis
Cells in hydrogel matrices were fixed in 4% paraformaldehyde for 45min, and then washed twice in Dulbecco’s phosphate-buffered saline (DPBS) with Ca2+/Mg2+ for 15min. The matrices were dehydrated in 30% sucrose solution in DPBS overnight, and then incubated in a 1:1 mixture of 30% sucrose and OCT solution (Fisher Scientific). The matrices were then frozen on dry ice and 40µm cryosections were adhered to slides.
Slides were blocked in a solution of 10% goat serum (ThermoFisher Scientific), 1% bovine serum albumin (Sigma), 0.1% Triton X-100 (Sigma) and 0.3M glycine (Sigma) for 1h at room temperature. Primary antibodies were diluted in the blocking solution (1:100) and incubated at 4 degrees C overnight. Alexa Fluor 488-phalloidin (1:100 dilution, ThermoFisher Scientific) and 4’,6-diamidino2-phenylindole (DAPI; 1µgml-1) were diluted in the blocking solution and incubated for 1h. Fluorescently conjugated secondary antibodies were incubated in blocking solution for 1h at room temperature. The slides were then washed three times for 5min in DPBS, and coverslips were applied with Prolong Gold antifade reagent (ThermoFisher Scientific). Slides were imaged on a Leica SP8 laser scanning confocal microscope with a x63 objective. Antibodies used were anti-pFAK (Tyr397, Invitrogen no. 700255), anti-E-cadherin (BD Biosciences, no. 610181), anti-N-cadherin (BD Biosciences, no. 610920), anti-vimentin (Abcam, ab92547), anti-β4 integrin (ThermoFisher, 439-9B), Alexa Fluor 488 goat anti-mouse IgG2a (no. A21131), Alex Fluor 488 goat anti-mouse IgG2b (no. A21141), Alexa Fluor 647 goat anti-mouse IgG1 (no. A21240) and Alexa Fluor 647 goat anti-rabbit (no. A21244).
F.1.5 Image analysis
A semi-automated image processing pipeline was created as an ImageJ macro to determine cluster roundness. The phalloidin signal intensity was used to segment cell clusters from background, and the Particle Analysis feature was used to outline clusters. The roundness metric available in ImageJ was used without modification. Invasive clusters were identified manually, on the basis of the presence of sharp protrusions emanating from the clusters. The area of MCF7 and MDA-MB-231 clusters was determined using the metric available in the ImageJ Particle Analysis feature.
Nuclear curvature was analysed using Bitplane Imaris software with the Lumen Curvature extension. Z-stacks of DAPI-stained nuclei were imported into Imaris and a 3D volume was rendered. The curvature extension determines the mean curvature for every vertex on the surface. Curvature magnitudes were plotted as colour-coded spheres on the surface of the 3D nucleus rendering.
F.1.6 TEM preparation, imaging and quantification
Hydrogel matrices for TEM were fixed in 4% paraformaldehyde and 2% glutaraldehyde in 0.1M sodium cacodylate, pH7.4. The samples were embedded in resin, sectioned, mounted on TEM grids and stained with uranyl acetate. The samples were imaged at x2,000 and x10,000 with a JEOL JEM1400 transmission electron microscope.
To quantify lamina-associated chromatin thickness, raw image intensity histograms were normalized and equalized, and then segmented. Seg3D software was used to generate connected components, representing the entirety of the lamina-associated chromatin. These masks were dilated and eroded, and then boundaries were identified. A MATLAB script was used to measure chromatin thickness for each inner boundary pixel, defined as the shortest distance from the inner boundary to the outer boundary.
F.1.7 Western blotting
Cells were extracted from matrices by incubating gels in ice-cold 50mM EDTA in PBS for 10min with pipette mixing. The suspensions were centrifuged at 500g at 4 degrees C for 10min to pellet cells. The pellets were incubated in 1ml of 0.25% trypsin/2.21mM EDTA at 37 degrees C for 5min to digest any remaining rBM, and then centrifuged at 500g for 10min again. Pellets were lysed in RIPA lysis buffer with protease and phosphatase inhibitors (ThermoFisher Scientific) and the protein concentration was determined using the bicinchoninic acid assay. Samples were diluted in Laemmli Sample Buffer (Bio-Rad) to 2.5µgµl-1 and 25µg was loaded in 4-20%, 15-well gels. The gels were run for 35min and the protein was then transferred to a nitrocellulose membrane (BioRad). The membrane was blocked in 5% non-fat milk for 1h and then incubated with primary antibody overnight. A fluorescent secondary antibody against the primary antibody was added, and the incubation was continued for 1h. The blots were imaged using a Licor Odyssey imaging system. The primary antibodies used were anti-Sp1 (Millipore, 07-645), anti-Sp1 (phospho Thr453, Abcam, ab37707), anti-H3K4me3 (Abcam, ab1012), anti-H3K9me3 (Abcam, ab8898), anti-AcH3 (Millipore, 06-599), anti-AcH4 (Millipore, 06-866), anti-HP1γ (Santa Cruz, sc-398562), anti-histone 3 (Cell Signaling Technologies, 4499) and anti-GAPDH (Abcam, ab181602). The secondary antibodies used were IRDye 680CW donkey anti-mouse (925-68072) or anti-rabbit (925-68073), and IRDye 800CW donkey anti-mouse (925-32212) or anti-rabbit (925-32213).
F.1.8 Quantitative polymerase chain reaction
Cells were extracted from matrices as described above. TRIzol (ThermoFisher Scientific) reagent was added and the cells were lysed by passing through a 30G syringe. RNA was isolated by phenol–chloroform extraction and RNA extraction columns (Green Bio). One microgram of RNA was reverse transcribed into cDNA using the High-Capacity Reverse Transcription Kit (Applied Biosystems). Fast SYBR green master mix was used, with the primers listed in F.6. Reactions were performed on an Applied Biosystems 7500 instrument.
F.2 Supplementary Figures
F.2.1 Supplementary Tables
Sample Total Read Pairs Overall Alignment Reads after removing dups and ChrM chrM% TSS Score Peaks
Soft 1 67130756 99.03 39972784 23.00 3.52 247467
Soft 2 76695213 99.10 43352668 28.00 5.22 206044
Soft 3 43536302 99.16 30560188 24.00 4.31 276420
Stiff 1 80573565 99.20 24900167 29.00 4.47 99018
Stiff 2 72414168 99.29 25602115 24.00 5.74 142860
Stiff 3 75932402 99.24 24724955 27.00 4.68 112322
Soft SAHA 1 93114529 98.28 34238502 55.00 10.52 276042
Soft SAHA 2 61702889 97.58 24116864 38.00 11.36 300000*
Soft SAHA 3 38183891 97.67 19995608 44.00 9.69 220089
Stif SAHA 1 6677472 98.67 31143242 42.00 7.03 297894
Stiff SAHA 2 119271447.5 98.58 59492254 37.00 6.89 300000*
Stiff SAHA 3 78113134 98.80 32630242 51.00 8.82 289607
2D TCPS 1 51497846 99.00 12564140 68.00 7.28 242872
2D TCPS 2 64577877 98.81 15524528 64.00 5.67 285857
2D TCPS 3 45461292 98.89 11757864 66.00 6.62 212725
2D Soft 1 132540227 99.10 47701860 47.00 10.65 231195
2D Soft 2 108523671 99.21 30192920 58.00 13.69 230481
2D Soft 3 104265087 99.08 46172076 41.00 11.27 235843
Mammary Gland 1 83916062 98.65 82971264 26.00 13.58 228256
Mammary Gland 2 96778381 98.83 89564088 36.00 11.22 238131
Table F.1: ATAC-seq quality control metrics.

Table F.2: Optimal overlap and IDR thresholded peak metrics

Table F.3: Homer Genome Ontology for na¨ıve overlap peak sets from soft and stiff matrices.

Table F.4: Disease ontology of Sp1 target genes.

Table F.5: KEGG pathway analysis of Sp1 target genes.

Table F.6: qPCR Primers

Figure F.1: . a, Time sweep of soft and stiff alginate-rBM IPNs. b, Elastic moduli of soft and stiff matrices (n = 3, mean ± s.d.).

Figure F.2: Brightfield time-lapse images of MCF10A cells invading stiff matrix (top), with corresponding PIVgenerated deformation maps (middle) and overlay of cell cluster with the deformation map (bottom). Scale bar represents 100 µm.

Figure F.3: a-b, Confocal imaging of immunofluorescent stains of MCF10A clusters in soft and stiff matrices for DAPI, F-actin (phalloidin), vimentin (a) and DAPI, E-cadherin, and N-cadherin (b).

Figure F.4: a, Brightfield images of HME1 cells in soft or stiff matrices treated with DMSO (vehicle) or indicated pharmacological inhibitor. b, Confocal images of cells in the indicated matrices and inhibitor groups. c, Quantification of cluster roundness shows a significant decrease in roundness for cells in stiff matrices, and significantly increased roundness upon inhibition of Sp1, class I HDACs, or PI3K. Three independent replicates were used for each condition, significance was determined by Kruskal-Wallis test followed by Dunn’s multiple testing correction (median ± 95% C.I.).

Figure F.5: a, Heatmap of differentially accessible regions upon SAHA treatment in stiff matrices (color scale represents row z-score). b, De novo motif logos, q value, and best matching motifs from MEME-ChIP for regions less accessible in the stiff SAHA group. c, De novo motif logos, q value, and best matching motifs from MEME-ChIP for regions more accessible in the stiff SAHA group. d, Heatmap of differentially accessible regions upon SAHA treatment in soft matrices (color scale represents row z-score). e, De novo motif logos, q value, and best matching motifs from MEMEChIP for regions less accessible in the soft SAHA group. f, De novo motif logos, q value, and best matching motifs from MEME-ChIP for regions more accessible in the soft SAHA group.

Figure F.6: a, Homer motif analysis of significantly more accessible peaks in stiff matrices versus soft. De novo motif discovery scores and motif logo compared to Sp1 motif. b, De novo motif discovery scores for regions that undergo accessibility reversion upon SAHA treatment and motif logos for SAHA reverted regions and Sp1. c, MEME-ChIP de novo motif analysis for regions that did not revert in accessibility in SAHA-treated cells in stiff matrices.

Figure F.7: . a-c, Representative western blots for pSp1(Thr453), Sp1, and loading control (p38 or GAPDH) for control cells (a), cell treated with PI3K inhibitor LY294002 (b), and cells treated with class I HDAC inhibitor SAHA (c) in soft or stiff matrices. d,Uncropped blots for cropped lanes displayed in a-c.

Figure F.8: a,d Confocal imaging of MCF7 (a) or MDA-MB-231 (d) cells cultured for 7 days in soft or stiff matrices and treated with DMSO (vehicle) or mithramycin A. b,e Quantification of cluster roundness shows a significant decrease in roundness for cells in stiff matrices and an increase upon Sp1 inhibition with mithramycin A. c,f) Quantification of cluster area shows a significant increase in size of clusters in stiff matrices and a significant decrease upon Sp1 inhibition. Three independent replicates were used for each condition, significance was determined for roundness by Kruskal-Wallis followed by Dunn’s multiple testing correction and for area by one-way ANOVA followed by Sidak’s multiple testing correction (median ± 95% C.I.).

Appendix G
Learning cis-regulatory principles
of ADAR-based RNA editing from CRISPR-mediated mutagenesis
G.1 Supplementary Figures
326

Figure G.1 (previous page): a) Degenerate donor oligos are designed for a 10 nt region of the ECS in the TTYH2 substrate. The mutagenized region is highlighted in red and the editing site in blue.
b) The distribution of editing level by the number of mutations from the results of the degenerate TTYH2-ECS library from (a). c to f) Reproducibility of the editing level measurement in different substrates. Each dot represents one designed variant. Editing level of each variant is compared. c) Two biological replicates for NEIL1. d) Two biological replicates with variants in the TTYH2 edit strand. e) Two biological replicates with variants in the TTYH2 ECS. f) Two biological replicates of variants in the AJUBA edit strand. g to h) Comparing different types of donor oligos for CRISPR knock-in efficiency of NEIL1 variants. i to j). Editing level measurement reproducibility using different versions of NEIL1 donor oligos.

Figure G.2: a-c) NEIL1. d-f) TTYH2. g) AJUBA. The outlier of the RNA coverage and gDNA coverage data were removed by percent cumulative density function to remove the data points larger than the 99% distribution. The RNA coverage, gDNA coverage and editing level are quantile normalized for pair-wise comparison.

Figure G.3: a) Heatmap of editing levels from single- and double-mutations in the editing strand of TTYH2. b) Heatmap of editing levels from single mutations in the editing complementary sequence (ECS) of TTYH2. Editing level of WT TTYH2 is 0.31. c) Heatmap of editing levels from singleand double-mutations in the editing strand of AJUBA. Editing level of WT TTYH2 is 0.35. The z-score is calculated for each RNA library as described in Methods and the WT editing level z-score is 0. d) Comparing editing levels of variants with the mutation type. e) Comparing the normalized similarity score of the variants with mutation type. “Others” refer to mutation types other than single or double-mutations, such as indels or multiple mutations. ns, non-significant; , P ≤ 0.05; , P ≤ 0.001;, P ≤ 0.0001; by Wilcoxnt est.

Figure G.4: a-b) Position-specific effects of TTYH2 (a) and AJUBA (b) single mutations. c) Comparison of the computationally predicted and experimentally inferred MFE structure of NEL1 and TTYH2-ECS. A normalized similarity score means identical structures. d-e) Single-mutation at different locations lead to different editing levels. f) Effects of compound mutations that break the base pairing in the 5’ and 3’ structure of NEIL1 (WT NEIL1 editing level is 0.64). g) Comparing the editing levels of variants with different editing site structure. “Others” means the editing site A is located in a structure other than the 1:1 mismatch internal loop. The A:A mismatch was not presented in our RNA library.

Figure G.5: a) Minimum Free Energy (MFE) b) Ensemble Free Energy. c) MFE freqeuncy. d)
Ensemble Diversity.

Figure G.6: a) All Stem Length. b) 5’ Stem Length. c) Similarity Score (normalized). d) Probability of Active Conformation.

Figure G.7: a) Heatmap of one clade from the NEIL1 cluster shown in 5.14. b) The consensus structure of the cluster shown in a. c) the MFE structure of each variant in the cluster. The gray box (”not base paired”) in the consensus structure for a cluster indicate that there is individual variant’s MFE structure within the cluster that has a different structure at this position.

Figure G.8: Clustering of TTYH2 with RNA editing levels.

Figure G.9

Figure G.10: a) Test set XGBoost predictions for each substrate from a model trained jointly on NEIL1, TTYH2, and AJUBA. b) SHAP values for the 20 most important features driving test set predictions. Features are ranked in order of predictive importance from most important (top) to least important (bottom). c) SHAP values for the variants with highest and lowest editing levels for each substrate. d) Percent contribution to XGBoost test set predictions from each feature subset. Feature subset composition is defined in Supplementary Table 1. e) Spearman and Pearson correlation between observed and XGBoost-predicted editing levels for NEIL1, TTYH2, and AJUBA variants within the test set. f) Mean absolute error (MAE), mean absolute percent error (MAPE), and root mean square error (RMSE) for XGBoost predictions on test set variants of NEIL1, TTYH2, AJUBA. g) Area under the precision recall curve (auPRC) and area under the receiver operating characteristic curve (auROC) for test set variants of NEIL1, TTYH2, and AJUBA.

Figure G.11: Spearman correlation between observed and predicted Adar editing levels in the heldout test set, as well as the percent of variance explained in the held-out test set, is illustrated for all combinations of training and test sets examined. Filled-in circles indicate that the particular substrate was included in a given training/test split.
G.2 Supplementary Tables
Table G.1: Features used with XGBoost for prediction of editing level.
Feature Grouping Description
MFE free energy struct overall/therm odynamicsThe minimum free energy of the RNA
variant from RNA fold prediction.
probability active co nfstruct overall/therm odynamicsProbability that the isoform takes on an WT-like active conformation (see Methods).
ensemble free energy struct overall/therm odynamicsEnsembel free energy calculated by RNAfold.
Continued on next page
Continued from
previous page
Feature Grouping Description
MFE frequency struct overall/therm odynamicsMFE requencing calculated by RNAfold.
ensemble diveristy struct overall/therm odynamicsEnsemble diversity calculated RNAfold.
sim nor score struct overall/therm odynamicsSimilarity score indicationg how similar the structure of RNA variant to the structure of the wild type.
num mutations mut overall Number of mutations in the RNA variant (relative to wild type).
mut exist mut other Are there mutations in the RNA, relative to wild type? (Yes or No).
mut type mut other Type of mutation (SNP, DNP, indel, None) in the RNA variant.
mut pos mut seq Base pair (numbering starting with 5’ end of RNA sequence) that is mutated.
mut site dist mut seq Distance between editing site and mutation position.
mut ref nt mut seq Genome/WT reference allele at mutation position.
mut nt mut seq Alternate allele at mutation position.
mut struct mut struct Structural element within the RNA variant where the mutation is introduced (Hairpin loop, Bulge, Internal loop, Stem, Multiloop).
mut ref struct mut struct Structural element within the WT RNA that encompasses base pair of mutation.
mut prev struct mut struct Structural element upstream (5’) of the mutation base.
mut next struct mut struct Sturcutral element downstream (3’) of the mutation base.
mut same as site mut struct Is the editing site in the same structural element as the mutated base(s)?
all stem length struct overall/therm odynamicsSum of 5’ and 3’ stem length.
Continued on next page
Continued from
previous page
Feature Grouping Description
site struct site struct Structural element that encompasses the editing site.
site prev nt site seq Nucleotide upstream (5’) of the editing site.
site next nt site seq Nucleotide downstream (3’) of the editing site.
site prev struct site struct Structural element upstream (5’) of
”site struct”.
site next struct site struct Structural element downstream (3’) of ”site struct”.
site 1 1 site struct Is the editing in a 1:1 internal loop?
site length site struct Length of the editing site structure (in sequence space).
site length stem site struct If the editing site is in a stem, this is the length of the stem. Otherwise, this is None.
site length hairpin site struct If the editing site is in a hairpin, this is the length of the hairpin. Otherwise, this is None.
site length bulge site struct If the editing site is in a bulge, this is the length of the bulge. Otherwise, this is None.
site length internal e s site struct If the editing site is in an internal loop, this is the length of the internal loop on the same strand as the editing site.
Otherwise, this is None.
site length internal e cssite struct If the editing site is in an internal loop, this is the length of the internal loop on the complementary strand to the editing site. Otherwise, this is None.
site 5prm cp hairpin site struct 5’ closing pair of editing site hairpin (NA if editing site not in hairpin).
site 5prm cp bulge site struct 5’ closing pair of editing site bulge (NA if editing site not in bulge).
Continued on next page

Continued	from

previous page
Feature Grouping Description
site 3prm cp bulge site struct 3’ closing pair of editing site bulge (NA if editing site not in bulge).
site 5prm cp internal site struct 5’ closing pair of editing site internal loop (NA if editing site not in internal loop).
site 3prm cp internal site struct 3’ closing pair of editing site internal loop (NA if editing site not in internal loop).
u count upstream Number of structural features upstream of the editing site.
u all stem length upstream The number of base-pairs for all the stem structure upstream of the editing site.
u hairpin length upstream If the hairpin is upstream of the editing site, this is the length of the hairpin.
Otherwise, this is None.
u1 exist upstream Does a structural feature exist upstream of the editing site?
u1 distance upstream Distance of the immediately adjacent upstream structural feature from the editing site.
u1 struct upstream The structural feature immediately upstream of the editing site (one of Hairpin loop, Bulge, Internal loop, Stem, Multiloop).
u1 length upstream Length (in base pair for stem) or size (the loop size or bulge size for loops and bulge) of the immediately upstream structural feature from the editing site.
u1 length stem upstream If the u1 structural feature is a stem, this is the length of the stem. Otherwise, this is None.
Continued from
previous page
Feature Grouping Description
u1 length hairpin upstream If the u1 structural feature is a hairpin, this is the length of the hairpin. Otherwies, this is None.
u1 length bulge upstream If the u1 structural feature is a bulge, this is the length of the bulge. Otherwise, this is None.
u1 length internal es upstream If the u1 structural feature is an internal loop, this is the length of the loop on the same strand as the editing site.
Otherwise, this is None.
u1 length internal ec s upstream If the u1 structural feature is an internal loop, this is the length of the loop on the complementary strand to the editing site. Otherwise, this is None.
u1 5prm cp hairpin upstream If the u1 structural feature is a hairpin, this is the 5’ closing pair of the hairpin.
Otherwise, this is None.
u1 5prm cp bulge upstream If the u1 structural feature is a bulge, this is the 5’ closing pair of the bulge.
Otherwise, this is None.
u1 3prm cp bulge upstream If the u1 structural feature is a bulge, this is the 3’ closing pair of the bulge.
Otherwise, this is None.
u1 5prm cp internal upstream If the u1 structural feature is an internal loop, this is the 5’ closing peak of the internal loop. Otherwise, this is None.
u1 3prm cp internal upstream If the u1 structural feature is an internal loop, this is the 3’ closing peak of the internal loop. Otherwise, this is None.
u2 exist upstream Does a structural feature exist upstream of ”u1”?
Continued on next page

Continued	from

previous page
Feature Grouping Description
u2 distance upstream Distance of the u2 structural feature from the editing site.
u2 struct upstream The structural feature immediately upstream of u1 (one of Hairpin loop,
Bulge, Internal loop, Stem, Multiloop).
u2 length upstream Length (in base pair for stem) or size (the loop size or bulge size for loops and bulge) of the immediately upstream structural feature from the editing site.
u2 length stem upstream If the u2 structural feature is a stem, this is the length of the stem. Otherwise, this is None.
u2 length hairpin upstream If the u2 structural feature is a hairpin, this is the length of the hairpin. Otherwies, this is None.
u2 length bulge upstream If the u2 structural feature is a bulge, this is the length of the bulge. Otherwise, this is None.
u2 length internal es upstream If the u2 structural feature is an internal loop, this is the length of the loop on the same strand as the editing site.
Otherwise, this is None.
u2 length internal ec s upstream If the u2 structural feature is an internal loop, this is the length of the loop on the complementary strand to the editing site. Otherwise, this is None.
u2 5prm cp hairpin upstream If the u2 structural feature is a hairpin, this is the 5’ closing pair of the hairpin.
Otherwise, this is None.
u2 5prm cp bulge upstream If the u2 structural feature is a bulge, this is the 5’ closing pair of the bulge.
Otherwise, this is None.
Continued from
previous page
Feature Grouping Description
u2 3prm cp bulge upstream If the u2 structural feature is a bulge, this is the 3’ closing pair of the bulge.
Otherwise, this is None.
u2 5prm cp internal upstream If the u2 structural feature is an internal loop, this is the 5’ closing peak of the internal loop. Otherwise, this is None.
u2 3prm cp internal upstream If the u2 structural feature is an internal loop, this is the 3’ closing peak of the internal loop. Otherwise, this is None.
u3 exist upstream Does a structural feature exist upstream of ”u2”?
u3 distance upstream Distance of the u3 structural feature from the editing site.
u3 struct upstream The structural feature immediately upstream of u2 (one of Hairpin loop,
Bulge, Internal loop, Stem, Multiloop).
u3 length upstream Length (in base pair for stem) or size (the loop size or bulge size for loops and bulge) of the immediately upstream structural feature from the editing site.
u3 length stem upstream If the u3 structural feature is a stem, this is the length of the stem. Otherwise, this is None.
u3 length hairpin upstream If the u3 structural feature is a hairpin, this is the length of the hairpin. Otherwies, this is None.
u3 length bulge upstream If the u3 structural feature is a bulge, this is the length of the bulge. Otherwise, this is None.
Continued on next page

Continued	from

previous page
Feature Grouping Description
u3 length internal es upstream If the u3 structural feature is an internal loop, this is the length of the loop on the same strand as the editing site.
Otherwise, this is None.
u3 length internal ec s upstream If the u3 structural feature is an internal loop, this is the length of the loop on the complementary strand to the editing site. Otherwise, this is None.
u3 5prm cp hairpin upstream If the u3 structural feature is a hairpin, this is the 5’ closing pair of the hairpin.
Otherwise, this is None.
u3 5prm cp bulge upstream If the u3 structural feature is a bulge, this is the 5’ closing pair of the bulge.
Otherwise, this is None.
u3 3prm cp bulge upstream If the u3 structural feature is a bulge, this is the 3’ closing pair of the bulge.
Otherwise, this is None.
u3 5prm cp internal upstream If the u3 structural feature is an internal loop, this is the 5’ closing peak of the internal loop. Otherwise, this is None.
u3 3prm cp internal upstream If the u3 structural feature is an internal loop, this is the 3’ closing peak of the internal loop. Otherwise, this is None.
d count downstream Number of structural features downstream of the editing site.
d all stem length downstream The number of base-pairs for all the stem structure downstream of the editing site.
d1 exist downstream Does a structural feature exist downstream of the editing site?
Continued from
previous page
Feature Grouping Description
d1 distance downstream Distance of the immediately adjacent downstream structural feature from the editing site.
d1 struct downstream The structural feature immediately downstream of the editing site (one of Hairpin loop, Bulge, Internal loop, Stem, Multiloop).
d1 length downstream Length (in base pair for stem) or size (the loop size or bulge size for loops and bulge) of the immediately upstream structural feature from the editing site.
d1 length stem downstream If the d1 structural feature is a stem, this is the length of the stem. Otherwise, this is None.
d1 length hairpin downstream If the d1 structural feature is a hairpin, this is the length of the hairpin. Otherwies, this is None.
d1 length bulge downstream If the d1 structural feature is a bulge, this is the length of the bulge. Otherwise, this is None.
d1 length internal es downstream If the d1 structural feature is an internal loop, this is the length of the loop on the same strand as the editing site.
Otherwise, this is None.
d1 length internal ec s downstream If the d1 structural feature is an internal loop, this is the length of the loop on the complementary strand to the editing site. Otherwise, this is None.
d1 5prm cp hairpin downstream If the d1 structural feature is a hairpin, this is the 5’ closing pair of the hairpin.
Otherwise, this is None.
d1 5prm cp bulge downstream If the d1 structural feature is a bulge, this is the 5’ closing pair of the bulge.
Otherwise, this is None.
Continued on next page

Continued from
previous page
Feature Grouping Description
d1 3prm cp bulge downstream If the d1 structural feature is a bulge, this is the 3’ closing pair of the bulge.
Otherwise, this is None.
d1 5prm cp internal downstream If the d1 structural feature is an internal loop, this is the 5’ closing peak of the internal loop. Otherwise, this is None.
d1 3prm cp internal downstream If the d1 structural feature is an internal loop, this is the 3’ closing peak of the internal loop. Otherwise, this is None.
d2 exist downstream Does a structural feature exist downstream of ”d1”?
d2 distance downstream Distance of the d2 structural feature from the editing site.
d2 struct downstream The structural feature immediately downstream of d1 (one of Hairpin loop, Bulge, Internal loop, Stem, Multiloop).
d2 length downstream Length (in base pair for stem) or size (the loop size or bulge size for loops and bulge) of the immediately upstream structural feature from the editing site.
d2 length stem downstream If the d2 structural feature is a stem, this is the length of the stem. Otherwise, this is None.
d2 length hairpin downstream If the d2 structural feature is a hairpin, this is the length of the hairpin. Otherwies, this is None.
d2 length bulge downstream If the d2 structural feature is a bulge, this is the length of the bulge. Otherwise, this is None.
Continued on next page
Continued from
previous page
Feature Grouping Description
d2 length internal es downstream If the d2 structural feature is an internal loop, this is the length of the loop on the same strand as the editing site.
Otherwise, this is None.
d2 length internal ec s downstream If the d2 structural feature is an internal loop, this is the length of the loop on the complementary strand to the editing site. Otherwise, this is None.
d2 5prm cp hairpin downstream If the d2 structural feature is a hairpin, this is the 5’ closing pair of the hairpin.
Otherwise, this is None.
d2 5prm cp bulge downstream If the d2 structural feature is a bulge, this is the 5’ closing pair of the bulge.
Otherwise, this is None.
d2 3prm cp bulge downstream If the d2 structural feature is a bulge, this is the 3’ closing pair of the bulge.
Otherwise, this is None.
d2 5prm cp internal downstream If the d2 structural feature is an internal loop, this is the 5’ closing peak of the internal loop. Otherwise, this is None.
d2 3prm cp internal downstream If the d2 structural feature is an internal loop, this is the 3’ closing peak of the internal loop. Otherwise, this is None.
d3 exist downstream Does a structural feature exist downstream of ”d2”?
d3 distance downstream Distance of the d3 structural feature from the editing site.
d3 struct downstream The structural feature immediately downstream of d2 (one of Hairpin loop, Bulge, Internal loop, Stem, Multiloop).
Continued on next page
Continued from
previous page
Feature Grouping Description
d3 length downstream Length (in base pair for stem) or size (the loop size or bulge size for loops and bulge) of the immediately upstream structural feature from the editing site.
d3 length stem downstream If the d3 structural feature is a stem, this is the length of the stem. Otherwise, this is None.
d3 length hairpin downstream If the d3 structural feature is a hairpin, this is the length of the hairpin. Otherwies, this is None.
d3 length bulge downstream If the d3 structural feature is a bulge, this is the length of the bulge. Otherwise, this is None.
d3 length internal es downstream If the d3 structural feature is an internal loop, this is the length of the loop on the same strand as the editing site.
Otherwise, this is None.
d3 length internal ec s downstream If the d3 structural feature is an internal loop, this is the length of the loop on the complementary strand to the editing site. Otherwise, this is None.
d3 5prm cp hairpin downstream If the d3 structural feature is a hairpin, this is the 5’ closing pair of the hairpin.
Otherwise, this is None.
d3 5prm cp bulge downstream If the d3 structural feature is a bulge, this is the 5’ closing pair of the bulge.
Otherwise, this is None.
d3 3prm cp bulge downstream If the d3 structural feature is a bulge, this is the 3’ closing pair of the bulge.
Otherwise, this is None.
d3 5prm cp internal downstream If the d3 structural feature is an internal loop, this is the 5’ closing peak of the internal loop. Otherwise, this is None.
Continued on next page
Continued from
previous page
Feature Grouping Description
d3 3prm cp internal downstream If the d3 structural feature is an internal loop, this is the 3’ closing peak of the internal loop. Otherwise, this is None.
Table G.2: XGBoost prediction performance on the training, validation, and test splits. Performance metrics (% Variance explained, Spearman R, Pearson R, MAE, MAPE, RMSE, auPRC, auROC) are provided for models trained within-substrate, jointly across substrates, and cross-substrate (i.e. trained on two of the substrates, and tested on the third).
Training S et Test Set Test
Set
Substrate Perfor-
mance %
Variance Explained (Rˆ2) Spearm
R anPearson
R MAE
(Mean Absolute Error) MAPE
(Mean Absolute Percent Error) RMSE
(Root
Mean
Square
Error) auPRC
(area under Precision Recall
Curve) auROC
(area under Re-
ceiver Operating Characteristic
Curve)
AJUBA AJUBA AJUBA 0.75 0.90 0.88 0.01 0.26 0.03 0.93 0.99
NEIL1
TTYH2
AJUBA +

NEIL1
TTYH2
AJUBA +
AJUBA 0.28 0.72 0.71 0.02 0.48 0.05 0.93 0.99
NEIL1
TTYH2 + AJUBA AJUBA -0.43 0.66 0.55 0.15 4.54 0.19 0.29 0.91
NEIL1 AJUBA AJUBA -0.29 0.68 0.66 0.20 6.22 0.24 0.69 0.96
TTYH2 AJUBA AJUBA -0.13 0.54 0.43 0.15 5.50 0.17 0.45 0.85
NEIL1 NEIL1 NEIL1 0.88 0.93 0.94 0.05 0.20 0.09 0.97 0.97
NEIL1
TTYH2
AJUBA +
NEIL1
TTYH2
AJUBA +
NEIL1 0.83 0.90 0.91 0.06 0.32 0.10 0.93 0.96
Continued on next page
Table G.2 – continued from previous page
Training S et Test Set Test
Set
Substrate Perfor-
mance %
Variance Explained (Rˆ2) Spearm
R anPearson
R MAE
(Mean Absolute Error) MAPE
(Mean Absolute Percent Error) RMSE
(Root
Mean
Square
Error) auPRC
(area under Precision Recall
Curve) auROC
(area under Re-
ceiver Operating Characteristic
Curve)
TTYH2
AJUBA + NEIL1 NEIL1 0.16 0.47 0.53 0.20 0.94 0.27 0.84 0.71
TTYH2 NEIL1 NEIL1 0.05 0.31 0.26 0.22 1.28 0.28 0.50 0.69
AJUBA NEIL1 NEIL1 0.21 0.64 0.59 0.20 0.73 0.27 0.83 0.83
TTYH2 TTYH2 TTYH2 0.68 0.91 0.89 0.03 0.37 0.05 0.81 0.95
NEIL1
TTYH2
AJUBA +
NEIL1
TTYH2
AJUBA +
TTYH2 0.73 0.88 0.86 0.04 0.60 0.07 0.83 0.94
NEIL1
AJUBA + TTYH2 TTYH2 -1.22 0.13 0.09 0.21 3.62 0.25 0.44 0.56
NEIL1 TTYH2 TTYH2 -0.35 0.46 0.40 0.18 3.51 0.23 0.59 0.75
AJUBA TTYH2 TTYH2 0.00 0.35 0.29 0.12 0.75 0.16 0.53 0.64

Table G.3: Relative feature contributions are normalized to sum to 100 for the full set of features. A blank cell indicates that the feature was not used to train a given model due to a lack of training and/or validation examples for that feature/substrate combination. Normalized SHAP scores and normalized F scores from XGBoost are provided.
Feature AJUBA
normalized
mean abs
SHAP NEIL1
normalized
mean abs
SHAP TTYH2
normalized
mean abs
SHAP SHAP
Rank AJUBA
normalized F
score NEIL1
normalized F
score TTYH2
normalized F
score
num mutations 18.80 3.56 46.38 1 2.51 1.06 3.34
site 1 1:A:C 14.61 25.36 1.54 2 1.11 0.83 0.29
probability active conf 19.55 10.16 7.47 3 6.81 6.78 5.69
sim nor score 8.72 2.72 3.00 4 5.22 7.70 8.34
minimum free energy 2.82 4.47 4.97 5 14.02 16.97 15.80
mut pos 4.31 5.24 1.57 6 10.87 11.80 5.89
mfe frequency 3.78 3.26 3.29 7 16.05 12.96 15.11
ensemble diversity 3.20 3.87 2.67 8 14.40 7.93 9.52
ensemble free energy 4.04 2.25 0.96 9 6.43 5.76 7.07
site length internal ecs 4.15 0.00 10 0.68 0.00 0.00
site 5prm cp internal:C:G 0.00 0.00 5.67 11 0.05 0.00 0.29
d3 length 1.31 3.35 0.57 12 1.64 1.80 2.36
mut ref nt:C 0.82 2.50 13 1.06 0.92 0.00
d1 length 0.22 4.10 0.33 14 0.48 0.78 0.59
d2 3prm cp bulge:G:C 1.55 15 0.00 0.18 0.00
mut prev struct:E 1.41 16 0.00 0.41 0.00
site 5prm cp bulge:C:G 1.33 17 0.00 0.18 0.00
site next nt:C 0.32 2.33 18 0.00 0.55 0.59
site 3prm cp internal:C:G 0.01 2.58 19 0.10 0.23 0.00
u2 3prm cp bulge:G:U 0.03 2.55 20 0.39 0.14 0.00
u1 length 2.86 0.12 0.52 21 1.01 0.46 2.26
u3 length stem 1.07 1.24 22 0.00 0.51 1.08
u3 length 1.49 0.81 23 0.00 2.58 2.16
mut type:indel 0.00 2.28 24 0.00 0.00 0.69
mut nt:A 1.31 0.51 1.20 25 0.82 1.01 1.37
Continued on next page

Feature AJUBA
normalized
mean abs
SHAP NEIL1
normalized
mean abs
SHAP TTYH2
normalized
mean abs
SHAP SHAP
Rank AJUBA
normalized F
score NEIL1
normalized F
score TTYH2
normalized F
score
u all stem length 0.47 0.50 1.87 26 0.43 0.41 0.39
d all stem length 0.28 1.49 0.65 27 0.19 1.01 0.69
d2 length internal es 0.02 1.27 28 0.34 0.00 0.10
u count 0.01 0.61 1.23 29 0.14 0.23 0.49
mut ref nt:U 0.61 30 1.11 0.00 0.00
site next nt:G 1.23 0.00 31 0.00 0.18 0.00
site 3prm cp internal:G:U 1.16 0.00 32 0.24 0.00 0.00
site length internal es 1.09 0.07 33 0.39 0.00 0.10
u2 5prm cp internal:G:C 0.65 0.47 34 0.00 0.09 0.59
u3 distance 0.01 0.96 0.63 35 0.10 0.51 0.79
site prev nt:C 0.50 0.54 36 0.00 0.18 0.88
all stem length 0.11 1.41 0.02 37 0.72 0.83 0.39
mut nt:C 0.19 0.07 1.23 38 1.26 0.97 1.37
d2 5prm cp bulge:G:C 0.00 1.38 0.09 39 0.00 0.14 0.29
d3 distance 0.45 0.41 0.49 40 0.39 0.65 0.69
u2 length internal es 0.80 0.01 41 0.00 0.69 0.10
u2 5prm cp internal:U:A 0.39 42 0.00 0.00 0.10
mut nt:G 0.13 0.63 43 0.39 0.69 0.00
d count 0.36 0.24 0.53 44 0.63 0.23 1.08
d2 5prm cp internal:G:C 0.79 0.12 0.17 45 0.63 0.23 0.29
site next nt:U 0.31 0.33 46 0.58 0.00 0.29
mut ref nt:G 0.39 0.23 47 0.68 1.11 0.00
site 5prm cp internal:U:A 0.31 48 0.10 0.00 0.00
u2 length internal ecs 0.43 0.17 49 0.00 0.37 0.59
u2 length 0.40 0.28 0.15 50 0.10 0.41 0.49
u2 3prm cp bulge:C:G 0.53 0.01 51 0.00 0.28 0.29
mut nt:U 0.06 0.18 0.56 52 0.39 1.34 1.08
d2 length internal ecs 0.31 0.17 53 0.34 0.00 0.29
mut struct:B 0.46 0.01 54 0.00 0.23 0.20
Continued on next page

Feature AJUBA
normalized
mean abs
SHAP NEIL1
normalized
mean abs
SHAP TTYH2
normalized
mean abs
SHAP SHAP
Rank AJUBA
normalized F
score NEIL1
normalized F
score TTYH2
normalized F
score
u hairpin length 0.46 0.00 55 0.77 0.00 0.00
u2 3prm cp bulge:A:U 0.19 56 0.00 0.09 0.00
u2 5prm cp internal:A:U 0.18 57 0.00 0.00 0.59
site length 0.08 0.04 0.31 58 0.24 0.28 0.39
d2 5prm cp internal:U:A 0.12 0.13 59 0.14 0.00 0.20
site prev nt:A 0.12 60 0.10 0.00 0.00
u2 3prm cp internal:C:G 0.12 61 0.00 0.00 0.49
d2 5prm cp internal:A:U 0.11 62 0.00 0.00 0.49
mut next struct:I 0.22 0.12 0.00 63 0.19 0.51 0.00
d2 3prm cp internal:G:U 0.11 64 0.00 0.00 0.29
mut prev struct:I 0.05 0.25 0.00 65 0.14 0.51 0.00
u1 length stem 0.00 0.04 0.25 66 0.00 0.14 0.79
mut struct:I 0.17 0.11 0.00 67 1.11 0.51 0.10
d2 3prm cp bulge:U:A 0.01 0.18 68 0.00 0.05 0.49
d2 3prm cp bulge:A:U 0.09 69 0.14 0.00 0.00
mut ref struct:S 0.03 0.25 0.00 70 0.10 0.28 0.00
site prev nt:G 0.18 0.00 71 0.00 0.09 0.00
u2 3prm cp internal:U:A 0.09 72 0.00 0.00 0.10
site next nt:A 0.09 73 0.48 0.00 0.00
u2 3prm cp internal:U:G 0.08 74 0.00 0.14 0.00
site 3prm cp internal:A:U 0.08 0.09 75 0.34 0.00 0.49
u2 struct:I 0.00 0.24 0.00 76 0.05 0.18 0.00
d2 3prm cp internal:C:G 0.00 0.13 77 0.10 0.18 0.00
u2 3prm cp internal:G:U 0.06 78 0.00 0.09 0.00
site 3prm cp bulge:G:C 0.01 0.11 79 0.00 0.05 0.10
site 1 1:A:G 0.01 0.10 80 0.24 0.23 0.00
u2 3prm cp internal:G:C 0.00 0.14 0.00 81 0.05 0.18 0.00
d3 length stem 0.00 0.15 0.00 82 0.00 0.14 0.00
mut struct:H 0.05 83 0.58 0.00 0.00
Continued on next page

Feature AJUBA
normalized
mean abs
SHAP NEIL1
normalized
mean abs
SHAP TTYH2
normalized
mean abs
SHAP SHAP
Rank AJUBA
normalized F
score NEIL1
normalized F
score TTYH2
normalized F
score
mut prev struct:S 0.12 0.02 0.00 84 0.14 0.09 0.00
d2 3prm cp bulge:C:G 0.00 0.02 0.12 85 0.00 0.05 0.39
d2 length 0.06 0.07 0.00 86 0.24 0.23 0.00
d2 3prm cp internal:G:C 0.01 0.12 0.00 87 0.10 0.32 0.00
d2 5prm cp bulge:G:U 0.04 88 0.00 0.18 0.00
mut struct:S 0.00 0.12 0.00 89 0.10 0.09 0.00
d2 length bulge 0.04 90 0.00 0.18 0.00
mut exist 0.08 0.00 91 0.82 0.00 0.00
mut prev struct:B 0.04 92 0.00 0.41 0.00
mut next struct:B 0.00 0.12 0.00 93 0.00 0.18 0.00
site 3prm cp internal:G:C 0.02 0.01 0.09 94 0.05 0.05 0.10
d2 struct:B 0.00 0.00 0.10 95 0.05 0.05 0.20
d2 distance 0.00 0.10 0.00 96 0.00 0.46 0.00
u2 5prm cp hairpin:G:C 0.03 97 0.19 0.00 0.00
u2 length hairpin 0.03 98 0.14 0.00 0.00
d2 5prm cp internal:U:G 0.00 0.06 99 0.00 0.00 0.10
site 3prm cp internal:U:G 0.03 100 0.00 0.14 0.00
site struct:I 0.00 0.01 0.06 101 0.00 0.09 0.20
d2 5prm cp internal:C:G 0.00 0.05 102 0.00 0.28 0.00
d2 5prm cp bulge:C:G 0.01 0.03 103 0.10 0.14 0.00
d2 struct:I 0.00 0.06 0.00 104 0.00 0.09 0.00
site struct:S 0.00 0.06 0.00 105 0.00 0.09 0.00
site 5prm cp bulge:G:C 0.01 106 0.00 0.00 0.10
u2 struct:H 0.02 0.00 107 0.05 0.00 0.00
u2 3prm cp bulge:G:C 0.01 108 0.14 0.00 0.00
site struct:B 0.02 0.00 109 0.00 0.18 0.00
mut next struct:H 0.01 0.01 110 0.10 0.41 0.00
d1 length stem 0.00 0.02 0.00 111 0.00 0.09 0.00
site prev struct:B 0.01 112 0.00 0.00 0.10
Continued on next page

Feature AJUBA
normalized
mean abs
SHAP NEIL1
normalized
mean abs
SHAP TTYH2
normalized
mean abs
SHAP SHAP
Rank AJUBA
normalized F
score NEIL1
normalized F
score TTYH2
normalized F
score
u2 5prm cp internal:C:G 0.00 113 0.00 0.00 0.10
u2 3prm cp internal:A:U 0.00 114 0.00 0.00 0.10
u2 struct:B 0.00 0.01 0.00 115 0.05 0.14 0.00
u2 5prm cp bulge:U:G 0.00 116 0.00 0.05 0.00
u2 5prm cp hairpin:U:G 0.00 117 0.05 0.00 0.00
site 5prm cp internal:U:G 0.00 118 0.00 0.00 0.10
d2 3prm cp internal:U:A 0.00 119 0.10 0.00 0.00
Total 100.00 100.00 100.00 100.00 100.00 100.00

Appendix H
Transient relief from AP-1 epigenetic roadblock augments reprogramming to pluripotency
H.1 Supplemental Figures
357

a, Three of the fifteen differential ATAC-seq trajectories identified. b, Chromatin states in the fibroblast and ES cell at the dynamic peaks identified in a. c, Motif enrichment at the genomic regions identified in a plotted. Data represented as fold-change in motif occurrence relative to all differential peaks in the time course.

Pairwise Pearson correlations of motif enrichments over 15 ATAC-seq trajectories was calculated for 64 transcription factors and plotted as a heatmap after hierarchical clustering. Red asterisks denote motifs of interest to this study. A group of highly-correlated GC-rich motifs is annotated with a bracket.

Figure H.3 (previous page): a, Flow cytometry data for RFP under the control of the Tet-On promoter in human fibroblasts after 12 and 24 hours of doxycycline (dox). RFP is fused with a cleavable linker to dominant-negative AP-1 (dnAP-1). b, Mean ATAC-seq signal at peaks mapping to the human genome that do not contain an AP-1 site (TGA G/C TCA) in heterokaryons at 16 hours post-fusion. Signal is plotted from cells with or without continuous exposure to dox throughout heterokaryon formation. c, Same as b, except signal is plotted at peaks with an AP-1 site centered at the motif. d, Time course of FOSL1 expression in human fibroblasts by qRT-PCR after induction with dnAP-1. Gene expression normalized to housekeeping genes GADPH and RPLP0 and plotted relative to sample not treated with dox. Error bars are showing mean ± s.e.m. (n=3 technical replicates). e, Same as d, except showing AP-1 target gene SERP1 expression. f, Human NANOG gene expression in heterokaryons 48 hours post-fusion following induction of dnAP-1 either before fusion, after fusion, or throughout. Gene expression normalized to housekeeping genes GADPH and RPLP0 and plotted relative to sample without dox treatment. Error bars are showing mean ± s.e.m. (n=4 biological replicates). P-values calculated as a two-tailed Student’s t-test comparing the marked sample against the control (no dox). g, same as f, except showing human LIN28A expression (n=4 biological replicates). h, Human OCT4 gene expression in heterokaryons 16 hours post-fusion with or without induction of dnAP-1. Expression normalized to housekeeping genes GADPH and RPLP0 and plotted relative to sample without dox treatment. Error bars are showing mean ± s.e.m. (n=5 biological replicates). P-values calculated as a two-tailed Student’s t-test

Figure H.4 (previous page): a, Diagram of the human OCT4 locus with arrows marking the targeting sites of the guide RNAs. ATAC and RNA-seq tracks are shown for reference. b, Human fibroblasts were transduced with dCas9-KRAB and specific guide RNA as denoted for each sample. OCT4 gene expression was measured by qRT-PCR in heterokaryons 48 hours post-fusion. Expression is normalized to housekeeping genes GADPH and RPLP0 and plotted relative to sample receiving a non-targeting guide RNA (sg Scr). Error bars are showing mean ± s.e.m. (n=2 biological replicates). P-values calculated as a two-tailed Student’s t-test comparing sample to sg Scr.
H.2 Supplementary Methods
H.2.1 Heterokaryon generation and isolation
Generation of heterokaryons by cell fusion is described previously[57]. Briefly, 5x105 human MRC5 fibroblast (ATCC CCL171) and 3x106 GFP+ mouse ESC were co-cultured overnight in mouse embryonic stem cell medium (KO DMEM with 10% FBS, 10% KSR, LIF, supplemented with glutamine, non-essential amino acids, 0.1mM beta-mercaptoethanol, and antibiotics, sterile filtered). After fusion with PEG 1500 (Sigma 10783641001), heterokaryons were isolated by FACS sorting cells double positive for human CD44 (Clone BJ18 from BioLegend, cat 338806) and GFP.
H.2.2 RNA extraction and qRT-PCR
Cells were resuspended in RLT buffer from Qiagen and RNA was isolated with DNase treatment according the RNeasy kit protocol (74004). Single-strand cDNA synthesis was performed using SuperScript Reverse Transcriptase IV from Thermo Scientific, according to manufacturer’s protocol. For qRT-PCR, cDNA was mixed with primers and Sybr Green Master Mix (Thermo 4385612) according to manufacturer’s protocol. Samples were normalized to the average Ct value of two housekeeping genes: GAPDH and RPLP0. Data are presented as the mean ± S.E.M. Statistical significance between samples was calculated using unpaired t-test on biological replicates assuming two-tailed distributions.
H.2.3 ATAC-seq library generation
ATAC-seq libraries were prepared as described13. Briefly from 20-50,000 heterokaryons were sorted, gently permeabilized, and nuclei were resuspended in transposase (Illumina Nextera DNA kit FC121-1030) for 30 minutes at 37C.
H.2.4 Lentivirus production
Virus production was performed as previously described[57].
H.2.5 Cas9 experiments in heterokaryons
The guide RNA target sequences were designed with help from the CRISPRscan software, and cloned into a dCas9-KRAB vector. For dCas9 only (no KRAB), the same vector was used, but the KRAB domain was removed by standard subcloning techniques. Fibroblasts were transduced and selected for vector expression by puromycin. Heterokaryon generation proceeded as described above.

a, Representative example of gene expression in heterokaryons 48 hours post-fusion after transfection of fibroblasts with siRNA prior to fusion. Expression of denoted human gene is measured by qRTPCR and is normalized to housekeeping genes GADPH and RPLP0. For each gene, the expression is plotted relative to sample receiving a non-targeting siRNA (siScr). Error bars show s.e.m. (n=3 technical replicates). b, Representative example of MBD3 gene expression in heterokaryons 48 hours post-fusion after nucleofection of fibroblasts with siRNA and vector as denoted prior to fusion. Expression is measured by qRT-PCR, normalized to housekeeping genes GADPH and RPLP0, and plotted relative to sample receiving a non-targeting siRNA (siScr) and transfected with a control vector expressing mCherry (Ctrl). Error bars show s.e.m. (n=3 technical replicates). c, Flow cytometry of human fibroblasts labeled with anti-phospho-JUN (anti-pJUN) either with or without the expression vector for JNK1 fused to Mkk7 (constitutive JNK).
H.2.6 DNA constructs
The backbone for the dnAP-1 vector is the TRIPZ inducible lentiviral construct from Dharmacon. dnAP-1 was subcloned as a fusion protein with RFP separated by the sequence of the T2A cleavable peptide. Doxycycline was added at a final concentration of 1µg/ml.

Appendix I
Dissecting Murine Muscle Stem
Cell Aging Through Regeneration
Using Integrative Genomic Analysis
I.1 Supplementary Figures
366

Figure I.1: Related to Figure 2. A) Hematoxylin and Eosin (top) and embryonic myosin/laminin immunohistochemistry (bottom) stain of TA muscle harvested from young and aged mice 7 days post BaCl2 injection. n = 4 mice per group. * denotes p ≤ 0.05 by two-sample t-test. B) Immunostaining of Pax7 expression in FACS enriched muscle stem cells harvested from young muscles before injury and 3 days post injury. Control images are from FACS-enriched cells not stained with antiPax7 primary to validate signal seen in WT images. n = 300 cells from two mice, sorted separately, per condition. Bars show mean ± standard deviation of two mice. Ns denotes not significant by Wilcoxon sum rank test. C) Scatter plot of gene expression levels in muscle stem cells sorted from young wild type mice compared to an age and gender-matched Pax7 reporter mouse. D) Heatmap of gene expression profiles for genes associated with muscle stem cell quiescence plotted as transcript per million (TPM) values on a log2 scale. E) Time series expression values of myogenic genes (Myogenin-MyoG, Myosin Heavy Chain 3-Myh3, Myomaker-Tmem8 c) plotted as TPM for each of isolation before and after injury for young (Y) and aged (A) MuSCs.

Figure I.2: A) Representative immuno-fluorescence (IF) staining of Ki67 for young and aged muscle stem cells following 3 days of treatment of retinoic acid (+RA) or DMSO alone (-RA). Scale bars represents 50 µm in images of aged MuSCs and 25 µm in images of young MuSCs.

Figure I.3: . A) Distribution of ATAC-seq peak distances from transcription factor start sites (TSS). B) Number of differentially accessible ATAC-Seq peaks or differentially expressed genes for each time point profiled. C) Number of accessible ATAC-Seq peaks for each time point profiled.

Figure I.4: A) RNA expression (TPM log-fold-change in young vs aged) versus motif enrichment (HOMER log2 fold change over background) for mouse transcription factors with significant effects at days 1, B) 3, C) 5, and D) 7 post injury. Circle size is proportional to the mean expression (TPM) of the transcription factor across young and aged samples at a given timepoint. Transcription factors are color-coded by TF family assignment.

Figure I.5: A) Muscle progenitor cell differentiation with and without DDIT3 knockdown using C2C12s. Scale bars indicate 100um. B) Fusion index of C2C12s with and without knockdown. n = 3 independent experiments, where * corresponds to p≤0.05, calculated by two-sided, twosample student’s t-test assuming equal population variance. C) Normalized Myog expression in C2C12s following differentiation with and without knockdown. ***p≤0.005 by two-tailed, twosample ttest. D) Representative images of muscle progenitor cell differentiation with and without DDIT3 knockdown using MuSCs harvested from young mice. Scale bars indicate 200um. E) Fusion index of young MuSCs with and without knockdown. n = 3 mice, with 5 technical replicates of each. ****p≤0.0001, calculated by two-sided, two-sample student’s t-test not assuming equal population variance based on F test for equality of variance returning p≤0.05. F) Normalized DDIT3 expression in aged MuSCs after knockdown. n=2 technical replicates. *p≤0.05 by two-sided, two-sample student’s t-test.

Bibliography
[1] .
[2] url: https://www.rdocumentation.org/packages/stats/versions/3.5.1/topics/TukeyHSD.
[3] url: https://stanforduniversity.qualtrics.com/jfe/preview/SV 87Fi9fwa54N98EZ?Q CHL= preview.
[4] url: http://www.cdc.gov/nhanes.
[5] “2008 Physical activity guidelines for Americans [Internet]”. In: PsycEXTRA Dataset (2008). doi: http://dx.doi.org/10.1037/e525412010-001.
[6] Gupta SK. Intention-to-treat concept: A review. In: Perspect Clin Res 2 (2011), p. 3.
[7] Saamah Abdallah, Sam Thompson, and Nic Marks. “Estimating worldwide life satisfaction”. In: Ecol. Econ. 65.1 (Mar. 2008), pp. 35–47.
[8] Andrew Adey et al. “Rapid, low-input, low-bias construction of shotgun fragment libraries by high-density in vitro transposition”. In: Genome Biology 11.12 (2010), R119. doi: 10.1186/gb2010-11-12-r119. url: https://doi.org/10.1186/gb-2010-11-12-r119.
[9] M. D. Adler and M. Fleurbaey. The Oxford Handbook of Well-Being and Public Policy. ; 848 p: Oxford University Press, 2016.
[10] Carlos A Aguilar et al. “Multiscale analysis of a regenerative therapy for treatment of volumetric muscle loss injury”. en. In: Cell Death Discov 4 (Dec. 2018), p. 33.
[11] C. Aguilera et al. “c-Jun N-terminal phosphorylation antagonises recruitment of the Mbd3/NuRD repressor complex”. In: Nature 469.7329 (2011), pp. 231–235.
[12] Thomas R Gingeras Alexander Dobin. “Mapping RNA-seq Reads with STAR”. en. In: Curr. Protoc. Bioinformatics 51 (2015), p. 11.14.1.
[13] Babak Alipanahi et al. “Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning”. In: Nature Biotechnology 33.8 (July 2015), pp. 831–838. doi: 10.1038/nbt.3300. url: https://doi.org/10.1038/nbt.3300.
[14] M. Allen et al. “Association of MAPT haplotypes with Alzheimer’s disease risk and MAPT brain gene expression levels”. In: Alzheimer’s Res Ther. 6 (2014), pp. 1–14.
372
[15] M. Allen et al. “Association of MAPT haplotypes with Alzheimer’s disease risk and MAPT brain gene expression levels”. In: Alzheimer’s Res Ther. 6 (2014), pp. 1–14.
[16] Albert E Almada and Amy J Wagers. Molecular circuitry of stem cell fate in skeletal muscle regeneration, ageing and disease. 2016.
[17] S Altan, H Sag˘s¨oz, and Z Og˘urtan. “Topical dimethyl sulfoxide inhibits corneal neovascularization and stimulates corneal repair in rabbits following acid burn”. en. In: Biotech. Histochem. 92.8 (Dec. 2017), pp. 619–636.
[18] Joel Alter and Eyal Bengal. “Stress-induced C/EBP homology protein (CHOP) represses MyoD transcription to delay myoblast differentiation”. en. In: PLoS One 6.12 (Dec. 2011), e29498.
[19] T. Althoff et al. “Large-scale physical activity data reveal worldwide activity inequality”. In: NatureJul 20.547 (2017), p. 7663.
[20] Haley M Amemiya, Anshul Kundaje, and Alan P Boyle. “The ENCODE Blacklist: Identification of Problematic Regions of the Genome”. en. In: Sci. Rep. 9.1 (June 2019), p. 9354.
[21] Haley M. Amemiya, Anshul Kundaje, and Alan P. Boyle. “The ENCODE Blacklist: Identification of Problematic Regions of the Genome”. In: Scientific Reports 9.1 (June 2019). doi: 10.1038/s41598-019-45839-z. url: https://doi.org/10.1038/s41598-019-45839-z.
[22] A. Amiri et al. “Transcriptome and epigenome landscape of human cortical development modeled in organoids”. In: Science 362 (2018).
[23] A. Amlie-Wolf et al. “InferNo: Inferring the molecular mechanisms of noncoding genetic variants”. In: Nucleic Acids Res 46 (2018), pp. 8740–8753.
[24] Yitai An. A molecular switch regulating cell fate choice between muscle progenitor cells and brown adipocytes.
[25] R. J. Andrew et al. “Reduction of the expression of the late-onset Alzheimer’s disease (AD) risk-factor BIN1 does not affect amyloid pathology in an AD mouse model”. In: J. Biol Chem 294 (2019), pp. 4477–4487.
[26] P. Angel et al. “The jun proto-oncogene is positively autoregulated by its product, Jun/AP1”. In: Cell 55.5 (1988), pp. 875–885.
[27] Christof Angermueller et al. “Deep learning for computational biology”. In: Molecular Systems Biology 12.7 (July 2016), p. 878. doi: 10.15252/msb.20156651. url: https://doi.org/10.
15252/msb.20156651.
[28] Christof Angermueller et al. “Deep learning for computational biology”. In: Molecular Systems Biology 12.7 (2016), p. 878. doi: 10.15252/msb.20156651.
[29] Anna Shcherbina, Avanti Shrikumar, and Soumya Kundu. kundajelab/seqdataloader: v0.2.
2020. doi: 10.5281/ZENODO.3771365. url: https://zenodo.org/record/3771365.
[30] Anna Shcherbina et al. kundajelab/kerasAC: v.2.3. 2020. doi: 10.5281/ZENODO.3831607. url: https://zenodo.org/record/3831607.
[31] R. J. L. Anney et al. “Genetic determinants of common epilepsies: A meta-analysis of genomewide association studies”. In: Lancet Neurol. 13 (2014), pp. 893–903.
[32] E. Apostolou and K. Hochedlinger. “Chromatin Dynamics during Cellular Reprogramming”. In: Nature 502.7472 (2013), pp. 462–471.
[33] M. Aragona et al. “A mechanical checkpoint controls multicellular growth through YAP/TAZ regulation by actin-processing factors”. In: Cell 154 (2013), pp. 1047–1059.
[34] R. Arena, D. K. Arnett, P. E. Terry, et al. “The role of worksite health screening: a policy statement from the American Heart Association”. In: Circulation 130.8 (2014), pp. 719–734.
[35] Euan A Ashley. “The Precision Medicine Initiative: A New National Effort”. In: JAMA (Apr. 2015).
[36] Actb: Tissue-breast (Human Protein Atlas. ; 2019. url: https://www.proteinatlas.org/ ENSG00000075624-ACTB/tissue/breast.
[37] A. Auton et al. “A global reference for human genetic variation”. In: Nature 526 (2015), pp. 68–74.
[38] Ziga Avsec et al. “Base-resolution models of transcription factor binding reveal soft motifˇ syntax”. In: bioRxiv (2020). doi: 10.1101/737981. eprint: https://www.biorxiv.org/content/ early/2020/07/19/737981.full.pdf. url: https://www.biorxiv.org/content/early/2020/07/ 19/737981.
[39] Birsel Ayrulu-Erdem and Billur Barshan. “Leg Motion Classification with Artificial Neural Networks Using Wavelet-Based Features of Gyroscope Signals”. In: Sensors 11.2 (Jan. 2011), pp. 1721–1743. doi: 10.3390/s110201721. url: https://doi.org/10.3390/s110201721.
[40] P. Baeza-Centurion et al. “Combinatorial Genetics Reveals a Scaling Law for the Effects of Mutations on Splicing”. In: Cell 176 (2019), pp. 549–563. doi: 10.1016/j.cell.2018.12.010.
[41] A. Bakir. “Tracking Fitness with HealthKit and Core Motion [Internet]. [cited 2018 July 30]”. In: Beginning iOS Media App Development (2014), pp. 343–74. doi: http://dx.doi.org/10. 1007/978-1-4302-5084-5 13.
[42] Andrew J Bannister and Tony Kouzarides. “Regulation of chromatin by histone modifications”. In: Cell Research 21.3 (Feb. 2011), pp. 381–395. doi: 10.1038/cr.2011.22. url: https://doi.org/10.1038/cr.2011.22.
[43] David R Bassett Jr et al. “Validity of four motion sensors in measuring moderate intensity physical activity”. In: Med. Sci. Sports Exerc. 32.9 Suppl (2000), S471–80.
[44] M Albert Basson. “Signaling in cell differentiation and morphogenesis”. en. In: Cold Spring Harb. Perspect. Biol. 4.6 (June 2012).
[45] D. M. Bates and S. DebRoy. “Linear mixed models and penalized least squares”. In: J Multivar Anal 91.1 (2004), pp. 1–17.
[46] D. M. Bates, Pinheiro JC. Linear, and Nonlinear Mixed-effects MODELS. In: Conference on Applied Statistics in Agriculture [Internet]. cited 2018 July 30] Available from, 1998. doi: http://dx.doi.org/10.4148/2475-7772.1273.
[47] L. Bazak et al. “A-to-I RNA editing occurs at over a hundred million genomic sites, located in a majority of human genes”. In: Genome Res 24 (2014), pp. 365–376. doi: 10.1101/gr. 164749.113.
[48] G. W. Beecham et al. “Genome-Wide Association Meta-analysis of Neuropathologic Features of Alzheimer’s Disease and Related Dementias”. In: PLoS Genet 10 (2014).
[49] J. E. Beevers et al. “MAPT Genetic Variation and Neuronal Maturity Alter Isoform Expression Affecting Axonal Transport in iPSC-Derived Dopamine Neurons”. In: Stem Cell Reports 9 (2017), pp. 587–599.
[50] J. E. Beevers et al. “MAPT Genetic Variation and Neuronal Maturity Alter Isoform Expression Affecting Axonal Transport in iPSC-Derived Dopamine Neurons”. In: Stem Cell Reports 9 (2017), pp. 587–599.
[51] Fabiana Braga Benatti and Mathias Ried-Larsen. “The Effects of Breaking up Prolonged Sitting Time: A Review of Experimental Studies”. In: Med. Sci. Sports Exerc. 47.10 (Oct. 2015), pp. 2053–2061.
[52] C. Benner et al. “FINEMAP: Efficient variable selection using summary data from genomewide association studies”. In: Bioinformatics 32 (2016), pp. 1493–1501.
[53] Mette Bentsen et al. “ATAC-seq footprinting unravels kinetics of transcription factor binding during zygotic genome activation”. In: Nature Communications 11.1 (Aug. 2020). doi: 10. 1038/s41467-020-18035-1. url: https://doi.org/10.1038/s41467-020-18035-1.
[54] Mette Bentsen et al. Beyond accessibility: ATAC-seq footprinting unravels kinetics of transcription factor binding during zygotic genome activation.
[55] Nathan A Berger. Epigenetics, Energy Balance, and Cancer. en. Springer, Sept. 2016.
[56] Jennifer D Bernet et al. p38 MAPK signaling underlies a cell-autonomous loss of stem cell self-renewal in skeletal muscle of aged mice. 2014.
[57] N. Bhutani et al. “Reprogramming towards pluripotency requires AID-dependent DNA demethylation”. In: Nature 463.7284 (2010), pp. 1042–1047.
[58] M. Bibikova et al. “Human embryonic stem cells have a unique epigenetic signature”. In: Genome Res 16 (2006), pp. 1075–1083.
[59] Stephanie A Bien et al. “Genetic variant predictors of gene expression provide new insight into risk of colorectal cancer”. en. In: Hum. Genet. 138.4 (Apr. 2019), pp. 307–326.
[60] S N Blair et al. “Physical fitness and all-cause mortality: a prospective study of healthy men and women”. In: JAMA (1989).
[61] Steven N Blair. “Physical inactivity: the biggest public health problem of the 21st century”. In: Br. J. Sports Med. 43 (2009), pp. 1–2.
[62] Helen M Blau, Benjamin D Cosgrove, and Andrew T V Ho. The central role of muscle stem cells in regenerative failure with aging. 2015.
[63] Christoph Bock et al. “Reference Maps of human ES and iPS cell variation enable highthroughput characterization of pluripotent cell lines”. en. In: Cell 144.3 (Feb. 2011), pp. 439– 452.
[64] Anthony M Bolger, Marc Lohse, and Bjoern Usadel. “Trimmomatic: a flexible trimmer for Illumina sequence data”. en. In: Bioinformatics 30.15 (Aug. 2014), pp. 2114–2120.
[65] Verawan Boonsanay et al. Regulation of Skeletal Muscle Stem Cell Quiescence by Suv4-20h1Dependent Facultative Heterochromatin Formation. 2016.
[66] B. M. Bot et al. “The mPower study, Parkinson disease mobile data collected using ResearchKit”. In: Sci DataMar 3 (2016), p. 3.
[67] A S Brack et al. Increased Wnt Signaling During Aging Alters Muscle Stem Cell Fate and Increases Fibrosis. 2007.
[68] Cara K Bradley et al. “Derivation of Human Embryonic Stem Cell Lines from Vitrified Human Blastocysts”. en. In: Methods Mol. Biol. 1307 (2016), pp. 1–23.
[69] Laura a Brocklebank et al. “Accelerometer-measured sedentary time and cardiometabolic biomarkers: A systematic review”. In: Prev. Med. 76 (2015), pp. 92–102.
[70] Y. Bromberg and B. Rost. “Comprehensive in silico mutagenesis highlights functionally important residues in proteins”. In: Bioinformatics 24 (2008), pp. 207–212.
[71] G C Brooks et al. “Accuracy and Usability of a Self-Administered 6-Minute Walk Test Smartphone Application”. In: Circ. Heart Fail. (2015).
[72] D. van Bruggen, E. Agirre, and G. Castelo-Branco. “Single-cell transcriptomic analysis of oligodendrocyte lineage cells”. In: Curr Opin Neurobiol. 47 (2017), pp. 168–175.
[73] J. Bryois et al. “Evaluation of chromatin accessibility in prefrontal cortex of individuals with schizophrenia”. In: Nat Commun. 9 (2018).
[74] J. D. Buenrostro et al. “Single-cell chromatin accessibility reveals principles of regulatory variation”. In: Nature 523 (2015), pp. 486–490.
[75] J. D. Buenrostro et al. “a method for assaying chromatin accessibility genome-wide”. In:
Curr. Protoc. Mol Biol 109 (2015), pp. 21–29.
[76] J. D. Buenrostro et al. “Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position”. In: Nat Methods 10 (2013), pp. 1213–1218.
[77] J. D. Buenrostro et al. “Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position”. In: Nat Methods 10 (2013), pp. 1213–1218.
[78] Jason D Buenrostro et al. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. 2013.
[79] Jason D. Buenrostro et al. “ATAC-seq: A Method for Assaying Chromatin Accessibility Genome-Wide”. In: Current Protocols in Molecular Biology 109.1 (Jan. 2015). doi: 10.1002/ 0471142727.mb2129s109. url: https://doi.org/10.1002/0471142727.mb2129s109.
[80] Annalisa Buniello et al. “The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019”. In: Nucleic Acids Research 47.D1 (Nov. 2018), pp. D1005–D1012. doi: 10.1093/nar/gky1120. url: https://doi.org/10.1093/ nar/gky1120.
[81] Tom Burdon, Austin Smith, and Pierre Savatier. “Signalling, cell cycle and pluripotency in embryonic stem cells”. en. In: Trends Cell Biol. 12.9 (Sept. 2002), pp. 432–438.
[82] Lora E Burke et al. Current Science on Consumer Use of Mobile Health for Cardiovascular Disease Prevention. 2015, CIR.0000000000000232.
[83] Nina Cabezas-Wallscheid et al. Vitamin A-Retinoic Acid Signaling Regulates Hematopoietic Stem Cell Dormancy. 2017.
[84] D. Cacchiarelli et al. “Integrative Analyses of Human Reprogramming Reveal Dynamic Nature of Induced Pluripotency”. In: Cell 162.2 (2015), pp. 412–424.
[85] Aslıhan Karabacak Calviello et al. “Reproducible inference of transcription factor footprints in ATAC-seq and DNase-seq datasets using protocol-specific bias modeling”. In: Genome Biology 20.1 (Feb. 2019). doi: 10.1186/s13059-019-1654-y. url: https://doi.org/10.1186/ s13059-019-1654-y.
[86] Yi Cao et al. Genome-wide MyoD Binding in Skeletal Muscle Cells: A Potential for Broad Cellular Reprogramming. 2010.
[87] Meredith A Case et al. “Accuracy of smartphone applications and wearable devices for tracking physical activity data”. In: JAMA 313.6 (Feb. 2015), pp. 625–626.
[88] Massimiliano Cerletti et al. Highly Efficient, Functional Engraftment of Skeletal Muscle Stem Cells in Dystrophic Muscles. 2008.
[89] Joe V Chakkalakal et al. The aged niche disrupts muscle stem cell quiescence. 2012.
[90] Abhishek A Chakraborty et al. “Histone demethylase KDM6A directly senses oxygen to control chromatin and cell fate”. en. In: Science 363.6432 (Mar. 2019), pp. 1217–1222.
[91] Stuart M Chambers et al. “Aging hematopoietic stem cells decline in function and exhibit epigenetic dysregulation”. en. In: PLoS Biol. 5.8 (Aug. 2007), e201.
[92] Christopher C Chang et al. “Second-generation PLINK: rising to the challenge of larger and richer datasets”. In: GigaScience 4.1 (Feb. 2015). doi: 10.1186/s13742-015-0047-8. url: https://doi.org/10.1186/s13742-015-0047-8.
[93] D. Chang et al. “A meta-analysis of genome-wide association studies identifies 17 new Parkinson’s disease risk loci”. In: Nat Genet 49 (2017), pp. 1511–1516.
[94] Gregory W Charville et al. Ex Vivo Expansion and In Vivo Self-Renewal of Human Muscle Stem Cells. 2015.
[95] O. Chaudhuri et al. “Extracellular matrix stiffness and composition jointly regulate the induction of malignant phenotypes in mammary epithelium”. In: Nat Mater 13 (2014), pp. 970– 978.
[96] B.-K. Chen and W.-C. Chang. “Functional interaction between c-Jun and promoter factor Sp1 in epidermal growth factor-induced gene expression of human 12 (S)-lipoxygenase”. In: Proceedings of the National Academy of Sciences 97.19 (2000), pp. 10406–10411.
[97] F. Chen et al. “Inhibition of histone deacetylase reduces transcription of NADPH oxidases and ROS production and ameliorates pulmonary arterial hypertension”. In: Free Radic Biol Med 99 (2016), pp. 167–178.
[98] G. Chen, D. Katrekar, and P. RNA-Guided Adenosine Deaminases Mali. “Advances and Challenges for Therapeutic RNA Editing”. In: Biochemistry 58 (2019), pp. 1947–1957. doi: 10.1021/acs.biochem.9b00046.
[99] J. Chen et al. “H3K9 methylation is a barrier during somatic cell reprogramming into iPSCs”. In: Nature Genetics 45 (2012), p. 34.
[100] T. Chen and C. Guestrin. “in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (Association for Computing Machinery, San Francisco, California”. In: US (2016).
[101] C. Y. Cheng et al. “structure inference through chemical mapping after accidental or intentional mutations”. In: Proceedings of the National Academy of Sciences 114 (2017), pp. 9876– 9881. doi: 10.1073/pnas.1619897114.
[102] Tao Cheng. Hematopoietic Differentiation of Human Pluripotent Stem Cells. en. Springer, Aug. 2015.
[103] Sundari Chetty et al. “A simple tool to improve pluripotent stem cell differentiation”. en. In:
Nat. Methods 10.6 (June 2013), pp. 553–556.
[104] Sundari Chetty et al. “A Src inhibitor regulates the cell cycle of human pluripotent stem cells and improves directed differentiation”. en. In: J. Cell Biol. 210.7 (Sept. 2015), pp. 1257–1268.
[105] R. Cheung et al. “A Multiplexed Assay for Exon Recognition Reveals that an Unappreciated Fraction of Rare Genetic Variants Cause Large-Effect Splicing Disruptions”. In: Mol Cell 73 (2019), pp. 183–194. doi: 10.1016/j.molcel.2018.10.037.
[106] Raymond J Cho et al. “Transcriptional regulation and function during the human cell cycle”. In: Nat. Genet. 27.1 (2001), pp. 48–54.
[107] D. Choquet, D. P. Felsenfeld, and M. P. Sheetz. “Extracellular matrix rigidity causes strengthening of integrin–cytoskeleton linkages”. In: Cell 88 (1997), pp. 39–48.
[108] Clara K Chow et al. “Effect of Lifestyle-Focused Text Messaging on Risk Factor Modification in Patients With Coronary Heart Disease: A Randomized Clinical Trial”. In: JAMA 314.12 (2015), pp. 1255–1263.
[109] C. Chronis et al. “Cooperative Binding of Transcription Factors Orchestrates Reprogramming”. In: Cell 168.3 (2017), pp. 442–459.
[110] Constantinos Chronis et al. Cooperative Binding of Transcription Factors Orchestrates Reprogramming. 2017.
[111] Andrea J Cohen et al. “Hotspots of aberrant enhancer activity punctuate the colorectal cancer epigenome”. en. In: Nat. Commun. 8 (Feb. 2017), p. 14400.
[112] Jamie F Conklin, Julie Baker, and Julien Sage. “The RB family is required for the self-renewal and survival of human embryonic stem cells”. en. In: Nat. Commun. 3 (2012), p. 1244.
[113] Jamie F Conklin and Julien Sage. “Keeping an eye on retinoblastoma control of human embryonic stem cells”. en. In: J. Cell. Biochem. 108.5 (Dec. 2009), pp. 1023–1030.
[114] M. R. Corces et al. “An improved ATAC-seq protocol reduces background and enables interrogation of frozen tissues”. In: Nat Methods 14 (2017), pp. 959–962.
[115] M. R. Corces et al. “Lineage-specific and single cell chromatin accessibility charts human hematopoiesis and leukemia evolution”. In: Nat Genet 48 (2016), pp. 1193–1203.
[116] M. R. Corces et al. “The chromatin accessibility landscape of primary human cancers”. In: Science 362 (2018).
[117] D. A. Cusanovich et al. “The cis-regulatory dynamics of embryonic development at single-cell resolution”. In: Nature 555 (2018), pp. 538–542.
[118] P. Danecek et al. “The variant call format and VCFtools”. In: Bioinformatics 27 (2011), pp. 2156–2158.
[119] C. Daniel et al. “Editing inducer elements increases A-to-I editing efficiency in the mammalian transcriptome”. In: Genome Biol 18 (2017), p. 195. doi: 10.1186/s13059-017-1324-x.
[120] Carrie A Davis et al. “The Encyclopedia of DNA elements (ENCODE): data portal update”. In: Nucleic Acids Research 46.D1 (Nov. 2017), pp. D794–D801. doi: 10.1093/nar/gkx1081. url: https://doi.org/10.1093/nar/gkx1081.
[121] J. Debnath, S. K. Muthuswamy, and J. S. Brugge. “Morphogenesis and oncogenesis of MCF10A mammary epithelial acini grown in three-dimensional basement membrane cultures”. In: Methods 30 (2003), pp. 256–268.
[122] A. DelRio et al. “Stretching single talin rod molecules activates vinculin binding”. In: Science 323 (2009), pp. 638–641.
[123] J. Demmerle, A. J. Koch, and J. M. Holaska. “The nuclear envelope protein emerin binds directly to histone deacetylase 3 (HDAC3) and activates HDAC3 activity”. In: J Biol Chem 287 (2012), pp. 22080–22088.
[124] D. Demontis et al. “Discovery of the first genome-wide significant risk loci for attention deficit/hyperactivity disorder”. In: Nat Genet 51 (2019), pp. 63–75.
[125] E. Diener. Well-being for Public Policy. USA; 245 p: Oxford University Press, 2009.
[126] Y. Ding et al. “In vivo genome-wide profiling of RNA secondary structure reveals novel regulatory features”. In: Nature 505 (2014), pp. 696–700. doi: 10.1038/nature12756.
[127] Et al Dobin A. STAR: ultrafast universal RNA-seq aligner. - PubMed - NCBI. https://www. ncbi.nlm.nih.gov/pubmed/23104886. Accessed: 2018-7-6.
[128] J. G. Doench et al. “Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9”. In: Nat Biotechnol 34 (2016), pp. 184–191. doi: 10.1038/nbt.3437.
[129] A. Doetzlhofer et al. “Histone deacetylase 1 can repress transcription by binding to Sp1”. In: Mol Cell Biol 19 (1999), pp. 5504–5511.
[130] Aiden Doherty et al. “GWAS identifies 14 loci for device-measured physical activity and sleep duration”. In: Nature Communications 9.1 (Dec. 2018). doi: 10.1038/s41467-018-07743-4. url: https://doi.org/10.1038/s41467-018-07743-4.
[131] E. R. Dorsey, E. Ray Dorsey, and Chan Y-f. “yvonne, McConnell MV, Shaw SY, Trister AD, et al. The Use of Smartphones for Health Research [Internet]. [cited 2018 July 30].Vol. 92, Academic Medicine. 2017. p. 157–60. Available from:”
[132] N C Duarte et al. Global reconstruction of the human metabolic network based on genomic and bibliomic data. 2007.
[133] Gregory S Ducker and Joshua D Rabinowitz. “One-Carbon Metabolism in Health and Disease”. en. In: Cell Metab. 25.1 (Jan. 2017), pp. 27–42.
[134] L. Duncan et al. “Significant locus and metabolic genetic correlations revealed in genome-wide association study of anorexia nervosa”. In: Am. J. Psychiatry 174 (2017), pp. 850–858.
[135] A. L. Dunn et al. “Six-month physical activity and fitness changes in Project Active, a randomized trial”. In: Med Sci Sports ExercJul; 30.7 (1998), pp. 1076–83.
[136] Eran Eden, I. W., and Zohar Yakhini. SimTree: Computing similarity between RNA secondary structure. ¡ url: http://bioinfo.cs.technion.ac.il/SimTree/>.
[137] R. Eferl and E. F. Wagner. “AP-1: a double-edged sword in tumorigenesis”. In: Nat Rev Cancer 3.11 (2003), pp. 859–868.
[138] A. G. Efthymiou and A. M. Goate. “Late onset Alzheimer’s disease genetics implicates microglial pathways in disease risk”. In: Mol Neurodegener. 12 (2017), pp. 1–12.
[139] J. M. Eggington, T. Greene, and B. L. Bass. “Predicting sites of ADAR editing in doublestranded RNA”. In: Nat Commun 2 (2011), p. 319. doi: 10.1038/ncomms1324.
[140] A. Elosegui-Artola et al. “Mechanical regulation of a molecular clutch defines force transmission and transduction in response to matrix rigidity”. In: Nat Cell Biol 18 (2016), pp. 540– 548.
[141] Gwang Hyeon Eom et al. “Histone methyltransferase SETD3 regulates muscle differentiation”. en. In: J. Biol. Chem. 286.40 (Oct. 2011), pp. 34733–34742.
[142] Roadmap Epigenomics et al. “Integrative analysis of 111 reference human epigenomes”. In:
Nature 518.7539 (2015), pp. 317–330.
[143] Maria Ermolaeva et al. Cellular and epigenetic drivers of stem cell ageing. 2018.
[144] Jason Ernst and Manolis Kellis. ChromHMM: automating chromatin-state discovery and characterization. 2012.
[145] Herv´e Faralli et al. UTX demethylase activity is required for satellite cell–mediated muscle regeneration. 2016.
[146] Andrew Farmer and Lionel Tarassenko. “Use of wearable monitoring devices to change health behavior”. In: JAMA 313.18 (May 2015), pp. 1864–1865.
[147] G. M. Findlay et al. “Saturation editing of genomic regions by multiplex homology-directed repair”. In: Nature 513 (2014), pp. 120–123. doi: 10.1038/nature13695.
[148] H. K. Finucane et al. “Heritability enrichment of specifically expressed genes identifies diseaserelevant tissues and cell types”. In: Nat Genet 50 (2018), pp. 621–629.
[149] H. K. Finucane et al. “Partitioning heritability by functional annotation using genome-wide association summary statistics”. In: Nat Genet 47 (2015), pp. 1228–1235.
[150] W. A. Flavahan, E. Gaskell, and B. E. Bernstein. “Epigenetic plasticity and the hallmarks of cancer”. In: Science 357 (2017).
[151] W. L. Fong et al. “Differential and Overlapping Pattern of Foxp1 and Foxp2 Expression in the Striatum of Adult Mouse Brain”. In: Neuroscience 388 (2018), pp. 214–223.
[152] C. Fork et al. “Epigenetic control of microsomal prostaglandin E synthase-1 by HDACmediated recruitment of p300”. In: J. Lipid Res 58 (2017), pp. 386–392.
[153] M. F. Fraga et al. “Loss of acetylation at Lys16 and trimethylation at Lys20 of histone H4 is a common hallmark of human cancer”. In: Nat Genet 37 (2005), pp. 391–400.
[154] E. C. Freund et al. “Unbiased Identification of trans Regulators of ADAR and A-to-I RNA Editing”. In: Cell Rep 31.10765 (2020), p. 6. doi: 10.1016/j.celrep.2020.107656.
[155] Collins FS, Brooks LD, and Chakravarti A (December. “A DNA polymorphism discovery resource for research on human genetic variation”. In: Genome Research. 8 8 (1998).
[156] So-Ichiro Fukada et al. Molecular Signature of Quiescent Satellite Cells in Adult Skeletal Muscle. 2007.
[157] C. P. Fulco et al. “Activity-by-contact model of enhancer–promoter regulation from thousands of CRISPR perturbations”. In: Nat Genet 51 (2019), pp. 1664–1669.
[158] J. F. Fullard et al. “An atlas of chromatin accessibility in the adult human brain”. In: Genome Res 28 (2018), pp. 1243–1252.
[159] J. F. Fullard et al. “Open chromatin profiling of human postmortem brain infers functional roles for non-coding schizophrenia loci”. In: Hum Mol Genet 26 (2017), pp. 1942–1951.
[160] N. Gal-Mark et al. “Abnormalities in A-to-I RNA editing patterns in CNS injuries correlate with dynamic changes in cell type composition”. In: Sci Rep 7 (2017), p. 43421. doi: 10. 1038/srep43421.
[161] M. D. Gallagher and A. S. The Post-GWAS Era Chen-Plotkin. “From Association to Function”. In: Am J Hum Genet 102 (2018), pp. 717–730.
[162] M. Gallo et al. “MLL5 orchestrates a cancer self-renewal state by repressing the histone variant H3.3 and globally reorganizing chromatin”. In: Cancer Cell 28 (2015), pp. 715–729.
[163] Laura Garca-Prat et al. “Autophagy maintains stemness by preventing senescence”. en. In: Nature 529.7584 (Jan. 2016), pp. 37–42.
[164] John Michael Gaziano et al. “Million Veteran Program: A mega-biobank to study genetic influences on health and disease”. In: Journal of Clinical Epidemiology 70 (Feb. 2016), pp. 214– 223. doi: 10.1016/j.jclinepi.2015.09.016. url: https://doi.org/10.1016/j.jclinepi.2015.09.016.
[165] M. Ghandi et al. “GkmSVM: An R package for gapped-kmer SVM”. In: Bioinformatics 32 (2016), pp. 2205–2207.
[166] Mahmoud Ghandi et al. “Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features”. In: PLoS Computational Biology 10.7 (2014). doi: 10.1371/journal.pcbi.1003711.
[167] David C Goff Jr et al. “2013 ACC/AHA guideline on the assessment of cardiovascular risk:
a report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines”. In: Circulation 129.25 Suppl 2 (June 2014), S49–73.
[168] Jennifer C. Goldsack et al. “Verification, analytical validation, and clinical validation (V3): the foundation of determining fit-for-purpose for Biometric Monitoring Technologies (BioMeTs)”. In: npj Digital Medicine 3.1 (Apr. 2020). doi: 10.1038/s41746-020-0260-4. url: https:
//doi.org/10.1038/s41746-020-0260-4.
[169] Kevin Andrew Uy Gonzales et al. “Deterministic Restriction on Pluripotent State Dissolution by Cell-Cycle Pathways”. en. In: Cell 162.3 (July 2015), pp. 564–579.
[170] C. Grady et al. In: N Engl J MedMar 2.376 (2017), p. 9.
[171] J. M. Granja et al. An integrative and scalable software package for single-cell chromatin accessibility analysis. ArchR: biorxiv, 2020.
[172] J. M. Granja et al. “Single-cell multiomic analysis identifies regulatory programs in mixedphenotype acute leukemia”. In: Nat Biotechnol. 37 (2019), pp. 1458–1465.
[173] Ani Grigoryan et al. LaminA/C regulates epigenetic and chromatin architecture changes upon aging of hematopoietic stem cells. 2018.
[174] Shobhit Gupta et al. “Quantifying similarity between motifs”. In: Genome Biology 8.2 (2007), R24. doi: 10.1186/gb-2007-8-2-r24. url: https://doi.org/10.1186/gb-2007-8-2-r24.
[175] Kim H-Y. “Statistical notes for clinical researchers: post-hoc multiple comparisons”. In: Restor Dent EndodMay; 40.2 (2015), pp. 172–6.
[176] Marina El Haddad et al. Retinoic acid maintains human skeletal muscle progenitor cells in an immature state. 2017.
[177] E Haithcock et al. Age-related changes of nuclear architecture in Caenorhabditis elegans. 2005.
[178] T. D. Halazonetis et al. “c-Jun dimerizes with itself and with c-Fos, forming complexes of different DNA binding affinities”. In: Cell 55.5 (1988), pp. 917–924.
[179] Alon Halevy, Peter Norvig, and Fernando Pereira. “The Unreasonable Effectiveness of Data”. In: (2009).
[180] M. Hallegger, A. Taschner, and M. F. Rna Jantsch. “aptamers binding the double-stranded RNA-binding domain”. In: RNA 12 (2006), pp. 1993–2004. doi: 10.1261/rna.125506.
[181] H. Han et al. “TRRUSTv2: an expanded reference database of human and mouse transcriptional regulatory interactions”. In: Nucleic Acids Res 46 (2018), pp. D380–D386.
[182] Jaeseok Han et al. ER-stress-induced transcriptional regulation increases protein synthesis leading to cell death. 2013.
[183] L. Han et al. “The Genomic Landscape and Clinical Relevance of A-to-I RNA Editing in Human Cancers”. In: Cancer Cell 28 (2015), pp. 515–528. doi: 10.1016/j.ccell.2015.08.013.
[184] David Hardy et al. “Comparative Study of Injury Models for Studying Muscle Regeneration in Mice”. en. In: PLoS One 11.1 (Jan. 2016), e0147198.
[185] C. T. Harvey et al. “QuASAR: Quantitative allele-specific analysis of reads”. In: Bioinformatics 31 (2015), pp. 1235–1242.
[186] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. “The Elements of Statistical Learning”. In: Elements 1 (2009), pp. 337–387.
[187] S. Heinz et al. “Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities”. In: Mol Cell 38 (2010), pp. 576–589.
[188] S. Heinz et al. “Simple combinations of lineage-determining transcription factors prime cisregulatory elements required for macrophage and B cell identities”. In: Mol Cell 38 (2010), pp. 576–589.
[189] Sven Heinz et al. Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities. 2010.
[190] A. L. Hemonnot et al. “Microglia in Alzheimer disease: Well-known targets and new opportunities”. In: Front Cell Infect Microbiol. 9 (2019), pp. 1–20.
[191] S. J. Heo et al. “Biophysical regulation of chromatin architecture instills a mechanical memory in mesenchymal stem cells”. In: Sci. Rep 5 (2015), p. 16895.
[192] Ravi S Hira et al. “Frequency and practice-level variation in inappropriate and nonrecommended prasugrel prescribing: insights from the NCDR PINNACLE registry”. en. In: J. Am. Coll. Cardiol. 63.25 Pt A (July 2014), pp. 2876–2877.
[193] J. D. Hoeck et al. “Fbw7 controls neural stem cell differentiation and progenitor apoptosis via Notch and c-Jun”. In: Nature Neuroscience 13 (2010), p. 1365.
[194] J. M. M. Howson et al. “Fifteen new risk loci for coronary artery disease highlight arterialwall-specific mechanisms”. In: Nat Genet 49 (2017), pp. 1113–1119.
[195] Da Wei Huang, Brad T Sherman, and Richard A Lempicki. “Bioinformatics enrichment tools:
paths toward the comprehensive functional analysis of large gene lists”. In: Nucleic Acids Res. 37.1 (2008), pp. 1–13.
[196] Da Wei Huang, Brad T Sherman, and Richard A Lempicki. “Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources”. en. In: Nat. Protoc. 4.1 (2009), pp. 44–57.
[197] D. Huangfu et al. “Induction of pluripotent stem cells by defined factors is greatly improved by small-molecule compounds”. In: Nature Biotechnology 26 (2008), p. 795.
[198] V. Huin et al. “Alternative promoter usage generates novel shorter MAPT mRNA transcripts in Alzheimer’s disease and progressive supranuclear palsy brains”. In: Sci Rep 7 (2017), pp. 1– 10.
[199] Jeroen R Huyghe and Stephanie A et al. Bien. “Discovery of common and rare genetic risk variants for colorectal cancer”. en. In: Nat. Genet. 51.1 (Jan. 2019), pp. 76–87.
[200] T. Hwang et al. “Dynamic regulation of RNA editing in human brain development and disease”. In: Nat Neurosci 19 (2016), pp. 1093–1099. doi: 10.1038/nn.4337.
[201] Christopher C Imes and Frances Marcus Lewis. “Family history of cardiovascular disease, perceived cardiovascular disease risk, and health-related behavior: a review of the literature”. en. In: J. Cardiovasc. Nurs. 29.2 (Mar. 2014), pp. 108–129.
[202] National Human Genome Research Institute. In: (). url: https://www.genome.gov/aboutgenomics/fact-sheets/DNA-Sequencing-Costs-Data.
[203] International Study of Comparative Health Effectiveness With Medical and Invasive Approaches (ISCHEMIA) - Full Text View - ClinicalTrials.gov. url: http://clinicaltrials.gov/ show/NCT01471522.
[204] Marta Jackowska and Andrew Steptoe. “Sleep and future cardiovascular risk: prospective analysis from the English Longitudinal Study of Ageing”. In: Sleep Med. 16.6 (June 2015), pp. 768–774.
[205] K. Jaganathan et al. “Predicting Splicing from Primary Sequence with Deep Learning”. In: Cell 176 (2019), pp. 535–548. doi: 10.1016/j.cell.2018.12.015.
[206] N. Jain et al. “Cell geometric constraints induce modular gene-expression patterns via redistibution of HDAC3 regulated by actomyosin contractility”. In: Proc. Natl Acad. Sci. USA 110, 2013, pp. 11349–11354.
[207] N. Jain et al. “Cell geometric constraints induce modular gene-expression patterns via redistribution of HDAC3 regulated by actomyosin contractility”. In: Proc. Natl Acad. Sci. USA 110, 2013, pp. 11349–11354.
[208] I. Jansen et al. “Genetic meta-analysis identifies 10 novel loci and functional pathways for Alzheimer’s disease risk”. In: Nat Genet 51 (2018), pp. 404–413.
[209] J. Jardine, J. Fisher, and Carrick B. Apple’s ResearchKit. “smart data collection for the smartphone era?” In: J R Soc Med 108.8 (2015), pp. 294–6.
[210] Jinwook Lee et al. ENCODE ATAC-seq pipeline. en. 2019. doi: 10.5281/ZENODO.3564813. url: https://zenodo.org/record/3564813.
[211] N. B. Johnson et al. “Centers for Disease Control and Prevention (CDC)”. In: CDC National Health Report: leading causes of morbidity and mortality and associated behavioral risk and protective factors–United States 31.63 (Oct. 2014), pp. 3–27.
[212] Samuel E. Jones et al. “Genetic studies of accelerometer-based sleep measures yield new insights into human sleep behaviour”. In: Nature Communications 10.1 (Apr. 2019). doi: 10.1038/s41467-019-09576-1. url: https://doi.org/10.1038/s41467-019-09576-1.
[213] Goff D. C. Jr, D. M. Lloyd-Jones, G. Bennett, et al. “2013 ACC/AHA guideline on the assessment of cardiovascular risk: a report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines”. In: Circulation 129.25 (2014), S49– S73.
[214] P Jung et al. AP4 encodes a c-MYC-inducible repressor of p21. 2008.
[215] Leonard A Kaminsky et al. “Investigation of Methods to Determine Individualized Thresholds for Moderate and Vigorous Intensity from Accelerometer Measurements: 2014”. In: Med. Sci. Sports Exercise 42 (May 2010), p. 478.
[216] D. K. Kaushik et al. “a novel transcription factor regulates microglial activation and subsequent neuroinflammation”. In: J. Neuroinflammation 7 (2010), pp. 1–20.
[217] N. A. Kearns et al. “Functional annotation of native enhancers with a Cas9–histone demethylase fusion”. In: Nature methods 12 (2015), p. 5.
[218] Alexandra C Keefe et al. Muscle stem cells contribute to myofibres in sedentary adult mice. 2015.
[219] David R. Kelley, Jasper Snoek, and John L. Rinn. “Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks”. In: Genome Research 26.7 (2016), pp. 990–999. doi: 10.1101/gr.200535.115.
[220] J. P. Kemp et al. “Identification of 153 new loci associated with heel bone mineral density and functional involvement of GPC6 in osteoporosis”. In: Nat Genet 49 (2017), pp. 1468–1475.
[221] P. Kheradpour and M. Kellis. “Systematic discovery and characterization of regulatory motifs in ENCODE TF binding experiments”. In: Nucleic acids research 42.5 (2013), pp. 2976–2987.
[222] M. Kiernan, D. E. Schoffman, K. Lee, et al. “The Stanford Leisure-Time Activity Categorical Item (L-Cat): a single categorical item sensitive to physical activity changes in overweight/obese women”. In: Int J Obes 37.12 (2013), pp. 1597–1602.
[223] Dong-Ho Kim et al. Synthetic dsRNA Dicer substrates enhance RNAi potency and efficacy. 2005.
[224] W. Kladwang et al. “two-dimensional mutate-and-map strategy for non-coding RNA structure”. In: Nat Chem 3 (2011), pp. 954–962. doi: 10.1038/nchem.1176.
[225] Yann C. Klimentidis et al. “Genome-wide association study of habitual physical activity in over 377, 000 UK Biobank participants identifies multiple variants including CADM2 and APOE”. In: International Journal of Obesity 42.6 (June 2018), pp. 1161–1176. doi:
10.1038/s41366-018-0120-3. url: https://doi.org/10.1038/s41366-018-0120-3.
[226] A. S. Knaupp et al. “Transient and Permanent Reconfiguration of Chromatin and Transcription Factor Occupancy Drive Reprogramming”. In: Cell Stem Cell 21.6 (2017), pp. 834–845.
[227] J. W. Knowles, T. L. Assimes, M. Kiernan, et al. “Randomized trial of personal genomics for preventive cardiology: design and challenges”. In: Circ Cardiovasc Genet 5.3 (2012), pp. 368– 376.
[228] Joshua W Knowles et al. “Randomized trial of personal genomics for preventive cardiology: design and challenges”. In: Circ. Cardiovasc. Genet. 5.3 (June 2012), pp. 368–376.
[229] Young Ko and Sunjoo Boo. “Self-perceived health versus actual cardiovascular disease risks”. en. In: Jpn. J. Nurs. Sci. 13.1 (Jan. 2016), pp. 65–74.
[230] H W Kohl et al. “An empirical evaluation of the ACSM guidelines for exercise testing”. In:
Med. Sci. Sports Exerc. 22.4 (Aug. 1990), pp. 533–539.
[231] Harold W Kohl et al. “The pandemic of physical inactivity: global action for public health”. In: Lancet 380.9838 (July 2012), pp. 294–305.
[232] T. Kondo and M. Raff. “Basic helix-loop-helix proteins and the timing of oligodendrocyte differentiation”. In: Development 127 (2000), pp. 2989–2998.
[233] I. Korsunsky et al. “Fast, sensitive and accurate integration of single-cell data with Harmony”. In: Nat Methods 16 (2019), pp. 1289–1296.
[234] T. Chromatin modifications and Kouzarides. “and their function”. In: Cell 128 (2007), pp. 693–705.
[235] K. Kuhlbrodt et al. “a novel transcriptional modulator in glial cells”. In: J. Neurosci. 18 (1998), pp. 237–250.
[236] N. Kumasaka, A. J. Knights, and D. J. Gaffney. “High-resolution genetic mapping of putative causal interactions between regions of open chromatin”. In: Nat Genet 51 (2018), pp. 128– 137.
[237] B. W. Kunkle et al. “Genetic meta-analysis of diagnosed Alzheimer’s disease identifies new risk loci and implicates Abeta, tau, immunity and lipid processing”. In: Nat Genet 2019 (2019), p. 513.
[238] M. C. Lai et al. “Haplotype-specific MAPT exon 3 expression regulated by common intronic polymorphisms associated with Parkinsonian disorders”. In: Mol Neurodegener. 12 (2017), pp. 1–16.
[239] M. C. Lai et al. “Haplotype-specific MAPT exon 3 expression regulated by common intronic polymorphisms associated with Parkinsonian disorders”. In: Mol Neurodegener. 12 (2017), pp. 1–16.
[240] B. B. Lake et al. “Neuronal subtypes and diversity revealed by single-nucleus RNA sequencing of the human brain”. In: Science 352 (2016), pp. 1586–1590.
[241] J.-C. Lambert et al. “Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer’s disease”. In: Nat Genet 45 (2013), pp. 1452–1458.
[242] B. Langmead and S. L. Salzberg. “Fast gapped-read alignment with Bowtie 2”. In: Nat Methods 9 (2012), pp. 357–359.
[243] Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with Bowtie 2. 2012.
[244] M. Larsson et al. “GWAS findings for human iris patterns: Associations with variants in genes that influence normal neuronal pattern development”. In: Am J Hum Genet 89 (2011), pp. 334–343.
[245] Carl J Lavie et al. “Exercise and the Cardiovascular System”. In: Circ. Res. 117.2 (2015), pp. 207–219.
[246] Charity W Law et al. “voom: Precision weights unlock linear model analysis tools for RNA-seq read counts”. en. In: Genome Biol. 15.2 (Feb. 2014), R29.
[247] H. Q. Le et al. “Mechanical regulation of transcription controls Polycomb-mediated gene silencing during lineage commitment”. In: Nat Cell Biol 18 (2016), pp. 864–875.
[248] J. LeBeyec et al. “Cell shape regulates global histone acetylation in human mammary epithelial cells”. In: Exp Cell Res 313 (2007), pp. 3066–3075.
[249] D. Lee et al. “A method to predict the impact of regulatory variants from DNA sequence”. In: Nat Genet 47 (2015), pp. 955–961.
[250] D.-S. Lee et al. “An epigenomic roadmap to indu ced pluripotency reveals DNA methylation as a reprogramming modulator”. In: Nature Communications 5 (2014), p. 5619.
[251] Dongwon Lee. “LS-GKM: a new gkm-SVM for large-scale datasets”. In: Bioinformatics 32.14 (2016), pp. 2196–2198. doi: 10.1093/bioinformatics/btw142.
[252] Jin Lee et al. ENCODE-DCC/chip-seq-pipeline2: Zenodo integration for citation purposes. 2020. doi: 10.5281/ZENODO.3978629. url: https://zenodo.org/record/3978629.
[253] W. Lee et al. “Activation of transcription by two factors that bind promoter and enhancer sequences of the human metallothionein gene and SV40”. In: Nature 325.6102 (1987), pp. 368– 372.
[254] Jeffrey T Leek et al. “The sva package for removing batch effects and other unwanted variation in high-throughput experiments”. In: Bioinformatics 28.6 (2012), pp. 882–883.
[255] Jeffrey T Leek et al. “The sva package for removing batch effects and other unwanted variation in high-throughput experiments”. In: Bioinformatics 28.6 (2012), pp. 882–883.
[256] K. A. Lehmann and B. L. Bass. “Double-stranded RNA adenosine deaminases ADAR1 and ADAR2 have overlapping specificities”. In: Biochemistry 39 (2000), pp. 12875–12884.
[257] K. A. Lehmann and B. L. Bass. “The importance of internal loops within RNA substrates of ADAR1”. In: J Mol Biol 291 (1999), pp. 1–13. doi: 10.1006/jmbi.1999.2914.
[258] I. Letunic and P. Bork. “Interactive Tree Of Life (iTOL) v4: recent updates and new developments”. In: Nucleic Acids Res 47 (2019), W256–W259. doi: 10.1093/nar/gkz239.
[259] I. Letunic and P. Bork. “Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation”. In: Bioinformatics 23 (2007), pp. 127–128. doi: 10.1093/ bioinformatics/btl529.
[260] K. R. Levental et al. “Matrix crosslinking forces tumor progression by enhancing integrin signaling”. In: Cell 139 (2009), pp. 891–906.
[261] Alexander Lex et al. “UpSet: Visualization of Intersecting Sets”. en. In: IEEE Trans. Vis. Comput. Graph. 20.12 (Dec. 2014), pp. 1983–1992.
[262] Bo Li and Colin N Dewey. “RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome”. en. In: BMC Bioinformatics 12.1 (Aug. 2011), p. 323.
[263] Bo Li and Colin N Dewey. “RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome”. en. In: BMC Bioinformatics 12.1 (Aug. 2011), p. 323.
[264] D. Li et al. “Chromatin Accessibility Dynamics during iPSC Reprogramming”. In: Cell Stem Cell 21.6 (2017), pp. 819–833.
[265] H. Li et al. “The Ink4/Arf locus is a barrier for iPS cell reprogramming”. In: Nature 460 (2009), p. 1136.
[266] Jingling Li et al. “A transient DMSO treatment increases the differentiation potential of human pluripotent stem cells through the Rb family”. en. In: PLoS One 13.12 (Dec. 2018), e0208110.
[267] M. Li et al. “Integrative functional genomic analysis of human brain development and neuropsychiatric risks”. In: Science 362 (2018).
[268] R. Li et al.
[269] Victor C Li, Andrea Ballabeni, and Marc W Kirschner. “Gap 1 phase length and mouse embryonic stem cell self-renewal”. en. In: Proc. Natl. Acad. Sci. U. S. A. 109.31 (July 2012), pp. 12550–12555.
[270] Y. Li, C. B. Tang, and K. A. Kilian. “Matrix mechanics influence fibroblast–myofibroblast transition by directing the localization of histone deacetylase 4”. In: Cell Mol Bioeng. 10 (2017), pp. 405–415.
[271] A. Liberzon et al. “Molecular signatures database (MSigDB) 3.0”. In: Bioinformatics 27.12 (May 2011), pp. 1739–1740. doi: 10.1093/bioinformatics/btr260. url: https://doi.org/10.
1093/bioinformatics/btr260.
[272] J. Listgarten et al. “Prediction of off-target activities for the end-to-end design of CRISPR guide RNAs”. In: Nat Biomed Eng 2 (2018), pp. 38–47. doi: 10.1038/s41551-017-0178-6.
[273] J. Liu et al. “The oncogene c-Jun impedes somatic cell reprogramming”. In: Nat Cell Biol 17.7 (2015), pp. 856–867.
[274] Ling Liu et al. “Chromatin modifications as determinants of muscle stem cell quiescence and chronological aging”. en. In: Cell Rep. 4.1 (July 2013), pp. 189–204.
[275] S. Liu et al. “Sp1,NFKB,HDAC,miR-29b regulatory network in KIT-driven myeloid leukemia”.
In: Cancer Cell 17 (2010), pp. 333–347.
[276] Y. Liu, M. Lei, and C. E. Samuel. “Chimeric double-stranded RNA-specific adenosine deaminase ADAR1 proteins reveal functional selectivity of double-stranded RNA-binding domains from ADAR1 and protein kinase PKR”. In: Proc Natl Acad Sci U S A 97 (2000), pp. 12541– 12546. doi: 10.1073/pnas.97.23.12541.
[277] D M Lloyd-Jones. “Prediction of Lifetime Risk for Cardiovascular Disease by Risk Factor Burden at 50 Years of Age”. In: Circulation 113.6 (Feb. 2006), pp. 791–798.
[278] Donald M. Lloyd-Jones et al. “Prediction of Lifetime Risk for Cardiovascular Disease by
Risk Factor Burden at 50 Years of Age”. In: Circulation 113.6 (2006), pp. 791–798. doi: 10.1161/circulationaha.105.548206.
[279] John Lonsdale et al. “The Genotype-Tissue Expression (GTEx) project”. In: Nature Genetics
45.6 (May 2013), pp. 580–585. doi: 10.1038/ng.2653. url: https://doi.org/10.1038/ng.2653. [280] Carlos Lopez-Otin et al. The Hallmarks of Aging. 2013.
[281] R. Lorenz et al. “ViennaRNA Package 2.0”. In: Algorithms for Molecular Biology 6 (2011),
p. 26. doi: 10.1186/1748-7188-6-26.
[282] M. I. Love, W. Huber, and S. Anders. “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2”. In: Genome Biol (2014), pp. 1–21. doi: 10.1186/s13059014-0550-8.
[283] S. M. Lundberg and S.-I. Lee. Consistent feature attribution for tree ensembles. arXiv e-prints ¡, 2017. url: https://ui.adsabs.harvard.edu/abs/2017arXiv170606060L>.
[284] Scott M Lundberg and Su-In Lee. “A Unified Approach to Interpreting Model Predictions”. In: Advances in Neural Information Processing Systems 30. Ed. by I. Guyon et al. Curran Associates, Inc., 2017, pp. 4765–4774. url: http://papers.nips.cc/paper/7062-a-unifiedapproach-to-interpreting-model-predictions.pdf.
[285] M. Luo et al. “NuRD Blocks Reprogramming of Mouse Somatic Cells into Pluripotent Stem Cells”. In: STEM CELLS 31.7 (2013), pp. 1278–1286.
[286] J. Ma, J. T. Yu, and L. Ms4a Tan. “Cluster in Alzheimer’s Disease”. In: Mol Neurobiol. 51 (2015), pp. 1240–1248.
[287] Febbraio MA. “Faculty of 1000 evaluation for Large-scale physical activity data reveal worldwide activity inequality [Internet]”. In: F 1000 (2017). doi: http://dx.doi.org/10.3410/f.
727795643.793534116.
[288] P. Machanick and T. L. MEME-ChIP Bailey. “motif analysis of large DNA datasets”. In: Bioinformatics 27 (2011), pp. 1696–1697.
[289] M. J. Machiela and S. J. LDlink: A Chanock. “web-based application for exploring populationspecific haplotype structure and linking correlated alleles of possible functional variants”. In: Bioinformatics 31 (2015), pp. 3555–3557.
[290] A. MacLaren et al. “c-Jun-Deficient Cells Undergo Premature Senescence as a Result of Spontaneous DNA Damage Accumulation”. In: Molecular and Cellular Biology 24.20 (2004), pp. 9006–9018.
[291] Claire C Maesner, Albert E Almada, and Amy J Wagers. “Established cell surface markers efficiently isolate highly overlapping populations of skeletal muscle satellite cells by fluorescenceactivated cell sorting”. en. In: Skelet. Muscle 6 (Nov. 2016), p. 35.
[292] Brendan Maher. ENCODE: The human encyclopaedia. 2012.
[293] A Mal. A role for histone deacetylase HDAC1 in modulating the transcriptional activity of MyoD: inhibition of the myogenic program. 2001.
[294] T. Maniatis, S. Goodbourn, and J. A. Fischer. “Regulation of inducible and tissue-specific gene expression”. In: Science 236 (1987), pp. 1237–1245.
[295] T J Marcell. Review Article: Sarcopenia: Causes, Consequences, and Preventions. 2003.
[296] Raphael Margueron et al. Role of the polycomb protein EED in the propagation of repressive histone marks. 2009.
[297] Anne Martin et al. “Interventions with potential to reduce sedentary time in adults: systematic review and meta-analysis”. In: Br. J. Sports Med. 49.16 (Aug. 2015), pp. 1056–1063.
[298] Marcel Martin. Cutadapt removes adapter sequences from high-throughput sequencing reads. 2011.
[299] G. Masliah, P. Barraud, and F. H. Rna Allain. “recognition by double-stranded RNA binding domains: a matter of shape and sequence”. In: Cell Mol Life Sci 70 (2013), pp. 1875–1895. doi: 10.1007/s00018-012-1119-x.
[300] M. M. Matthews et al. “Structures of human ADAR2 bound to dsRNA reveal base-flipping mechanism and basis for site selectivity”. In: Nat Struct Mol Biol 23 (2016), pp. 426–433. doi: 10.1038/nsmb.3203.
[301] M. T. Maurano et al. “Systematic Localization of Common Disease-Associated Variation in Regulatory DNA”. In: Science 337.6099 (Sept. 2012), pp. 1190–1195. doi: 10.1126/science.
1222794. url: https://doi.org/10.1126/science.1222794.
[302] C. McCallum, J. Rooksby, and Gray CM. “Evaluating the Impact of Physical Activity Apps and Wearables: Interdisciplinary Review”. In: JMIR Mhealth UhealthMar 23 (2018), p. 6.
[303] M. V. McConnell et al. “Feasibility of Obtaining Measures of Lifestyle From a Smartphone App: The MyHeart Counts Cardiovascular Health Study”. In: JAMA CardiolJan 1.2 (2017), p. 1.
[304] O. G. McDonald et al. “Genome-scale epigenetic reprogramming during epithelial-to-mesenchymal transition”. In: Nat Struct. Mol Biol 18 (2011), pp. 867–874.
[305] I. C. McDowell et al. “Clustering gene expression time series data using an infinite Gaussian process mixture model”. In: PLoS computational biology 14 (2018), p. 1.
[306] Ian C McDowell et al. “Clustering gene expression time series data using an infinite Gaussian process mixture model”. en. Apr. 2017.
[307] Ian C McDowell et al. “Clustering gene expression time series data using an infinite Gaussian process mixture model”. In: PLoS Comput. Biol. 14.1 (Jan. 2018), e1005896.
[308] M. R. McKeown et al. “Superenhancer analysis defi nes novel epigenomic subtypes of nonAPL AML, including an RARa dependency targetable by SY-1425, a potent and selective RARa agonist”. In: Cancer Discov. 7 (2017), pp. 1136–1153.
[309] Cory Y McLean et al. “GREAT improves functional interpretation of cis-regulatory regions”. In: Nature Biotechnology 28.5 (May 2010), pp. 495–501. doi: 10.1038/nbt.1630. url: https:
//doi.org/10.1038/nbt.1630.
[310] Cory Y McLean et al. GREAT improves functional interpretation of cis-regulatory regions.
2010.
[311] T. Melcher et al. “A mammalian RNA editing enzyme”. In: Nature 379 (1996), pp. 460–464. doi: 10.1038/379460a0.
[312] Alexandre Melnikov et al. “Systematic dissection and optimization of inducible enhancers in human cells using a massively parallel reporter assay”. In: Nature Biotechnology 30.3 (2012), pp. 271–277. doi: 10.1038/nbt.2137.
[313] Samantha J Mentch et al. Histone Methylation Dynamics and Gene Regulation Occur through the Sensing of One-Carbon Metabolism. 2015.
[314] T. Merkle et al. “Precise RNA editing by recruiting endogenous ADARs with antisense oligonucleotides”. In: Nat Biotechnol 37 (2019), pp. 133–138. doi: 10.1038/s41587-0190013-6.
[315] T. S. Mikkelsen et al. “Dissecting direct reprogramming through integrative genomic analysis”. In: Nature 454.7200 (2008), pp. 49–55.
[316] Edmond Mitchell, David Monaghan, and Noel O’Connor. “Classification of Sporting Activities Using Smartphone Accelerometers”. In: Sensors 13.4 (Apr. 2013), pp. 5317–5337. doi: 10.3390/s130405317. url: https://doi.org/10.3390/s130405317.
[317] A. V. Molofsky et al. “Astrocyte-encoded positional cues maintain sensorimotor circuit integrity”. In: Nature 509 (2014), pp. 189–194.
[318] M. F. Montiel-Gonzalez et al. “Correction of mutations within the cystic fibrosis transmembrane conductance regulator by site-directed RNA editing”. In: Proc Natl Acad Sci U S A 110 (2013), pp. 18285–18290. doi: 10.1073/pnas.1306243110.
[319] Melissa J Moore and Nick J Proudfoot. “Pre-mRNA processing reaches back to transcription and ahead to translation”. en. In: Cell 136.4 (Feb. 2009), pp. 688–700.
[320] J N Morris and M a Oxfd. “Physical Activity of Work”. In: Transportation (1953), pp. 1111– 1120.
[321] M. R. Mumbach et al. “Enhancer connectome in primary human cells reveals target genes of disease-associated DNA elements”. In: Nat Genet 49 (2017), pp. 1602–1612.
[322] M. R. Mumbach et al. “HiChIP: efficient and sensitive analysis of protein-directed genome architecture”. In: Nat Methods 13 (2016), pp. 919–922.
[323] E. M. Murtagh, M. H. Murphy, and Boone-Heinonen J. Walking. “the first steps in cardiovascular disease prevention”. In: Curr Opin CardiolSep; 25.5 (2010), pp. 490–6.
[324] Jonathan Myers et al. “Exercise capacity and mortality among men referred for exercise testing”. In: N. Engl. J. Med. 346.11 (2002), pp. 793–801.
[325] Akiko Nagai et al. “Overview of the BioBank Japan Project: Study design and profile”. In: Journal of Epidemiology 27.3 (Mar. 2017), S2–S8. doi: 10.1016/j.je.2016.12.005. url: https://doi.org/10.1016/j.je.2016.12.005.
[326] H. Nakatani et al. “Ascl1/Mash1 promotes brain oligodendrogenesis during myelination and remyelination”. In: J. Neurosci. 33 (2013), pp. 9752–9768.
[327] M. A. Nalls et al. “Identification of novel risk loci, causal insights, and heritable risk for Parkinson’s disease: a meta-analysis of genome-wide association studies”. In: Lancet Neurol. 18 (2019), pp. 1091–1102.
[328] M. A. Nalls et al. “Large-scale meta-analysis of genome-wide association data identifies six new risk loci for Parkinson’s disease”. In: Nat Genet 46 (2014), pp. 989–993.
[329] K. Functions and Nishikura. “and regulation of RNA editing by ADAR deaminases”. In: Annual review of biochemistry 79 (2010), pp. 321–349. doi: 10.1146/annurev- biochem060208-105251.
[330] L G Norman. “THE HEALTH OF BUS DRIVERS A STUDY IN LONDON TRANSPORT”.
In: Lancet 272.7051 (1958), pp. 807–812.
[331] A. Nott et al. “Brain cell type – specific enhancer – promoter interactome maps and diseaserisk association”. In: Science 1139 (2019), pp. 1134–1139.
[332] T. J. Nowakowski et al. “Spatiotemporal gene expression trajectories reveal developmental hierarchies of the human cortex”. In: Science 358 (2017), pp. 1318–1323.
[333] Henriette O’Geen, Lorigail Echipare, and Peggy J. Farnham. “Using ChIP-Seq Technology to Generate High-Resolution Profiles of Histone Modifications”. In: Methods in Molecular Biology. Humana Press, 2011, pp. 265–286. doi: 10.1007/978- 1- 61779- 316- 5 20. url: https://doi.org/10.1007/978-1-61779-316-5 20.
[334] OECD. Oecd. Guidelines on Measuring Subjective Well-Being. OECD Publishing; 2013.
[335] OECD. Oecd. “Guidelines on Measuring Subjective Well-being”. In: OECD Publishing; 290
(2013).
[336] A. Okbay et al. “Genetic variants associated with subjective well-being, depressive symptoms, and neuroticism identified through genome-wide analyses”. In: Nat Genet 48 (2016), pp. 624– 633.
[337] M. Olive et al. “A Dominant Negative to Activation Protein-1 (AP1) That Abolishes DNA Binding and Inhibits Oncogenesis”. In: Journal of Biological Chemistry 272.30 (1997), pp. 18586– 18594.
[338] J. Y. Ooi et al. “HDAC inhibition attenuates cardiac hypertrophy by acetylation and deacetylation of target genes”. In: Epigenetics 10 (2015), pp. 418–430.
[339] Keith W Orford and David T Scadden. “Deconstructing stem cell self-renewal: genetic insights into cell-cycle regulation”. en. In: Nat. Rev. Genet. 9.2 (Feb. 2008), pp. 115–128.
[340] Kenji Osafune et al. “Marked differences in differentiation propensity among human embryonic stem cell lines”. en. In: Nat. Biotechnol. 26.3 (Mar. 2008), pp. 313–315.
[341] T. Otowa et al. “Meta-analysis of genome-wide association studies of anxiety disorders”. In: Mol Psychiatry 21 (2016), pp. 1391–1399.
[342] S. B. Overdijkink et al. “and Effectiveness of Mobile Health Technology–Based Lifestyle and Medical Intervention Apps Supporting Health Care During Pregnancy: Systematic Review”.
In: JMIR mHealth and uHealth 6 (2018), p. 4.
[343] S Oyadomari and M Mori. Roles of CHOP/GADD153 in endoplasmic reticulum stress. 2004.
[344] R S Paffenbarger Jr et al. “Physical activity, all-cause mortality, and longevity of college alumni”. In: N. Engl. J. Med. 314.10 (Mar. 1986), pp. 605–613.
[345] R S Paffenbarger Jr et al. “Physical activity, all-cause mortality, and longevity of college alumni”. en. In: N. Engl. J. Med. 314.10 (Mar. 1986), pp. 605–613.
[346] Rajarshi Pal et al. “Diverse effects of dimethyl sulfoxide (DMSO) on the differentiation potential of human embryonic stem cells”. en. In: Arch. Toxicol. 86.4 (Apr. 2012), pp. 651– 661.
[347] Francesca Pala et al. “Distinct metabolic states govern skeletal muscle stem cell fates during prenatal and postnatal myogenesis”. en. In: J. Cell Sci. 131.14 (July 2018).
[348] N. Pankratz et al. “Genomewide association study for susceptibility genes contributing to familial Parkinson disease”. In: Hum Genet 124 (2009), pp. 593–605.
[349] N. Pankratz et al. “Meta-analysis of Parkinson’s Disease: Identification of a novel locus, RIT2”. In: Ann. Neurol. 71 (2012), pp. 370–384.
[350] Stavros Papadopoulos et al. “The TileDB array data storage manager”. In: Proceedings of the VLDB Endowment 10.4 (Nov. 2016), pp. 349–360. doi: 10.14778/3025111.3025117. url: https://doi.org/10.14778/3025111.3025117.
[351] E. Park et al. “Population and allelic variation of A-to-I RNA editing in human transcriptomes”. In: Genome Biol 18 (2017), p. 143. doi: 10.1186/s13059-017-1270-7.
[352] E. Pascale et al. “Genetic architecture of MAPT gene region in parkinson disease subtypes”. In: Front Cell Neurosci. 10 (2016), pp. 1–7.
[353] M. J. Paszek et al. “Tensional homeostasis and the malignant phenotype”. In: Cancer Cell 8 (2005), pp. 241–254.
[354] M. S. Patel et al. “Effect of a Game-Based Intervention Designed to Enhance Social Incentives to Increase Physical Activity Among Families: The BE FIT Randomized Clinical Trial”. In: JAMA Intern MedNov 1.177 (2017), p. 11.
[355] Siim Pauklin and Ludovic Vallier. “The Cell-Cycle State of Stem Cells Determines Cell Fate Propensity”. en. In: Cell 156.6 (Sept. 2013), p. 1338.
[356] N. Paz-Yaacov et al. “Adenosine-to-inosine RNA editing shapes transcriptome diversity in primates”. In: Proc Natl Acad Sci U S A 107 (2010), pp. 12174–12179. doi: 10.1073/pnas. 1006183107.
[357] F. Pedregosa et al. “Scikit-learn: Machine Learning in Python”. In: J Mach Learn Res 12 (2011), pp. 2825–2830.
[358] Antoine H F M Peters et al. Partitioning and Plasticity of Repressive Histone Methylation States in Mammalian Chromatin. 2003.
[359] A. Phinyomark et al. “Feature Extraction and Reduction of Wavelet Transform Coefficients for EMG Pattern Classification”. In: Electronics and Electrical Engineering 122.6 (June 2012). doi: 10.5755/j01.eee.122.6.1816. url: https://doi.org/10.5755/j01.eee.122.6.1816.
[360] Inˆes Pinheiro et al. Prdm3 and Prdm16 are H3K9me1 Methyltransferases Required for Mammalian Heterochromatin Integrity. 2012.
[361] R. Pique-Regi et al. “Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data”. In: Genome Res 21 (2011), pp. 447–455.
[362] John Platt. “Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods”. In: Adv. Large Margin Classif. 10 (June 2000).
[363] H. A. Pliner et al. “Cicero Predicts cis-Regulatory DNA Interactions from Single-Cell Chromatin Accessibility Data”. In: Mol Cell 71 (2018), pp. 858–871.
[364] Hannah A Pliner et al. Cicero Predicts cis-Regulatory DNA Interactions from Single-Cell Chromatin Accessibility Data. 2018.
[365] A. G. Polson and B. L. Bass. “Preferential selection of adenosines for modification by doublestranded RNA adenosine deaminase”. In: Embo J 13 (1994), pp. 5701–5711.
[366] H. T. Porath, S. Carmi, and E. Y. A Levanon. “genome-wide map of hyper-edited RNA reveals numerous new sites”. In: Nat Commun 5 (2014), p. 4726. doi: 10.1038/ncomms5726.
[367] Ermelinda Porpiglia et al. High-resolution myogenic lineage mapping by single-cell mass cytometry. 2017.
[368] Pouya Kheradpour, Alicia Martin, and Peyton Greenside. LD matrices for 1000 genomes phase 1 files for EUR and YRI. 2019. doi: 10.5281/ZENODO.3404275. url: https://zenodo. org/record/3404275.
[369] St´ephanie A Prince et al. “A comparison of direct versus self-report measures for assessing physical activity in adults: a systematic review”. In: Int. J. Behav. Nutr. Phys. Act. 5 (Nov. 2008), p. 56.
[370] P. P. Provenzano et al. “Matrix density-induced mechanoregulation of breast cell phenotype, signaling and gene expression through a FAK–ERK linkage”. In: Oncogene 28 (2009), pp. 4326–4343.
[371] S. Purcell et al. “PLINK: A tool set for whole-genome association and population-based linkage analyses”. In: Am J Hum Genet 81 (2007), pp. 559–575.
[372] P L Puri et al. “Differential roles of p300 and PCAF acetyltransferases in muscle differentiation”. en. In: Mol. Cell 1.1 (Dec. 1997), pp. 35–45.
[373] Lei S. Qi et al. “Repurposing CRISPR as an RNA-Guided Platform for Sequence-Specific Control of Gene Expression”. In: Cell 152.5 (2013), pp. 1173–1183.
[374] K. Qu et al. “Chromatin accessibility landscape of cutaneous T cell lymphoma and dynamic response to HDAC inhibitors”. In: Cancer Cell 32 (2017), pp. 27–41.
[375] L. Qu et al. Programmable RNA editing by recruiting endogenous ADAR using engineered RNAs. Nature Biotechnology, 2019. doi: 10.1038/s41587-019-0178-z.
[376] International Physical Activity Questionnaire. url: https://sites.google.com/site/theipaq/. [377] A. R. Quinlan and I. M. BEDTools: A Hall. “flexible suite of utilities for comparing genomic features”. In: Bioinformatics 26 (2010), pp. 841–842.
[378] H. Rafehi et al. “Systems approach to the pharmacological actions of HDAC inhibitors reveals EP300 activities and convergent mechanisms of regulation in diabetes”. In: Epigenetics 12 (2017), pp. 991–1003.
[379] H. Rafehi et al. “Vascular histone deacetylation by pharmacological HDAC inhibition”. In: Genome Res 24 (2014), pp. 1271–1284.
[380] H. Rafehi and A. Hdac El-Osta. “inhibition in vascular endothelial cells regulates the expression of ncRNAs”. In: Noncoding RNA 2 (2016), p. 4.
[381] Y. Rais et al. “Deterministic direct reprogramming of somatic cells to pluripotency”. In: Nature 502.7469 (2013), pp. 65–70.
[382] P. Rajarajan et al. “Neuron-specific signatures in the chromosomal connectome associated with schizophrenia risk”. In: Science 362 (2018).
[383] G. Ramaswami et al. “Genetic mapping uncovers cis-regulatory landscape of RNA editing”.
In: Nat Commun 6 (2015), p. 8194. doi: 10.1038/ncomms9194.
[384] G. Ramaswami and J. B. Radar Li. a rigorously annotated database of A-to-I RNA editing.
Nucleic Acids Res, 2013. doi: 10.1093/nar/gkt996.
[385] G. Ramaswami and J. B. Radar Li. “a rigorously annotated database of A-to-I RNA editing”. In: Nucleic Acids Res 42 (2014), pp. D109–113. doi: 10.1093/nar/gkt996.
[386] Fidel Ramirez et al. “deepTools2: a next generation web server for deep-sequencing data analysis”. In: Nucleic Acids Research 44.W1 (Apr. 2016), W160–W165. doi: 10.1093/nar/ gkw257. url: https://doi.org/10.1093/nar/gkw257.
[387] Priscila Ramos-Ibeas et al. “Pluripotency and X chromosome dynamics revealed in pig pregastrulating embryos by single cell analysis”. en. In: Nat. Commun. 10.1 (Jan. 2019), p. 500.
[388] M. Rehmsmeier et al. “Fast and effective prediction of microRNA/target duplexes”. In: RNA 10 (2004), pp. 1507–1517. doi: 10.1261/rna.5248604.
[389] K. RNAclust: A Reiche. tool for clustering of RNAs based on their secondary structures using LocARNA. ¡kristin/Software/RNAclust/¿, 2010. url: http://www.bioinf.uni-leipzig.de/.
[390] Ruotong Ren et al. Regulation of Stem Cell Aging by Metabolism and Epigenetics. 2017.
[391] Matthew E Ritchie et al. “limma powers differential expression analyses for RNA-sequencing and microarray studies”. en. In: Nucleic Acids Res. 43.7 (Apr. 2015), e47.
[392] Michael I Robson et al. “Tissue-Specific Gene Repositioning by Muscle Nuclear Membrane Proteins Enhances Repression of Critical Developmental Genes during Myogenesis”. en. In:
Mol. Cell 62.6 (June 2016), pp. 834–847.
[393] Joseph T Rodgers et al. mTORC1 controls the adaptive transition of quiescent stem cells from G0 to GAlert. 2014.
[394] A. B. Rosenberg et al. “Learning the sequence determinants of alternative splicing from millions of random sequences”. In: Cell 163 (2015), pp. 698–711. doi: 10.1016/j.cell.2015.09. 054.
[395] Mary E Rosenberger et al. “24 Hours of Sleep, Sedentary Behavior, and Physical Activity with Nine Wearable Devices”. In: Med. Sci. Sports Exerc. (Oct. 2015).
[396] E. Rouka et al. “Differential recognition preferences of the three Src Homology 3 (SH3) domains from the adaptor CD2-associated Protein (CD2AP) and Direct Association with Ras and Rab Interactor 3 (RIN3)”. In: J. Biol 290 (2015), pp. 25275–25292.
[397] S. Rouskin et al. “Genome-wide probing of RNA structure reveals active unfolding of mRNA structures in vivo”. In: Nature 505 (2014), pp. 701–705. doi: 10.1038/nature12894.
[398] James G Ryall et al. The NAD -Dependent SIRT1 Deacetylase Translates a Metabolic Switch into Regulatory Epigenetics in Skeletal Muscle Stem Cells. 2015.
[399] Tammy Ryan et al. Retinoic Acid Enhances Skeletal Myogenesis in Human Embryonic Stem Cells by Expanding the Premyogenic Progenitor Population. 2012.
[400] Asako Sakaue-Sawano et al. “Tracing the silhouette of individual cells in S/G2/M phases with fluorescence”. en. In: Chem. Biol. 15.12 (Dec. 2008), pp. 1243–1248.
[401] Danielle Sambo et al. Transient Treatment of Human Pluripotent Stem Cells with DMSO to Promote Differentiation — Protocol. https://www.jove.com/video/59833/transienttreatment-human-pluripotent-stem-cells-with-dmso-to-promote. Accessed: 2019-4-21.
[402] A. L. Sapiro et al. “Cis regulatory effects on A-to-I RNA editing in related Drosophila species”. In: Cell Rep 11 (2015), pp. 697–703. doi: 10.1016/j.celrep.2015.04.005.
[403] Vittorio Sartorelli and Pier Lorenzo Puri. “Shaping Gene Expression by Landscaping Chromatin Architecture: Lessons from a Master”. en. In: Mol. Cell 71.3 (Aug. 2018), pp. 375– 388.
[404] A. T. Satpathy et al. “Massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral T cell exhaustion”. In: Nat Biotechnol. 37 (2019), pp. 925– 936.
[405] Y. A. Savva, L. E. Rieder, and R. A. Reenan. “The ADAR protein family”. In: Genome Biol 13 (2012), p. 252. doi: 10.1186/gb-2012-13-12-252.
[406] Mukhopadhyay SC. Wearable Sensors for Human Activity Monitoring: A Review [Internet]. [cited 2018 July 30]Vol. 15, IEEE Sensors Journal. 2015. p. 1321–30. Available from. doi: http://dx.doi.org/10.1109/jsen.2014.2370945.
[407] Juergen Scharner and Peter S Zammit. “The muscle satellite cell at 50: the formative years”. en. In: Skelet. Muscle 1.1 (Aug. 2011), p. 28.
[408] K. H. Schlingensiepen et al. “The role of Jun transcription factor expression and phosphorylation in neuronal differentiation, neuronal cell death, and plastic adaptations in vivo”. In: Cell Mol Neurobiol. 14 (1994), pp. 487–505.
[409] M. D. Schmidt et al. “Cardiometabolic risk in younger and older adults across an index of ambulatory activity”. In: Am J Prev MedOct; 37.4 (2009), pp. 278–84.
[410] Patrick Seale et al. PRDM16 controls a brown fat/skeletal muscle switch. 2008.
[411] Yogev Sela et al. “Human embryonic stem cells exhibit increased propensity to differentiate during the G1 phase prior to phosphorylation of retinoblastoma protein”. en. In: Stem Cells 30.6 (June 2012), pp. 1097–1108.
[412] Payel Sen et al. “Epigenetic Mechanisms of Longevity and Aging”. en. In: Cell 166.4 (Aug. 2016), pp. 822–839.
[413] Bridge Server. Home - Bridge - Confluence. url: https://sagebionetworks.jira.com/wiki/ display/BRIDGE/Bridge+Server+Home.
[414] N. Y. A. Sey et al. “A computational tool (H-MAGMA) for improved prediction of braindisorder risk genes by incorporating brain chromatin interaction profiles”. In: Nat Neurosci. 23 (2020), pp. 583–593.
[415] L. Shallev et al. “Decreased A-to-I RNA editing as a source of keratinocytes’ dsRNA in psoriasis”. In: RNA 24 (2018), pp. 828–840. doi: 10.1261/rna.064659.117.
[416] A. Sharma et al. “Utilizing mobile technologies to improve physical activity and medication adherence in patients with heart failure and diabetes mellitus: Rationale and design of the TARGET-HF-DM trial”. In: Am Heart J [Internet]. [citedJuly 30 (2018). doi: http://dx.doi. org/10.1016/j.ahj.2019.01.007.
[417] E. Shaulian and M. Karin. “AP-1 in cell proliferation and survival”. In: Oncogene 20.19 (2001), pp. 2390–2400.
[418] Fangzhou Shen et al. “Genome-scale network model of metabolism and histone acetylation reveals metabolic dependencies of histone deacetylase inhibitors”. en. In: Genome Biol. 20.1 (Mar. 2019), p. 49.
[419] R J Shephard. “Limits to the measurement of habitual physical activity by questionnaires”. In: Br. J. Sports Med. 37.3 (June 2003), 197–206, discussion 206.
[420] S. T. Sherry. “dbSNP: the NCBI database of genetic variation”. In: Nucleic Acids Research 29.1 (Jan. 2001), pp. 308–311. doi: 10.1093/nar/29.1.308. url: https://doi.org/10.1093/ nar/29.1.308.
[421] Y. Shi et al. “Induction of Pluripotent Stem Cells from Mouse Embryonic Fibroblasts by Oct4 and Klf4 with Small-Molecule Compounds”. In: Cell Stem Cell 3.5 (2008), pp. 568–574. [422] Tomer Shlomi et al. Network-based prediction of human tissue-specific metabolism. 2008.
[423] A. Shrikumar, E. Prakash, and A. GkmExplain Kundaje. “Fast and accurate interpretation of nonlinear gapped k-mer SVMs”. In: Bioinformatics 35 (2019).
[424] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning Important Features Through Propagating Activation Differences. 2017. arXiv: 1704.02685 [cs.CV].
[425] Avanti Shrikumar, Eva Prakash, and Anshul Kundaje. “GkmExplain: fast and accurate interpretation of nonlinear gapped k-mer SVMs”. In: Bioinformatics 35.14 (July 2019), pp. i173– i182. doi: 10.1093/bioinformatics/btz322. url: https://doi.org/10.1093/bioinformatics/ btz322.
[426] J. Sim´on-Sa´nchez et al. “Genome-wide association study reveals genetic risk underlying Parkinson’s disease”. In: Nat Genet 41 (2009), pp. 1308–1312.
[427] Amar M Singh et al. “Cell-Cycle Control of Developmentally Regulated Transcription Factors Accounts for Heterogeneity in Human Pluripotent Cells”. en. In: Stem Cell Reports 2.3 (Mar.
2013), p. 398.
[428] Amar M Singh et al. “Utilizing FUCCI reporters to understand pluripotent stem cell biology”. en. In: Methods 101 (May 2016), pp. 4–10.
[429] Param Priya Singh et al. “The Genetics of Aging: A Vertebrate Perspective”. en. In: Cell 177.1 (Mar. 2019), pp. 200–220.
[430] Biddle Sjh, N. Mutrie, and T. Gorely. Psychology of Physical Activity: Determinants, WellBeing and Interventions. ; 434 p: Routledge, 2015.
[431] Montgomery Slatkin. “Linkage disequilibrium — understanding the evolutionary past and mapping the medical future”. In: Nature Reviews Genetics 9.6 (June 2008), pp. 477–485. doi: 10.1038/nrg2361. url: https://doi.org/10.1038/nrg2361.
[432] Denise N Slenter et al. “WikiPathways:a multifaceted pathway database bridging metabolomics to other omics research”. en. In: Nucleic Acids Res. 46.D1 (Jan. 2018), pp. D661–D667.
[433] A. M. Smith et al. “The transcription factor PU.1 is critical for viability and function of human brain microglia”. In: Glia 61 (2013), pp. 929–942.
[434] Aaron Smith. Smartphone ownership 2013. Pew Research Center. 2013.
[435] Austin G Smith. “Embryo-Derived Stem Cells: Of Mice and Men”. In: Annu. Rev. Cell Dev.
Biol. 17.1 (2001), pp. 435–462.
[436] L. Song and G. E. Crawford. “DNase-seq: A High-Resolution Technique for Mapping Active Gene Regulatory Elements across the Genome from Mammalian Cells”. In: Cold Spring Harbor Protocols 2010.2 (Feb. 2010), pdb.prot5384–pdb.prot5384. doi: 10.1101/pdb.prot5384. url: https://doi.org/10.1101/pdb.prot5384.
[437] M. Song et al. “Mapping cis-regulatory chromatin contacts in neural cells links neuropsychiatric disorder risk variants to target genes”. In: Nat Genet 51 (2019), pp. 1252–1262.
[438] Pedro Sousa-Victor et al. Geriatric muscle stem cells switch reversible quiescence into senescence. 2014.
[439] Chris C. A. Spencer et al. “Designing Genome-Wide Association Studies: Sample Size, Power, Imputation, and the Choice of Genotyping Chip”. In: PLoS Genetics 5.5 (May 2009). Ed. by John D. Storey, e1000477. doi: 10.1371/journal.pgen.1000477. url: https://doi.org/10.
1371/journal.pgen.1000477.
[440] T. Stafforst and M. F. Schneider. “An RNA-deaminase conjugate selectively repairs point mutations”. In: Angew Chem Int Ed Engl 51 (2012), pp. 11166–11169. doi: 10.1002/anie. 201206489.
[441] E. Stage et al. “The effect of the top 20 Alzheimer disease risk genes on gray-matter density and FDG PET brain metabolism”. In: Alzheimer’s Dement Diagnosis Assess Dis Monit. 5 (2016), pp. 53–66.
[442] H. Stefansson et al. “A common inversion under selection in Europeans”. In: Nat Genet 37 (2005), pp. 129–137.
[443] H. Stefansson et al. “A common inversion under selection in Europeans”. In: Nat Genet 37 (2005), pp. 129–137.
[444] O. M. Stephens, B. L. Haudenschild, and P. A. Beal. “The binding selectivity of ADAR2’s dsRBMs contributes to RNA-editing selectivity”. In: Chem Biol 11 (2004), pp. 1239–1250. doi: 10.1016/j.chembiol.2004.06.009.
[445] M. Steven G. Hershman Brian. “Bot Anna Shcherbina Megan Doerr Yasbanoo Moayedi Aleksandra Pavlovic Daryl Waggott Mildred K. Cho Mary E. Rosenberger William L. Haskell Jonathan Myers Mary Ann Champagne Emmanuel Mignot Dario Salvi Martin Landray Lionel Tarassenko Robert A. Harrington Alan C. Yeung Michael V. McConnell Euan A”. In: Ashley. MyHeart Counts physical activity, sleep, and cardiovascular health data on a free-living cohort of 50 ().
[446] C. C. Stolt et al. “The Sox9 transcription factor determines glial fate choice in the developing spinal cord”. In: Genes Dev 17 (2003), pp. 1677–1689.
[447] R. S. Stowers et al. “Extracellular matrix stiffening induces a malignant phenotypic transition in breast epithelial cells”. In: Cell Mol Bioeng. 10 (2016), pp. 114–123.
[448] Barbara E. Stranger, Eli A. Stahl, and Towfique Raj. “Progress and Promise of GenomeWide Association Studies for Human Complex Trait Genetics”. In: Genetics 187.2 (Nov. 2010), pp. 367–383. doi: 10.1534/genetics.110.120907. url: https://doi.org/10.1534/ genetics.110.120907.
[449] Scott J Strath et al. “Guide to the assessment of physical activity: Clinical and research applications: a scientific statement from the American Heart Association”. In: Circulation 128.20 (Nov. 2013), pp. 2259–2279.
[450] Tara W Strine et al. “The associations between life satisfaction and health-related quality of life, chronic illness, and health behaviors among US community-dwelling adults”. In: J.
Community Health 33.1 (2008), pp. 40–50.
[451] T. Stuart et al. “Comprehensive Integration of Single-Cell Data”. In: Cell 177 (2019), pp. 1888– 1902.
[452] Cathie Sudlow et al. “UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age”. In: PLOS Medicine 12.3 (Mar. 2015), e1001779. doi: 10.1371/journal.pmed.1001779. url: https://doi.org/10.1371/journal. pmed.1001779.
[453] J. Swift et al. “Nuclear lamin-A scales with tissue stiffness and enhances matrix-directed differentiation”. In: Science 341.12401 (2013), p. 04.
[454] Mai T, Markov GJ, and Brady JJ. “NKX3-1 is required for induced pluripotent stem cell reprogramming and can replace OCT4 in mouse and human iPSC induction”. In: Nat Cell Biol 20.8 (2018), pp. 900–908.
[455] V. Tabar and L. Studer. “Pluripotent stem cells in regenerative medicine: challenges and recent progress”. In: Nature Reviews Genetics 15 (2014), p. 82.
[456] A. Tajik et al. “Transcription upregulation via force-induced direct stretching of chromatin”. In: Nat Mater 15 (2016), pp. 1287–1296.
[457] K. Takahashi and S. Yamanaka. “Induction of Pluripotent Stem Cells from Mouse Embryonic and Adult Fibroblast Cultures by Defined Factors”. In: Cell 126.4 (2006), pp. 663–676.
[458] Vivian Tam et al. “Benefits and limitations of genome-wide association studies”. In: Nature Reviews Genetics 20.8 (May 2019), pp. 467–484. doi: 10.1038/s41576-019-0127-1. url: https://doi.org/10.1038/s41576-019-0127-1.
[459] R. E. Taylor-Piliae, W. L. Haskell, C. Iribarren, et al. “Clinical utility of the Stanford brief activity survey in men and women with early-onset coronary artery disease”. In: J Cardiopulm Rehabil Prev 27.4 (2007), pp. 227–232.
[460] Ruth E. Taylor-Piliae et al. “Validation of a New Brief Physical Activity Survey among Men and Women Aged 60–69 Years”. In: American Journal of Epidemiology 164.6 (2006), pp. 598– 606. doi: 10.1093/aje/kwj248.
[461] Topic: Wearable technology. www.statista.com Available at, 2020. url: https://www.statista. com/topics/1556/wearable-technology/.
[462] “The International HapMap Project”. In: Nature 426.6968 (Dec. 2003), pp. 789–796. doi: 10.1038/nature02168. url: https://doi.org/10.1038/nature02168.
[463] Lenth RV. Least-Squares Means: TheRPackagelsmeans. J Stat Softw [Internet]. 2016;69(1). Available from. doi: http://dx.doi.org/10.18637/jss.v069.i01.
[464] W. Thielicke and E. J. Towards user-friendly Stamhuis. “affordable and accurate digital particle image velocimetry in MATLAB”. In: J. Open Res 2 (2014).
[465] J. M. Thomas and P. A. Beal. “How do ADARs bind RNA?” In: New protein-RNA structures illuminate substrate recognition by the RNA editing ADARs. Bioessays 39 (2017). doi: 10. 1002/bies.201600187.
[466] N. Tian et al. “A structural determinant required for RNA editing”. In: Nucleic Acids Res 39 (2011), pp. 5669–5681. doi: 10.1093/nar/gkr144.
[467] G. Tiscornia, E. L. Vivas, and J. C. I. Belmonte. “Diseases in a dish: modeling human genetic disorders using induced pluripotent cells”. In: Nature Medicine 17 (2011), p. 1570.
[468] Milica Tosic et al. Lsd1 regulates skeletal muscle regeneration and directs the fate of satellite cells. 2018.
[469] A. E. Trevino et al. “Chromatin accessibility dynamics in a model of human forebrain development”. In: Science 367 (2020).
[470] Richard P Troiano et al. “Physical Activity in the United States Measured by Accelerometer”. In: Med. Sci. Sports Exercise 40.1 (2008), pp. 181–188.
[471] Alexander M Tsankov et al. “Transcription factor binding dynamics during human ES cell differentiation”. en. In: Nature 518.7539 (Feb. 2015), pp. 344–349.
[472] C. Tudor-Locke. “Steps to Better Cardiovascular Health: How Many Steps Does It Take to Achieve Good Health and How Confident Are We in This Number?” In: Curr Cardiovasc Risk RepJul; 4.4 (2010), pp. 271–6.
[473] M. Uhlen et al. “Tissue-based map of the human proteome”. In: Science 347.12604 (2015), p. 19.
[474] C. Uhler and G. V. Shivashankar. “Regulation of genome organization and gene expression by nuclear mechanotransduction”. In: Nat Rev Mol Cell Biol 18 (2017), pp. 717–727.
[475] J. C. Ulirsch et al. “Interrogation of human hematopoiesis at single-cell and single-variant resolution”. In: Nat Genet 51 (2019), pp. 683–693.
[476] Kfir Baruch Umansky et al. “Runx1 Transcription Factor Is Required for Myoblasts Proliferation during Muscle Regeneration”. en. In: PLoS Genet. 11.8 (Aug. 2015), e1005457.
[477] G. T. Valenca et al. The Role of MAPT Haplotype H2 and Isoform 1N/4R in Parkinsonism of Older Adults. PLoS One, 2016.
[478] Jolien Vanhove et al. “H3K27me3 Does Not Orchestrate the Expression of Lineage-Specific Markers in hESC-Derived Hepatocytes In Vitro”. In: Stem Cell Reports 7.2 (2016), pp. 192– 206.
[479] E Verdin. NAD in aging, metabolism, and neurodegeneration. 2015.
[480] T. Vierbuchen et al. “AP-1 Transcription Factors and the BAF Complex Mediate SignalDependent Enhancer Selection”. In: Molecular Cell 68.6 (2017), pp. 1067–1082.
[481] Jeff Vierstra et al. “Global reference mapping of human transcription factor footprints”. In: Nature 583.7818 (July 2020), pp. 729–736. doi: 10.1038/s41586-020-2528-x. url: https:
//doi.org/10.1038/s41586-020-2528-x.
[482] P. Vogel et al. “Efficient and precise editing of endogenous transcripts with SNAP-tagged ADARs”. In: Nat Methods 15 (2018), pp. 535–538. doi: 10.1038/s41592-018-0017-z.
[483] P. Vogel and T. Stafforst. “Critical review on engineering deaminases for site-directed RNA editing”. In: Curr Opin Biotechnol 55 (2019), pp. 74–80. doi: 10.1016/j.copbio.2018.08.006.
[484] P. Vogel and T. Stafforst. “Site-directed RNA editing with antagomir deaminases–a tool to study protein and RNA function”. In: ChemMedChem 9 (2014), pp. 2021–2025. doi: 10.1002/ cmdc.201402139.
[485] Jiyong Wang, Sharon T Jia, and Songtao Jia. “New Insights into the Regulation of Heterochromatin”. en. In: Trends Genet. 32.5 (May 2016), pp. 284–294.
[486] Y. Wang, S. Park, and P. A. Beal. “Selective Recognition of RNA Substrates by ADAR Deaminase Domains”. In: Biochemistry 57 (2018), pp. 1640–1651. doi: 10.1021/acs.biochem.
7b01100.
[487] Y. Wang, Y. Zheng, and P. A. Beal. “Adenosine Deaminases That Act on RNA (ADARs)”.
In: Enzymes 41 (2017), pp. 215–268. doi: 10.1016/bs.enz.2017.03.006.
[488] Yu Xin Wang and Michael A Rudnicki. Satellite cells, the engines of muscle repair. 2012.
[489] Kyoko Watanabe et al. “Functional mapping and annotation of genetic associations with FUMA”. In: Nature Communications 8.1 (Nov. 2017). doi: 10.1038/s41467-017-01261-5. url: https://doi.org/10.1038/s41467-017-01261-5.
[490] Dong-Qing Wei et al. Translational Bioinformatics and Its Application. en. Springer, Mar.
2017.
[491] M. Wernig et al. “c-Myc Is Dispensable for Direct Reprogramming of Mouse Fibroblasts”. In: Cell Stem Cell 2.1 (2008), pp. 10–12.
[492] J. Wettengel et al. “Harnessing human ADAR2 for RNA repair - Recoding a PINK1 mutation rescues mitophagy”. In: Nucleic Acids Res 45 (2017), pp. 2797–2808. doi: 10.1093/nar/ gkw911.
[493] Josephine White and Stephen Dalton. “Cell cycle control of embryonic stem cells”. en. In:
Stem Cell Rev. 1.2 (2005), pp. 131–138.
[494] H Wickham. “Programming with ggplot2”. In: use R! (2016), pp. 241–253.
[495] S. Will et al. “accurate boundary prediction and improved detection of structural RNAs”. In: RNA 18 (2012), pp. 900–914. doi: 10.1261/rna.029041.111.
[496] S. Will et al. “Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering”. In: PLoS Comput Biol 3 (2007). doi: 10.1371/journal.pcbi. 0030065.
[497] K. Wisdom and O. 3d Chaudhuri. “cell culture in interpenetrating networks of alginate and rBM matrix”. In: Methods Mol Biol 1612 (2017), pp. 29–37.
[498] S. K. Wong, S. Sato, and D. W. Lazinski. “Substrate recognition by ADAR1 and ADAR2”. In: RNA 7 (2001), pp. 846–858. doi: 10.1017/s135583820101007x.
[499] T. M. Woolf, J. M. Chase, and D. T. Stinchcomb. “Toward the therapeutic editing of mutated
RNA sequences”. In: Proc Natl Acad Sci U S A 92 (1995), pp. 8298–8302. doi: 10.1073/pnas. 92.18.8298.
[500] W. B. Wu. “Isotonic regression: Another look at the changepoint problem”. In: Biometrika 88.3 (Oct. 2001), pp. 793–804. doi: 10.1093/biomet/88.3.793. url: https://doi.org/10.1093/ biomet/88.3.793.
[501] Guangyan Xiong et al. “The PERK arm of the unfolded protein response regulates satellite cell-mediated skeletal muscle regeneration”. en. In: Elife 6 (Mar. 2017).
[502] H. Y. Xiong et al. “RNA splicing”. In: The human splicing code reveals new insights into the genetic determinants of disease 347 (2015), p. 12548. doi: 10.1126/science.1254806.
[503] R. Xu et al. “Sustained activation of STAT5 is essential for chromatin remodeling and maintenance of mammary-specific function”. In: J. Cell Biol 184 (2009), pp. 57–66.
[504] W. Xu, L. Tan, and J. T. Yu. “The Role of PICALM in Alzheimer’s Disease”. In: Mol Neurobiol. 52 (2015), pp. 399–413.
[505] Chan Y-fy et al. “The Asthma Mobile Health Study, a large-scale clinical observational study using ResearchKit”. In: Nat BiotechnolApr; 35.4 (2017), pp. 354–62.
[506] Galip Gu¨rkan Yardımcı et al. “Explicit DNase sequence bias modeling enables high-resolution transcription factor footprint detection”. In: Nucleic Acids Research 42.19 (Oct. 2014), pp. 11865– 11878. doi: 10.1093/nar/gku810. url: https://doi.org/10.1093/nar/gku810.
[507] J. D. Yesselman et al. “Updates to the RNA mapping database (RMDB), version 2”. In: Nucleic Acids Res 46 (2018), pp. D375–D379. doi: 10.1093/nar/gkx873.
[508] Hang Yin, Feodor Price, and Michael A Rudnicki. Satellite Cells and the Muscle Stem Cell Niche. 2013.
[509] C. Yu et al. “Small molecules enhance CRISPR genome editing in pluripotent stem cells”.
In: Cell Stem Cell 16 (2015), pp. 142–147. doi: 10.1016/j.stem.2015.01.003.
[510] Z W Yu and P J Quinn. “Dimethyl sulphoxide: a review of its applications in cell biology”. en. In: Biosci. Rep. 14.6 (Dec. 1994), pp. 259–281.
[511] Silvia Zecchini et al. “Autophagy controls neonatal myogenesis by regulating the GH-IGF1 system through a NFE2L2- and DDIT3-mediated mechanism”. en. In: Autophagy 15.1 (Jan. 2019), pp. 58–77.
[512] Gabriel E Zentner and Steven Henikoff. “Regulation of nucleosome dynamics by histone modifications”. In: Nature Structural & Molecular Biology 20.3 (Mar. 2013), pp. 259–266. doi: 10.1038/nsmb.2470. url: https://doi.org/10.1038/nsmb.2470.
[513] H-M Zhang et al. AnimalTFDB: a comprehensive animal transcription factor database. 2012.
[514] R. Zhang et al. “Quantifying RNA allelic ratios by microfluidic multiplex PCR and sequencing”. In: Nat Methods 11 (2014), pp. 51–54. doi: 10.1038/nmeth.2736.
[515] R. Zhang et al. “Evolutionary analysis reveals regulatory and functional landscape of coding and non-coding RNA editing”. In: PLoS Genet 13 (2017). doi: 10.1371/journal.pgen.1006563.
[516] Y. Zhang, M. Liao, and M. L. Dufau. “Phosphatidylinositol 3-kinase/protein kinase Czinduced phosphorylation of Sp1 and p107 repressor release have a critical role in histone deacetylase inhibitor-mediated derepression of transcription of the luteinizing hormone receptor gene”. In: Mol Cell Biol 26 (2006), pp. 6748–6761.
[517] Yong Zhang et al. “Model-based analysis of ChIP-Seq (MACS)”. en. In: Genome Biol. 9.9 (Sept. 2008), R137.
[518] Xiuying Zhong et al. “Mitochondrial Dynamics Is Critical for the Full Pluripotency and Embryonic Developmental Potential of Pluripotent Stem Cells”. en. In: Cell Metab. (Dec.
2018).
[519] Jian Zhou and Olga G Troyanskaya. “Predicting effects of noncoding variants with deep learning–based sequence model”. In: Nature Methods 12.10 (Aug. 2015), pp. 931–934. doi: 10.1038/nmeth.3547. url: https://doi.org/10.1038/nmeth.3547.
[520] Jian Zhou and Olga G Troyanskaya. “Predicting effects of noncoding variants with deep learning–based sequence model”. In: Nature Methods 12.10 (Aug. 2015), pp. 931–934. doi: 10.1038/nmeth.3547. url: https://doi.org/10.1038/nmeth.3547.
[521] Xin Zhou and Ting Wang. “Using the Wash U Epigenome Browser to examine genome-wide sequencing data”. en. In: Curr. Protoc. Bioinformatics Chapter 10 (Dec. 2012), Unit10.10.
[522] M. C. Zillikens et al. “Large meta-analysis of genome-wide association studies identifies five loci for lean body mass”. In: Nat Commun. 8 (2017).
[523] M. C. Zody et al. “Evolutionary toggling of the MAPT 17q21.31 inversion region”. In: Nat Genet 40 (2008), pp. 1076–1083.
[524] M. Zubradt et al. “DMS-MaPseq for genome-wide or targeted RNA structure probing in vivo”. In: Nat Methods 14 (2017), pp. 75–82. doi: 10.1038/nmeth.4057.

ProQuest Number: 28354110
INFORMATION TO ALL USERS
The quality and completeness of this reproduction is dependent on the quality and completeness of the copy made available to ProQuest.

Distributed by ProQuest LLC ( ). 2021
Copyright of the Dissertation is held by the Author unless otherwise noted.
This work may be used in accordance with the terms of the Creative Commons license or other rights statement, as indicated in the copyright statement or in the metadata associated with this work. Unless otherwise specified in the copyright statement or the metadata, all rights are reserved by the copyright holder.
This work is protected against unauthorized copying under Title 17, United States Code and other applicable copyright laws.
Microform Edition where available © ProQuest LLC. No reproduction or digitization of the Microform Edition is authorized without permission of ProQuest LLC.
ProQuest LLC
789 East Eisenhower Parkway P.O. Box 1346
Ann Arbor, MI 48106 - 1346 USA

JZTXT

1

non-stranded (can be PE or SE) bamCoverage -p16 -v --binSize 1 --samFlagExclude 780 --Offset 1 1 \ --minMappingQuality 30 -b $cur_bam -o $cur_bam.bpnet.unstranded.bw

forward strand -- assumes SE data bamCoverage -p16 -v --binSize 1 --samFlagExclude 796 --Offset 1 1 \ --minMappingQuality 30 -b $cur_bam -o $cur_bam.bpnet.plus.bw

reverse strand -- assumes SE data