Given a query sequence Q, this site explains how to generate a nucleosome formation energy profile describing the preference of nucleosomes along each location of Q. In our example Q is the target sequence used in Figure 5 of the manuscript. Assuming that Q has a length of LQ and we are given a structural template T of LT base-pairs (LT = 147 for structural template 1kx5, PDB), we could proceed as follows: (1) Generate a set {S1,...,Sn,...,SN} of N = LQ - LT + 1 local sequences, where Sn is the continuous subsequence of Q from position n to position (n + LT - 1). In order to achieve this we can use the script gen_local_sequence.pl and the one line representation of Q, query_oneline. All scripts can be downloaded below. # procedure: # $ ./gen_local_sequence.pl query_oneline nseq shift # # query_oneline: the query sequence, Q in one line # nseq : number of sequences to be generated # nshift : from this position (e.g. nshift = 0 from begining) Example of use: $ ./gen_local_sequence.pl query_oneline 1 147 creates a file 'SEQ_00148.dat' ------------------------------------------------------------------------------------ SEQ I >ACAGGATGTATATATCTGACACGTGCCTGGAGACTAGGGAGTAAT....GAGCGGCCTCGGCACCGGGATTCTCCAG SEQ J >CTGGAGAATCCCGGTGCCGAGGCCGCTCAATTGGTCGTAGACAGC....AGGCACGTGTCAGATATATACATCCTGT ------------------------------------------------------------------------------------ that contains the local sequence and its complentary strand accommodated by a nucleosome that occupies the 148th position along Q. As indicated by Figure S5 this subsequence is the 601 nucleosome positioning sequence. In order to generate all local sequences, starting from the first position of Q, one needs to run the script: '$ ./gen_local_sequence.pl query_oneline N 0'. In the paper, for Figure 5, we used N = 3979. Download:gen_local_sequence.pl Download:query_oneline (The present Q represents target sequence used for Figure 5) Download:SEQ_00148.dat (The first position along target sequence used for Figure 5) (2) Generate a set {T1,...,Tn,...,TN} of query structures, where Tn is obtained by threading Sn onto our nucleosomal DNA template T. In order to achieve this, we execute a two stage threading procedure. First, we generate a structural template T, in which all native bases of a nucleosomal DNA are replaced by the 3 pseudoatoms representing the plane. This is done by the script: #! /usr/bin/perl -w # # This script replaces bases from nucleotides in a .pdb with three # pseudo atoms named PL1, PL2, PL3 that determine the base plane # information # # usage: # ./replace_base_by_plane.pl A_with_base.pdb > A_with_plane.pdb # # A_with_base.pdb : pdb containing 'full' nucleotides # A_with_plane.pdb : pdb containing 'nucleotide base plane only' # plus first group nitrogen ( GUA N9, ADE N9, # THY N1, CYT N1 ) # Copyright (c) 2007 Peter Minary Example of use: $ ./replace_base_by_plane.pl 1kx5_DNA_with_base.pdb > 1kx5_DNA_with_plane.pdb where 1kx5_DNA_with_base.pdb is the structure of the nucleosomal DNA as it is found in PDB structure 1kx5 from the Protein Data Bank. Download:replace_base_by_plane.pl Download:1kx5_DNA_with_base.pdb (Structure of the nucleosomal DNA) Download:1kx5_DNA_with_plane.pdb (Nucleosomal DNA backbone) Next, we have to build the nucleosome structure accommodating a given sequence. To do this, we use the script build_DNA_base.pl #! /usr/bin/perl -w # #./build_DNA_base.pl nuc_base_r_theta_phi DNA_with_plane.pdb SEQ.dat > DNA_SEQ.pdb # # nuc_base_r_theta_phi (input) : (r, theta, phi) coordinates for # for the base atoms of ade, thy, gua, cyt # DNA_with_plane.pdb (input) : a .pdb file containing backbonde DNA # strands and the plane definition (3 atoms) and # NN (N1 for C, T; N3 for G, A) for each basis # SEQ.dat (input) : Sequence file # SEQ A >ATATATATATAT .... # SEQ B >TATATATATATA .... # DNA_SEQ.pdb (output) : The DNA with the threaded sequence # # Copyright (c) 2007 Peter Minary Example of use: ./build_DNA_base.pl nuc_base_r_theta_phi 1kx5_DNA_with_plane.pdb SEQ_00148.dat > 1kx5_00148.pdb where both SEQ_00148.dat and 1kx5_DNA_with_plane.pdb were generated above. The final output of this script is 1kx5_00148.pdb the structure of the nucleosome (T1) occuping the 148th position of the query sequence. This position corresponds to the 601 nucleosome positioning sequence element. Download:build_DNA_base.pl Download:nuc_base_r_theta_phi Download:1kx5_00148.pdb (T1) In order to generate the set {T1,...,Tn,...,TN}, one has to replicate the procedure we just outlined above for S1 → T1 using a DNA backbone structure (1kx5_DNA_with_plane.pdb). To achieve this task we use the script: #! /usr/bin/perl -w # # ./replicate_design.pl nstructure shift # # inputs # nstructure: # of structures to be generated # shift: from a given structure (e.g. n for T_n) # output # it generates directories each containing a nucleosome # structure accommdating a local sequence. Download:replicate_design.pl Download:tutorial (Tutorial on using this script.) (3) Cytosines may be methylated by conversion to 5-Me-cytosine. This is achieved by using the script: #! /usr/bin/perl -w # # usage: # ./methylate_CYT.pl DNA.pdb > 5MeC-DNA.pdb # # This script methylates all CYT residues in a DNA # # H42 H41 # \ / # N4 # | # H51 C4 # | / \\ # H5- H52-C5M-C5 N3 # | || | # H53 C6 C2 # / \ / \\ # H6 N1 O2 # \ # \ # \ # O1P H5' H4' O4' \ # | | \ / \ \ # -P-O5'-C5'---C4' C1' # | | \ / \ # O2P H5'' C3'--C2' H1' # / \ / \ # O3' H3' O2' H2'' # | | # H2' # # Copyright (c) 2007 Peter Minary Example of use: $ ./methylate_CYT.pl 1kx5_00148.pdb > 5MeC-1kx5_00148.pdb Download:methylate_CYT.pl Download:1kx5_00148.pdb Download:5MeC-1kx5_00148.pdb (4) Optimization of all structures are performed using the MOSAICS package, which can be downloaded upon registration at wwww.cs.ox.ac.uk/mosaics. In addition we provide an online tutorial (follow Tutorial link) for MOSAICS as part of the courses in the Doctoral Training Centres at Oxford. At www.cs.ox.ac.uk/mosaics extensive Documentation and Installation guide is available. For this project we recommend downloading version.3.9.1_bgq that has been the latest version mostly used in Blue Gene Q Jules and iDataplex systems in the Hartree center of the UK. Once installed MOSAICS can be executed using $ ./mosaics.exe some.input > output Once MOSAICS is installed Section (6) includes detailed instructuctions on how to reproduce single sequence nucleosome formation energies that are further processed - as described in (5) - to produce the energies we reported in Figure 5(a) (5) Once the energy is obtained for the structures a plot can be created representing the nucleosome formation energy at each local sequence. Note, that one needs to perform the same procedure outlined above for both linear for nucleosomal DNA forms defined in the manuscript. Here we provide all the energies we obtained and demonstrate how we obtain the plot as it is presented in Figure 5. To reproduce our plot, please download and untar the tutorial below and follow the steps as outlined in (c). (a) SEQENCES in directory SEQS_Figure5 This directory contains all the sequences used for Figure 5 in the format (*.dat, see above) that can be directly used in designing nucleosomal DNAs by the scripts introduced in (2). ================================================================================= Target Sequence (see. Figure S5): RRR 601 RRR 603 RRR 605 .... RRR TGA RRR Target sequence length: Total length of 13 positioning sequences (2067) + Total length of 14 random sequences (14*147) = 4125 For a nucleosome for 147 DNA base pairs there are 4125 - 147 + 1 = 3979 locations. ================================================================================= (b) RAW ENERGIES obtained from minimization E_nucpos: energies for 3979 sequences accommodated by the nucleosome E_linpos: energies for 3979 sequences accommodated by the linear DNA (c) PROCEDURE to obtain nucleosome formation energy (as in Figure 5) please follow the steps and ($ ..procedure to perform) (i) Get the nucleosome formation energy from the raw data E_(nucleosome_formation) = E_nucfor = E_(nucleosome) - E_(linear) $ scripts/substract.pl E_nucpos E_linpos > E_nucfor (ii) Substract out the average to obtain relative energies E_nucfor_(substract-average) = E_nucfor_sa $ scripts/substract_average_1d.pl E_nucfor > E_nucfor_sa (iii) Use +/- 73 (73 nucleotides upstream/downstream) to get the energy of one position. This is done as in (i) the E_nucfor is obtained as a difference of two STRONG and highly CORRELATED NOISY signals (see E_nucpos/E_linpos) The second smoothing leaves the results identical but only removes the minor noise (see E_nucfor_sa_sm147 and E_nucfor_sa_sm147_sm73). E_nucfor_sa_sm in Matlab using E_nucfor_sa In Matlab using fastsmooth.m >> load E_nucfor_sa >> E_nucfor_sa_sm147 = fastsmooth(E_nucfor_sa,147,1,1); >> E_nucfor_sa_sm147_sm73 = fastsmooth(E_nucfor_sa_sm147,73,1,1); >> E_nucfor_sa_sm = E_nucfor_sa_sm147_sm73; >> save -ascii E_nucfor_sa_sm E_nucfor_sa_sm (iv) Finally we remove the first and last 73 positions as the accuracy of these positions are not comparable to the positions at which 73 upstream and downstream positions are available. Thus, we obtain the plot published in the manuscript. E_nucfor_in_figure5 $ scripts/chop_ht73_renumber.pl E_nucfor_sa_sm147_sm73 > E_nucfor_in_Figure5 Download:tutorial_reproduce.tar.gz (6) Reproducability of the calculations (a) Signal processed energies as presented in Figure 5a Here we reproduce the first part of Figure 5a on a commonly available Desktop Computer. As it has been discussed, explained and demonstrated in (5) each energy point reported on Figure 5a contains information from +/-73 single position nucleosome energies we obtain both upstream and downstream. Here, we present the signal processed energies for the five test sequences that are categorised as extrema (minima or maxima on Figure 5a). Here, we report the energy values on IBM iDataPlex and a Desktop PC (see below for architecture). Energies presented on Figure 5a: Platforms[@] IBM iDataPlex (*) Desktop PC E type En - El - <.> En - El - <.> (averaged +/-72) (averaged +/-72) Index (v)138 -16.7 -12.8 283 75.2 78.1 (v)435 -58.3 -57.5 664 52.8 55.5 (v)863 -73.0 -73.6 [@] IBM iDataPlex (2.6 GHz Intel(R) Xeon(R), OS: Linux, RedHat Release 6.4) Desktop PC (3.4 GHz Intel(R) Core(TM), OS: Linux, Fedora Release 17) (v) These are minima points on the signal processed and final energy landscape we present in Figure 5a. (*) Full MOSAICS software raw output for intermediate energy terms, and verified output logs for all 1-1000 sequences are provided here: Download:output_release.tar.gz (b) Single sequence energies on Linux and OSX platforms We also present some single point energies we obtain on commonly available platforms. Platforms[@] Desktop PC Mac-OSX E type En - El En - El (averaged +/-0) (averaged +/-0) Index 138 -862.0 -840.0 283 -321.7 -321.2 435 -550.9 -556.2 664 -363.2 -373.5 863 -317.7 -326.3 [@] Desktop PC (3.4 GHz Intel(R) Core(TM), OS: Linux, Fedora Release 17) MacBook Pro (2.7 GHz Intel Core i7, OS: Mac OS X, Version 10.6.8) Wall Times (Seconds) for single sequence calculations: System CG SA-MC Total Desktop PC ~310 - 520 ~230 ~540 - 750 MacBook Pro ~320 - 540 ~240 ~560 - 780 All these calculations can be reproduced, see below (c)-. (c) To set up a sequence directory for testing, please follow these steps after you downloaded the tar file available at the bottom of this page: $ tar -xzvf run_test_sequences.tar.gz $ cd run_test_sequences $ ./setup.pl linear start_sequence end_sequence $ ./setup.pl linear 664 664 # will create directory linear_00664 # linear DNA with the 664 sequence $ ./setup.pl 1kx5 664 664 # will create directory 1kx5_00664 # nucleosomal DNA with the 664 sequence In order to set up diretories for any of the test sequences, we need to run $ ./setup.pl 1kx5 XXX XXX # XXX = 138, 283, 435, 664, 863 $ ./setup.pl linear XXX XXX # XXX = 138, 283, 435, 664, 863 All test sequences can be set up with a single command: $ ./setup_all_test.sh (d) Once the reqired setups are generated we move all generated setup directories to the example directory and we run calculations there $ mv linear_* examples $ mv 1kx5_* examples $ cd examples (e) Create a soft link to your MOSAICS executable in this directory $ ln -s /path/to/your/mosaics.x mosaics.x (f) Run the scripts First, we run conjugate gradient optimization on all structures $ ./run_cg_1kx5_XXX.sh # on nucleosomal structures $ ./run_cg_linear_XXX.sh # on linear structures where XXX = 'linux' or 'osx' based on the operating system used. Each of these commands will perform 5 independent conjugate gradient optimizations with a Wall Time of ~5 x (310 - 520) = ~1550 - 2600 seconds = ~26 - 43 mins on the Desktop PC defined above. Next, using this output of cg min we setup samc optimization to move the output structure to input and save cg results $ scripts/setup_samc.pl 1kx5 # on nucleosomal structures $ scripts/setup_samc.pl linear # on linear structures Finally, we have to run the samc optimization $ ./run_samc_1kx5.sh # nucleosomal structures $ ./run_samc_linear.sh # linear structures Each of these commands will perform 5 independent simulated annealing optimizations with a Wall Time of ~5 x 240 = 1200 seconds = ~20 mins on the Mac Book Pro defined above. All outputs, including the evolution of optimized energies can be found in out_cg, out_samc files. To get final energies for 1kx5 and linear template forms please type: $ scripts/get_Eopt.pl 1kx5 > En $ scripts/get_Eopt.pl linear > El To obtain the nucleosome formation energy at test sequences please type: $ scripts/get_En-El.pl En El > En-El Download:run_test_sequences.tar.gz
Primary Chromatin Structure
In the cell DNA is packaged into a condensed form called chromatin within the nucleus. To facilitate the first level of packing nuclear DNA is wrapped around nuclear proteins, therefore under the microscope DNA appears to be like beads on a string, where the beads are called nucleosomes. The organization of nucleosomes along the string is called the primary chromatin structure, whose prediction is one of the grand challenges of computational biology. In a recent study, PNAS 111(17): 6293-6298 (2014), we describe an application in which MOSAICS simulations are performed to characterize (by means of energy) a particular nucleosomal location along a DNA sequence of interest. This is done through the calculation of nucleosome formation energy. For each possible nucleosomal position along a given sequence, the tutorial below explains how we generated the nucleocome formation energies resulting in a nucleosomal formation energy profile.