• Home
  • What is MOSAICS?
  • Download
  • Examples
  • Documentation
    • Using MOSAICS
    • Using PyMOSAICS
    • FAQ
  • Tutorials
    • Using MOSAICS
    • Using PyMOSAICS
    • FAQ
  • Publications
  • Contact

Computational Biology
Department of Computer Science

Web Design and GUI
by Konrad Krawczyk
Department of Statistics
University of Oxford

Primary Chromatin Structure

In the cell DNA is packaged into a condensed form called chromatin within the nucleus. To facilitate the first level of packing nuclear DNA is wrapped around nuclear proteins, therefore under the microscope DNA appears to be like beads on a string, where the beads are called nucleosomes. The organization of nucleosomes along the string is called the primary chromatin structure, whose prediction is one of the grand challenges of computational biology. In a recent study, PNAS 111(17): 6293-6298 (2014), we describe an application in which MOSAICS simulations are performed to characterize (by means of energy) a particular nucleosomal location along a DNA sequence of interest. This is done through the calculation of nucleosome formation energy. For each possible nucleosomal position along a given sequence, the tutorial below explains how we generated the nucleocome formation energies resulting in a nucleosomal formation energy profile.

    Given a query sequence Q, this site explains how to generate a nucleosome formation
    energy profile describing the preference of nucleosomes along each location of Q. In 
    our example Q is the target sequence used in Figure 5 of the manuscript. 
    Assuming that Q has a length of LQ and we are given a structural template T of LT 
    base-pairs (LT = 147 for structural template 1kx5, PDB), we could proceed as follows:

    (1) Generate a set {S1,...,Sn,...,SN} of N = LQ - LT + 1 local sequences, where Sn is the 
        continuous subsequence of Q from position n to position (n + LT - 1).

        In order to achieve this we can use the script gen_local_sequence.pl and the one line 
        representation of Q, query_oneline. All scripts can be downloaded below. 

        # procedure:
        #  $ ./gen_local_sequence.pl query_oneline nseq shift
        #
        #  query_oneline: the query sequence, Q in one line
        #  nseq         : number of sequences to be generated
        #  nshift       : from this position (e.g. nshift = 0 from begining) 
 
        Example of use:

        $ ./gen_local_sequence.pl query_oneline 1 147 

        creates a file 'SEQ_00148.dat'
        ------------------------------------------------------------------------------------    
        SEQ I >ACAGGATGTATATATCTGACACGTGCCTGGAGACTAGGGAGTAAT....GAGCGGCCTCGGCACCGGGATTCTCCAG
        SEQ J >CTGGAGAATCCCGGTGCCGAGGCCGCTCAATTGGTCGTAGACAGC....AGGCACGTGTCAGATATATACATCCTGT
        ------------------------------------------------------------------------------------    

        that contains the local sequence and its complentary strand accommodated by a 
        nucleosome that occupies the 148th position along Q. As indicated by Figure S5 this
        subsequence is the 601 nucleosome positioning sequence. In order to generate all
        local sequences, starting from the first position of Q, one needs to run the 
        script: '$ ./gen_local_sequence.pl query_oneline N 0'. In the paper, for Figure 5, 
        we used N = 3979. 

         
        Download:gen_local_sequence.pl 
        Download:query_oneline (The present Q represents target sequence used for Figure 5)     
        Download:SEQ_00148.dat (The first position along target sequence used for Figure 5)     
             
 
    (2) Generate a set {T1,...,Tn,...,TN} of query structures, where Tn is obtained by threading 
        Sn onto our nucleosomal DNA template T.

        In order to achieve this, we execute a two stage threading procedure. First, we
        generate a structural template T, in which all native bases of a nucleosomal DNA
        are replaced by the 3 pseudoatoms representing the plane. This is done by the 
        script:

        #! /usr/bin/perl -w
        #
        # This script replaces bases from nucleotides in a .pdb with three
        # pseudo atoms named PL1, PL2, PL3 that determine the base plane 
        # information
        #
        # usage:
        #  ./replace_base_by_plane.pl A_with_base.pdb > A_with_plane.pdb
        #
        # A_with_base.pdb  : pdb containing 'full' nucleotides
        # A_with_plane.pdb : pdb containing 'nucleotide base plane only'
        #                    plus first group nitrogen ( GUA N9, ADE N9, 
        #                                                THY N1, CYT N1 )
        # Copyright (c) 2007 Peter Minary
  
        Example of use:

        $ ./replace_base_by_plane.pl 1kx5_DNA_with_base.pdb > 1kx5_DNA_with_plane.pdb

        where 1kx5_DNA_with_base.pdb is the structure of the nucleosomal DNA as it is 
        found in PDB structure 1kx5 from the Protein Data Bank.         

         
        Download:replace_base_by_plane.pl 
        Download:1kx5_DNA_with_base.pdb (Structure of the nucleosomal DNA)     
        Download:1kx5_DNA_with_plane.pdb (Nucleosomal DNA backbone)     
             

        Next, we have to build the nucleosome structure accommodating a given sequence. 
        To do this, we use the script build_DNA_base.pl

        #! /usr/bin/perl -w
        #
        #./build_DNA_base.pl nuc_base_r_theta_phi DNA_with_plane.pdb SEQ.dat > DNA_SEQ.pdb
        # 
        #  nuc_base_r_theta_phi (input) : (r, theta, phi) coordinates for
        #                                  for the base atoms of ade, thy, gua, cyt
        #  DNA_with_plane.pdb (input)   : a .pdb file containing backbonde DNA 
        #                                 strands and the plane definition (3 atoms) and
        #                                 NN (N1 for C, T; N3 for G, A) for each basis 
        #  SEQ.dat (input) : Sequence file
        #                             SEQ A >ATATATATATAT ....
        #                             SEQ B >TATATATATATA ....
        #  DNA_SEQ.pdb (output)         : The DNA with the threaded sequence
        #
        #  Copyright (c) 2007 Peter Minary

        Example of use:

        ./build_DNA_base.pl nuc_base_r_theta_phi 1kx5_DNA_with_plane.pdb SEQ_00148.dat > 1kx5_00148.pdb

        where both SEQ_00148.dat and 1kx5_DNA_with_plane.pdb were generated above. The final output
        of this script is 1kx5_00148.pdb the structure of the nucleosome (T1) occuping the 
        148th position of the query sequence. This position corresponds to the 601 nucleosome 
        positioning sequence element.

         
        Download:build_DNA_base.pl 
        Download:nuc_base_r_theta_phi      
        Download:1kx5_00148.pdb (T1)     
              

        In order to generate the set {T1,...,Tn,...,TN}, one has to replicate the 
        procedure we just outlined above for S1 → T1 using a DNA backbone structure 
        (1kx5_DNA_with_plane.pdb).

        To achieve this task we use the script:
      
        #! /usr/bin/perl -w
        #
        # ./replicate_design.pl nstructure shift
        #
        #  inputs
        #  nstructure: # of structures to be generated
        #       shift: from a given structure (e.g. n for T_n)     
        #  output
        #  it generates directories each containing a nucleosome
        #  structure accommdating a local sequence.

         
        Download:replicate_design.pl 
        Download:tutorial (Tutorial on using this script.)     
              
 
    (3) Cytosines may be methylated by conversion to 5-Me-cytosine. 

        This is achieved by using the script:

        #! /usr/bin/perl -w 
        #
        # usage:
        #  ./methylate_CYT.pl DNA.pdb > 5MeC-DNA.pdb
        #
        # This script methylates all CYT residues in a DNA 
        #
        #                      H42  H41
        #                        \  /
        #                         N4
        #                         |
        #                  H51    C4
        #                   |    / \\
        #       H5-     H52-C5M-C5   N3
        #                   |   ||   |
        #                  H53  C6   C2
        #                      / \  / \\
        #                     H6  N1   O2
        #                           \
        #                            \
        #                             \
        #         O1P    H5' H4'  O4'  \
        #          |      |    \ /   \  \
        #         -P-O5'-C5'---C4'    C1'
        #          |      |     \     / \
        #         O2P    H5''   C3'--C2' H1'
        #                      / \   / \
        #                   O3' H3' O2' H2''
        #                    |       |
        #                           H2'
        #
        # Copyright (c) 2007 Peter Minary

        Example of use:

        $ ./methylate_CYT.pl 1kx5_00148.pdb > 5MeC-1kx5_00148.pdb

         
        Download:methylate_CYT.pl 
        Download:1kx5_00148.pdb     
        Download:5MeC-1kx5_00148.pdb     
              

    (4) Optimization of all structures are performed using the MOSAICS package, which can be 
        downloaded upon registration at wwww.cs.ox.ac.uk/mosaics. In addition we provide an 
        online tutorial (follow Tutorial link) for MOSAICS as part of the courses in the 
        Doctoral Training Centres at Oxford. At www.cs.ox.ac.uk/mosaics extensive Documentation 
        and Installation guide is available. For this project we recommend downloading version.3.9.1_bgq
        that has been the latest version mostly used in Blue Gene Q Jules and iDataplex systems in the 
        Hartree center of the UK.  Once installed MOSAICS can be executed using 

        $ ./mosaics.exe some.input > output

        Once MOSAICS is installed Section (6) includes detailed instructuctions on how to reproduce
        single sequence nucleosome formation energies that are further processed - as described in 
        (5) - to produce the energies we reported in Figure 5(a)


    (5) Once the energy is obtained for the structures a plot can be created representing the 
        nucleosome formation energy at each local sequence.  Note, that one needs to perform
        the same procedure outlined above for both linear for nucleosomal DNA forms defined 
        in the manuscript. Here we provide all the energies we obtained and demonstrate how 
        we obtain the plot as it is presented in Figure 5. To reproduce our plot, please 
        download and untar the tutorial below and follow the steps as outlined in (c). 

        (a) SEQENCES in directory SEQS_Figure5
            This directory contains all the sequences used for Figure 5 in the format (*.dat, 
            see above) that can be directly used in designing nucleosomal DNAs by the scripts
            introduced in (2).

            =================================================================================
            Target Sequence (see. Figure S5): RRR 601 RRR 603 RRR 605 .... RRR TGA RRR

            Target sequence length:
                                   Total length of 13 positioning sequences (2067)
                                 + Total length of 14 random sequences (14*147)
                                 = 4125
                                 For a nucleosome for 147 DNA base pairs there are
                                 4125 - 147 + 1 = 3979 locations.
            =================================================================================

        (b) RAW ENERGIES obtained from minimization

            E_nucpos: energies for 3979 sequences accommodated by the nucleosome
            E_linpos: energies for 3979 sequences accommodated by the linear DNA

        (c) PROCEDURE to obtain nucleosome formation energy (as in Figure 5)
                      please follow the steps and ($ ..procedure to perform)

            (i)   Get the nucleosome formation energy from the raw data

                  E_(nucleosome_formation) = E_nucfor = E_(nucleosome) - E_(linear)

                  $ scripts/substract.pl E_nucpos E_linpos > E_nucfor

            (ii)  Substract out the average to obtain relative energies

                  E_nucfor_(substract-average) = E_nucfor_sa

                  $ scripts/substract_average_1d.pl E_nucfor > E_nucfor_sa

            (iii) Use +/- 73 (73 nucleotides upstream/downstream) to get
                  the energy of one position. This is done as in (i) the 
                  E_nucfor is obtained as a difference of two STRONG and 
                  highly CORRELATED NOISY signals (see E_nucpos/E_linpos)
                  The second smoothing leaves the results identical but 
                  only removes the minor noise (see E_nucfor_sa_sm147 and 
                  E_nucfor_sa_sm147_sm73). 
                    

                  E_nucfor_sa_sm in Matlab using E_nucfor_sa

                  In Matlab using fastsmooth.m

                  >> load E_nucfor_sa
                  >> E_nucfor_sa_sm147 = fastsmooth(E_nucfor_sa,147,1,1);
                  >> E_nucfor_sa_sm147_sm73 = fastsmooth(E_nucfor_sa_sm147,73,1,1);
                  >> E_nucfor_sa_sm = E_nucfor_sa_sm147_sm73;
                  >> save -ascii E_nucfor_sa_sm E_nucfor_sa_sm

            (iv)  Finally we remove the first and last 73 positions as the 
                  accuracy of these positions are not comparable to the positions
                  at which 73 upstream and downstream positions are available. 
                  Thus, we obtain the plot published in the manuscript.

                  E_nucfor_in_figure5

                  $ scripts/chop_ht73_renumber.pl E_nucfor_sa_sm147_sm73 > E_nucfor_in_Figure5

 
         
        Download:tutorial_reproduce.tar.gz 
        


    (6) Reproducability of the calculations

        (a) Signal processed energies as presented in Figure 5a

            Here we reproduce the first part of Figure 5a on a commonly available
            Desktop Computer. As it has been discussed, explained and demonstrated
            in (5) each energy point reported on Figure 5a contains information
            from +/-73 single position nucleosome energies we obtain both upstream
            and downstream. Here, we present the signal processed energies for
            the five test sequences that are categorised as extrema (minima or
            maxima on Figure 5a). Here, we report the energy values on IBM
            iDataPlex and a Desktop PC (see below for architecture). Energies
            presented on Figure 5a:

                     Platforms[@]    IBM iDataPlex     (*) Desktop PC                 
                       E type        En - El - <.>        En - El - <.>
                                    (averaged +/-72)     (averaged +/-72)
                       Index
                      (v)138             -16.7              -12.8
                         283              75.2               78.1
                      (v)435             -58.3              -57.5
                         664              52.8               55.5
                      (v)863             -73.0              -73.6

        [@] IBM iDataPlex (2.6 GHz Intel(R)  Xeon(R), OS: Linux, RedHat Release 6.4)
            Desktop PC    (3.4 GHz Intel(R) Core(TM), OS: Linux, Fedora Release  17)

        (v) These are minima points on the signal processed and final energy
            landscape we present in Figure 5a.

        (*) Full MOSAICS software raw output for intermediate energy terms, and
            verified output logs for all 1-1000 sequences are provided here:
         
        Download:output_release.tar.gz 
        

        (b) Single sequence energies on Linux and OSX platforms

            We also present some single point energies we obtain on commonly
            available platforms.

                     Platforms[@]     Desktop PC          Mac-OSX
                       E type          En - El            En - El
                                  (averaged +/-0)     (averaged +/-0)
                       Index
                         138            -862.0            -840.0
                         283            -321.7            -321.2
                         435            -550.9            -556.2
                         664            -363.2            -373.5
                         863            -317.7            -326.3

        [@] Desktop PC  (3.4 GHz Intel(R) Core(TM), OS: Linux, Fedora Release 17)
            MacBook Pro (2.7 GHz Intel Core i7,     OS: Mac OS X, Version 10.6.8)

                 Wall Times (Seconds) for single sequence calculations:
  
                   System            CG           SA-MC        Total
                Desktop PC       ~310 - 520       ~230      ~540 - 750
                MacBook Pro      ~320 - 540       ~240      ~560 - 780

            All these calculations can be reproduced, see below (c)-.


        (c) To set up a sequence directory for testing, please follow these
            steps after you downloaded the tar file available at the bottom
            of this page:

            $ tar -xzvf run_test_sequences.tar.gz

            $ cd run_test_sequences

            $ ./setup.pl linear start_sequence end_sequence

            $ ./setup.pl linear 664 664     # will create directory linear_00664
                                            # linear DNA with the 664 sequence

            $ ./setup.pl 1kx5 664 664       # will create directory 1kx5_00664
                                            # nucleosomal DNA with the 664 sequence

            In order to set up diretories for any of the test sequences, we
            need to run

            $ ./setup.pl 1kx5 XXX XXX     # XXX = 138, 283, 435, 664, 863
            $ ./setup.pl linear XXX XXX   # XXX = 138, 283, 435, 664, 863

            All test sequences can be set up with a single command:

            $ ./setup_all_test.sh

        (d) Once the reqired setups are generated we move all generated setup
            directories to the example directory and we run calculations there

            $ mv linear_* examples
            $ mv 1kx5_* examples
            $ cd examples

        (e) Create a soft link to your MOSAICS executable in this directory

            $ ln -s /path/to/your/mosaics.x mosaics.x

        (f) Run the scripts

            First, we run conjugate gradient optimization on all structures

            $ ./run_cg_1kx5_XXX.sh           # on nucleosomal structures
            $ ./run_cg_linear_XXX.sh         # on linear structures

            where XXX = 'linux' or 'osx' based on the operating system used.

            Each of these commands will perform 5 independent conjugate
            gradient optimizations with a Wall Time of ~5 x (310 - 520)
            = ~1550 - 2600 seconds = ~26 - 43 mins on the Desktop PC 
            defined above.

            Next, using this output of cg min we setup samc optimization to
            move the output structure to input and save cg results

            $ scripts/setup_samc.pl 1kx5    # on nucleosomal structures
            $ scripts/setup_samc.pl linear  # on linear structures

            Finally, we have to run the samc optimization

            $ ./run_samc_1kx5.sh            # nucleosomal structures
            $ ./run_samc_linear.sh          # linear structures

            Each of these commands will perform 5 independent simulated
            annealing optimizations with a Wall Time of ~5 x 240 = 1200
            seconds = ~20 mins on the Mac Book Pro defined above.

            All outputs, including the evolution of optimized energies
            can be found in out_cg, out_samc files.

            To get final energies for 1kx5 and linear template forms
            please type:

            $ scripts/get_Eopt.pl 1kx5 > En
            $ scripts/get_Eopt.pl linear > El

            To obtain the nucleosome formation energy at test sequences
            please type:

            $ scripts/get_En-El.pl En El > En-El

         
        Download:run_test_sequences.tar.gz