Data analysis

Here you can find tips on management, processing and analysis of data.

Contents

1       Creating reports
1.1        Using LaTeX
1.1.1         Software setup: MikTex, ghostscript, GSview, TexLipse and TexStudio,
1.1.2         The preamble
1.1.3         Basics
1.1.4         Creating tables
1.1.5         Inserting graphics
1.1.6         Creating custom BibTeX citation styles (style sheets)
2       Image processing, image analysis and data mining in Bio7 using Python and R
2.1        Image processing in Bio7 using the ImageJ perspective
2.2        Python scrips for image processing in the ImageJ perspective of Bio7
2.3        Transferring images into data using Bio7
2.4        Manipulation and analysis of data using the R perspective of Bio7
2.5        Alternatives to Image J: Darktable, Krita, Gimp 3.2 or Endrov
2.6        Deducer: a graphical user interface for the programming language R
3       Databases
3.1        MySQL and DBeaver plugin for Bio7 / Eclipse
3.1.1         Setting up MySQL and DBeaver and creating a database
3.1.2         Create tables, columns and add data
3.1.3         Data retrieval from a database: queries using SQL
3.2        Processing of data from a MySQL database with the MySQL ODBC Connector using R and the packages odbc, dbplyr and dplyr
3.3        Processing of data from a MySQL database using Python with PyMySQL or SQLAlchemy
3.4        Further Reading
4       Statistics
4.1        Common misconceptions
4.2        Power analysis
4.3        Further reading


1       Creating reports

In order to get the most out of collected data, a good presentation of the information is key. Several approaches for report generation are available. If the information cannot be summarized in a single diagram with a legend and a short description, it is worth to wisely choose the software used for layout and type setting. One common approach is to use LaTeX.

1.1       Using LaTeX

TeX is a typesetting system. Since version TeX82, it contains a turing complete programming language (primitive TeX commands), as well as “Plain TeX macros” for typesetting. Since version TeX 3, no new features are added. LaTeX is a shortening for Lamport TeX. It provides a program to process LaTeX documents (a compiler) and a macro package that implements most plain TeX macros and adds many more TeX macros for easy typesetting. The currently used version is LaTeX2e (LaTeX2ε). It is open source software that is licensed under the terms of the LaTeX Project Public License. Distributions such as TeX Live or MiKTeX provide an installer that installs TeX (tex.exe), LaTeX2e (latex.exe), ConTeXt (context.exe), LuaTeX (luatex.exe) and tools for maintainance and update management like the MiKTeX console (miktex-console.exe and miktex-console_admin.exe).    

1.1.1     Software setup: MikTex, ghostscript, GSview, TexLipse and TexStudio,

The following steps need to be followed:
1.      Make sure a PDF viewer is installed. The freeware PDF-XChange Editor by Tracker Software is recommended.
2.      Install ghostscript
3.      Install GSView version 6 or newer by Artifex Inc.
4.      Install MiKTeX
5.      Install TexStudio.
6.      If something does not work, start TextStudio and left-click Options -> Configure -> TexStudio… -> Commands. In the text field for the command Latex, specify the path to the executable file latex.exe within the MiKTeX installation directory (e.g. …MiKTeX 2.9/miktex/bin/x64/latex.exe). Proceed with setting up TexStudio as described in the manual. You may want to add MiKTeX 2.9/miktex/bin/x64/latex.exe as well as the executables for ghostscript and GSView to the PATH environment variable.
7.      [Left-click Configuration. Activate „Advanced options“ (option 3) at bottom left
8.      Leftclick Configuration -> General. Uncheck (deactivate) „Logview with Tabs“
9.      Leftclick fast creation/fast compile. select Option 3 (pdflatex + Pdf Betrachter) [or option 2 latex + dvi viewer].
10.   Leftclick Configuration -> Editor: If your current project uses \usepackage[latin9]{inputenc}, select font encoding ISO-8859-15. ?deactivate Auto detect? Font: Consolas, Font size: 11
11.   Preview Command: dvipng – follow (parallel).  Test with option „show inline“. Then set to „only show in preview panel“. Automatic preview: Select “Previously as preview translated text”. Delay: 100]
12.   Install Excel2LaTeX
13.   Install Zotero
14.   Install and set up “Better Bib(La)TeX for Zotero
15.   Install JabRef
16.   Install BibTex4Word by downloading the software and copying the file bibtex2word.dot from the downloaded folder into the directory C:\Users\<Username>\AppData\Roaming\Microsoft\Word\STARTUP.
17.   As an alternative to TexStudio, install the Eclipse plugin TeXlipse.

1.1.2     The preamble

Aaa

1.1.3     Basics

Aaa

1.1.4     Creating tables

aaa

1.1.5     Inserting graphics

For plotting, use R with the packages ggplot2 and tikzDevice or gnuplot.

1.1.6     Creating custom BibTeX citation styles (style sheets)

In order to create a custom BibTeX bibliography style, follow the following steps:
1.      Open C:\Programme\MiKTeX 2.6\tex\latex\custom-bib
2.      Right-click makebst.text. In the context menu left-click open with… -> browse -> select C:\Programme\MiKTeX 2.6\miktex\bin\tex.exe
3.      Enter the input file:
merlin.msd
4.      Enter output file:
<stylename>.bst
5.      Answer last question of the dialog (batch process) with yes by entering y and pressing the enter-key.
6.      Search for the created file name (in the directory Windows/system32) and move the .dbj and .bst files to a convenient directory.
7.      Additionally, copy the .bst file to C:\Programme\MikTex 2.6\bibtex\bst\base
8.      Left-click Start -> Programs -> MiKTeX -> Maintainance(Admin) -> Settings (Admin) -> Refresh FNDB
9.      Open your LaTeX editor (TeXstudio). You can now set the bibtex bibliography style for your LaTeX .tex files with the command \bibliographystyle{<stylename>} in the preamble code of your LaTeX .tex files.

2       Image processing, image analysis and data mining in Bio7 using Python and R

Bio7 is an open source program that is built on top of the Eclipse rich client platform. Bio7 version 2.8 has a graphical user interface (R perspective) for the editing of code written in the programming language R and the EclipseImageJ1Plugin (ImageJ2 version 1.52) tightly integrated and simplifies the extraction of data from images, the data manipulation in a spreadsheet view using the programming language R and the conversion from data to image (plotting using R and imageJ) easy. It also has the plug-in GEF integrated which provides a flow chart editor that is used for the creation of documented and reproducible work flows and image/data processing pipelines. After installation of Bio7 and Python3 to the system partition (with the option “Add Python 3.x to PATH” during the installation selected), on computers with Microsoft Windows as the operating system, the environment variables might have to be edited to add the folder containing the python.exe to the system Path. Additional packages/libraries for Python provide easy access to important functions. Most python packages can be downloaded and installed via the command-line tool PIP that comes with the python3.x distribution for Windows from www.python.org. Ensure that PIP is properly installed. Edit the environment variables to add the folder …/Python3x/Scripts that contains the executable file pip.exe to the system Path. In order to download and install a python package that is available as source archives or as a python wheel in the python package index, start the command prompt (cmd.exe) and type the following into the command prompt and confirm by pressing the <enter> key
pip install <package>
or alternatively:
py -m pip install <package>
or:
python -m pip install <package>
Here, <package> is a placeholder that has to be replaced with the name (without the <>) of the package on the python package index that you want to install. For example, to install the package NumPy:
pip install numpy

The following packages are recommended: NumPy, DatabaseInterfaces, mysqlclient, Cython, Py2exe, Matplotlib, PyX, RPy, sh, Pillow, Pandas, pyserial, setuptools

Specify the path to to the executable file py.exe and python3X.exe. Then open Bio7 and in the main menu of Bio7 click Help ->install new software or Help -> Eclipse marketplace. Install the Eclipse plug-in PyDev. Then, in the main menu of Bio7 click Preferences -> preferences -> PyDev -> Interpreters -> Python Interpreter -> New. Enter the name to be displayed (e.g. “Python36”) and specify the path to the python36.exe. The Python interpreter menu has a tab “Packages”, where one can click on “Install/uninstall with PIP” in order to install new packages from within Bio7.

2.1       Image processing in Bio7 using the ImageJ perspective

In order to automate image processing, it is advisable to write imageJ macros or Scripts. Possibly the easiest approach is to write macros using the ImageJ macro language., create a New project (New -> Project) and give it some meaningful name. Add a new imageJ macro file to that project  (New -> ImageJ -> ImageJ macro file (.IJM)) and call it appropriately. The final macro file has to be copied into the directory …\Bio7\plugins\com.eco.bio7.image_x.xx.xxx\macros. Finally, inside Bio7, open the ImageJ Edit perspective and click on the toolbar Plugins -> Macros -> Install and select the needed macro file. One macro file can contain several macros. One can also edit the file …\Bio7\plugins\com.eco.bio7.image_x.xx.xxx\macros\StartupMacros.txt and add custom macros to that macro file. After installing that file, those macros will be directly available after starting Bio7 from the ImageJ Edit perspective toolbar Plugins -> Macros without previously reinstalling the file using Plugins -> Macros -> Install in the ImageJ Edit perspective. I added the following code to the end of the file …\Bio7\plugins\com.eco.bio7.image_x.xx.xxx\macros\StartupMacros.txt:

// @title: ImageProcessingByDerAndere
// @author: DerAndere
// @created: 2018
// Copyright 2018 DerAndere
// @license: MIT
// @version: 0.0.1
// @language: ImageJ macro language
// @info: https://it-by-derandere.blogspot.com
// @description: ImageJ macros for image processing.

macro "Batch-AdustBCsetMaxValue" {

// This macro processes all the images in a folder and any subfolders.

  extension = ".tif";
  extension2 = ".czi";
  inputDir = getDirectory("Choose Source Directory ");
  channel = getNumber("which channel should be adjusted?", 0);
  maxVal = getNumber("maxVal for the chosen channel?", 65535);

  //dir2 = getDirectory("Choose Destination Directory ");
  outputDir = inputDir + "BCSetMaxValueAdjusted\\" + "C" + channel +"\\";
  if(!File.exists(outputDir)) {
    File.makeDirectory(outputDir);
  }
  setBatchMode(true);
  n = 0;
  processFolder(inputDir);

  function processFolder(inputDir) {
     list = getFileList(inputDir);
     for (i=0; i<list.length; i++) {
          if (endsWith(list[i], "/")) {
              processFolder(inputDir+list[i]);
          }
          else if (endsWith(list[i], extension) || endsWith(list[i], extension2)) {
             processImage(inputDir, list[i]);
          }
      }
  }

  function processImage(inputDir, name) {
       img = inputDir + name;
  // only proceeds if file is there
    if(File.exists(img)){
      // opens the images and gives them the temporary name "imgTemp" 
      open(img);
      ID = getImageID(img);
      selectImage(ID);
      dotIndex = indexOf(name, ".");
      basename = substring(name, 0, dotIndex);
      rename("imgTemp");
      // the next lines do your procedure
      Stack.setChannel(channel);
      stackSize = nSlices(imgTemp); // or nSlices(); or nSlices; ?
      for(z=1; z<=stackSize; ++z){
        setSlice(z) //?
        run("setMinAndMax", "min=0 max=maxVal");
        run("Apply LUT");
        wait(500);
        selectImage(ID);
        wait(500);
      }
    titleMod = "C" + channel + "adj";
    titleTitle = basename + titleMod + ".tif";
    wait(500);
    // add code here to analyze or process the image
    saveAs("tiff", outputDir+titleTitle);
    wait(500);
    close();
    }
  }
}

macro "SaveCurrentSliceAsTIF" {
id = getImageID();
title = getTitle(); // or getTitle;?
dotIndex = indexOf(title, ".");
basename = substring(title, 0, dotIndex);
path = getDirectory("image");
outputDir = inputDir + "Slices\\";
if(!File.exists(outputDir)) {
  File.makeDirectory(outputDir);
}
currentSliceNumber = getSliceNumber();
tileTitle = basename + "_S" + currentSliceNumber + "_mod1.tif";
// using the ampersand allows spaces in the tileTitle to be handled correctly
run("Duplicate...", "title=&tileTitle");
selectWindow(tileTitle);
saveAs("tiff",outpurDir+tileTitle);
close();
selectImage(id);
}
// End of ImageProcessingByDerAndere

2.2       Python scrips for image processing in the ImageJ perspective of Bio7 

aaa

2.3       Transferring images into data using Bio7

aaa

2.4       Manipulation and analysis of data using the R perspective of Bio7

In order to use R from within Bio7, left-click R -> Preferences -> Preferences R. Select the correct Server. For Germany, try Germany (Goettingen). Left-click R-shell, enter update.packages() to update Rserve. If it is not installed, enter install.packages(“Rserve”). Execute by pressing enter. After the successful installation is confirmed in the R-shell, reboot your computer and restart Bio7 and left-click R-shell. Then, left-click the tool “Start Rserve” from the Bio7 toolbar at the top (or: R -> Start Rserve). Open the R perspective and Left-click “Packages” -> Install packages and search for packages you need. I recommend the packages dplyr and its dependencies that are installed automatically with it (dbplyr, RMySQL, DBI, lubrify, knitr, rmarkdown, ggplot2, RPostgreSQL, colorspace, utf8 and others), XLConnect, xlsx, foreign, tidyr.

2.5       Alternatives to Image J: Darktable, Krita, Gimp 3.2 or Endrov

Darktable is an excellent choice for radiometrically correct image processing. It allows you to completely deactivate any modules you don't want to use, so you don't even have to look at them, and also allows you to store presets for the remaining modules. darktable indicates the active modules with an icon that can be clicked to disable the module in question.

If you use Krita, open the image and left-click Left-click Image -> Properties… -> Image color space. Change Model to greyscale/alpha, bit depth (precision) 32bit floating point per channel. For 16 bit images in a color space with a linear tone reproduction curve (gamma = 1.0), select a Gray profile with linear tone reproduction curve (gamma = 1.0). An example is the ICC profile “Gray-D50-elle-V4-g10.icc” (note: The profile “Gray built-in (standard)” is the same as “Gray-D50-elle-V2-g22.icc” and has a tone reproduction curve with gamma = 2.2). Choose the rendering method “relative colorimetric”, activate “use black-point compensation” and left-click OK.. Image -> convert Image color space. Add a Layer on top of it and use it as an adjustment layer for non-destructive editing  by selecting that layer with a left-click in the layer overview to make it the active layer and do level adjustments and/or inversion. By default, the final aggregate will be in the same color space as the bottom image. After level adjustments and inversions have been applied, change  back to 16bit integer precision and the profile “Gray (built-in)” or better Gray-D50-elle-V2-srgbtrc.icc.

Gimp 3.2 (in development) will be fully color-managed.

Endrov is an alternative to ImageJ. Currently the contrast and brightness settings are too limited, but an ImageJ plug-in is available.

Icy includes ImageJ and Microscopy Manager

2.6       Deducer: a graphical user interface for the programming language R

Deducer is a specialized open source graphical user interface for data analysis that is built around the JGR console for the programming language R. In order to start Deducer from within the Bundled R installation that comes with Bio7, you have to locate the path for your Bio7 installation and open .../Bio7_x.x/plugins/Bundled_R_x.x.x/R/bin/x64 (all x stand for the version numbers installed). In that folder, create a shortcut to the file Rgui.exe and move this shortcut to a more convenient location. Open the Rgui by double clicking the freshly created shortcut. Install the packages JGR, Deducer and DeducerExtras, DeducerText. Under Microsoft Windows this can be done by selecting from the Rgui menu Packages -> Install packages.... Select a CTRAN mirror from the list and then select the packages JGR, Deducer and DeducerExtras. It is recommendet to install the following additional packages: DeducerSpatial, DeducerSurvival, DeduverText, ggplot2, Rcpp, RcppEigen, Rserve, tm, utf8, XLConnect, XLConnectJars. Click OK and wait until all packages are installed. 
To start Deducer using the Rgui console, select from the Rgui menu Packages -> Load package..., select JGR from the list and click OK. Then type JGR() into the Rgui console and press <enter> to start the JGR console. After all required packages were loaded, Deducer's Data Viewer opens. By hovering the mouse over the main menu of the JGR console (either File, Edit, Packages, Window or Help), the menu entries Data, Plot and DeducerExtra are automatically added to the JGR console. 
In the JGR console, select from the menu Packages -> Package manager and choose the packages installed in the steps above to be loaded by default with the JGR console and click refresh. Read the documentation for R (here and here), for Deducer and for the packages mentioned above.

3       Databases

Data manipulation using Excel, R or Python as described in section 1 usually depends on dataframes that fit into RAM and can be handled in memory. Databases permanently store Data in a flexible way and data has to be retrieved from storage (e.g. hard disk drive or solid state drive) using queries. Functionality can be extended to manipulate data that does not fit into memory (using the dplyr package for R). But for big data that is analyzed in many different ways (over a long period and/ or by multiple users), using databases is an alternative approach. Typically, databases are relational and can be managed using the programming language SQL. One of the most popular open-source relational database management software (RDBMS) is MySQL, a more advanced alternative is PostgreSQL. It is much easier to set up and use a database than the unqualified guess might suggest. It does not require additional hardware such as an additional computer (host) that would serve as a dedicated server. Everything works with a single personal computer! in the context of databases, the word “server” usually refers to the server-software that runs in the background of most database systems. During first setup you will come across a lot of technical terms that stem from network administration but if you follow instructions tightly, you don’t have to know anything about these terms. You can skip reading the following background information and proceed with step 1 (see below). Databases like MySQL or PostgreSQL are usually implemented as a Server – client system that uses the TC/IP communication protocol and the Server can be set up to be accessible from a network / the internet. The server software manages the access of users to the database(s). The client software is the frontend (SQL editor) that is used by the database administrator und -user to interact with the database on the server. Management of user access and user privileges is typically required and requires persons with database-administrator privileges to think about how to manage system-/ database integrity and security concerns by restricting privileges of other database users.

3.1       MySQL and DBeaver plugin for Bio7 / Eclipse

One relational database management system (RDBMS) is MySQL. The windows installer installs “MySQL Server” with its MySQL command line client, the GUI client software called “MySQL Workbench”, the MySQL Router as well as some connectors that are needed to interact with the databases using alternative clients and programming languages.

3.1.1     Setting up MySQL and DBeaver and creating a database

It is as easy as following the steps below:
1)     Install MySQL (for Windows: Use the MySQL installer).
During the installation of MySQL, follow the wizard to configure the MySQL Server instance.
1a)       Choose “Standalone MySQL Server / Classic MySQL Replication. Click “Next”.
1b)       Choose Config type Development Computer”. “TCP/IP” should be selected. Click “Next”.
1c)       Choose “Use Legacy Authentication Method (Retain MySQL 5.x compatibility)
1d)       Enter root password (click “check” if required). Click “Next”
1e)       Configure MySQL Server as a Windows Service. Choose “Standard System Account”.Click “Next”
1f)        Click Execute
2)     After installation has completed, start the MySQL Workbench and click on the entry below “Connections” to connect to the local instance of the MySQL Server via the MySQL Router. Then, enter the password for the user the connection and click OK.
3)     In the bottom left right-click “Schemas” and left-click “Create New Schema”. This creates a new Schema ( = Database). Enter a name, e.g. “Database1”
4)     Under “Users”, add the following users and select their roles and passwords:
User name      host                 role
Admin              %                     DB-Admin

5)     Open Bio7 / Eclipse and check, if the plugin DBeaver is installed with all bundled extensions (connectors etc.) and create a new project using File -> New -> Project -> DBeaver
6)     Left-click Perspective ->“DBeaver”. At the let side of the DBeaver perspective, left-click “Database Navigator”.
7)     Right-click on the project and choose “Set Active” from the context menu that appears.
8)     Select “New Connection”. Enter a connection name. Under Database: Enter the name of the previously created Schema (Database1). Enter the user name of the desired user you want to log in as (e.g. Admin).
9)     During connection, enter the same user name as specified in step 9) and enter the respective password.
10)  In the Database Navigator of the DBeaver perspective, double-click “Users” and then left-click on the button “Create New User” (or: in the Database Navigator, right-click “Users” and in the context menu left-click “Create New User”).Add the following users:
User name      host                 role
AdminLocal      localhost          DB-Admin
UserAdmin       %                     User-Admin
User1              %                     Custom*
User1Local      localhost          Custom*
Viewer1           %                     Custom2**

* Custom role: select the following properties: File, Process, Create tablespace, RESOURCE_GRPOUP_USER. Under “Schema privileges” select the desired schema / catalog and tables. Select the desired privileges, e.g. CREATE, Create view, Insert, References, Select, Show view, Create routine, Execute.
** Custom2: select the following properties: RESOURCE_GRPOUP_USER. Under “Schema privileges” select the desired schema / catalog and tables. Select the desired privileges, e.g. Create view, References, Select, Show view, Create routine, Execute.

3.1.2     Create tables, columns and add data

1)     In the Database Navigator of the DBeaver Perspective, expand the desired database (e.g. Database1) and double click “Tables”. Left-click on the button “Create new table” (or: in the Database Navigator, right-click “Tables” and in the context menu left-click “Create new table”). Choose a lowercase name without spaces (e.g. t1_table1), set Auto Increment to 1. Select the engine innoDB (default) or CONNECT (for connection to files outside the database).
2)     Select a table (e.g. t1_tablename1) and
2a)       left-click on the button “Create New Column” (or: in the Database Navigator, right-click “Tables” and in the context menu left-click “Create New Column”. Choose a lowercase name without spaces (e.g. t1ca_id or t1cb_name), set data type to INTEGER (int). Select “Non-Zero” and Auto Increment.
2b)       Left-click on the desired table. Left-click “constraints”. Left-click “Create New Constraint”. Choose the desired column from the dropdown menu, enter the same name as the column, select PRIMARY KEY from the dropdown menu. Left-click “OK”. Left-click the table and left-click “Save”. In the wizard, left-click “Persist”.
2c)       Select the same table and left-click on the button “Create new Column”. Choose a lowercase name without spaces (e.g. colb_name), set data type to varchar(80) for strings with a maximum length of 80 characters. Set a default value (e.g. ‘NA’ or NULL (= missing value, not applicable). Select “Auto Increment” only if appropriate. Left-click OK. Left-click the table and left-click “Save”. In the wizard, left-click “Persist”.    
2d)       Repeat step 2c) for more columns, replace the name of the column. Select the data type INTEGER for integers or DOUBLE for floating point numbers or BOOL for Boolean expressions. Set the default value either to NULL or to 0 as desired.
3)     Select a table and left-click “Data”. Use the provided buttons to add rows, then select cells to edit data or to delete rows. For the column with the PRIMARY KEY, only enter 1 to the first cell and do not change the following rows manually to make use of the auto increment. Left-click “Save”. In the wizard, left-click “Persist”.

3.1.3     Data retrieval from a database: queries using SQL

1)     To filter database content and show the result, perform a database query by left-clicking the tool “SQL Editor” in the DBeaver toolbar at the top. The following query (script) returns all cells of columns cu, cv from table tx for which the conditions <condition(s)> are met: 
SELECT cu, cv, … FROM tx WHERE <condition(s)>
2)     To save a script like the query from step 1), right-click into the script and from the context menu choose file -> rename. Right-click the script again and from the context menu select file -> save script.
3)     Left-click the button “Execute script”.
4)     The results of a query can be a) copied to clipboard as ascii text with comma separated values, or b) exported to a new text file with comma separated values (.csv). c) dismissed and a new querry can create a new table inside the database that contains the query results
4a)       To copy the results of a query, right click into the result output and in the context menu left-click Advanced copy -> Advanced copy …
4b)       To save the results in a new text file, right click into the result output and in the context menu left-click export Resultset… -> CSV -> Next -> Next -> Next. Specify the target directory and the file name. Left-click Next. (See MySQL reference manual section 13.2.10.1 SELECT … INTO syntax )
4c)       To create a new table ty in the database with the results of a query at the time of that scipt’s execution, execute the following query: 
CREATE ty
SELECT cu, cv, … FROM tx WHERE <condition(s)>

3.2       Processing of data from a MySQL database with the MySQL ODBC Connector using R and the packages odbc, dbplyr and dplyr

Run the R frontend / editor of your choice (e.g. Deducer or Bio7 or Eclipse with StatET). Install the R packages RmySQL or odbc. Both are MySQL drivers for R and rely on DBI (R database interface) which is installed automatically on the fly. RmySQL is directly connecting using the MySQL JDBC connector, but it is licensed under the terms of the viral license GPL. So I recommend to install the MySQL ODBC connector and on Microsoft Windows systems, use the Mircosoft Windows ODBC data source administrator connector 32bit to create a new user data source name, e.g. MySQLodbdc64bitdsn1Database1Unicode as described in the documentation. Then copy and modify the following R script: https://gitlab.com/DataAnalysisByDerAndere/DatabaseInteractionByDerAndere/RdatabaseInteraction/blob/master/src/odbc_MySQL_databaseinteraction.R
In the code above, add or remove the wanted columns and replace the placeholders (<…>) with the desired code (or remove the “WHERE <condition(s)>” statement) in the SQL query (in the dbSendQuery() command) 

The package dplyr provides an efficient abstraction layer that provides fast functions for data manipulation because it was written with the package Rcpp which makes it possible to combine code written in R and C/C++. First, install the package dplyr. The dependencies (e.g. knitr, rmarkdown, ggplot2, RPostgreSQL, RSQLite, DBI, colorspace, utf8 and many others are installed on the fly automatically. Most importantly, the package dbplyr, the MySQL driver RmySQL and the R database interface DBI are also installed automatically together with dplyr and serve as backends for the communication with databases. Tutorials are available online. And here. The following code does the same as the one above, but uses dplyr:
In the code above, add or remove the wanted columns and replace the placeholders (<…>) with the desired code (or remove the “%>% filter()” statement) in the SQL query (in the tbl() command) 

3.3       Processing of data from a MySQL database using Python with PyMySQL or SQLAlchemy

The mysqlclient python wheel can be used as a driver for the MySQL database, but it is licensed under the terms of GPL. I therefore recommend the pymysql python wheel.
In addition the python package SQLAlchemy serves as a high level ORMTo install these packages on Microsoft Windows, right-click cmd.exe and in the context menu left-click “run as administrator” and enter pip install PyMySQL and execute by pressing enter, then enter pip install SQLAlchemy and execute by pressing enter. Run the Python frontend / editor of your choice (e.g. Bio7 / Eclipse with the plug-in PyDev). The connection to a database on a running MySQL server is established via PyMYSQL by executing a Python script with the following code:
In the code above, add or remove the wanted columns and replace the placeholders (<…>) with the desired code (or remove the “WHERE <condition(s)>” statement) in the SQL query (in the execute() command) 

In the code above, add or remove the wanted columns and replace the placeholders (<…>) with the desired code (or replace the .where() part of the select.where() command (read about method chaining in the manual of SQLAlchemy) in the SQL query (in the execute() command). 

If you want to use the object relational mapper (ORM) abstraction level of SQLAlchemy, the same is done with the following Python Script:

The code above takes advantage of deferred reflection. The argument __table_args__ provides variables to hold table parameters and “autoload” = True makes it possible to reflect the respective table in the database automatically without specifying the columns (fields) and their names again. The session.query() command is used as described in the section ”Loading Columns” of the SQLAlchemy manual. Fully automated database reflection using the automap functionality is showcased in the comments. If selected data from different databases has to be reflected automatically, metadata can be predefined using MetaData() and the metadata.reflect() method, and passed as the metadata-argument of the automap_base() method as shown in the last code example that is commented out.

3.4       Further Reading

If real-time processing of data streams is required, have a look at this article to see how Celery works in conjugation with RabbitMQ and SQLAlchemy. An Alternative is Confluent Open Source, which builds upon Apache Kafka.

4       Statistics

Use R with the packages asht (Fay and Kim 2016; Fay 2018) and WRS2 (Mair and Wilcox 2018).
For choosing the best fitting statistical test, refer to review articles (Najak, B.K. and Hazra, A. 2011). For small sample sizes, caution is required. Excellent information on this topic is provided by Delacre et al. (2017). Means are usually meaningless, standard deviation is underestimated and parametric tests are usually meaningless for sample sizes below n=5. Tests for normal distribution also typically require n>30 (Delacre et al. 2017). Especially for small sample sizes, the exact Wilcoxon-Mann-Whitney test is to be preferred when compared to variants with large sample asymptotic (normal) approximation (Bergmann et al. 2000; Fray 2018). It needs a sample size of n1=4, n2=4 or n1=5, n2=3 to produce u = p < α = 0.05 (one sided) or n1=5, n2=4 to produce u = p/2 < α/2 = 0.025 (two sided) (Milton 1964). Spearman’s Rank correlation coefficient needs minimum n=4 (one sided) or n=5 (two sided) or better n=6 (Rhamsey 1989).

4.1       Common misconceptions

Wilcoxon-Mann Whitney test
Note that this is a test for differences in mean ranks. Only if the two populations have the same distribution shape this tests for distribution shift (differences in medians and thus for differences in means) (Campbell 2006).

Parametric tests
Requirement 1) is not “normal distribution of the parameter under observation for the underlying population or the sample but normal distribution of the sample means.


Tests are more or less robust with violations of the remaining requirements. Student’s t test is reportedly robust with violation of the equality of variances. To compensate for deviations from the requirements, some recommend to reduce the significance level (Lehman 2002)


Recent research has made progress in applying t statistics to data that deviates from normality (Hall and Wang 2004; Unnikrishnan and Huang 2016; Schröder and Yitzhaki 2018). However, critical researchers showed, that normal distribution is rare in real life (Pococ 2011; Wilcox 2013; Wilcox and Rousselet 2018) and that tests for normality are problematic (Wright and Herrington 2011) while at the same time even small deviations from normal distribution can result in failure of classic statistical test. That is why Wilcox and others collect and introduce robust alternatives to classical parametric tests (Wilcox 2013; Wilcox 2016; Wilcox 2017; Nair and Sankaran 2009; Sankaran et al 2016; Nair and Balakrishnan 2016; Nair et al. 2018; Wilcox and Rousselet 2018).

4.2       Power analysis

-        Choose the minimum effect size you want to detect (10%)
-        Choose minimum significance level (α=0.05)
-        Choose power according to risk assessment (Pwr=0.8=80% (propability that an actual effect results in correct rejection of H0 and acceptance of H1 at α=0.05 and thereby correct significant (p<α) result).
It follows that β = 1- Pwr = 0.2: Propability for false negative acceptance of null hypothesis H0. 
-        Calculate minimal required sample number n using R with the package pwr.

4.3       Further reading



Copyright © 2018-2019 DerAndere

Popular posts from this blog

Scalable knitting patterns with open source software: Textile design with Inkscape and GIMP.

AYAB shield - self soldered circuit board for a computer-controlled knitting machine