Welcome to Data Engineering UTokyo’s documentation!
A software package developed for the Center of Nuclear Studies (CNS) to join all data sources and analyze the data in real-time. The package saves a lot of time by providing an easy-to-use interface for interacting with the data and allows to automatically extract key results from the data via Maximum Likelihood Estimation (MLE) and search algorithms.
Data sources
The package can map csv files created by LabView to pandas dataframes in real-time up to a sampling rate in linear time, providing a quadratic speedup compared to reloading the data. Examples include data from the SSD, PMT, Coil, Heater, Gauge, Laser, Ion and CCD camera. One can process new formats easily, be creating a child class of Recorder and just overriding the behaviour which has to differ from the default.
Outputs
The output comes in the of recorders and analyses. The recorders abstract away loading the csv file as pandas dataframe and keeping it up to date when more data comes in. The analyses take the recorders, visualize their data, calculate or fit some metrics and store the result as csv file.
The recorders also come in a flavor that we call parser. Parsers provide a solution for handling csv file which are to large for pandas. Each time we evaluate a parser, it provides us with the next chunk of data as pandas dataframe, without loading the full dataframe at any point in time.
Impressions
The plot and histogram are visualizations of the SSD data. The peaks and the background are fitted with an algorithm.
This 2D plot is obtained by Maximum Likelihood Estimation with a Gaussian model on an image from the CCD camera.
How to get started
There are two environments in which this package has been tested:
Locally
You can clone the repository to your local computer, load the data to your computer, and run the code either in a Jupyter notebook or in a Python file. Checkout notebooks/recorder_demo.ipynb, notebooks/ssd_analysis_demo.ipynb and main/main.py as reference.
Install the library by navigating into the repo in your CLI and typing
pip install .
When you start Python, then the following imports should work
import data_eng_utokyo
Google Colaboratory
You create a Google Account, ask for access to the Google Drive folder, create your own alias of the folder such that you see it in your on files, start Google Colaboratory and create a new notebook. In this notebook, you have to make sure that you import your Google Drive folder. In a first step, you just mount Google Drive.
from google.colab import drive
drive.mount('/content/drive')
filepath_drive = "/content/drive/.shortcut-targets-by-id/something/some_folder"
Then, you navigate to the Google Drive folder with commands like
cd /content/drive/
and
ls
The latter will output the files and folders which are located at your current position. When you have found the place of your folder, you save it as code like this:
filepath_drive = "/content/drive/.shortcut-targets-by-id/something/some_folder"
Next, you navigate to your drive and clone this repository with the instructions provided on Github. You can use this Medium article and this Medium article as reference. Read them carefully. Here we just describe the most important steps:
Clone the repo with.
!git clone "https://github.com/romanwixinger/data-engineering-utokyo"
Navigate to the repository in the notebook and verify that you are on the right branch with
!git status
This should output that you are on the branch main. Last, we have to configure origin such that we can push to Github. So go to Github -> settings -> Developer settings -> Personal access tokens and create a personal access token for the repository. Store the access token safely, you cannot see it afterwards. Then you add the remote origin
!git remote add <remote-name> https://{git_username}:{git_token}@github.com/{username}/{repository}.git
Use origin as remote-name, romanwixinger as username, data-engineering-utokyo as repository and your git_username and git_token. To verify that it worked, check that the following command outputs two lines:
!git remote -v
Now you can start to import functionality from the package as described in the examples.
Examples
How to use the package to track your own csv files.
from data_eng_utokyo.recorders import SSDRecorder
ssd_recorder = SSDRecorder(filepath="my_filepath.csv")
df = ssd_recorder.get_table() # Full table
metadata_df = ssd_recorder.get_metadata() # Just metadata
Next we can setup analyses which performs MLE and visualize the data.
from data_eng_utokyo.analyses import SSDAnalysis
ssd_analysis = SSDAnalysis(
recorder=ssd_recorder,
image_src="../../plots/",
image_extension=".png"
)
ssd_analysis.run()
Last, we can also use a runner to perform many analyses in real-time during an experiment.
from data_eng_utokyo.utilities import Runner
from src.analyses import (
SSDAnalysis,
PMTAnalysis,
)
analyses = [
SSDAnalysis(recorder=ssd_recorder, image_src="../../plots/mock"),
PMTAnalysis(recorder=pmt_recorder, image_src="../../plots/mock")
]
runner = Runner(analyses=analyses)
runner.run(cycles=3*60, period_s=60) # Every 60 seconds for three hours
Sometimes, we want to perform the same analysis on many different files. Then, we can leverage the FileRecorder to find the filenames, paths and creation times of these files.
from data_eng_utokyo.recorders import FileRecorder
file_recorder = FileRecorder(
filepath="path_to_the_folder/",
match="regex_pattern_to_find_the_right_files"
)
file_df = file_recorder.get_table()
From time to time, we want to test our analysis with a recorder that represents data that we know well. For this purpose, we introduce the StaticRecorder. This class behaves like a recorder, but its data is a fixed dataframe given upon initialization.
import pandas as pd
from data_eng_utokyo.recorders import StaticRecorder
df = pd.DataFrame(
data=[["Alice", 25], ["Bob", 23], ["Eve", 27]],
columns=["name", "age"]
)
static_recorder = StaticRecorder(df=df)
Structure
Recorders
- Recorders
RecorderRecorder.filepathRecorder.has_metadataRecorder.delimiterRecorder.always_updateRecorder.encodingRecorder.read_data_linesRecorder.last_updatedRecorder.get_table()Recorder.get_data()Recorder.get_metadata()Recorder.is_up_to_date()Recorder._get_mod_time()Recorder._update()Recorder._update_data()Recorder._timestamp_to_datetimes()Recorder._load_initial_data()Recorder._load_new_data()Recorder._load_metadata()Recorder._harmonize_time()
SSDRecorderSSDParserPMTRecorderFileRecorderFileParserImageFileRecorderCoilRecorderIonRecorderLaserRecorderGaugeRecorderHeaterRecorderCycleRecorderParameterRecorderSSDResultsRecorderStaticRecorder
Analyses
Algorithms
- Algorithms
MOTMLEMOTMLE.cMOTMLE.referencesMOTMLE.do_subtract_dead_pixelsMOTMLE.dead_pixels_percentileMOTMLE.self.dead_pixelsMOTMLE.self.dead_pixel_sumMOTMLE.perform_analysis()MOTMLE._load()MOTMLE._df_to_array()MOTMLE._precalculate_dead_pixels()MOTMLE._plot_dead_pixels()MOTMLE._subtract_dead_pixels()MOTMLE._preprocess()MOTMLE._get_scaling_factor()MOTMLE._fitting()MOTMLE._extract_statistics()MOTMLE._get_initial_guess()MOTMLE._generate_fit_data()MOTMLE._plot_fit_result()MOTMLE._plot_heatmap()MOTMLE._print_stats()
PeakPeakFinder
Utilities
Notebooks
Constants
- Constants
CameraConstantsCameraConstants.hbarCameraConstants.cCameraConstants.lambda_RbCameraConstants.omega0_RbCameraConstants.Gamma_RbCameraConstants.I_satCameraConstants.x_intensCameraConstants.y_intensCameraConstants.z_intensCameraConstants.I_beamCameraConstants.s_0CameraConstants.Eff_Gamma_RbCameraConstants.Omega_VPCameraConstants.two_D_gaussCameraConstants.Pow_elec_coefCameraConstants.MOTnum_Pow_coefCameraConstants.Elec_MOTnum_coefCameraConstants.Pow_pulses_per_s_coefCameraConstants.MOTnum_pulses_per_s_coefCameraConstants.XnumCameraConstants.Ynum
c_config()two_D_gauss()
Indices and tables
How to contribute
Contributions are welcomed and should apply the usual git-flow: fork this repo, create a local branch named ‘feature-…’. Commit often to ensure that each commit is easy to understand. Name your commits ‘[feature-…] Commit message.’, such that it possible to differentiate the commits of different features in the main line. Request a merge to the mainline often. Please remember to follow the PEP 8 style guide, and add comments whenever it helps. The docstring follow the Google Style Python Docstring format. The corresponding authors are happy to support you.