What is Distributed File Systems (DFS) and why you need it for Deep Learning - Hopsworks (2024)

When you train deep learning models with lots of high quality training data, you can beat state-of-the-art prediction models in a wide array of domains (image classification, voice recognition, and machine translation). Distributed file systems are becoming increasingly indispensable as a central store for training data, logs, model serving, and checkpoints. HopsFS is a great choice, as it has native support for the main Python frameworks for Data Science: Pandas, TensorFlow/Keras, PySpark, and Arrow.

What is DFS?

As the name suggests, Distributed File System (DFS) is a file system that is distributed across multiple platforms or locations. Using this technology, applications can access or store isolated files just like local files, allowing programmers to access files from any network or computer. The primary function of the Distributed File System (DFS) is to enable users of physically distributed systems to share resources and data. Distributed File System (DFS) has two components: Location transparency (using the namespace component) and Redundancy (using the file replication component). By allowing shares in several locations to be logically grouped under one folder, these components provide data availability in the event of failures or heavy loads.

Prediction Performance Improves Predictably with Dataset Size

Baidu showed that the improvement in prediction accuracy (or reduction in generalization error) for deep learning models was predictable based on the amount of training data. The decrease in generalization error with increasing training dataset size follows a power-law distribution(as seen by the straight lines in the log-log graph below). This astonishing result came from a large-scale study in the different application domains of machine translation, language modeling, image classification, and speech recognition. Given that this result holds true in vastly different application domains, there is a good chance the same result holds true for your particular application domain. This result is important for companies considering investing in distributed file systems (DFS) for deep learning - if it costs $X to collect or generate a new GB of high quality training data, you can predict the improvement of prediction accuracy for your model, given the slope, Y, of the log-log graph you have observed while training.

What is Distributed File Systems (DFS) and why you need it for Deep Learning - Hopsworks (1)

Predictable ROI in the Power-Law Region

This predictable return-on-investment (ROI) for collecting/generating more training data is slightly more complex than the one described above. You first need to collect enough training data to get beyond the “Small Data Region” in the diagram below. That is, you can only make predictions if you have enough data that you are in the “Power-Law Region”.

What is Distributed File Systems (DFS) and why you need it for Deep Learning - Hopsworks (2)

You can determine this by graphing the reduction in your generalization error as a function of your training data size on a log-log scale. After you start observing the straight line on your model, calculate the exponent of your power-law graph (the slope of the graph). Baidu’s empirically-collected learning curves showed exponents in the range [-0.35, -0.07] - suggesting models learn real-world data more slowly than suggested by theory (theoretical models indicate the power-law exponent is expected to be -0.5).

Still, if you observe the power-law region, increasing your training data set size will give you a predictable decrease in generalization error. For example, if you are training an image classifier for a self-driving vehicle, the number of hours your cars have driven autonomously determines your training data size. So, going from 2m hours to 6m hours of autonomous driving should reduce errors in your image classifier by a predictable amount. This is important in giving businesses a level of certainty in the improvements they can expect when making large investments in new data collection or generation.

Need for a Distributed Filesystem (DFS)

The TensorFlow team say a distributed filesystem is a must for deep learning. Datasets are getting larger, worker GPUs need to coordinate for model checkpointing, worker GPUs need to coordinate for hyperparameter optimization, and/or model-architecture search. Your system may grow beyond a single server, or you may have different servers for serving your models from the servers you have for training your models. A distributed file system (DFS) is the glue that holds together the different stages of your machine learning workflows, and it enables teams to share GPU hardware. What is important is that the distributed file system (DFS) works with your choice of programming language and deep learning framework(s).

What is Distributed File Systems (DFS) and why you need it for Deep Learning - Hopsworks (3)

HopsFS is a great choice as a distributed file system (DFS), due to it being a drop-in replacement for HDFS. HopsFS/HDFS are supported in major Python frameworks: Pandas, PySpark DataFrames, TensorFlow Data, and so on. In Hops, we provide built-in HopsFS/HDFS support with the pydoop library. HopsFS has one additional feature that is aimed at machine learning workloads: improved throughput and lower latency reading/writing for small files. In a peer reviewed paper at Middleware 2018, we showed throughput improvements of up to 66X compared to HDFS for small files.

What is Distributed File Systems (DFS) and why you need it for Deep Learning - Hopsworks (4)

Python Support in Distributed File Systems (DFS)

As we can see from the table below, the choice of distributed file system will affect what you can do.

What is Distributed File Systems (DFS) and why you need it for Deep Learning - Hopsworks (5)

Python Support in HopsFS

We now give some simple examples of how to write Python code to use datasets in HopsFS. Complete and up to date notebooks can be found on our examples repository in github

Pandas with HopsFS

What is Distributed File Systems (DFS) and why you need it for Deep Learning - Hopsworks (6)

In Pandas, the only change we need to make to our code, compared to a local filesystem, is to replace open_file(..) with h.open_file(..), where h is a file handle to HDFS/HopsFS.

PySpark with HopsFS

What is Distributed File Systems (DFS) and why you need it for Deep Learning - Hopsworks (7)

TensorFlow Datasets with HopsFS

What is Distributed File Systems (DFS) and why you need it for Deep Learning - Hopsworks (8)

What is Distributed File Systems (DFS) and why you need it for Deep Learning  - Hopsworks (2024)

FAQs

What is Distributed File Systems (DFS) and why you need it for Deep Learning - Hopsworks? ›

A distributed file system (DFS) is the glue that holds together the different stages of your machine learning workflows, and it enables teams to share GPU hardware. What is important is that the distributed file system (DFS) works with your choice of programming language and deep learning framework(s).

What is the distributed file system (dfs) and how does it work? ›

A distributed file system (DFS) is a file system that spans across multiple file servers or multiple locations, such as file servers that are situated in different physical places. Files are accessible just as if they were stored locally, from any device and from anywhere on the network.

Why do we need DFS? ›

The Distributed File System (DFS) functions provide the ability to logically group shares on multiple servers and to transparently link shares into a single hierarchical namespace. DFS organizes shared resources on a network in a treelike structure.

What are distributed systems for deep learning? ›

Distributed Deep Learning is a subset of machine learning that involves training deep neural networks across multiple machines in parallel. Deep neural networks are trained simultaneously across numerous devices in distributed deep learning, a form of machine learning.

What are the benefits of a distributed file system when compared to a file system in a centralized system? ›

The advantages of such a system over a centralized files system include increased performance and fault tolerability as well as higher availability. Because multiple copies of all files live on different file servers, if one of those nodes fails, your file from another location is still available.

What are the advantages of DFS? ›

The following are the advantages of DFS algorithm: Requires less memory since it only stores stack of nodes on the path from root node to current node. It can find solution with examining much search space and stop once found. Takes less time to reach goal node than BFS algorithm.

What is the purpose of the DFS algorithm? ›

Depth-first search (DFS) is an algorithm for searching a graph or tree data structure. The algorithm starts at the root (top) node of a tree and goes as far as it can down a given branch (path), then backtracks until it finds an unexplored path, and then explores it.

What are 4 examples of distributed systems? ›

Distributed System Examples
  • Networks. The earliest example of a distributed system happened in the 1970s when ethernet was invented and LAN (local area networks) were created. ...
  • Telecommunication networks. ...
  • Distributed Real-time Systems. ...
  • Parallel Processing. ...
  • Distributed artificial intelligence. ...
  • Distributed Database Systems.

What are the three types of distributed systems? ›

Here are four different distributed system types, including a definition and description of the uses for each:
  • Client-server. ...
  • Peer-to-peer. ...
  • Three-tier. ...
  • N-tier.

Why do we use distributed systems? ›

Distributed systems offer faster performance with optimum resource use of the underlying hardware. As a result, you can manage any workload without worrying about system failure due to volume spikes or underuse of expensive hardware.

What are the characteristics of a DFS? ›

A DFS should uphold data integrity and be secure and scalable. Distributed file systems can share data from a single computing system among various servers, so client systems can use multiple storage resources as if they were local storage.

What protocol does DFS use? ›

The DFS: Referral Protocol relies on the Server Message Block (SMB) Protocol, (as specified in [MS-SMB]) or the Server Message Block (SMB) Version 2 Protocol (as specified in [MS-SMB2]), or the Common Internet File System Protocol (as specified in [MS-CIFS]) as its transport layer.

What is the difference between distributed file system and HDFS? ›

Hadoop Distributed File System (HDFS) is designed for storing and processing large volumes of data across a distributed network and Normal File System is designed for managing smaller volumes of data on a single machine.

How does DFS management work? ›

DFS uses the Windows Server file replication service to copy changes between replicated targets. Users can modify files stored on one target, and the file replication service propagates the changes to the other designated targets. The service preserves the most recent change to a document or files.

What is the mechanism of DFS? ›

Replication mechanisms ensure that changes are made while data consistency is maintained. Caching. DFS systems improve performance with caching mechanisms that store frequently accessed files or parts of files closer to requesting clients. Caching reduces the need for network transfers and speeds up file access.

What is distributed file systems give two examples? ›

A cloud-based distributed file system is a type of distributed file system that uses the internet to store and access data. Amazon S3, Microsoft Azure, and Google Cloud Storage are examples of cloud-based distribution file systems.

What is the difference between NTFS and DFS? ›

NTFS is one type of file system. File systems are generally differentiated by the OS and the type of drive they are being used with. Today, there is also a distributed file system (DFS) where files are stored across multiple servers but is accessed and handled as if it were stored locally.

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Msgr. Refugio Daniel

Last Updated:

Views: 5293

Rating: 4.3 / 5 (74 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Msgr. Refugio Daniel

Birthday: 1999-09-15

Address: 8416 Beatty Center, Derekfort, VA 72092-0500

Phone: +6838967160603

Job: Mining Executive

Hobby: Woodworking, Knitting, Fishing, Coffee roasting, Kayaking, Horseback riding, Kite flying

Introduction: My name is Msgr. Refugio Daniel, I am a fine, precious, encouraging, calm, glamorous, vivacious, friendly person who loves writing and wants to share my knowledge and understanding with you.