ntfs interview questions
Top ntfs frequently asked interview questions
At a high level, the only obvious difference between NTFS Junction Points and Symbolic Links is that Junctions are only able to be directories, while SymLinks are allowed to also target files.
What other differences between the two exist?
(Note, I've already seen this question and what I'm looking for is a bit different -- that question is looking for a pro and con list, I'm looking for a set of technical differences)
Source: (StackOverflow)
On Windows NTFS there is a nice but mostly unused feature called
"Alternate Data Streams" (ADS) which I recently used in a hobby-dev project.
On Mac HFS+ there is also a similarly nice but mostly unused feature called "named forks".
I am thinking of porting this project to Linux, but I do not know if any Filesystem on linux has such a feature?
Source: (StackOverflow)
How would I create/ delete/ read/ write/ NTFS alternate data streams from .NET?
If there is no native .NET support, which Win32 API's would I use? Also, how would I use them, as I don't think this is documented?
Source: (StackOverflow)
Which built in (if any) tool can I use to determine the allocation unit size of a certain NTFS partition ?
Source: (StackOverflow)
Application Description
I have an offline data processing tool. This tool loads hundreds of thousands of files. For each one it performs some calculations and when done writes a single index file. It is all C++ (all IO is via standard library objects/functions), and is being compiled with Visual Studio 2013 targeting amd64.
Performance
My test dataset has 115,757 files that need to be processed. Files total 731MB in size, and the Median file size is 6KB.
- First run: 12 seconds
- Second run: ~18 minutes
Thats 90x slower! The second run is extrapolated from one minute of run time. All runs after that, as i've experienced thus far, are equally slow.
Surprise!
If I rename the folder with the files in it, and then rename it back to what it originally was, the next time I run the application it will again perform quickly!
Its the same app, machine, and source data. The only difference is that one folder was temporarily renamed.
So far I can reproduce this 100% of the time.
Profiling
Naturally the next step was to profile. I profiled the quick run and the slow run and compared the hot spots. In the slow version about 86% of the application was spent in a function called NtfsFindPrefix
. The quick version spends about 0.4% of its time here. This is the call stack:
Ntfs.sys!NtfsFindPrefix<itself>
Ntfs.sys!NtfsFindPrefix
Ntfs.sys!NtfsFindStartingNode
Ntfs.sys!NtfsCommonCreate
Ntfs.sys!NtfsCommonCreateCallout
ntoskrnl.exe!KySwitchKernelStackCallout
ntoskrnl.exe!KiSwitchKernelStackContinue
ntoskrnl.exe!KeExpandKernelStackAndCalloutEx
Ntfs.sys!NtfsCommonCreateOnNewStack
Ntfs.sys!NtfsFsdCreate
fltmgr.sys!FltpLegacyProcessingAfterPreCallbacksCompleted
fltmgr.sys!FltpCreate
ntoskrnl.exe!IopParseDevice
ntoskrnl.exe!ObpLookupObjectName
ntoskrnl.exe!ObOpenObjectByName
ntoskrnl.exe!NtQueryAttributesFile
ntoskrnl.exe!KiSystemServiceCopyEnd
ntdll.dll!NtQueryAttributesFile
KernelBase.dll!GetFileAttributesW
DataGenerator.exe!boost::filesystem::detail::status
The boost call in question is an exists
call. It will test for the zipped version of a file, fail to find it, and then test for the unzipped one and find it.
Profiling also showed that the disk did not get hit by either runs of the application, however File IO was expectedly high. I believe this indicates that the files were already paged to memory.
File IO also showed that the duration of file "Create" events were on average MUCH higher in the slow version. 26 us vs 11704 us.
Machine
- Samsung SSD 830 Series
- Intel i7 860
- Windows 7 64 bit
- NTFS file system.
- 32GB Ram
Summary
- On the second run the calls into
NtfsFindPrefix
take much longer.
- This is a function in the NTFS driver.
- The disk didn't get hit in either profiles, the Files were served from pages in memory.
- A rename operation seems to be enough to stop this issue occurring on the next run.
Question
Now that the background info is out of the way, Does anyone recognise what is going on and know how to fix it?
It seems like I could work around it by renaming the folder myself, but that seems...dirty. plus im not sure what that even works.
Is the rename invalidating the pages in memory and causing them to get updated before the next run? Is this a bug in the NTFS driver?
Thanks for reading!
Update!!
After some more profiling it looks like the part that is performing slower is testing to see if the non-existent zipped file exists. If I remove this test everything seems to get quicker again.
I have also managed to reproduce this issue in a small C++ app for everyone too see. Note that The sample code will create 100k 6KB files on your machine in the current directory. Can anyone else repro it?
// using VS tr2 could replace with boost::filesystem
#include <filesystem>
namespace fs = std::tr2::sys;
//namespace fs = boost::filesystem;
#include <iostream>
#include <string>
#include <chrono>
#include <fstream>
void createFiles( fs::path outDir )
{
// create 100k 6KB files with junk data in them. It doesn't matter that they are all the same.
fs::create_directory( outDir );
char buf[6144];
for( int i = 0; i < 100000; ++i )
{
std::ofstream fout( outDir / fs::path( std::to_string( i ) ), std::ios::binary );
fout.write( buf, 6144 );
}
fs::rename( outDir, fs::path( outDir.string() + "_tmp" ) );
fs::rename( fs::path( outDir.string() + "_tmp" ), outDir );
}
int main( int argv, const char* argc[] )
{
fs::path outDir = "out";
if( !fs::exists( outDir ) )
createFiles( outDir );
auto start = std::chrono::high_resolution_clock::now();
int counter = 0;
for( fs::recursive_directory_iterator i( outDir ), iEnd; i != iEnd; ++i )
{
// test the non existent one, then the other
if( !fs::exists( fs::path( i->path().string() + "z" ) ) && fs::exists( i->path() ) )
counter += 1;
if( counter % 100 == 0 )
std::cout << counter << std::endl;
}
std::cout << counter << std::endl;
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration< double, std::milli > s( end - start );
std::cout << "Time Passed: " << s.count() << "ms" << std::endl;
return 0;
}
Update 2
I have logged an issue with MS here. Hopefully they can help shed some light on the issue.
Source: (StackOverflow)
I have a problem when installing npm modules. NodeJS is installed on Ubuntu 11.10 running on Virtual Box on Windows host. My project files are on NTFS partition (I have to share them with windows). When I try to install some npm module I get an error, and module is not installed. I've found out that problem occurs when npm tries to create symbolic links.
Probably you can not create symlinks on NTFS partition, when I'm installing module "inside" Linux file system, everything works fine.
How can I fix this? I don't want to resolve dependencies manually :/
Source: (StackOverflow)
Solved:
* Workable Solution: @sbi
* Explanation for what really happens: @Hans
* Explanation for why OpenFile doesn't pass through "DELETE PENDING": @Benjamin
The Problem:
Our software is in large part an interpreter engine for a proprietary scripting language. That scripting language has the ability to create a file, process it, and then delete the file. These are all separate operations, and no file handles are kept open in between these operations. (i.e. during the file create a handle is created, used for writing, then closed. During the file processing portion, a separate file handle opens the file, reads from it, and is closed at EOF. and Finally, delete uses ::DeleteFile which only has use of a filename, not a file handle at all).
Recently we've come to realize that a particular macro (script) fails sometimes to be able to create the file at some random subsequent time (i.e. it succeeds during the first hundred iterations of "create, process, delete", but when it comes back to creating it a hundred and first time, Windows replies "Access Denied").
Looking deeper into the issue, I have written a very simple program that loops over something like this:
while (true) {
HANDLE hFile = CreateFileA(pszFilename, FILE_ALL_ACCESS, FILE_SHARE_READ, NULL, CREATE_NEW, FILE_ATTRIBUTE_NORMAL, NULL);
if (hFile == INVALID_HANDLE_VALUE)
return OpenFailed;
const DWORD dwWrite = strlen(pszFilename);
DWORD dwWritten;
if (!WriteFile(hFile, pszFilename, dwWrite, &dwWritten, NULL) || dwWritten != dwWrite)
return WriteFailed;
if (!CloseHandle(hFile))
return CloseFailed;
if (!DeleteFileA(pszFilename))
return DeleteFailed;
}
As you can see, this is direct to the Win32 API, and pretty darn simple. I create a file, write to it, close the handle, delete it, rinse, repeat...
But somewhere along the line, I'll get an Access Denied (5) error during the CreateFile() call. Looking at sysinternal's ProcessMonitor, I can see that the underlying issue is that there is a pending delete on the file while I'm trying to create it again.
Questions:
* Is there a way to wait for the delete to complete?
* Is there a way to detect that a file is pending deletion?
We have tried the first option, by simply WaitForSingleObject() on the HFILE. But the HFILE is always closed before the WaitForSingleObject executes, and so WaitForSingleObject always returns WAIT_FAILED. Clearly, trying to wait for the closed handle doesn't work.
I could wait on a change notification for the folder that the file exists in. However, that seems like an extremely overhead-intensive kludge to what is a problem only occasionally (to wit: in my tests on my Win7 x64 E6600 PC it typically fails on iteration 12000+ -- on other machines, it can happen as soon as iteration 7 or 15 or 56 or never).
I have been unable to discern any CreateFile() arguments that would explicitly allow for this ether. No matter what arguments CreateFile has, it really is not okay with opening a file for any access when the file is pending deletion. And since I can see this behavior on both an XP box and on an x64 Win7 box, I am quite certain that this is core NTFS behavior "as intended" by Microsoft. So I need a solution that allows the OS to complete the delete before I attempt to proceed, preferably w/o tying up CPU cycles needlessly, and without the extreme overhead of watching the folder that this file is in (if possible).
Thanks for taking the time to read this and post a response. Clarifying Questions welcome!
[1] Yes, this loop returns on a failure to write or a failure to close which leaks, but since this is a simple console test app, the app itself exits, and Windows guarantees that all handles are closed by the OS when a process completes. So no leaks exist here.
bool DeleteFileNowA(const char * pszFilename)
{
// determine the path in which to store the temp filename
char szPath[MAX_PATH];
strcpy(szPath, pszFilename);
PathRemoveFileSpecA(szPath);
// generate a guaranteed to be unique temporary filename to house the pending delete
char szTempName[MAX_PATH];
if (!GetTempFileNameA(szPath, ".xX", 0, szTempName))
return false;
// move the real file to the dummy filename
if (!MoveFileExA(pszFilename, szTempName, MOVEFILE_REPLACE_EXISTING))
return false;
// queue the deletion (the OS will delete it when all handles (ours or other processes) close)
if (!DeleteFileA(szTempName))
return false;
return true;
}
Source: (StackOverflow)
Why can not I create a deep path whose characters in path is more than 255 in NTFS File System?
It seems a limits of FAT32, but also exist in NTFS? Can anyone provide some documents?
Many Thanks!
Source: (StackOverflow)
How does Windows with NTFS perform with large volumes of files and directories?
Is there any guidance around limits of files or directories you can place in a single directory before you run into performance problems or other issues? e.g. is a folder with 100,000 folders inside of it an ok thing to do
Source: (StackOverflow)
I have files on my hard drive that throw a PathTooLongException
when I access the Fullname
property of a FileSystemInfo
object. Is there any way around this (excluding renaming the files which is not an option)?
http://msdn.microsoft.com/en-us/library/aa365247%28VS.85%29.aspx#maxpath mentioned by other answers suggests putting a "\?\" prefix on the file name but in this case the DirectoryInfo.GetFileSystemInfos()
is responsible for creating the FileSystemInfo
objects and DirectoryInfo
doesn't accept that prefix so there's no way to use it.
The answer " PathTooLongException in C# code " doesn't help because this is a multi-threaded application and I can't keep changing the current application path.
Do I really have to do everything with PInvoke
just to be able to read every file on the hard drive?
Source: (StackOverflow)
I will likely be involved in a project where an important component is a storage for a large number of files (in this case images, but it should just act as a file storage).
Number of incoming files is expected to be around 500,000 per week (averaging around 100 Kb each), peaking around 100,000 files per day and 5 per second.
Total number of files is expected to reach tens of million before reaching an equilibrium where files are being expired for various reasons at the input rate.
So I need a system that can store around 5 files per second at peak hours, while reading around 4 and deleting 4 at any time.
My initial idea is that a plain NTFS file system with a simple service for storing, expiring and reading should actually be sufficient. I could imagine the service creating sub-folders for each year, month, day and hour to keep the number of files per folder at a minimum and to allow manual expiration in case that should be needed.
A large NTFS solution has been discussed here, but I could still use some advice on what problems to expect when building a storage with the mentioned specifications, what maintenance problems to expect and what alternatives exist. Preferably I would like to avoid a distributed storage, if possible and practical.
edit
Thanks for all the comments and suggestions. Some more bonus info about the project:
This is not a web-application where images are supplied by end-users. Without disclosing too much, since this is in the contract phase, it's more in the category of quality control. Think production plant with conveyor belt and sensors. It's not traditional quality control since the value of the product is entirely dependent on the image and metadata database working smoothly.
The images are accessed 99% by an autonomous application in first in - first out order, but random access by a user application will also occur. Images older than a day will mainly serve archive purposes, though that purpose is also very important.
Expiration of the images follow complex rules for various reasons, but at some date all images should be deleted. Deletion rules follow business logic dependent on metadata and user interactions.
There will be downtime each day, where maintenance can be performed.
Preferably the file storage will not have to communicate image location back to the metadata server. Image location should be uniquely deducted from metadata, possibly though a mapping database, if some kind of hashing or distributed system is chosen.
So my questions are:
- Which technologies will do a robust job?
- Which technologies will have the lowest implementing costs?
- Which technologies will be easiest to maintain by the client's IT-department?
- What risks are there for a given technology at this scale (5-20 TB data, 10-100 million files)?
Source: (StackOverflow)
I will use the Linux NTFS driver as an example.
The Linux kernel NTFS driver only has very limited write support in the kernel, and after 5 years it is still considered experimental.
The same development team creates the ntfsmount userspace driver, which has almost perfect write support.
Likewise, the NTFS-3G project which is written by a different team also has almost perfect write support.
Why has the kernel drive taken so much longer? Is it much harder to develop for?
Saying that there already exists a decent userspace application is not a reason why the kernel driver is not compelte.
NOTE: Do not migrate this to superuser.com. I want a programing heavy answer, from a programming perspective, not a practical use answer. If the question is not appropriate for SO, please advise me as to why so I can edit it so it is.
Source: (StackOverflow)
NTFS files can have object ids. These ids can be set using FSCTL_SET_OBJECT_ID
. However, the msdn article says:
Modifying an object identifier can result in the loss of data from portions of a file, up to and including entire volumes of data.
But it doesn't go into any more detail. How can this result in loss of data? Is it talking about potential object id collisions in the file system, and does NTFS rely on them in some way?
Side node: I did some experimenting with this before I found that paragraph, and set the object id's of some newly created files, here's hoping that my file system's still intact.
Source: (StackOverflow)
I would like to provide a way to recognize when a large file is fragmented to a certain extent, and alert the user when they should perform a defragmentation. In addition, I'd like to show them a visual display demonstrating how the file is actually broken into pieces across the disk.
I don't need to know how to calculate how fragmented it is, or how to make the visual display. What I need to know is two things: 1) how to identify the specific clusters on any disk which contain pieces of any particular given file, and 2) how to identify the total number of clusters on that disk. I would essentially need a list of all the clusters which contain pieces of this file, and where on the disk each of those clusters is located.
Most defragmentation utilities have a visual display showing how the files are spread across the disk. My display will show how one particular file is split up into different areas of a disk. I just need to know how I can retrieve the necessary data to tell me where the file's clusters/sectors are located on the disk, so I can further determine how fragmented it is.
Source: (StackOverflow)
Sorry, I know this sounds like a newbie question. But seriously, I'm an experienced developer, and I understand that Windows 7 Pro 64-bit and the like will say, "Oh, if you move an NTFS tree from one drive to another, when I write the children files that really means that I'm modifying the parent folder so I'll update its timestamp." So I wind up with all the destination files having the same timestamps as the original, but all of the folders having the same just-now-modified date/time.
So I understand what's happening. And I know that I could write my own utility (I have) to copy/move files on NTFS. But utilities are risky---if they aren't NTFS-aware, they could ignore other properties or miss things like NTFS Alternate Data Streams (ADS), etc.
So does anyone know a good, NTFS-aware tree-move utility that will simply move all of a tree and maintain the timestamps? I don't want to risk losing anything. Thanks.
Source: (StackOverflow)